Vous êtes sur la page 1sur 56

100 Data Science in R

Interview Questions and


Answers for 2016
04 Dec 2015

Latest Update made on October November 22, 2016.


R Programming is one of the languages that data scientists have to
be familiar with. In most of the Data Science job interviews
questions surrounding coding in R will be asked and it is expected
that applicants are well versed with the nitty-gritties of R. We got
together with our data science faculties who are experts in the field
having worked as Sr. Data Scientists themselves, to bring together
a list of questions that might be asked in data science interviews.
These interview questions are just related to R programming and
though it is not an exhaustive list, it will be useful to go through it
while preparing for data science jobs.
CLICK HERE

to get the 2016 data scientist salary report delivered to


your inbox!

If you would like more information about Data Science careers,


please click the orange "Request Info" button on top of this page.
Data Science is a vast field, that is true, but when it comes to
cracking interviews for data science jobs, knowledge of either R or
Python is important to get started with your Data Science career.
Being a Data Scientist means that you are usually applying for the
top level position. A fresh college graduate will not be hired for a
data scientist position, as this is one position that demands
experience, maturity and in depth knowledge of data science
concepts and the industry that data scientists get hired in. The

interview questions below are specific to the R programming


language that has remained the preferred language for data
scientists through the years.
In our previous post for 100 Data Science Interview Questions, we
had listed all the general statistics, data, mathematics and
conceptual questions that are asked in the interviews. These articles
have been divided into 3 parts which focus on each topic wise
distribution of interview questions. Below are some of the questions
that maybe asked during a data science interview, that is related to
R programing specifically.

Data Science Interview Questions


and Answers in R Programming
1) How can you merge two data frames in R language?
Data frames in R language can be merged manually using cbind ()
functions or by using the merge () function on common rows or
columns.
2) Explain about data import in R language
R Commander is used to import data in R language. To start the R
commander GUI, the user must type in the command Rcmdr into the
console. There are 3 different ways in which data can be imported in
R language

Users can select the data set in the dialog box or enter the

name of the data set (if they know).

Data can also be entered directly using the editor of R

Commander via Data->New Data Set. However, this works well


when the data set is not too large.

Data can also be imported from a URL or from a plain text

file (ASCII), from any other statistical package or from the clipboard.
3) Two vectors X and Y are defined as follows X <- c(3, 2, 4)
and Y <- c(1, 2). What will be output of vector Z that is
defined as Z <- X*Y.
In R language when the vectors have different lengths, the
multiplication begins with the smaller vector and continues till all
the elements in the larger vector have been multiplied.
The output of the above code will be
Z <- (3, 4, 4)
4) How

missing

values

and

impossible

values

are

represented in R language?
NaN (Not a Number) is used to represent impossible values whereas
NA (Not Available) is used to represent missing values. The best way
to answer this question would be to mention that deleting missing
values is not a good idea because the probable cause for missing
value could be some problem with data collection or programming
or the query. It is good to find the root cause of the missing values
and then take necessary steps handle them.
5) R language has several packages for solving a particular
problem. How do you make a decision on which one is the
best to use?

CRAN package ecosystem has more than 6000 packages. The best
way for beginners to answer this question is to mention that they
would look for a package that follows good software development
principles. The next thing would be to look for user reviews and find
out if other data scientists or analysts have been able to solve a
similar problem.
6) Which function in R language is used to find out whether
the means of 2 groups are equal to each other or not?
t.tests ()
7) What is the best way to communicate the results of data
analysis using R language?
The best possible way to do this is combine the data, code and
analysis results in a single document using knitr for reproducible
research. This helps others to verify the findings, add to them and
engage in discussions. Reproducible research makes it easy to redo
the experiments by inserting new data and applying it to a different
problem.
8) How many data structures does R language have?
R language has Homogeneous and Heterogeneous data structures.
Homogeneous data structures have same type of objects Vector,
Matrix ad Array. Heterogeneous data structures have different type
of objects Data frames and lists.
9) What is the value of f (2) for the following R code?

b <- 4
f <- function (a)

{
b <- 3
b^3 + g (a)
}
g <- function (a)
{
a*b
}

The answer to the above code snippet is 35. The value of a passed
to the function is 2 and the value for b defined in the function f (a)
is 3. So the output would be 3^3 + g (2). The function g is defined in
the global environment and it takes the value of b as 4(due to lexical
scoping in R) not 3 returning a value 2*4= 8 to the function f. The
result will be 3^3+8= 35.
10) What is the process to create a table in R language
without using external files?
MyTable= data.frame ()
edit (MyTable)
The above code will open an Excel Spreadsheet for entering data
into MyTable.
Learn Data Science in R Programming to land a top gig as an
Enterprise Data Scientist!
11)

Explain

about

the

significance

of

transpose

in

language
Transpose t () is the easiest method for reshaping the data before
analysis.

12) What are with () and BY () functions used for?


With () function is used to apply an expression for a given dataset
and BY () function is used for applying a function each level of
factors.
13)

dplyr

package

is

used

to

speed

up

data

frame

management code. Which package can be integrated with


dplyr for large fast tables?
data.table
14) In base graphics system, which function is used to add
elements to a plot?
boxplot () or text ()
15) What are the different type of sorting algorithms
available in R language?
Bucket Sort
Selection Sort
Quick Sort
Bubble Sort
Merge Sort
15) What is the command used to store R objects in a file?
save (x, file=x.Rdata)
16) What is the best way to use Hadoop and R together for
analysis?
HDFS can be used for storing the data for long-term. MapReduce
jobs submitted from either Oozie, Pig or Hive can be used to encode,

improve and sample the data sets from HDFS into R. This helps to
leverage complex analysis tasks on the subset of data prepared in R.
17) What will be the output of log (-5.8) when executed on R
console?
Executing the above on R console will display a warning sign that
NaN (Not a Number) will be produced because it is not possible to
take the log of negative number.
18) How is a Data object represented internally in R
language?
unclass (as.Date (2016-10-05))
19) What will be the output of the below code printmessage <- function (a) {

if (is.na (a))

print ("a is a missing value!")

else if (a < 0)

print ("a is less than zero")

else

print ("a is greater than or equal to zero")

invisible (a)

printmessage (NA)

The output for the above R programming code will be a is a missing


value. The function is.na () is used to check if the input passed is a
missing value.
20) Which package in R supports the exploratory analysis of
genomic data?
adegenet
21) What is the difference between data frame and a matrix
in R?
Data frame can contain heterogeneous inputs while a matrix cannot.
In matrix only similar data types can be stored whereas in a data
frame there can be different data types like characters, integers or
other data frames.
22) How can you add datasets in R?
rbind () function can be used add datasets in R language provided
the columns in the datasets should be same.
23) How do you split a continuous variable into different
groups/ranks in R?
24) What are factor variable in R language?
Factor variables are categorical variables that hold either string or
numeric values. Factor variables are used in various types of

graphics and particularly for statistical modelling where the correct


number of degrees of freedom is assigned to them.
25) What is the memory limit in R?
8TB is the memory limit for 64-bit system memory and 3GB is the
limit for 32-bit system memory.
26) What are the data types in R on which binary operators
can be applied?
Scalars, Matrices ad Vectors.
27) How do you create log linear models in R language?
Using the loglm () function
28) What will be the class of the resulting vector if you
concatenate a number and NA?
number
29) What is meant by K-nearest neighbour?
K-Nearest Neighbour is one of the simplest machine learning
classification algorithms that is a subset of supervised learning
based

on

lazy

learning.

In

this

algorithm

the

function

is

approximated locally and any computations are deferred until


classification.
30) What will be the class of the resulting vector if you
concatenate a number and a character?
character
31) Write code to build an R function powered by C?
32) If you want to know all the values in c (1, 3, 5, 7, 10)
that are not in c (1, 5, 10, 12, 14). Which in-built function in

R can be used to do this? Also, how this can be achieved


without using the in-built function.
Using in-built function - setdiff(c (1, 3, 5, 7, 10), c (1, 5, 10, 11, 13))
Without using in-built function - c (1, 3, 5, 7, 10) [! c (1, 3, 5, 7, 10)
%in% c (1, 5, 10, 11, 13).
33) How can you debug and test R programming code?
R code can be tested using Hadleys testthat package.
34) What will be the class of the resulting vector if you
concatenate a number and a logical?
number
35) Write a function in R language to replace the missing
value in a vector with the mean of that vector.
mean impute <- function(x) {x [is.na(x)] <- mean(x, na.rm = TRUE);
x}
36) What happens if the application object is not able to
handle an event?
The event is dispatched to the delegate for processing.
37) Differentiate between lapply and sapply.
If the programmers want the output to be a data frame or a vector,
then sapply function is used whereas if a programmer wants the
output to be a list then lapply is used. There one more function
known as vapply which is preferred over sapply as vapply allows the
programmer to specific the output type. The disadvantage of using
vapply is that it is difficult to be implemented and more verbose.
38) Differentiate between seq (6) and seq_along (6)

Seq_along(6) will produce a vector with length 6 whereas seq(6) will


produce a sequential vector from 1 to 6 c( (1,2,3,4,5,6)).
39) How will you read a .csv file in R language?
read.csv () function is used to read a .csv file in R language. Below is
a simple example
filcontent <-read.csv (sample.csv)
print (filecontent)
40) How do you write R commands?
The line of code in R language should begin with a hash symbol (#).
41) How can you verify if a given object X is a matric data
object?
If the function call is.matrix(X ) returns TRUE then X can be termed
as a matrix data object.
42) What do you understand by element recycling in R?
If two vectors with different lengths perform an operation the
elements of the shorter vector will be re-used to complete the
operation. This is referred to as element recycling.
Example Vector A <-c(1,2,0,4) and Vector B<-(3,6) then the result
of A*B will be ( 3,12,0,24). Here 3 and 6 of vector B are repeated
when computing the result.
43) How can you verify if a given object X is a matrix data
object?
If the function call is.matrix(X) returns true then X can be considered
as a matrix data object otheriwse not.

44) How will you measure the probability of a binary


response variable in R language?
Logistic regression can be used for this and the function glm () in R
language provides this functionality.
45) What is the use of sample and subset functions in R
programming language?
Sample () function can be used to select a random sample of size n
from a huge dataset.
Subset () function is used to select variables and observations from
a given dataset.
46) There is a function fn(a, b, c, d, e) a + b * c - d / e. Write
the code to call fn on the vector c(1,2,3,4,5) such that the
output is same as fn(1,2,3,4,5).
do.call (fn, as.list(c (1, 2, 3, 4, 5)))
47) How can you resample statistical tests in R language?
Coin package in R provides various options for re-randomization and
permutations based on statistical tests. When test assumptions
cannot be met then this package serves as the best alternative to
classical methods as it does not assume random sampling from welldefined populations.
48) What is the purpose of using Next statement in R
language?
If a developer wants to skip the current iteration of a loop in the
code without terminating it then they can use the next statement.
Whenever the R parser comes across the next statement in the

code, it skips evaluation of the loop further and jumps to the next
iteration of the loop.
49) How will you create scatterplot matrices in R language?
A matrix of scatterplots can be produced using pairs. Pairs function
takes various parameters like formula, data, subset, labels, etc.
The two key parameters required to build a scatterplot matrix are

formula- A formula basically like ~a+b+c . Each term gives a


separate variable in the pairs plots where the terms should be
numerical vectors. It basically represents the series of variables
used in pairs.

data- It basically represents the dataset from which the


variables have to be taken for building a scatterplot.
50) How will you check if an element 25 is present in a
vector?
There are various ways to do this-

i.

It can be done using the match () function- match () function


returns the first appearance of a particular element.

ii.

The other is to use %in% which returns a Boolean value either


true or false.

iii.

Is.element () function also returns a Boolean value either true


or false based on whether it is present in a vector or not.
51) What is the difference between library() and require()
functions in R language?

There is no real difference between the two if the packages are not
being loaded inside the function. require () function is usually used
inside function and throws a warning whenever a particular package
is not found. On the flip side, library () function gives an error
message if the desired package cannot be loaded.
52) What are the rules to define a variable name in R
programming language?
A variable name in R programming language can contain numeric
and alphabets along with special characters like dot (.) and
underline (-). Variable names in R language can begin with an
alphabet or the dot symbol. However, if the variable name begins
with a dot symbol it should not be a followed by a numeric digit.
53)

What

do

you

understand

by

workspace

in

programming language?
The current R working environment of a user that has user
defined objects like lists, vectors, etc. is referred to as Workspace
in R language.
54) Which function helps you perform sorting in R language?
Order ()
55) How will you list all the data sets available in all R
packages?
Using

the

data(package
56)

Which

visualisation

below
=

function
in

line

.packages(all.available
is
R

used

to

create

programming

of

code=

TRUE))
histogram
language?

Hist()
57) Write the syntax to set the path for current working

directory

in

environment.

Setwd(dir_path)
58) How will you drop variables using indices in a data
frame?
Lets

take

dataframe

df<-

data.frame(v1=c(1:5),v2=c(2:6),v3=c(3:7),v4=c(4:8))
df

## v1 v2 v3 v4

## 1 1 2 3 4

## 2 2 3 4 5

## 3 3 4 5 6

## 4 4 5 6 7

## 5 5 6 7 8

Suppose we want to drop variables v2 & v3 , the variables v2 and v3


can be dropped using negative indicies as followsdf1<-df[-c(2,3)]
df1

## v1 v4

## 1 1 4

## 2 2 5

## 3 3 6

## 4 4 7

## 5 5 8

59) What will be the output of runif (7)?


It will generate 7 randowm numbers between 0 and 1.
60)

What

is

the

difference

between

rnorm

and

runif

functions ?
rnorm function generates "n" normal random numbers based on the
mean and standard deviation arguments passed to the function.
Syntax of rnorm function rnorm(n, mean = , sd = )

runif function generates "n" unform random numbers in the interval


of minimum and maximum values passed to the function.
Syntax of runif function runif(n, min = , max = )

61) What will be the output on executing the following R


programming code
mat<-matrix(rep(c(TRUE,FALSE),8),nrow=4)

sum(mat)
8
62) How will you combine multiple different string like
Data, Science, in ,R, Programming as a single
string Data_Science_in_R_Programmming ?
paste(Data, Science, in ,R, Programming,sep="_")
63) Write a function to extract the first name from the string
Mr. Tom White.
substr (Mr. Tom White,start=5, stop=7)
64) Can you tell if the equation given below is linear or not ?
Emp_sal= 2000+2.5(emp_age)2
Yes it is a linear equation as the coefficients are linear.
65) What will be the output of the following R programming
code ?
var2<- c("I","Love,"DeZyre")
var2
It will give an error.
66) What will be the output of the following R programming
code?
x<-5
if(x%%2==0)
print("X is an even number")
else
print("X is an odd number")
Executing the above code will result in an error as shown below -

## Error: :4:1: unexpected 'else'


## 3: print("X is an even number")
## 4: else
##

R programming language does not know if the else related to the


first if or not as the first if() is a complete command on its own.
67) I have a string "contact@dezyre.com". Which string
function can be used to split the string into two different
strings contact@dezyre and com ?
This can be accomplished using the strsplit function which splits a
string based on the identifier given in the function call. The output of
strsplit() function is a list.
strsplit("contact@dezyre.com",split = ".")
Output of the strsplit function is ## [[1]]
## [1] " contact@dezyre" "com"
68) What is R Base package?
R Base package is the package that is loaded by default whenever R
programming environent is loaded .R base package provides basic
fucntionalites

in

environment

like

arithmetic

calcualtions,

input/output.
69) How will you merge two dataframes in R programming
language?

Merge () function is used to combine two dataframes and it


identifies common rows or columns between the 2 dataframes.
Merge () function basically finds the intersection between two
different sets of data.
Merge () function in R language takes a long list of arguments as
follows
Syntax for using Merge function in R language merge (x, y, by.x, by.y, all.x or all.y or all )

X represents the first dataframe.

Y represents the second dataframe.

by.X- Variable name in dataframe X that is common in Y.

by.Y- Variable name in dataframe Y that is common in X.

all.x - It is a logical value that specifies the type of merge. all.X


should be set to true, if we want all the observations from dataframe
X . This results in Left Join.

all.y - It is a logical value that specifies the type of merge. all.y


should be set to true , if we want all the observations from
dataframe Y . This results in Right Join.

all The default value for this is set to FALSE which means that
only matching rows are returned resulting in Inner join. This should
be set to true if you want all the observations from dataframe X and
Y resulting in Outer join.
70) Write the R programming code for an array of words so
that the output is displayed in decreasing frequency order.

R Programming Code to display output in decreasing frequency


order tt <- sort(table(c("a", "b", "a", "a", "b", "c", "a1", "a1", "a1")), dec=T)
depth <- 3
tt[1:depth]

Output 1) a a1 b
2) 3 3 2

71) How to check the frequency distribution of a categorical


variable?
The frequency distribution of a categorical variable can be checked
using the table function in R language. Table () function calculates
the count of each categories of a categorical variable.
gender=factor(c(M,F,M,F,F,F))
table(sex)
Output of the above R Code
Gender
F M
4 2
Programmers can also calculate the % of values for each categorical
group by storing the output in a dataframe and applying the column
percent function as shown below -

data.frame(table(gender))

t$percent= round(t$Freq / sum(t$Freq)*100,2)

72)

Gender

Frequency

Percent

66.67

33.33

What

is

the

procedure

to

check

the

cumulative

frequency distribution of any categorical variable?


The cumulative frequency distribution of a categorical variable can
be checked using the cumsum () function in R language.
Example
gender

factor(c("f","m","m","f","m","f"))
= table(gender)

cumsum(y)
Output of the above R codeCumsum(y)
fm
33
73) What will be the result of multiplying two vectors in R
having different lengths?
The multiplication of the two vectors will be performed and the
output will be displayed with a warning message like Longer
object length is not a multiple of shorter object length. Suppose
there is a vector a<-c (1, 2, 3) and vector b <- (2, 3) then the
multiplication of the vectors a*b will give the resultant as 2 6 6 with

the warning message. The multiplication is performed in a


sequential manner but since the length is not same, the first
element of the smaller vector b will be multiplied with the last
element of the larger vector a.

1. What is R?
R is a programming language which is used for developing statistical
software and data analysis.
2. How R commands are written?
By using # at the starting of the line of code like #division commands are
written.
3.What is t-tests() in R?
It is used to determine that the means of two groups are equal or not by using
t.test() function.
4.What are the disadvantages of R Programming?
The disadvantages are: Lack of standard GUI
Not good for big data.
Does not provide spreadsheet view of data.
5.What is the use of With () and By () function in R?
with() function applies an expression to a dataset.
#with(data,expression)
By() function applies a function t each level of a factors.
#by(data,factorlist,function)
6. In R programming, how missing values are represented?
In R missing values are represented by NA which should be in capital letters.
7.What is the use of subset() and sample() function in R?
Subset() is used to select the variables and observations and sample() function
is used to generate a random sample of the size n from a dataset.
8. Explain what is transpose?
Transpose is used for reshaping of the data which is used for analysis. Transpose
is performed by t() function.
9.What are the advantages of R?
The advantages are: It is used for managing and manipulating of data.
No license restrictions
Free and open source software.
Graphical capabilities of R are good.
Runs on many Operating system and different hardware and also run on 32 &
64 bit processors etc.
10. What is the function used for adding datasets in R?
For adding two datasets rbind() function is used but the column of two datasets
must be same.
Syntax: rbind(x1,x2) where x1,x2: vector, matrix, data frames.

11.How you can produce co-relations and covariances?


Cor-relations is produced by cor() and covariances is produced by cov() function.
12.What is difference between matrix and dataframes?
Dataframe can contain different type of data but matrix can contain only similar
type of data.
13.What is difference between lapply and sapply?
lapply is used to show the output in the form of list whereas sapply is used to
show the output in the form of vector or data frame.
14. What is the difference between seq(4) and seq_along(4)?
Seq(4) means vector from 1 to 4 (c(1,2,3,4)) whereas seq_along(4) means a
vector of the length(4) or 1(c(1)).
15. Explain how you can start the R commander GUI?
rcmdr command is used to start the R commander GUI.
16. What is the memory limit of R?
In 32 bit system memory limit is 3Gb but most versions limited to 2Gb and in 64
bit system memory limit is 8Tb.
17.How many data structures R has?
There are 5 data structure in R i.e. vector, matrix, array which are of
homogenous type and other two are list and data frame which are
heterogeneous.
18. Explain how data is aggregated in R?
There are two methods that is collapsing data by using one or more BY variable
and other is aggregate() function in which BY variable should be in list.
19. How many sorting algorithms are available?
there are 5 types of sorting algorithms are used which are: Bubble Sort
Selection Sort
Merge Sort
Quick Sort
Bucket Sort
20.How to create new variable in R programming?
For creating new variable assignment operator <- is used
For e.g. mydata$sum <- mydata$x1 + mydata$x2
21.What are R packages?
Packages are the collections of data, R functions and compiled code in a welldefined format and these packages are stored in library.
22.What is the workspace in R?
Workspace is the current R working environment which includes any user defined
objects like vector, lists etc.
23.What is the function which is used for merging of data frames horizontally
in R?
Merge()function is used to merge two data frames
Eg. Sum<-merge(data frame1,data frame 2,by=ID).
24.what is the function which is used for merging of data frames vertically in
R?
rbind() function is used to merge two data frames vertically.
Eg. Sum<- rbind(data frame1,data frame 2)
25.What is the power analysis?

It is used for experimental design .It is used to determine the effect of given
sample size.
26.Which package is used for power analysis in R?
Pwr package is used for power analysis in R.
27.Which method is used for exporting the data in R?
There are many ways to export the data into another formats like SPSS, SAS ,
Stata , Excel Spreadsheet.
28.Which packages are used for exporting of data?
For excel xlsReadWrite package is used and for sas,spss ,stata foreign package is
implemented.
29. How impossible values are represented in R?
In R NaN is used to represent impossible values.
30.Which command is used for storing R object into a file?
Save command is used for storing R objects into a file.
Syntax: >save(z,file=z.Rdata)
31. Which command is used for restoring R object from a file?
load command is used for storing R objects from a file.
Syntax: >load(z.Rdata)
32.What is the use of coin package in R?
coin package is used to achieve the re randomization or permutation based
statistical tests.
33.Which function is used for sorting in R?
order() function is used to perform the sorting.
34.What is the use of tapply?
IOS-6.1.3
35.What happens when the application object does not handle an event?
the event will be dispatched to your delegate for processing.
36.Explain app specific objects which store the app contents?
Data model objects are app specific objects and store apps content. Apps can
also use document objects.
37.Explain the purpose of using UIWindow object?
UIWindow object coordinates the one or more views presenting on the screen.
38.Tell me the super class of all view controller objects?
UIView Controller class.
39.How to create axes in the graph?
Using axes() function custom axes are created.
40.What is the use of abline() function?
abline() function is add the reference line to a graph.
Syntax:- abline(h=yvalues, v=xvalues)
41.Why vcd package is used?
vcd package provides different methods for visualizing multivariate categorical
data.
42. What is GGobi?
GGobi is an open source program for visualization for exploring high dimensional
typed data.
43.What is iPlots?

It is a package which provide bar plots, mosaic plots, box plots, parallel plots,
scatter plots and histograms.
44.What is the use of lattice package?
lattice package is to improve on base R graphics by giving better defaults and it
have the ability to easily display multivariate relationships.
45. What is fitdistr() function?
It is used to provide the maximum likelihood fitting of univariate distributions. It
is defined under the MASS package.
46.Which data structures are used to perform statistical analysis and create
graphs.
Data structures are vectors, arrays, data frames and matrices.
47.What is the use of sink() function?
It defines the direction of output.
48. Why library() function is used?
This function is used to show the packages which are installed.
49.Why search() function is used?
By this function we see that which packages are currently loaded.
50. On which type of data binary operators are worked?
Binary operators are worked on matrices, vectors and scalars.
51. What is the use of doBY package?
It is used to define the desired table using function and model formula.
52. Which function is used to create frequency table?
Frequency table is created by table() function.
53.Define loglm() function.
Loglm() function is used to create log-linear models.
54.What is the use of corrgram() function?
corrgram() function is used to plot correlograms.
55.How to create scatterplot matrices?
Pair() or splom() function is used for create scatterplot matrices.
56. What is npmc?
It is a package which gives nonparametric multiple comparisons.
57. What is the use of diagnostic plots?
It is used to check the normality, heteroscedasticity and influential observations.
58.Define anova() function.
anova() is used to compare the nested models.
59.What is cv.lm() function?
It is defined under the DAAG package which is used for k-fold validation.
60. Define stepAIC() function.
It is define under the MASS package which performs stepwise model selection
under exact AIC.
61. Define leaps().
It is used to perform the all-subsets regression and it is defined under the leaps
package.
62.Define relaimpo package.
It is used to measure the relative importance of each of the predictor in the
model.
63.Why car package is used?

It provide a variety of regression including scatter plots, variable plots and it


also enhanced diagnostic.
64. Define robust package.
It provides a library of robust methods including regression.
65. What is robustbase?
It is a package which provides basic robust statistics including model selection
methods.
66. Define plotmeans().
It is define under gplots package which includes confidence intervals and it
produces mean plot for single factors.
67.What is the full form of MANOVA?
MANOVA stands for multivariate analysis of variance.
68. What is the use of MANOVA?
By using MANOVA we can test more than one dependent variable simultaneously.
69. Define mshapiro.test( ).
It is a function which defines in mvnormtest package. It produces the Shapirowilk test for multivariate normality.
70. Define barlett.test().
Barlett.test() is used to provide a parametric k-sample test of the equality of
variances.
71.What is fligner.test()?
It is a function which provides a non-parametric k sample test of the equality of
variances.
72.Define hovplot().
It is define in HH package which provides a graphic test of homogeneity of
variance based on brown forsyth.
73.Which variables are represented by lower case letters?
Numerical variables are represented by lower case letters.
74. Which variables are represented by upper case letters?
Categorical factors are represented by upper case letters.
75.What is logistic regression?
Logistic regression is used to predict the binary outcome from the given set of
continuous predictor variables.
76.Define Poison regression.
It is used to predict the outcome variable which represents counts from the given
set of continuous predictor variable.
77.Define Survival analysis.
It includes number of techniques which is used for modeling the time to an event.
78. What is the use survfit() function?
It estimates a survival distribution one or more groups.
79. Define survdiff().
It determines the differences in survival distribution between two or more groups.
80.What is coxph()?
It is a function which is used to model the hazard function on the set of predictor
variable.
81. In which package survival analysis is defined?
Survival analysis is defined under the survival package.

82.What is the use of MASS package?


MASS functions include those functions which performs linear and quadratic
discriminant function analysis.
83. Define qda().
qda() prints a quadratic discriminant function.
84.Define lda().
lda() is used to print the discriminant functions which is based on centered
variable.
85. What is the use of forecast package?
It provides the functions which are used for automatic selection of ARIMA and
exponential models.
86.Define auto.arima().
It is used to handle the seasonal as well as non-seasonal ARIMA models.
87.What is principal() function?
It is define in psych package which is used to rotate and extract the principal
componants.
88.What is FactoMineR?
It is a package which includes quantitative and qualitative variables. It also
includes supplementary variables and observations.
89.What is the full form of CFA?
CFA stands for Confirmatory Factor Analysis.
90.What is the use of boot.sem() function?
It is used to bootstrap the structural equation model.
91.What is the full form of SEM?
SEM stands for Structural Equation Modeling.
92. Which function performs classical multidimensional scaling?
cmdscale() function is used to perform classical multidimensional scaling.
93.Define isoMDS().
This function is defined under the MASS package which performs nonmetric
multidimensional scaling.
94.Which function perform individual difference scaling?
It is done by indscal() function.
95. What is pvclust() function ?
It comes under the pvclust package which provides p-values for hierarchical
clustering .
96.Define cluster.stats() ?
It is define in fpc package which provide a method for comparing the similarity of
two clusters solution using different validation criteria.
97.What we use party package?
It is used to provide a non-parametric regression for ordinal, nominal, censored
and multivariate responses.
98. Which package provide the bootstrapping?
boot package is used which provide bootstrapping.
99.Define matlab package.
Matlab package includes those wrapper functions and variable which are used to
replicate matlab function calls.
100.What is the of use Matrix package?

Matrix package includes those function which support sparse and dense
matrices like Lapack, BLAS etc.

R Programming: 35 Job Interview Questions and Answers

Posted by Laetitia Van Cauwenberge on December 6, 2015 at 9:00am


View Blog

Read the questions. At the bottom, you will find a link to the answers.

The Questions
First Set
1. Explain what is R?
2. List out some of the function that R provides?
3. Explain how you can start the R commander GUI?
4. In R how you can import Data?
5. Mention what does not R language do?
6. Explain how R commands are written?
7. How can you save your data in R?
8. Mention how you can produce co-relations and covariances?
9. Explain what is t-tests in R?
10. Explain what is With () and By () function in R is used for?
11. What are the data structures in R that is used to perform statistical analyses and
create graphs?
12. Explain general format of Matrices in R?
13. In R how missing values are represented ?
14. Explain what is transpose?

15. Explain how data is aggregated in R?


16. What is the function used for adding datasets in R?
17. What is the use of subset() function and sample() function in R ?
18. Explain how you can create a table in R without external file?
You can find the answers here.
Second Set
1. Data structure -- How many data structures R has? How do you build a binary
search tree in R?
2. Sorting -- How many sorting algorithms are available? Show me an example in R.
3. Low level -- How do you build a R function powered by C?
4. String -- How do you implement string operation in R?
5. Vectorization -- If you want to do Monte Carlo simulation by R, how do you improve
the efficiency?
6. Function -- How do you take function as argument of another function? What is the
apply() function family?
7. Threading -- How do you do multi-threading in R?
8. Memory limit and database -- What is the memory limit of R? How do you avoid it?
How do you use SQL in R?
9. Testing -- How do you do testing and debugging in R?
10. Software development -- How do you develop a package? How do you do version
control?
You can find the answers here.
Third Set
1. If I have a data.frame df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c(7, 8,
9))...

How do I select the c(4, 5, 6)?

How do I select the 1?

How do I select the 5?

What is df[, 3]?

What is df[1,]?

What is df[2, 2]?


2. What is the difference between a matrix and a dataframe?
3. If I concatenate a number and a character together, what will the class of the resulting
vector be?
4. What if I concatenate a number and a logical?
5. What if I concatenate a number and NA?
6. What is the difference between sapply and lapply? When should you use one versus
the other? Bonus: When should you use vapply?
7. What is the difference between seq(4) and seq_along(4)?
8. What is f(3) where:
y <- 5 f <- function(x) { y <- 2; y^2 + g(x) } g <- function(x) { x + y }

Why?
9. I want to know all the values in c(1, 4, 5, 9, 10) that are not in c(1, 5, 10, 11, 13).
How do I do that with one built-in function in R? How could I do it if that function didn't exist?

10. Can you write me a function in R that replaces all missing values of a vector with the
mean of that vector?
11. How do you test R code? Can you write a test for the function you wrote in #6?
12. Say I have...
fn(a, b, c, d, e) a + b * c - d / e
How do I call fn on the vector c(1, 2, 3, 4, 5) so that I get the same result as fn(1, 2,
3, 4, 5)? (No need to tell me the result, just how to do it.)

13. dplyr <- "ggplot2" library(dplyr)


Why does the dplyr package get loaded and not ggplot2?
14. mystery_method <- function(x) { function(z) Reduce(function(y, w) w(y), x, z) } fn <mystery_method(c(function(x) x + 1, function(x) x * x)) fn(3)
What is the value of fn(3)? Can you explain what is happening at each step?

1.) If I have a data.frame df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c(7, 8, 9))...
1a.) How do I select the c(4, 5, 6)?
1b.) How do I select the 1?
1c.) How do I select the 5?
1d.) What is df[, 3]?
1e.) What is df[1,]?
1f.) What is df[2, 2]?
Answers: (a) df[[2]] or df$b, (b) df[[1]][[1]] or df$a[[1]], (c) df[[2]][[2]] or df$b[[2]],
(d) 7 8 9, (e) 1 4 7, (f) 5.
2.) What is the difference between a matrix and a dataframe?
Answer: A dataframe can contain heterogenous inputs and a matrix cannot.
(You can have a dataframe of characters, integers, and even other
dataframes, but you can't do that with a matrix -- a matrix must be all the
same type.)
3a.) If I concatenate a number and a character together, what will the class
of the resulting vector be?
3b.) What if I concatenate a number and a logical?
3c.) What if I concatenate a number and NA?
Answers: (a) character, (b) number, (c) number.
4.) What is the difference between sapply and lapply? When should you use
one versus the other? Bonus: When should you use vapply?
Answer: Use lapply when you want the output to be a list, and sapply when you
want the output to be a vector or a dataframe. Generally vapply is preferred
over sapply because you can specify the output type of vapply (but not sapply).
The drawback is vapply is more verbose and harder to use.
5.) What is the difference between seq(4) and seq_along(4)?

Answer: seq(4) produces a vector from 1 to 4 (c(1, 2, 3, 4)),


whereas seq_along(4) produces a vector of length(4), or 1 (c(1)).
6.) What is f(3) where:
y <- 5
f <- function(x) { y <- 2; y^2 + g(x) }
g <- function(x) { x + y }

Why?
Answer: 12. In f(3), y is 2, so y^2 is 4. When evaluating g(3), y is the globally
scoped y (5) instead of the y that is locally scoped to f, so g(3) evaluates to 3
+ 5 or 8. The rest is just 4 + 8, or 12.
7.) I want to know all the values in c(1, 4, 5, 9, 10) that are not in c(1, 5, 10, 11,
13). How do I do that with one built-in function in R? How could I do it if that
function didn't exist?
Answer: setdiff(c(1, 4, 5, 9, 10), c(1, 5, 10, 11, 13)) and c(1, 4, 5, 9, 10)[!c(1, 4, 5, 9, 10)
%in% c(1, 5, 10, 11, 13).
8.) Can you write me a function in R that replaces all missing values of a
vector with the mean of that vector?
Answer:
mean_impute <- function(x) { x[is.na(x)] <- mean(x, na.rm = TRUE); x }

9.) How do you test R code? Can you write a test for the function you wrote
in #6?
Answer: You can use Hadley's testthat package. A test might look like this:
testthat("It imputes the median correctly", {
expect_equal(mean_impute(c(1, 2, NA, 6)), 3)
})

10.) Say I have...


fn(a, b, c, d, e) a + b * c - d / e

How do I call fn on the vector c(1, 2, 3, 4, 5) so that I get the same result as fn(1,
2, 3, 4, 5)? (No need to tell me the result, just how to do it.)
Answer: do.call(fn, as.list(c(1, 2, 3, 4, 5)))
11.)
dplyr <- "ggplot2"
library(dplyr)

Why does the dplyr package get loaded and not ggplot2?
Answer: deparse(substitute(dplyr))
12.)
mystery_method <- function(x) { function(z) Reduce(function(y, w) w(y), x, z) }

fn <- mystery_method(c(function(x) x + 1, function(x) x * x))


fn(3)

What is the value of fn(3)? Can you explain what is happening at each step?
Answer:
Best seen in steps.
fn(3) requires mystery_method to be evaluated first.
mystery_method(c(function(x) x + 1, function(x) x * x)) evaluates to...
function(z) Reduce(function(y, w) w(y), c(function(x) x + 1, function(x) x * x), z)

Now, we can see the 3 in fn(3) is supposed to be z, giving us...


Reduce(function(y, w) w(y), c(function(x) x + 1, function(x) x * x), 3)

This Reduce call is wonky, taking three arguments. A three


argument Reduce call will initialize at the third argument, which is 3.
The inner function, function(y, w) w(y) is meant to take an argument and a
function and apply that function to the argument. Luckily for us, we have
some functions to apply.
That means we intialize at 3 and apply the first function, function(x) x + 1. 3 +
1 = 4.
We then take the value 4 and apply the second function. 4 * 4 = 16.

R INTERVIEW QUESTIONS AND ANSWERS


Deepanshu Bhalla 2 Comments R Interview Questions, R Tutorial

R is one of the most popular programming language for performing statistical


analysis and predictive modeling. Many recent surveys and studies claimed
"R" holds a good percentage of market share in analytics industry. Data
scientist role generally requires a candidate to know R/Python programming
language. People who know R programming language are generally paid
more than python and SAS programmers. In terms of advancement in R
software, it has improved a lot in the recent years. It supports parallel
computing and integration with big data technologies.

R Interview Questions and Answers

The following is a list of most frequently asked R Programming Interview


Questions with detailed answer. It includes some basic, advanced or tricky
questions related to R. Also it covers interview questions related to data
science with R.

1. How to determine data type of an object?


class() is used to determine data type of an object. See the example below x <- factor(1:5)
class(x)
It returns factor.

Object Class

To determine structure of an object, use str() function :


str(x) returns "Factor w/ 5 level"
Example 2 :
xx <- data.frame(var1=c(1:5))
class(xx)
It returns "data.frame".
str(xx) returns 'data.frame' : 5 obs. of 1 variable: $ var1: int

2. What is the use of mode() function?


It returns the storage mode of an object.
x <- factor(1:5)
mode(x)
The above mode function returns numeric.

Mode Function

x <- data.frame(var1=c(1:5))
mode(x)
It returns list.

3. Which data structure is used to store


categorical variables?
R has a special data structure called "factor" to store categorical variables.
It tells R that a variable is nominal or ordinal by making it a factor.
gender = c(1,2,1,2,1,2)
gender = factor(gender)
gender

4. How to check the frequency distribution of


a categorical variable?
The table function is used to calculate the count of each categories of a
categorical variable.
gender = factor(c("m","f","f","m","f","f"))
table(gender)

Output

If you want to include % of values in each group, you can store the result
in data frame using data.frame function and the calculate the column
percent.
t = data.frame(table(gender))
t$percent= round(t$Freq / sum(t$Freq)*100,2)

Frequency Distribution

5. How to check the cumulative frequency


distribution of a categorical variable
The cumsum function is used to calculate the cumulative sum of a
categorical variable.
gender = factor(c("m","f","f","m","f","f"))
x = table(gender)
cumsum(x)

Cumulative Sum

If you want to see the cumulative percentage of values, see the code
below :
t = data.frame(table(gender))
t$cumfreq = cumsum(t$Freq)
t$cumpercent= round(t$cumfreq / sum(t$Freq)*100,2)

Cumulative Frequency Distribution

6. How to produce histogram


The hist function is used to produce the histogram of a variable.
df = sample(1:100, 25)
hist(df, right=FALSE)

Produce Histogram with R

To improve the layout of histogram, you can


use the code below
colors = c("red", "yellow", "green", "violet", "orange", "blue", "pink", "cyan")
hist(df, right=FALSE, col=colors, main="Main Title ", xlab="X-Axis Title")

7. How to produce bar graph


First calculate the frequency distribution with table function and then
apply barplot function to produce bar graph
mydata = sample(LETTERS[1:5],16,replace = TRUE)
mydata.count= table(mydata)
barplot(mydata.count)

To improve the layout of bar graph, you can


use the code below:

colors = c("red", "yellow", "green", "violet", "orange", "blue", "pink", "cyan")


barplot(mydata.count, col=colors, main="Main Title ", xlab="X-Axis Title")

Bar Graph with R

8. How to produce Pie Chart


First calculate the frequency distribution with table function and then
apply pie function to produce pie chart.
mydata = sample(LETTERS[1:5],16,replace = TRUE)
mydata.count= table(mydata)
pie(mydata.count, col=rainbow(12))

Pie Chart with R

9. Multiplication of 2 vectors having different


length
For example, you have two vectors as defined below x <- c(4,5,6)
y <- c(2,3)

If you run this vector z <- x*y , what would be


the output? What would be the length of z?
It returns 8 15 12 with the warning message as shown below. The length of z
is 3 as it has three elements.

Multiplication of vectors

First Step : It performs multiplication of the first element of vector x i.e. 4


with first element of vector y i.e. 2 and the result is 8. In the second step, it
multiplies second element of vector x i.e. 5 with second element of vector b
i.e. 3, and the result is 15. In the next step, R multiplies first element of
smaller vector (y) with last element of bigger vector x.
Suppose the vector x would contain four elements as shown below :
x <- c(4,5,6,7)
y <- c(2,3)
x*y
It returns 8 15 12 21. It works like this : (4*2) (5*3) (6*2) (7*3)

10. What are the different data structures R


contain?
R contains primarily the following data structures :
1.
Vector
2.
Matrix
3.
Array

4.
List
5.
Data frame
6.
Factor
The first three data types (vector, matrix, array) are homogeneous in
behavior. It means all contents must be of the same type. The fourth and
fifth data types (list, data frame) are heterogeneous in behavior. It implies
they allow different types. And the factor data type is used to store
categorical variable.

Explanation : Data Types (Structures) in R


11. How to combine data frames?
Let's prepare 2 vectors for demonstration :
x = c(1:5)
y = c("m","f","f","m","f")
The cbind() function is used to combine data frame by columns.
z=cbind(x,y)

cbind : Output

The rbind() function is used to combine data frame by rows.


z = rbind(x,y)

rbind : Output

While using cbind() function, make sure the number of rows must be
equal in both the datasets. While using rbind() function, make sure both

the number and names of columnsmust be same. If names of columns


would not be same, wrong data would be appended to columns or records
might go missing.

12. How to combine data by rows when


different number of columns?
When the number of columns in datasets are not equal, rbind() function
doesn't work to combine data by rows. For example, we have two data
frames df and df2. The data frame df has 2 columns and df2 has only 1
variable. See the code below df = data.frame(x = c(1:4), y = c("m","f","f","m"))
df2 = data.frame(x = c(5:8))
The bind_rows() function from dplyr package can be used to combine data
frames when number of columns do not match.
library(dplyr)
combdf = bind_rows(df,df2)

13. What are valid variable names in R?


A valid variable name consists of letters, numbers and the dot or underline
characters. A variable name can start with either a letter or the dot followed
by a character (not number).
A variable name such as .1var is not valid. But .var1 is valid.
A variable name cannot have reserved words. The reserved words are listed
below if else repeat while function for in next break
TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex_
NA_character_
A variable name can have maximum to 10,000 bytes.

14. What is the use of with() and by()


functions? What are its alternatives?
Suppose you have a data frame as shown below -

df=data.frame(x=c(1:6), y=c(1,2,4,6,8,12))
You are asked to perform this calculation : (x+y) + (x-y) . Most of the R
programmers write like code below (df$x + df$y) + (df$x - df$y)
Using with() function, you can refer your data frame and make the above
code compact and simplerwith(df, (x+y) + (x-y))
The with() function is equivalent to pipe operator in dplyr package. See the
code below library(dplyr)
df %>% mutate((x+y) + (x-y))

by() function in R
The by() function is equivalent to group by function in SQL. It is used to
perform calculation by a factor or a categorical variable. In the example
below, we are computing mean of variable var2 by a factor var1.
df = data.frame(var1=factor(c(1,2,1,2,1,2)), var2=c(10:15))
with(df, by(df, var1, function(x) mean(x$var2)))
The group_by() function in dply package can perform the same task.
library(dplyr)
df %>% group_by(var1)%>% summarise(mean(var2))

15. How to rename a variable?


In the example below, we are renaming variable var1 to variable1.
df = data.frame(var1=c(1:5))
colnames(df)[colnames(df) == 'var1'] <- 'variable1'
The rename() function in dplyr package can also be used to rename a
variable.
library(dplyr)
df= rename(df, variable1=var1)

16. What is the use of which() function in R?


The which() function returns the position of elements of a logical vector
that are TRUE. In the example below, we are figuring out the row number
wherein the maximum value of a variable x is recorded.
mydata=data.frame(x = c(1,3,10,5,7))
which(mydata$x==max(mydata$x))
It returns 3 as 10 is the maximum value and it is at 3rd row in the variable x.

17. How to calculate first non-missing value


in variables?
Suppose you have three variables X, Y and Z and you need to extract first
non-missing value in each rows of these variables.
data = read.table(text="
XYZ
NA 1 5
3 NA 2
", header=TRUE)
The coalesce() function in dplyr package can be used to accomplish this
task.
library(dplyr)
data %>% mutate(var=coalesce(X,Y,Z))

COALESCE Function in R

18. How to calculate max value for rows?


Let's create a sample data frame
dt1 = read.table(text="
XYZ
7 NA 5
245
", header=TRUE)

With apply() function, we can tell R to apply the max function rowwise.
The na,rm = TRUEis used to tell R to ignore missing values while calculating
max value. If it is not used, it would return NA.
dt1$var = apply(dt1,1, function(x) max(x,na.rm = TRUE))

Output

19. Count number of zeros in a row


dt2 = read.table(text="
ABC
800
605
", header=TRUE)
apply(dt2,1, function(x) sum(x==0))

20. Does the following code work?


ifelse(df$var1==NA, 0,1)
It does not work. The logic operation on NA returns NA. It does not TRUE or
FALSE.
This code works ifelse(is.na(df$var1), 0,1)

21. What would be the final value of x


after running the following program?
x=3
mult <- function(j)
{
x=j*2
return(x)
}
mult(2)
[1] 4

Answer : The value of 'x' will remain 3. See the output shown in the image
below-

Output

It is because x is defined outside function. If you want to change the value of


x after running the function, you can use the following program:
x=3
mult <- function(j)
{
x <<- j * 2
return(x)
}
mult(2)
x
The operator "<<-" tells R to search in the parent environment for an
existing definition of the variable we want to be assigned.

22. How to convert a factor variable to


numeric
The as.numeric() function returns a vector of the levels of your factor and not
the original values. Hence, it is required to convert a factor variable to
character before converting it to numeric.
a <- factor(c(5, 6, 7, 7, 5))
a1 = as.numeric(as.character(a))

23. How to concatenate two strings?


The paste() function is used to join two strings. A single space is the
default separator between two strings.
a = "Deepanshu"
b = "Bhalla"
paste(a, b)
It returns "Deepanshu Bhalla"
If you want to change the default single space separator, you can add
sep="," keyword to include comma as a separator.
paste(a, b, sep=",") returns "Deepanshu,Bhalla"

24. How to extract first 3 characters from a


word
The substr() function is used to extract strings in a character vector. The
syntax of substr function is substr(character_vector, starting_position,
end_position)
x = "AXZ2016"
substr(x,1,3)

Character Functions Explained


25. How to extract last name from full name
The last name is the end string of the name. For example, Jhonson is the last
name of "Dave,Jon,Jhonson".
dt2 = read.table(text="
var
Sandy,Jones
Dave,Jon,Jhonson
", header=TRUE)
The word() function of stringr package is used to extract or scan word from a
string. -1 in the second parameter denotes the last word.
library(stringr)
dt2$var2 = word(dt2$var, -1, sep = ",")

26. How to remove leading and trailing


spaces
The trimws() function is used to remove leading and trailing spaces.
a = " David Banes "
trimws(a)
It returns "David Banes".

27. How to generate random numbers

between 1 and 100


The runif() function is used to generate random numbers.
rand = runif(100, min = 1, max = 100)

28. How to apply LEFT JOIN in R?


LEFT JOIN implies keeping all rows from the left table (data frame) with the
matches rows from the right table. In the merge() function, all.x=TRUE
denotes left join.
df1=data.frame(ID=c(1:5), Score=runif(5,50,100))
df2=data.frame(ID=c(3,5,7:9), Score2=runif(5,1,100))
comb = merge(df1, df2, by ="ID", all.x = TRUE)
Left Join (SQL Style)
library(sqldf)
comb = sqldf('select df1.*, df2.* from df1 left join df2 on df1.ID = df2.ID')
Left Join with dply package
library(dplyr)
comb = left_join(df1, df2, by = "ID")

Joining and Merging with R

29. How to calculate cartesian product of two


datasets
The cartesian product implies cross product of two tables (data frames). For
example, df1 has 5 rows and df2 has 5 rows. The combined table would
contain 25 rows (5*5)
comb = merge(df1,df2,by=NULL)
CROSS JOIN (SQL Style)
library(sqldf)
comb2 = sqldf('select * from df1 join df2 ')

30. Unique rows common to both the

datasets
First, create two sample data frames
df1=data.frame(ID=c(1:5), Score=c(50:54))
df2=data.frame(ID=c(3,5,7:9), Score=c(52,60:63))
library(dplyr)
comb = intersect(df1,df2)
library(sqldf)
comb2 = sqldf('select * from df1 intersect select * from df2 ')

Output : Intersection with R

31. How to measure execution time of a


program in R?
There are multiple ways to measure running time of code. Some frequently
used methods are listed below R Base Method
start.time <- Sys.time()
runif(5555,1,1000)
end.time <- Sys.time()
end.time - start.time
With tictoc package
library(tictoc)
tic()
runif(5555,1,1000)
toc()

32. Which package is generally used for fast


data manipulation on large datasets?
The package data.table performs fast data manipulation on large datasets.
See the comparison between dplyr and data.table.
# Load data
library(nycflights13)
data(flights)
df = setDT(flights)
# Load required packages
library(tictoc)
library(dplyr)
library(data.table)
# Using data.table package
tic()
df[arr_delay > 30 & dest == "IAH",
.(avg = mean(arr_delay),
size = .N),
by = carrier]
toc()
# Using dplyr package
tic()
flights %>% filter(arr_delay > 30 & dest == "IAH") %>%
group_by(carrier) %>% summarise(avg = mean(arr_delay), size = n())
toc()
Result : data.table package took 0.04 seconds. whereas dplyr package took
0.07 seconds. So, data.table is approx. 40% faster than dplyr. Since the
dataset used in the example is of medium size, there is no noticeable
difference between the two. As size of data grows, the difference of
execution time gets bigger.

33. How to read large CSV file in R?


We can use fread() function of data.table package.

library(data.table)
yyy = fread("C:\\Users\\Dave\\Example.csv", header = TRUE)
We can also use read.big.matrix() function of bigmemory package.

34. What is the difference between the


following two programs ?
1. temp = data.frame(v1<-c(1:10),v2<-c(5:14))
2. temp = data.frame(v1=c(1:10),v2=c(5:14))
In the first case, it created two vectors v1 and v2 and a data frame temp
which has 2 variables with improper variable names. The second code
creates a data frame temp with proper variable names.

35. How to remove all the objects


rm(list=ls())

36. What are the various sorting algorithms


in R?
Major five sorting algorithms :
1.
2.
3.
4.
5.

Bubble Sort
Selection Sort
Merge Sort
Quick Sort
Bucket Sort

37. Sort data by multiple variables


Create a sample data frame
mydata = data.frame(score = ifelse(sign(rnorm(25))==-1,1,2),
experience= sample(1:25))
Task : You need to sort score variable on ascending order and then sort
experience variable on descending order.

R Base Method
mydata1 <- mydata[order(mydata$score, -mydata$experience),]
With dplyr package
library(dplyr)
mydata1 = arrange(mydata, score, desc(experience))

38. Drop Multiple Variables


Suppose you need to remove 3 variables - x, y and z from data frame
"mydata".
R Base Method
df = subset(mydata, select = -c(x,y,z))
With dplyr package
library(dplyr)
df = select(mydata, -c(x,y,z))

40. How to save everything in R session


save.image(file="dt.RData")

41. How R handles missing values?


Missing values are represented by capital NA.
To create a new data without any missing value, you can use the code
below :
df <- na.omit(mydata)

42. How to remove duplicate values by a


column
Suppose you have a data consisting of 25 records. You are asked to remove
duplicates based on a column. In the example, we are eliminating duplicates
by variable y.
data = data.frame(y=sample(1:25, replace = TRUE), x=rnorm(25))
R Base Method
test = subset(data, !duplicated(data[,"y"]))

dplyr Method
library(dplyr)
test1 = distinct(data, y, .keep_all= TRUE)

43. Which packages are used for transposing


data with R
The reshape2 and tidyr packages are most popular packages for reshaping
data in R.

Explanation : Transpose Data


44. Calculate number of hours, days, weeks,
months and years between 2 dates
Let's set 2 dates :
dates <- as.Date(c("2015-09-02", "2016-09-05"))
difftime(dates[2], dates[1], units = "hours")
difftime(dates[2], dates[1], units = "days")
floor(difftime(dates[2], dates[1], units = "weeks"))
floor(difftime(dates[2], dates[1], units = "days")/365)
With lubridate package
library(lubridate)
interval(dates[1],
interval(dates[1],
interval(dates[1],
interval(dates[1],
interval(dates[1],

dates[2])
dates[2])
dates[2])
dates[2])
dates[2])

%/%
%/%
%/%
%/%
%/%

hours(1)
days(1)
weeks(1)
months(1)
years(1)

The number of months unit is not included in the base difftime() function so
we can use interval() function of lubridate() package.

45. How to add 3 months to a date


mydate <- as.Date("2015-09-02")
mydate + months(3)

46. Extract date and time from timestamp


mydate <- as.POSIXlt("2015-09-27 12:02:14")
library(lubridate)
date(mydate) # Extracting date part
format(mydate, format="%H:%M:%S") # Extracting time part
Extracting various time periods
day(mydate)
month(mydate)
year(mydate)
hour(mydate)
minute(mydate)
second(mydate)

47. What are various ways to write loop in R


There are primarily three ways to write loop in R
1.
For Loop
2.
While Loop
3.
Apply Family of Functions such as Apply, Lapply, Sapply etc

48. Difference between lapply and sapply in R


lapply returns a list when we apply a function to each element of a data
structure. whereas sapply returns a vector.

49. Difference between sort(), rank() and


order() functions?
The sort() function is used to sort a 1 dimension vector or a single variable of
data.
The rank() function returns the ranking of each value.
The order() function returns the indices that can be used to sort the data.
Example :
set.seed(1234)
x = sample(1:50, 10)
x

[1] 6 31 30 48 40 29 1 10 28 22
sort(x)
[1] 1 6 10 22 28 29 30 31 40 48
It sorts the data on ascending order.
rank(x)
[1] 2 8 7 10 9 6 1 3 5 4
2 implies the number in the first position is the second lowest and 8 implies
the number in the second position is the eighth lowest.
order(x)
[1] 7 1 8 10 9 6 3 2 5 4
7 implies the 7th value of x is the smallest value, so 7 is the first element of
order(x) and i refers to the first value of x is the second smallest.
If you run x[order(x)], it would give you the same result as sort() function.
The difference between these two functions lies in two or more dimensions
of data (two or more columns). In other words, the sort() function cannot be
used for more than 1 dimension whereas x[order(x)] can be used.

50. Extracting Numeric Variables


cols <- sapply(mydata, is.numeric)
abc = mydata [,cols]

Data Science with R Interview


Questions
The list below contains most frequently asked interview questions for a role
of data scientist. Most of the roles related to data science or predictive
modeling require candidate to be well conversant with R and know how to
develop and validate predictive models with R.

51. Which function is used for building linear


regression model?
The lm() function is used for fitting a linear regression model.

52. How to add interaction in the linear


regression model?
:An interaction can be created using colon sign (:). For example, x1 and x2
are two predictors (independent variables). The interaction between the
variables can be formed like x1:x2.
See the example below linreg1 <- lm(y ~ x1 + x2 + x1:x2, data=mydata)
The above code is equivalent to the following code :
linreg1 <- lm(y ~ x1*x2, data=mydata)
x1:x2 - It implies including both main effects (x1 + x2) and interaction
(x1:x2).

53. How to check autocorrelation assumption


for linear regression?
durbinWatsonTest() function

54. Which function is useful for developing a


binary logistic regression model?
glm() function with family = "binomial"

55. How to perform stepwise variable


selection in logistic regression model?
Run step() function after building logistic model with glm() function.

56. How to do scoring in the logistic


regression model?
Run predict(logit_model, validation_data, type = "response")

57. How to split data into training and

validation?
dt = sort(sample(nrow(mydata), nrow(mydata)*.7))
train<-mydata[dt,]
val<-mydata[-dt,]

58. How to standardize variables?


data2 = scale(data)

59. How to validate cluster analysis


Validate Cluster Analysis
60. Which are the popular R packages for
decision tree?
rpart, party

61. What is the difference between rpart and


party package for developing a decision tree
model?
rpart is based on Gini Index which measures impurity in node. Whereas
ctree() function from "party" package uses a significance test procedure in
order to select variables.

62. How to check correlation with R?


cor() function

63. Have you heard 'relaimpo' package?


It is used to measure the relative importance of independent variables in a
model.

64. How to fine tune random forest model?


Use tuneRF() function

65. What shrinkage defines in gradient


boosting model?
Shrinkage is used for reducing, or shrinking, the impact of each additional
fitted base-learner (tree).

66. How to make data stationary for ARIMA


time series model?
Use ndiffs() function which returns the number of difference required to make
data stationary.

67. How to automate arima model?


Use auto.arima() function of forecast package

68. How to fit proportional hazards model in


R?
Use coxph() function of survival package.

69. Which package is used for market basket


analysis?
arules package

70. Parallelizing Machine Learning Algorithms