1.4K vues

Transféré par Tata Sairamesh

DATA science

- DataScienceWeekly DataScientistInterviews Vol1 April2014
- 316551847 Data Science Interview Question
- 21 Must-Know Data Science Interview Questions and Answers
- UltimateGuidetoDataScienceInterviews-2
- Data Science Interview Question
- Interview Quations Data Science
- 40 Interview Questions Asked at Startups in Machine Learning _ Data Science
- 120 Interview Questions
- The Data Science Handbook - Pre Release
- Introduction to Data Science
- 100 Data Science in Python Interview Questions and Answers for 2017
- 100 Data Science Interview Questions and Answers (General)
- Python Interview Questions
- Data Science Boot Camp Survival Manual
- Data Science Hiring Guide
- SQL interview Questions.pdf
- Data Science
- The-Field-Guide-to-Data-Science.pdf
- Data Science and Big Data Computing- Frameworks and Methodologies
- (9) How Can I Become a Data Scientist_ - Quora

Vous êtes sur la page 1sur 56

Answers for 2016

04 Dec 2015

R Programming is one of the languages that data scientists have to

be familiar with. In most of the Data Science job interviews

questions surrounding coding in R will be asked and it is expected

that applicants are well versed with the nitty-gritties of R. We got

together with our data science faculties who are experts in the field

having worked as Sr. Data Scientists themselves, to bring together

a list of questions that might be asked in data science interviews.

These interview questions are just related to R programming and

though it is not an exhaustive list, it will be useful to go through it

while preparing for data science jobs.

CLICK HERE

your inbox!

please click the orange "Request Info" button on top of this page.

Data Science is a vast field, that is true, but when it comes to

cracking interviews for data science jobs, knowledge of either R or

Python is important to get started with your Data Science career.

Being a Data Scientist means that you are usually applying for the

top level position. A fresh college graduate will not be hired for a

data scientist position, as this is one position that demands

experience, maturity and in depth knowledge of data science

concepts and the industry that data scientists get hired in. The

language that has remained the preferred language for data

scientists through the years.

In our previous post for 100 Data Science Interview Questions, we

had listed all the general statistics, data, mathematics and

conceptual questions that are asked in the interviews. These articles

have been divided into 3 parts which focus on each topic wise

distribution of interview questions. Below are some of the questions

that maybe asked during a data science interview, that is related to

R programing specifically.

and Answers in R Programming

1) How can you merge two data frames in R language?

Data frames in R language can be merged manually using cbind ()

functions or by using the merge () function on common rows or

columns.

2) Explain about data import in R language

R Commander is used to import data in R language. To start the R

commander GUI, the user must type in the command Rcmdr into the

console. There are 3 different ways in which data can be imported in

R language

Users can select the data set in the dialog box or enter the

when the data set is not too large.

file (ASCII), from any other statistical package or from the clipboard.

3) Two vectors X and Y are defined as follows X <- c(3, 2, 4)

and Y <- c(1, 2). What will be output of vector Z that is

defined as Z <- X*Y.

In R language when the vectors have different lengths, the

multiplication begins with the smaller vector and continues till all

the elements in the larger vector have been multiplied.

The output of the above code will be

Z <- (3, 4, 4)

4) How

missing

values

and

impossible

values

are

represented in R language?

NaN (Not a Number) is used to represent impossible values whereas

NA (Not Available) is used to represent missing values. The best way

to answer this question would be to mention that deleting missing

values is not a good idea because the probable cause for missing

value could be some problem with data collection or programming

or the query. It is good to find the root cause of the missing values

and then take necessary steps handle them.

5) R language has several packages for solving a particular

problem. How do you make a decision on which one is the

best to use?

CRAN package ecosystem has more than 6000 packages. The best

way for beginners to answer this question is to mention that they

would look for a package that follows good software development

principles. The next thing would be to look for user reviews and find

out if other data scientists or analysts have been able to solve a

similar problem.

6) Which function in R language is used to find out whether

the means of 2 groups are equal to each other or not?

t.tests ()

7) What is the best way to communicate the results of data

analysis using R language?

The best possible way to do this is combine the data, code and

analysis results in a single document using knitr for reproducible

research. This helps others to verify the findings, add to them and

engage in discussions. Reproducible research makes it easy to redo

the experiments by inserting new data and applying it to a different

problem.

8) How many data structures does R language have?

R language has Homogeneous and Heterogeneous data structures.

Homogeneous data structures have same type of objects Vector,

Matrix ad Array. Heterogeneous data structures have different type

of objects Data frames and lists.

9) What is the value of f (2) for the following R code?

b <- 4

f <- function (a)

{

b <- 3

b^3 + g (a)

}

g <- function (a)

{

a*b

}

The answer to the above code snippet is 35. The value of a passed

to the function is 2 and the value for b defined in the function f (a)

is 3. So the output would be 3^3 + g (2). The function g is defined in

the global environment and it takes the value of b as 4(due to lexical

scoping in R) not 3 returning a value 2*4= 8 to the function f. The

result will be 3^3+8= 35.

10) What is the process to create a table in R language

without using external files?

MyTable= data.frame ()

edit (MyTable)

The above code will open an Excel Spreadsheet for entering data

into MyTable.

Learn Data Science in R Programming to land a top gig as an

Enterprise Data Scientist!

11)

Explain

about

the

significance

of

transpose

in

language

Transpose t () is the easiest method for reshaping the data before

analysis.

With () function is used to apply an expression for a given dataset

and BY () function is used for applying a function each level of

factors.

13)

dplyr

package

is

used

to

speed

up

data

frame

dplyr for large fast tables?

data.table

14) In base graphics system, which function is used to add

elements to a plot?

boxplot () or text ()

15) What are the different type of sorting algorithms

available in R language?

Bucket Sort

Selection Sort

Quick Sort

Bubble Sort

Merge Sort

15) What is the command used to store R objects in a file?

save (x, file=x.Rdata)

16) What is the best way to use Hadoop and R together for

analysis?

HDFS can be used for storing the data for long-term. MapReduce

jobs submitted from either Oozie, Pig or Hive can be used to encode,

improve and sample the data sets from HDFS into R. This helps to

leverage complex analysis tasks on the subset of data prepared in R.

17) What will be the output of log (-5.8) when executed on R

console?

Executing the above on R console will display a warning sign that

NaN (Not a Number) will be produced because it is not possible to

take the log of negative number.

18) How is a Data object represented internally in R

language?

unclass (as.Date (2016-10-05))

19) What will be the output of the below code printmessage <- function (a) {

if (is.na (a))

else if (a < 0)

else

invisible (a)

printmessage (NA)

value. The function is.na () is used to check if the input passed is a

missing value.

20) Which package in R supports the exploratory analysis of

genomic data?

adegenet

21) What is the difference between data frame and a matrix

in R?

Data frame can contain heterogeneous inputs while a matrix cannot.

In matrix only similar data types can be stored whereas in a data

frame there can be different data types like characters, integers or

other data frames.

22) How can you add datasets in R?

rbind () function can be used add datasets in R language provided

the columns in the datasets should be same.

23) How do you split a continuous variable into different

groups/ranks in R?

24) What are factor variable in R language?

Factor variables are categorical variables that hold either string or

numeric values. Factor variables are used in various types of

number of degrees of freedom is assigned to them.

25) What is the memory limit in R?

8TB is the memory limit for 64-bit system memory and 3GB is the

limit for 32-bit system memory.

26) What are the data types in R on which binary operators

can be applied?

Scalars, Matrices ad Vectors.

27) How do you create log linear models in R language?

Using the loglm () function

28) What will be the class of the resulting vector if you

concatenate a number and NA?

number

29) What is meant by K-nearest neighbour?

K-Nearest Neighbour is one of the simplest machine learning

classification algorithms that is a subset of supervised learning

based

on

lazy

learning.

In

this

algorithm

the

function

is

classification.

30) What will be the class of the resulting vector if you

concatenate a number and a character?

character

31) Write code to build an R function powered by C?

32) If you want to know all the values in c (1, 3, 5, 7, 10)

that are not in c (1, 5, 10, 12, 14). Which in-built function in

without using the in-built function.

Using in-built function - setdiff(c (1, 3, 5, 7, 10), c (1, 5, 10, 11, 13))

Without using in-built function - c (1, 3, 5, 7, 10) [! c (1, 3, 5, 7, 10)

%in% c (1, 5, 10, 11, 13).

33) How can you debug and test R programming code?

R code can be tested using Hadleys testthat package.

34) What will be the class of the resulting vector if you

concatenate a number and a logical?

number

35) Write a function in R language to replace the missing

value in a vector with the mean of that vector.

mean impute <- function(x) {x [is.na(x)] <- mean(x, na.rm = TRUE);

x}

36) What happens if the application object is not able to

handle an event?

The event is dispatched to the delegate for processing.

37) Differentiate between lapply and sapply.

If the programmers want the output to be a data frame or a vector,

then sapply function is used whereas if a programmer wants the

output to be a list then lapply is used. There one more function

known as vapply which is preferred over sapply as vapply allows the

programmer to specific the output type. The disadvantage of using

vapply is that it is difficult to be implemented and more verbose.

38) Differentiate between seq (6) and seq_along (6)

produce a sequential vector from 1 to 6 c( (1,2,3,4,5,6)).

39) How will you read a .csv file in R language?

read.csv () function is used to read a .csv file in R language. Below is

a simple example

filcontent <-read.csv (sample.csv)

print (filecontent)

40) How do you write R commands?

The line of code in R language should begin with a hash symbol (#).

41) How can you verify if a given object X is a matric data

object?

If the function call is.matrix(X ) returns TRUE then X can be termed

as a matrix data object.

42) What do you understand by element recycling in R?

If two vectors with different lengths perform an operation the

elements of the shorter vector will be re-used to complete the

operation. This is referred to as element recycling.

Example Vector A <-c(1,2,0,4) and Vector B<-(3,6) then the result

of A*B will be ( 3,12,0,24). Here 3 and 6 of vector B are repeated

when computing the result.

43) How can you verify if a given object X is a matrix data

object?

If the function call is.matrix(X) returns true then X can be considered

as a matrix data object otheriwse not.

response variable in R language?

Logistic regression can be used for this and the function glm () in R

language provides this functionality.

45) What is the use of sample and subset functions in R

programming language?

Sample () function can be used to select a random sample of size n

from a huge dataset.

Subset () function is used to select variables and observations from

a given dataset.

46) There is a function fn(a, b, c, d, e) a + b * c - d / e. Write

the code to call fn on the vector c(1,2,3,4,5) such that the

output is same as fn(1,2,3,4,5).

do.call (fn, as.list(c (1, 2, 3, 4, 5)))

47) How can you resample statistical tests in R language?

Coin package in R provides various options for re-randomization and

permutations based on statistical tests. When test assumptions

cannot be met then this package serves as the best alternative to

classical methods as it does not assume random sampling from welldefined populations.

48) What is the purpose of using Next statement in R

language?

If a developer wants to skip the current iteration of a loop in the

code without terminating it then they can use the next statement.

Whenever the R parser comes across the next statement in the

code, it skips evaluation of the loop further and jumps to the next

iteration of the loop.

49) How will you create scatterplot matrices in R language?

A matrix of scatterplots can be produced using pairs. Pairs function

takes various parameters like formula, data, subset, labels, etc.

The two key parameters required to build a scatterplot matrix are

separate variable in the pairs plots where the terms should be

numerical vectors. It basically represents the series of variables

used in pairs.

variables have to be taken for building a scatterplot.

50) How will you check if an element 25 is present in a

vector?

There are various ways to do this-

i.

returns the first appearance of a particular element.

ii.

true or false.

iii.

or false based on whether it is present in a vector or not.

51) What is the difference between library() and require()

functions in R language?

There is no real difference between the two if the packages are not

being loaded inside the function. require () function is usually used

inside function and throws a warning whenever a particular package

is not found. On the flip side, library () function gives an error

message if the desired package cannot be loaded.

52) What are the rules to define a variable name in R

programming language?

A variable name in R programming language can contain numeric

and alphabets along with special characters like dot (.) and

underline (-). Variable names in R language can begin with an

alphabet or the dot symbol. However, if the variable name begins

with a dot symbol it should not be a followed by a numeric digit.

53)

What

do

you

understand

by

workspace

in

programming language?

The current R working environment of a user that has user

defined objects like lists, vectors, etc. is referred to as Workspace

in R language.

54) Which function helps you perform sorting in R language?

Order ()

55) How will you list all the data sets available in all R

packages?

Using

the

data(package

56)

Which

visualisation

below

=

function

in

line

.packages(all.available

is

R

used

to

create

programming

of

code=

TRUE))

histogram

language?

Hist()

57) Write the syntax to set the path for current working

directory

in

environment.

Setwd(dir_path)

58) How will you drop variables using indices in a data

frame?

Lets

take

dataframe

df<-

data.frame(v1=c(1:5),v2=c(2:6),v3=c(3:7),v4=c(4:8))

df

## v1 v2 v3 v4

## 1 1 2 3 4

## 2 2 3 4 5

## 3 3 4 5 6

## 4 4 5 6 7

## 5 5 6 7 8

can be dropped using negative indicies as followsdf1<-df[-c(2,3)]

df1

## v1 v4

## 1 1 4

## 2 2 5

## 3 3 6

## 4 4 7

## 5 5 8

It will generate 7 randowm numbers between 0 and 1.

60)

What

is

the

difference

between

rnorm

and

runif

functions ?

rnorm function generates "n" normal random numbers based on the

mean and standard deviation arguments passed to the function.

Syntax of rnorm function rnorm(n, mean = , sd = )

of minimum and maximum values passed to the function.

Syntax of runif function runif(n, min = , max = )

programming code

mat<-matrix(rep(c(TRUE,FALSE),8),nrow=4)

sum(mat)

8

62) How will you combine multiple different string like

Data, Science, in ,R, Programming as a single

string Data_Science_in_R_Programmming ?

paste(Data, Science, in ,R, Programming,sep="_")

63) Write a function to extract the first name from the string

Mr. Tom White.

substr (Mr. Tom White,start=5, stop=7)

64) Can you tell if the equation given below is linear or not ?

Emp_sal= 2000+2.5(emp_age)2

Yes it is a linear equation as the coefficients are linear.

65) What will be the output of the following R programming

code ?

var2<- c("I","Love,"DeZyre")

var2

It will give an error.

66) What will be the output of the following R programming

code?

x<-5

if(x%%2==0)

print("X is an even number")

else

print("X is an odd number")

Executing the above code will result in an error as shown below -

## 3: print("X is an even number")

## 4: else

##

first if or not as the first if() is a complete command on its own.

67) I have a string "contact@dezyre.com". Which string

function can be used to split the string into two different

strings contact@dezyre and com ?

This can be accomplished using the strsplit function which splits a

string based on the identifier given in the function call. The output of

strsplit() function is a list.

strsplit("contact@dezyre.com",split = ".")

Output of the strsplit function is ## [[1]]

## [1] " contact@dezyre" "com"

68) What is R Base package?

R Base package is the package that is loaded by default whenever R

programming environent is loaded .R base package provides basic

fucntionalites

in

environment

like

arithmetic

calcualtions,

input/output.

69) How will you merge two dataframes in R programming

language?

identifies common rows or columns between the 2 dataframes.

Merge () function basically finds the intersection between two

different sets of data.

Merge () function in R language takes a long list of arguments as

follows

Syntax for using Merge function in R language merge (x, y, by.x, by.y, all.x or all.y or all )

should be set to true, if we want all the observations from dataframe

X . This results in Left Join.

should be set to true , if we want all the observations from

dataframe Y . This results in Right Join.

all The default value for this is set to FALSE which means that

only matching rows are returned resulting in Inner join. This should

be set to true if you want all the observations from dataframe X and

Y resulting in Outer join.

70) Write the R programming code for an array of words so

that the output is displayed in decreasing frequency order.

order tt <- sort(table(c("a", "b", "a", "a", "b", "c", "a1", "a1", "a1")), dec=T)

depth <- 3

tt[1:depth]

Output 1) a a1 b

2) 3 3 2

variable?

The frequency distribution of a categorical variable can be checked

using the table function in R language. Table () function calculates

the count of each categories of a categorical variable.

gender=factor(c(M,F,M,F,F,F))

table(sex)

Output of the above R Code

Gender

F M

4 2

Programmers can also calculate the % of values for each categorical

group by storing the output in a dataframe and applying the column

percent function as shown below -

data.frame(table(gender))

72)

Gender

Frequency

Percent

66.67

33.33

What

is

the

procedure

to

check

the

cumulative

The cumulative frequency distribution of a categorical variable can

be checked using the cumsum () function in R language.

Example

gender

factor(c("f","m","m","f","m","f"))

= table(gender)

cumsum(y)

Output of the above R codeCumsum(y)

fm

33

73) What will be the result of multiplying two vectors in R

having different lengths?

The multiplication of the two vectors will be performed and the

output will be displayed with a warning message like Longer

object length is not a multiple of shorter object length. Suppose

there is a vector a<-c (1, 2, 3) and vector b <- (2, 3) then the

multiplication of the vectors a*b will give the resultant as 2 6 6 with

sequential manner but since the length is not same, the first

element of the smaller vector b will be multiplied with the last

element of the larger vector a.

1. What is R?

R is a programming language which is used for developing statistical

software and data analysis.

2. How R commands are written?

By using # at the starting of the line of code like #division commands are

written.

3.What is t-tests() in R?

It is used to determine that the means of two groups are equal or not by using

t.test() function.

4.What are the disadvantages of R Programming?

The disadvantages are: Lack of standard GUI

Not good for big data.

Does not provide spreadsheet view of data.

5.What is the use of With () and By () function in R?

with() function applies an expression to a dataset.

#with(data,expression)

By() function applies a function t each level of a factors.

#by(data,factorlist,function)

6. In R programming, how missing values are represented?

In R missing values are represented by NA which should be in capital letters.

7.What is the use of subset() and sample() function in R?

Subset() is used to select the variables and observations and sample() function

is used to generate a random sample of the size n from a dataset.

8. Explain what is transpose?

Transpose is used for reshaping of the data which is used for analysis. Transpose

is performed by t() function.

9.What are the advantages of R?

The advantages are: It is used for managing and manipulating of data.

No license restrictions

Free and open source software.

Graphical capabilities of R are good.

Runs on many Operating system and different hardware and also run on 32 &

64 bit processors etc.

10. What is the function used for adding datasets in R?

For adding two datasets rbind() function is used but the column of two datasets

must be same.

Syntax: rbind(x1,x2) where x1,x2: vector, matrix, data frames.

Cor-relations is produced by cor() and covariances is produced by cov() function.

12.What is difference between matrix and dataframes?

Dataframe can contain different type of data but matrix can contain only similar

type of data.

13.What is difference between lapply and sapply?

lapply is used to show the output in the form of list whereas sapply is used to

show the output in the form of vector or data frame.

14. What is the difference between seq(4) and seq_along(4)?

Seq(4) means vector from 1 to 4 (c(1,2,3,4)) whereas seq_along(4) means a

vector of the length(4) or 1(c(1)).

15. Explain how you can start the R commander GUI?

rcmdr command is used to start the R commander GUI.

16. What is the memory limit of R?

In 32 bit system memory limit is 3Gb but most versions limited to 2Gb and in 64

bit system memory limit is 8Tb.

17.How many data structures R has?

There are 5 data structure in R i.e. vector, matrix, array which are of

homogenous type and other two are list and data frame which are

heterogeneous.

18. Explain how data is aggregated in R?

There are two methods that is collapsing data by using one or more BY variable

and other is aggregate() function in which BY variable should be in list.

19. How many sorting algorithms are available?

there are 5 types of sorting algorithms are used which are: Bubble Sort

Selection Sort

Merge Sort

Quick Sort

Bucket Sort

20.How to create new variable in R programming?

For creating new variable assignment operator <- is used

For e.g. mydata$sum <- mydata$x1 + mydata$x2

21.What are R packages?

Packages are the collections of data, R functions and compiled code in a welldefined format and these packages are stored in library.

22.What is the workspace in R?

Workspace is the current R working environment which includes any user defined

objects like vector, lists etc.

23.What is the function which is used for merging of data frames horizontally

in R?

Merge()function is used to merge two data frames

Eg. Sum<-merge(data frame1,data frame 2,by=ID).

24.what is the function which is used for merging of data frames vertically in

R?

rbind() function is used to merge two data frames vertically.

Eg. Sum<- rbind(data frame1,data frame 2)

25.What is the power analysis?

It is used for experimental design .It is used to determine the effect of given

sample size.

26.Which package is used for power analysis in R?

Pwr package is used for power analysis in R.

27.Which method is used for exporting the data in R?

There are many ways to export the data into another formats like SPSS, SAS ,

Stata , Excel Spreadsheet.

28.Which packages are used for exporting of data?

For excel xlsReadWrite package is used and for sas,spss ,stata foreign package is

implemented.

29. How impossible values are represented in R?

In R NaN is used to represent impossible values.

30.Which command is used for storing R object into a file?

Save command is used for storing R objects into a file.

Syntax: >save(z,file=z.Rdata)

31. Which command is used for restoring R object from a file?

load command is used for storing R objects from a file.

Syntax: >load(z.Rdata)

32.What is the use of coin package in R?

coin package is used to achieve the re randomization or permutation based

statistical tests.

33.Which function is used for sorting in R?

order() function is used to perform the sorting.

34.What is the use of tapply?

IOS-6.1.3

35.What happens when the application object does not handle an event?

the event will be dispatched to your delegate for processing.

36.Explain app specific objects which store the app contents?

Data model objects are app specific objects and store apps content. Apps can

also use document objects.

37.Explain the purpose of using UIWindow object?

UIWindow object coordinates the one or more views presenting on the screen.

38.Tell me the super class of all view controller objects?

UIView Controller class.

39.How to create axes in the graph?

Using axes() function custom axes are created.

40.What is the use of abline() function?

abline() function is add the reference line to a graph.

Syntax:- abline(h=yvalues, v=xvalues)

41.Why vcd package is used?

vcd package provides different methods for visualizing multivariate categorical

data.

42. What is GGobi?

GGobi is an open source program for visualization for exploring high dimensional

typed data.

43.What is iPlots?

It is a package which provide bar plots, mosaic plots, box plots, parallel plots,

scatter plots and histograms.

44.What is the use of lattice package?

lattice package is to improve on base R graphics by giving better defaults and it

have the ability to easily display multivariate relationships.

45. What is fitdistr() function?

It is used to provide the maximum likelihood fitting of univariate distributions. It

is defined under the MASS package.

46.Which data structures are used to perform statistical analysis and create

graphs.

Data structures are vectors, arrays, data frames and matrices.

47.What is the use of sink() function?

It defines the direction of output.

48. Why library() function is used?

This function is used to show the packages which are installed.

49.Why search() function is used?

By this function we see that which packages are currently loaded.

50. On which type of data binary operators are worked?

Binary operators are worked on matrices, vectors and scalars.

51. What is the use of doBY package?

It is used to define the desired table using function and model formula.

52. Which function is used to create frequency table?

Frequency table is created by table() function.

53.Define loglm() function.

Loglm() function is used to create log-linear models.

54.What is the use of corrgram() function?

corrgram() function is used to plot correlograms.

55.How to create scatterplot matrices?

Pair() or splom() function is used for create scatterplot matrices.

56. What is npmc?

It is a package which gives nonparametric multiple comparisons.

57. What is the use of diagnostic plots?

It is used to check the normality, heteroscedasticity and influential observations.

58.Define anova() function.

anova() is used to compare the nested models.

59.What is cv.lm() function?

It is defined under the DAAG package which is used for k-fold validation.

60. Define stepAIC() function.

It is define under the MASS package which performs stepwise model selection

under exact AIC.

61. Define leaps().

It is used to perform the all-subsets regression and it is defined under the leaps

package.

62.Define relaimpo package.

It is used to measure the relative importance of each of the predictor in the

model.

63.Why car package is used?

also enhanced diagnostic.

64. Define robust package.

It provides a library of robust methods including regression.

65. What is robustbase?

It is a package which provides basic robust statistics including model selection

methods.

66. Define plotmeans().

It is define under gplots package which includes confidence intervals and it

produces mean plot for single factors.

67.What is the full form of MANOVA?

MANOVA stands for multivariate analysis of variance.

68. What is the use of MANOVA?

By using MANOVA we can test more than one dependent variable simultaneously.

69. Define mshapiro.test( ).

It is a function which defines in mvnormtest package. It produces the Shapirowilk test for multivariate normality.

70. Define barlett.test().

Barlett.test() is used to provide a parametric k-sample test of the equality of

variances.

71.What is fligner.test()?

It is a function which provides a non-parametric k sample test of the equality of

variances.

72.Define hovplot().

It is define in HH package which provides a graphic test of homogeneity of

variance based on brown forsyth.

73.Which variables are represented by lower case letters?

Numerical variables are represented by lower case letters.

74. Which variables are represented by upper case letters?

Categorical factors are represented by upper case letters.

75.What is logistic regression?

Logistic regression is used to predict the binary outcome from the given set of

continuous predictor variables.

76.Define Poison regression.

It is used to predict the outcome variable which represents counts from the given

set of continuous predictor variable.

77.Define Survival analysis.

It includes number of techniques which is used for modeling the time to an event.

78. What is the use survfit() function?

It estimates a survival distribution one or more groups.

79. Define survdiff().

It determines the differences in survival distribution between two or more groups.

80.What is coxph()?

It is a function which is used to model the hazard function on the set of predictor

variable.

81. In which package survival analysis is defined?

Survival analysis is defined under the survival package.

MASS functions include those functions which performs linear and quadratic

discriminant function analysis.

83. Define qda().

qda() prints a quadratic discriminant function.

84.Define lda().

lda() is used to print the discriminant functions which is based on centered

variable.

85. What is the use of forecast package?

It provides the functions which are used for automatic selection of ARIMA and

exponential models.

86.Define auto.arima().

It is used to handle the seasonal as well as non-seasonal ARIMA models.

87.What is principal() function?

It is define in psych package which is used to rotate and extract the principal

componants.

88.What is FactoMineR?

It is a package which includes quantitative and qualitative variables. It also

includes supplementary variables and observations.

89.What is the full form of CFA?

CFA stands for Confirmatory Factor Analysis.

90.What is the use of boot.sem() function?

It is used to bootstrap the structural equation model.

91.What is the full form of SEM?

SEM stands for Structural Equation Modeling.

92. Which function performs classical multidimensional scaling?

cmdscale() function is used to perform classical multidimensional scaling.

93.Define isoMDS().

This function is defined under the MASS package which performs nonmetric

multidimensional scaling.

94.Which function perform individual difference scaling?

It is done by indscal() function.

95. What is pvclust() function ?

It comes under the pvclust package which provides p-values for hierarchical

clustering .

96.Define cluster.stats() ?

It is define in fpc package which provide a method for comparing the similarity of

two clusters solution using different validation criteria.

97.What we use party package?

It is used to provide a non-parametric regression for ordinal, nominal, censored

and multivariate responses.

98. Which package provide the bootstrapping?

boot package is used which provide bootstrapping.

99.Define matlab package.

Matlab package includes those wrapper functions and variable which are used to

replicate matlab function calls.

100.What is the of use Matrix package?

Matrix package includes those function which support sparse and dense

matrices like Lapack, BLAS etc.

View Blog

Read the questions. At the bottom, you will find a link to the answers.

The Questions

First Set

1. Explain what is R?

2. List out some of the function that R provides?

3. Explain how you can start the R commander GUI?

4. In R how you can import Data?

5. Mention what does not R language do?

6. Explain how R commands are written?

7. How can you save your data in R?

8. Mention how you can produce co-relations and covariances?

9. Explain what is t-tests in R?

10. Explain what is With () and By () function in R is used for?

11. What are the data structures in R that is used to perform statistical analyses and

create graphs?

12. Explain general format of Matrices in R?

13. In R how missing values are represented ?

14. Explain what is transpose?

16. What is the function used for adding datasets in R?

17. What is the use of subset() function and sample() function in R ?

18. Explain how you can create a table in R without external file?

You can find the answers here.

Second Set

1. Data structure -- How many data structures R has? How do you build a binary

search tree in R?

2. Sorting -- How many sorting algorithms are available? Show me an example in R.

3. Low level -- How do you build a R function powered by C?

4. String -- How do you implement string operation in R?

5. Vectorization -- If you want to do Monte Carlo simulation by R, how do you improve

the efficiency?

6. Function -- How do you take function as argument of another function? What is the

apply() function family?

7. Threading -- How do you do multi-threading in R?

8. Memory limit and database -- What is the memory limit of R? How do you avoid it?

How do you use SQL in R?

9. Testing -- How do you do testing and debugging in R?

10. Software development -- How do you develop a package? How do you do version

control?

You can find the answers here.

Third Set

1. If I have a data.frame df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c(7, 8,

9))...

What is df[1,]?

2. What is the difference between a matrix and a dataframe?

3. If I concatenate a number and a character together, what will the class of the resulting

vector be?

4. What if I concatenate a number and a logical?

5. What if I concatenate a number and NA?

6. What is the difference between sapply and lapply? When should you use one versus

the other? Bonus: When should you use vapply?

7. What is the difference between seq(4) and seq_along(4)?

8. What is f(3) where:

y <- 5 f <- function(x) { y <- 2; y^2 + g(x) } g <- function(x) { x + y }

Why?

9. I want to know all the values in c(1, 4, 5, 9, 10) that are not in c(1, 5, 10, 11, 13).

How do I do that with one built-in function in R? How could I do it if that function didn't exist?

10. Can you write me a function in R that replaces all missing values of a vector with the

mean of that vector?

11. How do you test R code? Can you write a test for the function you wrote in #6?

12. Say I have...

fn(a, b, c, d, e) a + b * c - d / e

How do I call fn on the vector c(1, 2, 3, 4, 5) so that I get the same result as fn(1, 2,

3, 4, 5)? (No need to tell me the result, just how to do it.)

Why does the dplyr package get loaded and not ggplot2?

14. mystery_method <- function(x) { function(z) Reduce(function(y, w) w(y), x, z) } fn <mystery_method(c(function(x) x + 1, function(x) x * x)) fn(3)

What is the value of fn(3)? Can you explain what is happening at each step?

1.) If I have a data.frame df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c(7, 8, 9))...

1a.) How do I select the c(4, 5, 6)?

1b.) How do I select the 1?

1c.) How do I select the 5?

1d.) What is df[, 3]?

1e.) What is df[1,]?

1f.) What is df[2, 2]?

Answers: (a) df[[2]] or df$b, (b) df[[1]][[1]] or df$a[[1]], (c) df[[2]][[2]] or df$b[[2]],

(d) 7 8 9, (e) 1 4 7, (f) 5.

2.) What is the difference between a matrix and a dataframe?

Answer: A dataframe can contain heterogenous inputs and a matrix cannot.

(You can have a dataframe of characters, integers, and even other

dataframes, but you can't do that with a matrix -- a matrix must be all the

same type.)

3a.) If I concatenate a number and a character together, what will the class

of the resulting vector be?

3b.) What if I concatenate a number and a logical?

3c.) What if I concatenate a number and NA?

Answers: (a) character, (b) number, (c) number.

4.) What is the difference between sapply and lapply? When should you use

one versus the other? Bonus: When should you use vapply?

Answer: Use lapply when you want the output to be a list, and sapply when you

want the output to be a vector or a dataframe. Generally vapply is preferred

over sapply because you can specify the output type of vapply (but not sapply).

The drawback is vapply is more verbose and harder to use.

5.) What is the difference between seq(4) and seq_along(4)?

whereas seq_along(4) produces a vector of length(4), or 1 (c(1)).

6.) What is f(3) where:

y <- 5

f <- function(x) { y <- 2; y^2 + g(x) }

g <- function(x) { x + y }

Why?

Answer: 12. In f(3), y is 2, so y^2 is 4. When evaluating g(3), y is the globally

scoped y (5) instead of the y that is locally scoped to f, so g(3) evaluates to 3

+ 5 or 8. The rest is just 4 + 8, or 12.

7.) I want to know all the values in c(1, 4, 5, 9, 10) that are not in c(1, 5, 10, 11,

13). How do I do that with one built-in function in R? How could I do it if that

function didn't exist?

Answer: setdiff(c(1, 4, 5, 9, 10), c(1, 5, 10, 11, 13)) and c(1, 4, 5, 9, 10)[!c(1, 4, 5, 9, 10)

%in% c(1, 5, 10, 11, 13).

8.) Can you write me a function in R that replaces all missing values of a

vector with the mean of that vector?

Answer:

mean_impute <- function(x) { x[is.na(x)] <- mean(x, na.rm = TRUE); x }

9.) How do you test R code? Can you write a test for the function you wrote

in #6?

Answer: You can use Hadley's testthat package. A test might look like this:

testthat("It imputes the median correctly", {

expect_equal(mean_impute(c(1, 2, NA, 6)), 3)

})

fn(a, b, c, d, e) a + b * c - d / e

How do I call fn on the vector c(1, 2, 3, 4, 5) so that I get the same result as fn(1,

2, 3, 4, 5)? (No need to tell me the result, just how to do it.)

Answer: do.call(fn, as.list(c(1, 2, 3, 4, 5)))

11.)

dplyr <- "ggplot2"

library(dplyr)

Why does the dplyr package get loaded and not ggplot2?

Answer: deparse(substitute(dplyr))

12.)

mystery_method <- function(x) { function(z) Reduce(function(y, w) w(y), x, z) }

fn(3)

What is the value of fn(3)? Can you explain what is happening at each step?

Answer:

Best seen in steps.

fn(3) requires mystery_method to be evaluated first.

mystery_method(c(function(x) x + 1, function(x) x * x)) evaluates to...

function(z) Reduce(function(y, w) w(y), c(function(x) x + 1, function(x) x * x), z)

Reduce(function(y, w) w(y), c(function(x) x + 1, function(x) x * x), 3)

argument Reduce call will initialize at the third argument, which is 3.

The inner function, function(y, w) w(y) is meant to take an argument and a

function and apply that function to the argument. Luckily for us, we have

some functions to apply.

That means we intialize at 3 and apply the first function, function(x) x + 1. 3 +

1 = 4.

We then take the value 4 and apply the second function. 4 * 4 = 16.

Deepanshu Bhalla 2 Comments R Interview Questions, R Tutorial

analysis and predictive modeling. Many recent surveys and studies claimed

"R" holds a good percentage of market share in analytics industry. Data

scientist role generally requires a candidate to know R/Python programming

language. People who know R programming language are generally paid

more than python and SAS programmers. In terms of advancement in R

software, it has improved a lot in the recent years. It supports parallel

computing and integration with big data technologies.

Questions with detailed answer. It includes some basic, advanced or tricky

questions related to R. Also it covers interview questions related to data

science with R.

class() is used to determine data type of an object. See the example below x <- factor(1:5)

class(x)

It returns factor.

Object Class

str(x) returns "Factor w/ 5 level"

Example 2 :

xx <- data.frame(var1=c(1:5))

class(xx)

It returns "data.frame".

str(xx) returns 'data.frame' : 5 obs. of 1 variable: $ var1: int

It returns the storage mode of an object.

x <- factor(1:5)

mode(x)

The above mode function returns numeric.

Mode Function

x <- data.frame(var1=c(1:5))

mode(x)

It returns list.

categorical variables?

R has a special data structure called "factor" to store categorical variables.

It tells R that a variable is nominal or ordinal by making it a factor.

gender = c(1,2,1,2,1,2)

gender = factor(gender)

gender

a categorical variable?

The table function is used to calculate the count of each categories of a

categorical variable.

gender = factor(c("m","f","f","m","f","f"))

table(gender)

Output

If you want to include % of values in each group, you can store the result

in data frame using data.frame function and the calculate the column

percent.

t = data.frame(table(gender))

t$percent= round(t$Freq / sum(t$Freq)*100,2)

Frequency Distribution

distribution of a categorical variable

The cumsum function is used to calculate the cumulative sum of a

categorical variable.

gender = factor(c("m","f","f","m","f","f"))

x = table(gender)

cumsum(x)

Cumulative Sum

If you want to see the cumulative percentage of values, see the code

below :

t = data.frame(table(gender))

t$cumfreq = cumsum(t$Freq)

t$cumpercent= round(t$cumfreq / sum(t$Freq)*100,2)

The hist function is used to produce the histogram of a variable.

df = sample(1:100, 25)

hist(df, right=FALSE)

use the code below

colors = c("red", "yellow", "green", "violet", "orange", "blue", "pink", "cyan")

hist(df, right=FALSE, col=colors, main="Main Title ", xlab="X-Axis Title")

First calculate the frequency distribution with table function and then

apply barplot function to produce bar graph

mydata = sample(LETTERS[1:5],16,replace = TRUE)

mydata.count= table(mydata)

barplot(mydata.count)

use the code below:

barplot(mydata.count, col=colors, main="Main Title ", xlab="X-Axis Title")

First calculate the frequency distribution with table function and then

apply pie function to produce pie chart.

mydata = sample(LETTERS[1:5],16,replace = TRUE)

mydata.count= table(mydata)

pie(mydata.count, col=rainbow(12))

length

For example, you have two vectors as defined below x <- c(4,5,6)

y <- c(2,3)

the output? What would be the length of z?

It returns 8 15 12 with the warning message as shown below. The length of z

is 3 as it has three elements.

Multiplication of vectors

with first element of vector y i.e. 2 and the result is 8. In the second step, it

multiplies second element of vector x i.e. 5 with second element of vector b

i.e. 3, and the result is 15. In the next step, R multiplies first element of

smaller vector (y) with last element of bigger vector x.

Suppose the vector x would contain four elements as shown below :

x <- c(4,5,6,7)

y <- c(2,3)

x*y

It returns 8 15 12 21. It works like this : (4*2) (5*3) (6*2) (7*3)

contain?

R contains primarily the following data structures :

1.

Vector

2.

Matrix

3.

Array

4.

List

5.

Data frame

6.

Factor

The first three data types (vector, matrix, array) are homogeneous in

behavior. It means all contents must be of the same type. The fourth and

fifth data types (list, data frame) are heterogeneous in behavior. It implies

they allow different types. And the factor data type is used to store

categorical variable.

11. How to combine data frames?

Let's prepare 2 vectors for demonstration :

x = c(1:5)

y = c("m","f","f","m","f")

The cbind() function is used to combine data frame by columns.

z=cbind(x,y)

cbind : Output

z = rbind(x,y)

rbind : Output

While using cbind() function, make sure the number of rows must be

equal in both the datasets. While using rbind() function, make sure both

would not be same, wrong data would be appended to columns or records

might go missing.

different number of columns?

When the number of columns in datasets are not equal, rbind() function

doesn't work to combine data by rows. For example, we have two data

frames df and df2. The data frame df has 2 columns and df2 has only 1

variable. See the code below df = data.frame(x = c(1:4), y = c("m","f","f","m"))

df2 = data.frame(x = c(5:8))

The bind_rows() function from dplyr package can be used to combine data

frames when number of columns do not match.

library(dplyr)

combdf = bind_rows(df,df2)

A valid variable name consists of letters, numbers and the dot or underline

characters. A variable name can start with either a letter or the dot followed

by a character (not number).

A variable name such as .1var is not valid. But .var1 is valid.

A variable name cannot have reserved words. The reserved words are listed

below if else repeat while function for in next break

TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex_

NA_character_

A variable name can have maximum to 10,000 bytes.

functions? What are its alternatives?

Suppose you have a data frame as shown below -

df=data.frame(x=c(1:6), y=c(1,2,4,6,8,12))

You are asked to perform this calculation : (x+y) + (x-y) . Most of the R

programmers write like code below (df$x + df$y) + (df$x - df$y)

Using with() function, you can refer your data frame and make the above

code compact and simplerwith(df, (x+y) + (x-y))

The with() function is equivalent to pipe operator in dplyr package. See the

code below library(dplyr)

df %>% mutate((x+y) + (x-y))

by() function in R

The by() function is equivalent to group by function in SQL. It is used to

perform calculation by a factor or a categorical variable. In the example

below, we are computing mean of variable var2 by a factor var1.

df = data.frame(var1=factor(c(1,2,1,2,1,2)), var2=c(10:15))

with(df, by(df, var1, function(x) mean(x$var2)))

The group_by() function in dply package can perform the same task.

library(dplyr)

df %>% group_by(var1)%>% summarise(mean(var2))

In the example below, we are renaming variable var1 to variable1.

df = data.frame(var1=c(1:5))

colnames(df)[colnames(df) == 'var1'] <- 'variable1'

The rename() function in dplyr package can also be used to rename a

variable.

library(dplyr)

df= rename(df, variable1=var1)

The which() function returns the position of elements of a logical vector

that are TRUE. In the example below, we are figuring out the row number

wherein the maximum value of a variable x is recorded.

mydata=data.frame(x = c(1,3,10,5,7))

which(mydata$x==max(mydata$x))

It returns 3 as 10 is the maximum value and it is at 3rd row in the variable x.

in variables?

Suppose you have three variables X, Y and Z and you need to extract first

non-missing value in each rows of these variables.

data = read.table(text="

XYZ

NA 1 5

3 NA 2

", header=TRUE)

The coalesce() function in dplyr package can be used to accomplish this

task.

library(dplyr)

data %>% mutate(var=coalesce(X,Y,Z))

COALESCE Function in R

Let's create a sample data frame

dt1 = read.table(text="

XYZ

7 NA 5

245

", header=TRUE)

With apply() function, we can tell R to apply the max function rowwise.

The na,rm = TRUEis used to tell R to ignore missing values while calculating

max value. If it is not used, it would return NA.

dt1$var = apply(dt1,1, function(x) max(x,na.rm = TRUE))

Output

dt2 = read.table(text="

ABC

800

605

", header=TRUE)

apply(dt2,1, function(x) sum(x==0))

ifelse(df$var1==NA, 0,1)

It does not work. The logic operation on NA returns NA. It does not TRUE or

FALSE.

This code works ifelse(is.na(df$var1), 0,1)

after running the following program?

x=3

mult <- function(j)

{

x=j*2

return(x)

}

mult(2)

[1] 4

Answer : The value of 'x' will remain 3. See the output shown in the image

below-

Output

x after running the function, you can use the following program:

x=3

mult <- function(j)

{

x <<- j * 2

return(x)

}

mult(2)

x

The operator "<<-" tells R to search in the parent environment for an

existing definition of the variable we want to be assigned.

numeric

The as.numeric() function returns a vector of the levels of your factor and not

the original values. Hence, it is required to convert a factor variable to

character before converting it to numeric.

a <- factor(c(5, 6, 7, 7, 5))

a1 = as.numeric(as.character(a))

The paste() function is used to join two strings. A single space is the

default separator between two strings.

a = "Deepanshu"

b = "Bhalla"

paste(a, b)

It returns "Deepanshu Bhalla"

If you want to change the default single space separator, you can add

sep="," keyword to include comma as a separator.

paste(a, b, sep=",") returns "Deepanshu,Bhalla"

word

The substr() function is used to extract strings in a character vector. The

syntax of substr function is substr(character_vector, starting_position,

end_position)

x = "AXZ2016"

substr(x,1,3)

25. How to extract last name from full name

The last name is the end string of the name. For example, Jhonson is the last

name of "Dave,Jon,Jhonson".

dt2 = read.table(text="

var

Sandy,Jones

Dave,Jon,Jhonson

", header=TRUE)

The word() function of stringr package is used to extract or scan word from a

string. -1 in the second parameter denotes the last word.

library(stringr)

dt2$var2 = word(dt2$var, -1, sep = ",")

spaces

The trimws() function is used to remove leading and trailing spaces.

a = " David Banes "

trimws(a)

It returns "David Banes".

The runif() function is used to generate random numbers.

rand = runif(100, min = 1, max = 100)

LEFT JOIN implies keeping all rows from the left table (data frame) with the

matches rows from the right table. In the merge() function, all.x=TRUE

denotes left join.

df1=data.frame(ID=c(1:5), Score=runif(5,50,100))

df2=data.frame(ID=c(3,5,7:9), Score2=runif(5,1,100))

comb = merge(df1, df2, by ="ID", all.x = TRUE)

Left Join (SQL Style)

library(sqldf)

comb = sqldf('select df1.*, df2.* from df1 left join df2 on df1.ID = df2.ID')

Left Join with dply package

library(dplyr)

comb = left_join(df1, df2, by = "ID")

datasets

The cartesian product implies cross product of two tables (data frames). For

example, df1 has 5 rows and df2 has 5 rows. The combined table would

contain 25 rows (5*5)

comb = merge(df1,df2,by=NULL)

CROSS JOIN (SQL Style)

library(sqldf)

comb2 = sqldf('select * from df1 join df2 ')

datasets

First, create two sample data frames

df1=data.frame(ID=c(1:5), Score=c(50:54))

df2=data.frame(ID=c(3,5,7:9), Score=c(52,60:63))

library(dplyr)

comb = intersect(df1,df2)

library(sqldf)

comb2 = sqldf('select * from df1 intersect select * from df2 ')

program in R?

There are multiple ways to measure running time of code. Some frequently

used methods are listed below R Base Method

start.time <- Sys.time()

runif(5555,1,1000)

end.time <- Sys.time()

end.time - start.time

With tictoc package

library(tictoc)

tic()

runif(5555,1,1000)

toc()

data manipulation on large datasets?

The package data.table performs fast data manipulation on large datasets.

See the comparison between dplyr and data.table.

# Load data

library(nycflights13)

data(flights)

df = setDT(flights)

# Load required packages

library(tictoc)

library(dplyr)

library(data.table)

# Using data.table package

tic()

df[arr_delay > 30 & dest == "IAH",

.(avg = mean(arr_delay),

size = .N),

by = carrier]

toc()

# Using dplyr package

tic()

flights %>% filter(arr_delay > 30 & dest == "IAH") %>%

group_by(carrier) %>% summarise(avg = mean(arr_delay), size = n())

toc()

Result : data.table package took 0.04 seconds. whereas dplyr package took

0.07 seconds. So, data.table is approx. 40% faster than dplyr. Since the

dataset used in the example is of medium size, there is no noticeable

difference between the two. As size of data grows, the difference of

execution time gets bigger.

We can use fread() function of data.table package.

library(data.table)

yyy = fread("C:\\Users\\Dave\\Example.csv", header = TRUE)

We can also use read.big.matrix() function of bigmemory package.

following two programs ?

1. temp = data.frame(v1<-c(1:10),v2<-c(5:14))

2. temp = data.frame(v1=c(1:10),v2=c(5:14))

In the first case, it created two vectors v1 and v2 and a data frame temp

which has 2 variables with improper variable names. The second code

creates a data frame temp with proper variable names.

rm(list=ls())

in R?

Major five sorting algorithms :

1.

2.

3.

4.

5.

Bubble Sort

Selection Sort

Merge Sort

Quick Sort

Bucket Sort

Create a sample data frame

mydata = data.frame(score = ifelse(sign(rnorm(25))==-1,1,2),

experience= sample(1:25))

Task : You need to sort score variable on ascending order and then sort

experience variable on descending order.

R Base Method

mydata1 <- mydata[order(mydata$score, -mydata$experience),]

With dplyr package

library(dplyr)

mydata1 = arrange(mydata, score, desc(experience))

Suppose you need to remove 3 variables - x, y and z from data frame

"mydata".

R Base Method

df = subset(mydata, select = -c(x,y,z))

With dplyr package

library(dplyr)

df = select(mydata, -c(x,y,z))

save.image(file="dt.RData")

Missing values are represented by capital NA.

To create a new data without any missing value, you can use the code

below :

df <- na.omit(mydata)

column

Suppose you have a data consisting of 25 records. You are asked to remove

duplicates based on a column. In the example, we are eliminating duplicates

by variable y.

data = data.frame(y=sample(1:25, replace = TRUE), x=rnorm(25))

R Base Method

test = subset(data, !duplicated(data[,"y"]))

dplyr Method

library(dplyr)

test1 = distinct(data, y, .keep_all= TRUE)

data with R

The reshape2 and tidyr packages are most popular packages for reshaping

data in R.

44. Calculate number of hours, days, weeks,

months and years between 2 dates

Let's set 2 dates :

dates <- as.Date(c("2015-09-02", "2016-09-05"))

difftime(dates[2], dates[1], units = "hours")

difftime(dates[2], dates[1], units = "days")

floor(difftime(dates[2], dates[1], units = "weeks"))

floor(difftime(dates[2], dates[1], units = "days")/365)

With lubridate package

library(lubridate)

interval(dates[1],

interval(dates[1],

interval(dates[1],

interval(dates[1],

interval(dates[1],

dates[2])

dates[2])

dates[2])

dates[2])

dates[2])

%/%

%/%

%/%

%/%

%/%

hours(1)

days(1)

weeks(1)

months(1)

years(1)

The number of months unit is not included in the base difftime() function so

we can use interval() function of lubridate() package.

mydate <- as.Date("2015-09-02")

mydate + months(3)

mydate <- as.POSIXlt("2015-09-27 12:02:14")

library(lubridate)

date(mydate) # Extracting date part

format(mydate, format="%H:%M:%S") # Extracting time part

Extracting various time periods

day(mydate)

month(mydate)

year(mydate)

hour(mydate)

minute(mydate)

second(mydate)

There are primarily three ways to write loop in R

1.

For Loop

2.

While Loop

3.

Apply Family of Functions such as Apply, Lapply, Sapply etc

lapply returns a list when we apply a function to each element of a data

structure. whereas sapply returns a vector.

order() functions?

The sort() function is used to sort a 1 dimension vector or a single variable of

data.

The rank() function returns the ranking of each value.

The order() function returns the indices that can be used to sort the data.

Example :

set.seed(1234)

x = sample(1:50, 10)

x

[1] 6 31 30 48 40 29 1 10 28 22

sort(x)

[1] 1 6 10 22 28 29 30 31 40 48

It sorts the data on ascending order.

rank(x)

[1] 2 8 7 10 9 6 1 3 5 4

2 implies the number in the first position is the second lowest and 8 implies

the number in the second position is the eighth lowest.

order(x)

[1] 7 1 8 10 9 6 3 2 5 4

7 implies the 7th value of x is the smallest value, so 7 is the first element of

order(x) and i refers to the first value of x is the second smallest.

If you run x[order(x)], it would give you the same result as sort() function.

The difference between these two functions lies in two or more dimensions

of data (two or more columns). In other words, the sort() function cannot be

used for more than 1 dimension whereas x[order(x)] can be used.

cols <- sapply(mydata, is.numeric)

abc = mydata [,cols]

Questions

The list below contains most frequently asked interview questions for a role

of data scientist. Most of the roles related to data science or predictive

modeling require candidate to be well conversant with R and know how to

develop and validate predictive models with R.

regression model?

The lm() function is used for fitting a linear regression model.

regression model?

:An interaction can be created using colon sign (:). For example, x1 and x2

are two predictors (independent variables). The interaction between the

variables can be formed like x1:x2.

See the example below linreg1 <- lm(y ~ x1 + x2 + x1:x2, data=mydata)

The above code is equivalent to the following code :

linreg1 <- lm(y ~ x1*x2, data=mydata)

x1:x2 - It implies including both main effects (x1 + x2) and interaction

(x1:x2).

for linear regression?

durbinWatsonTest() function

binary logistic regression model?

glm() function with family = "binomial"

selection in logistic regression model?

Run step() function after building logistic model with glm() function.

regression model?

Run predict(logit_model, validation_data, type = "response")

validation?

dt = sort(sample(nrow(mydata), nrow(mydata)*.7))

train<-mydata[dt,]

val<-mydata[-dt,]

data2 = scale(data)

Validate Cluster Analysis

60. Which are the popular R packages for

decision tree?

rpart, party

party package for developing a decision tree

model?

rpart is based on Gini Index which measures impurity in node. Whereas

ctree() function from "party" package uses a significance test procedure in

order to select variables.

cor() function

It is used to measure the relative importance of independent variables in a

model.

Use tuneRF() function

boosting model?

Shrinkage is used for reducing, or shrinking, the impact of each additional

fitted base-learner (tree).

time series model?

Use ndiffs() function which returns the number of difference required to make

data stationary.

Use auto.arima() function of forecast package

R?

Use coxph() function of survival package.

analysis?

arules package

- DataScienceWeekly DataScientistInterviews Vol1 April2014Transféré parclungaho7109
- 316551847 Data Science Interview QuestionTransféré pardeba_subi
- 21 Must-Know Data Science Interview Questions and AnswersTransféré parKrishna Mohan Shrivastava
- UltimateGuidetoDataScienceInterviews-2Transféré parAnonymous wt1Miztt3F
- Data Science Interview QuestionTransféré parRahulsinghoooo
- Interview Quations Data ScienceTransféré parVaibhav Sahu
- 40 Interview Questions Asked at Startups in Machine Learning _ Data ScienceTransféré parChetan Shekdar
- 120 Interview QuestionsTransféré parsjbladen
- The Data Science Handbook - Pre ReleaseTransféré partintojames
- Introduction to Data ScienceTransféré parMichael Wee
- 100 Data Science in Python Interview Questions and Answers for 2017Transféré parRaJu SinGh
- 100 Data Science Interview Questions and Answers (General)Transféré parApoorva
- Python Interview QuestionsTransféré parnawrajlekhak
- Data Science Boot Camp Survival ManualTransféré parJoanna Reed
- Data Science Hiring GuideTransféré parramesh158
- SQL interview Questions.pdfTransféré parKishan Kumar Jha
- Data ScienceTransféré parGouthami Kondakindi
- The-Field-Guide-to-Data-Science.pdfTransféré parjyotimohapatra
- Data Science and Big Data Computing- Frameworks and MethodologiesTransféré parMihailo Majk Žikić
- (9) How Can I Become a Data Scientist_ - QuoraTransféré parPallav Anand
- 40 Interview Questions Asked at Startups in Machine Learning _ Data ScienceTransféré parPallav Anand
- Introduction to Data ScienceTransféré parsuhas_12345
- SQL Queries Interview Questions and Answers - Query ExamplesTransféré pariveraj
- An Introduction to Statistics With Python With Applications in the Life SciencesTransféré parsreekanth22063140
- Mastering Apache SparkTransféré parRicardo Mansilla
- Beginning Data Science With r Manas a PathakTransféré parAkonilagi
- Report Writing for Data Science in RTransféré parpitsosnikos4474
- The Data Science HandbookTransféré parR Carmichael
- Applied-Predictive-Modeling-PDF-Download.pdfTransféré parliliana stark
- Common Data Science QuestionsTransféré parchinu-pawan

- AlphabetsTransféré parEsHwar
- AlphabetsTransféré parEsHwar
- HiveTransféré parTata Sairamesh
- Python_concepts.docxTransféré parTata Sairamesh
- NumbersTransféré parTata Sairamesh
- 02473 Data Integration Hub Ds en USTransféré parTata Sairamesh
- 02473 Data Integration Hub Ds en USTransféré parTata Sairamesh
- PrmPayRcpt-MHDF3270652370Transféré parTata Sairamesh
- Oracle 11g Installation on WinTransféré parTata Sairamesh
- Teradata Normalized TransformationTransféré parAmit Sharma
- Autosys Instructors GuideTransféré parDilip Anand Khandrika
- Big Data&HadoopTransféré parTata Sairamesh
- Most Useful QueriesTransféré parTata Sairamesh
- Informatica QuestionsTransféré parTata Sairamesh
- HadoopTutorial.pptTransféré parbecitratul
- 0441-PCRealTimeProcessFlatFiles-H2LTransféré parTata Sairamesh
- 0371-PCGeneralSessionParameterTransféré parTata Sairamesh
- Advanced Designer ExamTransféré parRamya Karthikeyan
- 0094-MovingNodeToOtherDomainTransféré parTata Sairamesh
- 0005 SynchronizingObjects CompleteTransféré parTata Sairamesh
- 0423-ConfiguringPowerCenterResilience (1)Transféré parTata Sairamesh
- Visualforce Developers Guide Summer10Transféré parTata Sairamesh
- 0113-ConfiguringPowerCenterResourceMetadataManagerTransféré parTata Sairamesh
- 0344-Generating a UUID Using a Java TransformationTransféré parTata Sairamesh
- 0083-DeploymentUsingPmrepCommandsTransféré parTata Sairamesh
- DTM BuffercalTransféré parTata Sairamesh
- Informatica Sequence Generation Techniquesv2Transféré parTata Sairamesh
- 0353-TechniquesForDeploymentFromAVersionedRepositoryTransféré parTata Sairamesh

- 304831138-Class-8-Nco-5-Years-eBook.pdfTransféré parrekha_1234
- Evaluation of Banking FragilityTransféré parMuhammad Arif
- Arens_15e_Ch15_Audit_Sampling_for_Tests_of_Controls_and_Substantive_Tests_of_Transactions.pdfTransféré parellen
- 8b/10bTransféré parDario Santos
- productFlyer_978-3-540-70697-7Transféré parsuriya
- 05_FieldPotentialTransféré parAshutosh Dhamale
- Interview Questions 1111Transféré parSrinivas Gollanapalli
- Final Exam Solns Auto Morph Isms of GroupsTransféré parCassie Williams
- _CTI_EN--.pdfTransféré parbatbayar
- me471s03_q2_ansTransféré parHasen Bebba
- What s Next for HspaTransféré parMitsuo Sakamoto
- Messenger Launch Press KitTransféré parBob Andrepont
- BP_RP26-1HeatExchangeEquipment.pdfTransféré parMohd Khairul
- d5f0d544375547209ffa27cac0d76fa4.pdfTransféré parsai
- Propanolol IMPTransféré parShendi Suryana
- TexliveTransféré parVianey Sánchez Figueroa
- Hand ToolsTransféré parduvalrob
- 08 Trend 2005 Emcat Pro 2005 User ManualTransféré parPedro García
- APCRE11 Program[1]Transféré parallexy2002
- [Del Valle, 2012]Transféré parOscar Castro
- cdbfliteTransféré parPatrick Detollenaere
- JAMAICAN SKA EthnomusicologyTransféré parNigelPrudent
- Program for Frequency Response MeasurementsTransféré parStarLink1
- Accreditation of Initial Teacher Eduction Programmes in Algeria by Dr. MEBITILTransféré parNawal Esp
- 186702415-Ct-Sizing-XLTransféré parsenthil
- PowerTransféré parsri1414
- voltage comparatorTransféré parAmit Ranjan
- 5-b Direct Time Study - p1 Ams Mar27 17Transféré parTuấnAnh
- Materials in Oil & Gas - SimplifiedTransféré parbesant varghees
- 01_libclang.pdfTransféré parguser