Vous êtes sur la page 1sur 38

3)

Which technique is used to predict categorical responses?

Classification technique is used widely in mining for classifying data sets.

4)

What is logistic regression? Or State an example when you have

used logistic regression recently.


Logistic Regression often referred as logit model is a technique to predict the binary
outcome from a linear combination of predictor variables. For example, if you want
to predict whether a particular political leader will win the election or not. In this
case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor
variables here would be the amount of money spent for election campaigning of a
particular candidate, the amount of time spent in campaigning, etc.

5)

What are Recommender Systems?

A subclass of information filtering systems that are meant to predict the preferences
or ratings that a user would give to a product. Recommender systems are widely
used in movies, news, research articles, products, social tags, music, etc.

6)

Why data cleaning plays a vital role in analysis?

Cleaning data from multiple sources to transform it into a format that data analysts
or data scientists can work with is a cumbersome process because - as the number
of data sources increases, the time take to clean the data increases exponentially
due to the number of sources and the volume of data generated in these sources. It
might take up to 80% of the time for just cleaning data making it a critical part of
analysis task.

7)

Differentiate between univariate, bivariate and multivariate

analysis.
These are descriptive statistical analysis techniques which can be differentiated
based on the number of variables involved at a given point of time. For example,
the pie charts of sales based on territory involve only one variable and can be
referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at time as


in a scatterplot, then it is referred to as bivariate analysis. For example, analysing

the volume of sale and a spending can be considered as an example of bivariate


analysis.

Analysis that deals with the study of more than two variables to understand the
effect of variables on the responses is referred to as multivariate analysis.

8)

What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it
can all be jumbled up. However, there are chances that data is distributed around a
central value without any bias to the left or right and reaches normal distribution in
the form of a bell shaped curve. The random variables are distributed in the form of
a symmetrical bell shaped curve.

Image Credit : mathisfun.com

9)

What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is


predicted from the score of a second variable X. X is referred to as the predictor
variable and Y as the criterion variable.

10)

What is Interpolation and Extrapolation?

Estimating a value from 2 unknown values from a list of values is Interpolation.


Extrapolation is approximating a value by extending a known set of values or facts.

11)

What is power analysis?

An experimental design technique for determining the effect of a given sample size.

12)
13)

What is K-means? How can you select K for K-means?


What is Collaborative filtering?

The process of filtering used by most of the recommender systems to find patterns
or information by collaborating viewpoints, various data sources and multiple
agents.

14)

What is the difference between Cluster and Systematic Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target
population spread across a wide area and simple random sampling cannot be
applied. Cluster Sample is a probability sample where each sampling unit is a
collection, or cluster of elements. Systematic sampling is a statistical technique
where elements are selected from an ordered sampling frame. In systematic
sampling, the list is progressed in a circular manner so once you reach the end of
the list,it is progressed from the top again. The best example for systematic
sampling is equal probability method.

15)

Are expected value and mean value different?

They are not different but the terms are used in different contexts. Mean is
generally referred when talking about a probability distribution or sample population
whereas expected value is generally referred in a random variable context.

For Sampling Data


Mean value is the only value that comes from the sampling data.

Expected Value is the mean of all the means i.e. the value that is built from multiple
samples. Expected value is the population mean.

For Distributions
Mean value and Expected value are same irrespective of the distribution, under the
condition that the distribution is in the same population.

16)

What does P-value signify about the statistical data?

P-value is used to determine the significance of results after a hypothesis test in


statistics. P-value helps the readers to draw conclusions and is always between 0
and 1.

P- Value > 0.05 denotes weak evidence against the null hypothesis which

means the null hypothesis cannot be rejected.

P-value <= 0.05 denotes strong evidence against the null hypothesis which

means the null hypothesis can be rejected.

P-value=0.05is the marginal value indicating it is possible to go either way.

17) Do gradient descent methods always converge to same point?


No, they do not because in some cases it reaches a local minima or a local optima
point. You dont reach the global optima point. It depends on the data and starting
conditions

18) What are categorical variables?


19)

A test has a true positive rate of 100% and false positive rate of

5%. There is a population with a 1/1000 rate of having the condition the
test identifies. Considering a positive test, what is the probability of
having that condition?
Lets suppose you are being tested for a disease, if you have the illness the test will
end up saying you have the illness. However, if you dont have the illness- 5% of the
times the test will end up saying you have the illness and 95% of the times the test
will give accurate result that you dont have the illness. Thus there is a 5% error in
case you do not have the illness.

Out of 1000 people, 1 person who has the disease will get true positive result.

Out of the remaining 999 people, 5% will also get true positive result.

Close to 50 people will get a true positive result for the disease.

This means that out of 1000 people, 51 people will be tested positive for the disease
even though only one person has the illness. There is only a 2% probability of you
having the disease even if your reports say that you have the disease.

20)

How you can make data normal using Box-Cox transformation?

21)

What is the difference between Supervised Learning an

Unsupervised Learning?
If an algorithm learns something from the training data so that the knowledge can
be applied to the test data, then it is referred to as Supervised Learning.
Classification is an example for Supervised Learning. If the algorithm does not learn
anything beforehand because there is no response variable or any training data,
then it is referred to as unsupervised learning. Clustering is an example for
unsupervised learning.

22) Explain the use of Combinatorics in data science.


23) Why is vectorization considered a powerful method for optimizing
numerical code?
24) What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A
and B. The goal of A/B Testing is to identify any changes to the web page to
maximize or increase the outcome of an interest. An example for this could be
identifying the click through rate for a banner ad.

25)

What is an Eigenvalue and Eigenvector?

Eigenvectors are used for understanding linear transformations. In data analysis, we


usually calculate the eigenvectors for a correlation or covariance matrix.
Eigenvectors are the directions along which a particular linear transformation acts
by flipping, compressing or stretching. Eigenvalue can be referred to as the strength
of the transformation in the direction of eigenvector or the factor by which the
compression occurs.

26)

What is Gradient Descent?

27)

How can outlier values be treated?

Outlier values can be identified by using univariate or any other graphical analysis
method. If the number of outlier values is few then they can be assessed
individually but for large number of outliers the values can be substituted with
either the 99th or the 1st percentile values. All extreme values are not outlier
values.The most common ways to treat outlier values

1) To change the value and bring in within a range

2) To just remove the value.

28)

How can you assess a good logistic model?

There are various methods to assess the results of a logistic regression analysis-

Using Classification Matrix to look at the true negatives and false positives.

Concordance that helps identify the ability of the logistic model to

differentiate between the event happening and not happening.

Lift helps assess the logistic model by comparing it with random selection.

29)

What are various steps involved in an analytics project?

Understand the business problem

Explore the data and become familiar with it.

Prepare the data for modelling by detecting outliers, treating missing

values, transforming variables, etc.

After data preparation, start running the model, analyse the result and

tweak the approach. This is an iterative step till the best possible outcome is
achieved.

Validate the model using a new data set.

Start implementing the model and track the result to analyse the

performance of the model over the period of time.

30) How can you iterate over a list and also retrieve element indices at the
same time?
This can be done using the enumerate function which takes every element in a
sequence just like in a list and adds its location just before it.

31)

During analysis, how do you treat missing values?

The extent of the missing values is identified after identifying the variables with
missing values. If any patterns are identified the analyst has to concentrate on them
as it could lead to interesting and meaningful business insights. If there are no
patterns identified, then the missing values can be substituted with mean or median
values (imputation) or they can simply be ignored.There are various factors to be
considered when answering this question-

Understand the problem statement, understand the data and then give the

answer.Assigning a default value which can be mean, minimum or maximum value.


Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value

is assigned a default value.


If you have a distribution of data coming, for normal distribution give the

mean value.
Should we even treat missing values is another important point to consider?

If 80% of the values for a variable are missing then you can answer that you would
be dropping the variable instead of treating the missing values.
32)

Explain about the box cox transformation in regression models.

33)

Can you use machine learning for time series analysis?

Yes, it can be used but it depends on the applications.


35)

What is the difference between Bayesian Inference and Maximum

Likelihood Estimation (MLE)?

36)

What is Regularization and what kind of problems does

regularization solve?
37)

What is multicollinearity and how you can overcome it?

38)

What is the curse of dimensionality?

39)

How do you decide whether your linear regression model fits the

data?
40)

What is the difference between squared error and absolute error?

41)

What is Machine Learning?

The simplest way to answer this question is we give the data and equation to the
machine. Ask the machine to look at the data and identify the coefficient values in
an equation.

For example for the linear regression y=mx+c, we give the data for the variable x, y
and the machine learns about the values of m and c from the data.

42) How are confidence intervals constructed and how will you interpret
them?
43) How will you explain logistic regression to an economist, physican
scientist and biologist?
44) How can you overcome Overfitting?
45) Differentiate between wide and tall data formats?
46) Is Nave Bayes bad? If yes, under what aspects.
47) How would you develop a model to identify plagiarism?
48) How will you define the number of clusters in a clustering algorithm?
49) Is it better to have too many false negatives or too many false
positives?
50) Is it possible to perform logistic regression with Microsoft Excel?
51) What do you understand by Fuzzy merging ? Which language will you
use to handle it?
52) What is the difference between skewed and uniform distribution?
53) You created a predictive model of a quantitative outcome variable
using multiple regressions. What are the steps you would follow to
validate the model?
54) What do you understand by Hypothesis in the content of Machine
Learning?

55) What do you understand by Recall and Precision?


56) How will you find the right K for K-means?
57) Why L1 regularizations causes parameter sparsity whereas L2
regularization does not?
58) How can you deal with different types of seasonality in time series
modelling?
59) In experimental design, is it necessary to do randomization? If yes,
why?
60) What do you understand by conjugate-prior with respect to Nave
Bayes?
61) Can you cite some examples where a false positive is important than a
false negative?
62) Can you cite some examples where a false negative important than a
false positive?
63) Can you cite some examples where both false positive and false
negatives are equally important?
64) Can you explain the difference between a Test Set and a Validation
Set?
Validation set can be considered as a part of the training set as it is used for
parameter selection and to avoid Overfitting of the model being built. On the other
hand, test set is used for testing or evaluating the performance of a trained
machine leaning model.

In simple terms ,the differences can be summarized as-

Training Set is to fit the parameters i.e. weights.

Test Set is to assess the performance of the model i.e. evaluating the
predictive power and generalization.

Validation set is to tune the parameters.


65) What makes a dataset gold standard?
66) What do you understand by statistical power of sensitivity and how do
you calculate it?
67) What is the importance of having a selection bias?
68) Give some situations where you will use an SVM over a RandomForest
Machine Learning algorithm and vice-versa.

69) What do you understand by feature vectors?


70) How do data management procedures like missing data handling make
selection bias worse?
71) What are the advantages and disadvantages of using regularization
methods like Ridge Regression?
72) What do you understand by long and wide data formats?
73) What do you understand by outliers and inliers? What would you do if
you find them in your dataset?
74) Write a program in Python which takes input as the diameter of a coin
and weight of the coin and produces output as the money value of the
coin.
Learn Data Science in Python and R Programming to nail data science interviews at
top tech companies!

Data Science Puzzles-Brain Storming/


Puzzle Questions asked in Data Science
Job Interviews
1) How many Piano Tuners are there in Chicago?
To solve this kind of a problem, we need to know

How many Pianos are there in Chicago?

How often would a Piano require tuning?

How much time does it take for each tuning?

We need to build these estimates to solve this kind of a problem. Suppose, lets
assume Chicago has close to 10 million people and on an average there are 2
people in a house. For every 20 households there is 1 Piano. Now the question how
many pianos are there can be answered. 1 in 20 households has a piano, so
approximately 250,000 pianos are there in Chicago.

Now the next question is-How often would a Piano require tuning? There is no exact
answer to this question. It could be once a year or twice a year. You need to

approach this question as the interviewer is trying to test your knowledge on


whether you take this into consideration or not. Lets suppose each piano requires
tuning once a year so on the whole 250,000 piano tunings are required.

Lets suppose that a piano tuner works for 50 weeks in a year considering a 5 day
week. Thus a piano tuner works for 250 days in a year. Lets suppose tuning a piano
takes 2 hours then in an 8 hour workday the piano tuner would be able to tune only
4 pianos. Considering this rate, a piano tuner can tune 1000 pianos a year.

Thus, 250 piano tuners are required in Chicago considering the above estimates.

2) There is a race track with five lanes. There are 25 horses of which you
want to find out the three fastest horses. What is the minimal number of
races needed to identify the 3 fastest horses of those 25?
Divide the 25 horses into 5 groups where each group contains 5 horses. Race
between all the 5 groups (5 races) will determine the winners of each group. A race
between all the winners will determine the winner of the winners and must be the
fastest horse. A final race between the 2nd and 3rd place from the winners group
along with the 1st and 2nd place of thee second place group along with the third place
horse will determine the second and third fastest horse from the group of 25.
3) Estimate the number of french fries sold by McDonald's everyday.
4) How many times in a day does a clocks hand overlap?
5) You have two beakers. The first beaker contains 4 litre of water and the
second one contains 5 litres of water.How can you our exactly 7 litres of
water into a bucket?
6) A coin is flipped 1000 times and 560 times heads show up. Do you think
the coin is biased?
7) Estimate the number of tennis balls that can fit into a plane.
8) How many haircuts do you think happen in US every year?
9) In a city where residents prefer only boys, every family in the city
continues to give birth to children until a boy is born. If a girl is born, they
plan for another child. If a boy is born, they stop. Find out the proportion
of boys to girls in the city.
Probability and Statistics Interview Questions for Data Science

1.

There are two companies manufacturing electronic chip. Company A is


manufactures defective chips with a probability of 20% and good quality chips with
a probability of 80%. Company B manufactures defective chips with a probability of
80% and good chips with a probability of 20%.If you get just one electronic chip,
what is the probability that it is a good chip?

2.

Suppose that you now get a pack of 2 electronic chips coming from the same
company either A or B. When you test the first electronic chip it appears to be good.
What is the probability that the second electronic chip you received is also good?

3.

A coin is tossed 10 times and the results are 2 tails and 8 heads. How will you
analyse whether the coin is fair or not? What is the p-value for the same?

4.

Continuation to the above question, if each coin is tossed 10 times (100


tosses are made in total). Will you modify your approach to the test the fairness of
the coin or continue with the same?

5.

An ant is placed on an infinitely long twig. The ant can move one step
backward or one step forward with same probability during discrete time steps. Find
out the probability with which the ant will return to the starting point.

Frequently Asked Open Ended Interview Questions for Data


Scientists
1.

Which is your favourite machine learning algorithm and why?

2.

In which libraries for Data Science in Python and R, does your strength lie?

3.

What kind of data is important for specific business requirements and how, as
a data scientist will you go about collecting that data?

4.

Tell us about the biggest data set you have processed till date and for what
kind of analysis.

5.

Which data scientists you admire the most and why?

6.

Suppose you are given a data set, what will you do with it to find out if it
suits the business needs of your project or not.

7.

What were the business outcomes or decisions for the projects you worked
on?

8.

What unique skills you think can you add on to our data science team?

9.

Which are your favorite data science startups?

10.

Why do you want to pursue a career in data science?

11.

What have you done to upgrade your skills in analytics?

12.

What has been the most useful business insight or development you have
found?

13.

How will you explain an A/B test to an engineer who does not know statistics?

14.

When does parallelism helps your algorithms run faster and when does it
make them run slower?

15.

How can you ensure that you dont analyse something that ends up
producing meaningless results?

16.

How would you explain to the senior management in your organization as to


why a particular data set is important?

17.

Is more data always better?

18.

What are your favourite imputation techniques to handle missing data?

19.

What are your favorite data visualization tools?

20.

Explain the life cycle of a data science project.


Suggested Answers by Data Scientists for Open Ended Data Science
Interview Questions
How can you ensure that you dont analyse something that ends up
producing meaningless results?

Understanding whether the model chosen is correct or not.Start


understanding from the point where you did Univariate or Bivariate analysis,
analysed the distribution of data and correlation of variables and built the linear
model.Linear regression has an inherent requirement that the data and the errors in
the data should be normally distributed. If they are not then we cannot use linear
regression. This is an inductive approach to find out if the analysis using linear
regression will yield meaningless results or not.

Another way is to train and test data sets by sampling them multiple times.
Predict on all those datasets to find out whether or not the resultant models are
similar and are performing well.

By looking at the p-value, by looking at r square values, by looking at the fit


of the function and analysing as to how the treatment of missing value could have
affected- data scientists can analyse if something will produce meaningless results
or not.
- Gaganpreet Singh,Data Scientist

These are some of the more general questions around data, statistics and data
science that can be asked in the interviews. We will come up with more questions

specific to language, Python/ R, in the subsequent articles, and fulfil our goal of
providing a set of 100 data science interview questions and answers.

What is Odds ratio?

Ans. Odds are determined from probabilities and range between 0 and infinity. Odds are defined as the
ratio of the probability of success and the probability of failure.
E.g. Suppose that seven out of 10 males are admitted to an engineering school while three of 10 females
are admitted. The probabilities for admitting a male are,
p = 7/10 = .7

q = 1 - .7 = .3

If you are male, the probability of being admitted is 0.7 and the probability of not being admitted is 0.3.
Here are the same probabilities for females,
p = 3/10 = .3

q = 1 - .3 = .7

If you are female it is just the opposite, the probability of being admitted is 0.3 and the probability of not
being admitted is 0.7.
Now we can use the probabilities to compute the odds of admission for both males and females,
odds(male) = .7/.3 = 2.33333
odds(female) = .3/.7 = .42857
Next, we compute the odds ratio for admission,
OR = 2.3333/.42857 = 5.44
Thus, for a male, the odds of being admitted are 5.44 times larger than the odds for a female being
admitted.
2

What is Pseudo R-squared?

Ans.

Analogous to R-squared measure for linear regression.

Equals to 1 (Residual deviance / Null deviance)


A measure of how much of the deviance is explained by the model.
Ideally should be close to 1.

What is the mean of a t-distribution?

Ans. The t distribution has the following properties:

The mean of the distribution is equal to 0.

The variance is equal to v / (v - 2), where v is the degrees of freedom (see last section)
and v >2.

The variance is always greater than 1, although it is close to 1 when there are many degrees of
freedom. With infinite degrees of freedom, the t distribution is the same as the standard
normal distribution.

What is the standard deviation of z-distribution?

Ans. The Z-distribution is a normal distribution with meanzero and standard deviation 1
5

What is Standard error of estimate?

Ans. The standard error is an estimate of the standard deviation of a statistic. The standard
error is computed from known sample statistics.

Why do we smooth time series?

Ans. Smoothing is usually done to help us better see patterns, trends for example, in time series.
Generally smooth out the irregular roughness to see a clearer signal. For seasonal data, we might smooth
out the seasonality so that we can identify the trend. Smoothing doesnt provide us with a model, but it
can be a good first step in describing various components of the series.
A time series is a sequence of observations, which are ordered in time. Inherent in the collection of data
taken over time is some form of random variation. There exist methods for reducing of canceling the
effect due to random variation. A widely used technique is "smoothing". This technique, when properly
applied, reveals more clearly the underlying trend, seasonal and cyclic components.
Smoothing techniques are used to reduce irregularities (random fluctuations) in time series data. They
provide a clearer view of the true underlying behavior of the series. Moving averages rank among the
most popular techniques for the preprocessing of time series. They are used to filter random "white noise"
from the data, to make the time series smoother or even to emphasize certain informational components
contained in the time series.
Exponential smoothing is a very popular scheme to produce a smoothed time series. Whereas in moving
averages the past observations are weighted equally, Exponential Smoothing assigns exponentially
decreasing weights as the observation get older. In other words, recent observations are given relatively
more weight in forecasting than the older observations. Double exponential smoothing is better at
handling trends. Triple Exponential Smoothing is better at handling parabola trends.
Exponential smoothing is a widely method used of forecasting based on the time series itself. Unlike
regression models, exponential smoothing does not imposed any deterministic model to fit the series other
than what is inherent in the time series itself.

What is auto regression of time series?

Ans. Autoregressive is a stochastic process used in statistical calculations in which future values are
estimated based on a weighted sum of past values. An autoregressive process operates under the
premise that past values have an effect on current values. A process considered AR(1) is the first order
process, meaning that the current value is based on the immediately preceding value. An AR(2) process
has the current value based on the previous two values.
Anautoregressive model is when a value from a time series is regressed on previous values from that
same time series. for example, ytyt on yt1yt1:

yt=0+1yt1+t.yt=0+1yt1+t.
In this regression model, the response variable in the previous time period has become the predictor and
the errors have our usual assumptions about errors in a simple linear regression model. The order of an
autoregression is the number of immediately preceding values in the series that are used to predict the
value at the present time. So, the preceding model is a first-order autoregression, written as AR(1).

If we want to predict yy this year (ytyt) using measurements of global temperature in the previous two
years (yt1,yt2yt1,yt2), then the autoregressive model for doing so would be:
yt=0+1yt1+2yt2+t.yt=0+1yt1+2yt2+t.
This model is a second-order autoregression, written as AR(2), since the value at time tt is predicted from
the values at times t1t1 and t2t2. More generally, a kthkth-order autoregression, written as AR(k), is
a multiple linear regression in which the value of the series at any time t is a (linear) function of the
values at times t1,t2,,tkt1,t2,,tk.
8

What is lag?

Ans. Shifting the time base back by a given number of observations.


The coefficient of correlation between two values in a time series is called the autocorrelation
function (ACF) For example the ACF for a time series ytyt is given by:

Corr(yt,ytk)
This value of k is the time gap being considered and is called the lag. A lag 1 autocorrelation (i.e., k = 1 in
the above) is the correlation between values that are one time period apart. More generally,
a lag k autocorrelation is the correlation between values that are k time periods apart.
9

What do we mean by differencing time series data?

Ans. The entire purpose of differencing the data to make it stationary.


10 Why do we difference time series data?
Ans. The two most common ways to make a non-stationary time series data stationary are:

Differencing

Transforming
11 What is homoscedasticity?

Ans. Homoscedasticity describes a situation in which the error term (that is, the noise or
random disturbance in the relationship between the independent variables and the
dependent variable) is the same across all values of the independent variables.
12 What is the formula for calculating sample variance?

13 Why do we calculate variance first in order to find out standard deviation?


Ans.
In the calculation we first take the the deviation from mean. The total deviation from mean is zero. That is
why we take square. Calculate variance. Then square root variance for getting SD.
Also the square of any values gives more dispersion. But sometimes in some context the square of unit
does not mean anything like Rs^2.
14 What is degree of freedom?
Ans. The number of degrees of freedom generally refers to the number of independent observations in a
sample minus the number of population parameters that must be estimated from sample data.
For example, the exact shape of a t distribution is determined by its degrees of freedom. When the t
distribution is used to compute a confidence interval for a mean score, one population parameter (the
mean) is estimated from sample data. Therefore, the number of degrees of freedom is equal to the sample
size minus one.
The below is the layman example:
e.g. Now, imagine a set of three numbers, whose mean is 3. There are lots of sets of three numbers with a
mean of 3, but for any set the bottom line is this: you can freely pick the first two numbers, any number at
all, but the third (last) number is out of your hands as soon as you picked the first two. Say our first two
numbers are the same as in the previous set, 1 and 6, giving us a set of two freely picked numbers, and
one number that we still need to choose, x: [1, 6, x]. For this set to have a mean of 3, we dont have
anything to choose about x. X has to be 2, because (1 + 6 + 2) / 3 is the only way to get to 3. So, the first
two values were free for you to choose, the last value is set accordingly to get to a given mean. This set is
said to have two degrees of freedom, corresponding with the number of values that you were free to
choose (that is, that were allowed to vary freely).

15 What is autocorrelation of second order?


Ans. ut=ut-1+et

The coefficient is called the first-order autocorrelation coefficient and takes values fom -1 to +1
Second-order when:
ut=1ut-1+ 2ut-2+et
16 What is multi-collinearity?
Ans. Multicollinearity refers to a situation in which two or more explanatory variables in a multiple
regression model are highly correlated. We have perfect multicollinearity if the correlation between two
independent variables is equal to 1 or -1. In practice, we rarely face perfect multicollinearity in a data set.
More commonly, the issue of multicollinearity arises when there is a high degree of correlation (either
positive or negative) between two or more independent variables.
Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple
regression model are highly correlated. In this situation the coefficient estimates may change erratically in
response to small changes in the model or the data. Multicollinearity does not reduce the predictive power
or reliability of the model as a whole; it only affects calculations regarding individual predictors. That is,
a multiple regression model with correlated predictors can indicate how well the entire bundle of
predictors predicts the outcome variable, but it may not give valid results about any individual predictor,
or about which predictors are redundant with others.
17 Which command can we use in R for exponential smoothing with trend and seasonal
parameters?
Ans. HoltWinters (data)

18 How do you define Deviance?


Ans. Deviance is a measure of goodness of fit of a generalized linear model. Higher the deviance
poorer
the
model
is.
Deviance is a measure of how well the model fits the data. It helps to assess the Goodness-of-Fit of the
estimated model.
It is 2 times the negative log likelihood of the dataset i.e.- 2LL, given the model. The minimum value for
-2LL is 0, which corresponds to a perfect fit (likelihood = 1 and -2LL is then 0.)
The contribution of several coefficients simultaneously is done by Deviance tests (similar to F test)
If you think of deviance as analogous to variance, then the null deviance is similar to the variance of the
data around the average rate of positive examples. The residual deviance is similar to the variance of the
data around the model. The first thing we can do with the null and residual deviances is the check whether
the models probability predictions are better than guessing the average rate of positives.

In logistic regression analysis, deviance is used in lieu of sum of squares calculations. Deviance is
analogous to the sum of squares calculations in linear regression and is a measure of the lack of fit to the
data in a logistic regression model. When a "saturated" model is available (a model with a theoretically
perfect fit), deviance is calculated by comparing a given model with the saturated model. This
computation gives the likelihood-ratio test.

In the above equation D represents the deviance and ln represents the natural logarithm. The log of this
likelihood ratio (the ratio of the fitted model to the saturated model) will produce a negative value, hence
the need for a negative sign.

19 What is Null Deviance?


Ans. The null deviance represents the difference between a model with only the intercept (which means
"no predictors") and the saturated model.
20 What is R-squared?
Ans. The coefficient of determination (denoted by R2) is a key output of regression analysis. It is
interpreted as the proportion of the variance in the dependent variable that is predictable from the
independent variable.

The coefficient of determination ranges from 0 to 1.

An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.

An R2 of 1 means the dependent variable can be predicted without error from the independent
variable.

An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An
R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means
that 20 percent is predictable; and so on.

21 Why do we need adjusted R-squared?


Ans.

22 How do calculate R-square?

Ans.
23 What is a Type I error rate?
24 What is a Type II error rate?

Ans.

25 When do you need a Dummy variable?


Ans. In regression when we have independent variable in categorical in nature we need dummy variable.
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two
or more distinct categories/levels.
If there are n category of a IV then we have to create n-1 dummy.
If we take n dummy for this variable then it is called dummy variable trap.
The dummy variable trap is concerned with cases where a set of dummy variables is so highly collinear
with each other that OLS cannot identify the parameters of the model. That happens mainly if you include
all dummies from a certain variable, e.g. you have 3 dummies for education "no degree", "high school",
and "college". If you include all dummies in the regression together with an intercept (a vector of ones),
then this set of dummies will be linearly dependent with the intercept and OLS cannot solve.
26 What is a Stationary Time Series?
Ans. A stationary time series daata is one whose statistical properties such as mean, variance,
autocorrelation, etc. are all constant over time.
27 What do you mean by a Time Series data?
Ans. A time series data is a sequence of numerical data points in successive order, usually occurring in
uniform intervals. In plain English, a time series is simply a sequence of numbers collected at regular
intervals over a period of time.
28 How do you find the sample size from an Anova output in R?
Ans. Just add the df of residual and variable + 1 in anova table
Sample size = df(residual+variable (s)) + 1

29 Why cant we build a 100% confidence interval?


Ans. As we sample from population to estimate the population parameter, there is always a chance of
noise or error as we not taking all the population. Thats why we cant build 100% CI.
30 What is a point estimate?
Ans. A point estimate of a population parameter is a single value used to estimate the population
parameter. For example, the sample mean x is a point estimate of the population mean .
31 What is AIC in Logistic Regression?

Ans. The Akaike Information Criterion (AIC) is a way of selecting a model from a set of models. The
chosen model is the one that minimizes the Kullback-Leibler distance between the model and the truth.
Its based on information theory, but a heuristic way to think about it is as a criterion that seeks a model
that has a good fit to the truth but few parameters. It is defined as:
AIC = -2 ( ln ( likelihood )) + 2 K

32 Does IQR include the median of a sample? Why?


Ans. Median is the middle value of the distribution and IQR=Q3-Q1 represents the 50% of the data under
the distribution and hence the IQR always contain the median.
33 What is right skew and left skew?
Ans. skewness is a measure of the asymmetry of the probability distribution of a realvalued random variableabout its mean. The skewness value can be positive or negative, or even
undefined.

34 What is Confidence Interval for the response variable in Regression?


Ans. In all likelihood, your model will not perfectly predict Y. The SEE can be extended to determine the
confidence interval for a predicted Y value. A common CI to test for a predicted value is 95%.
Your regression parameters, the y-intercept (b0) and slope coefficient (b 1) will need to be tested for
significance before you can generate a confidence interval around your models project Y value around an
expected X value.
35 What is Prediction Interval for the response variable in Regression?
Ans. In statistical inference, specifically predictive inference, a prediction interval is an
estimate of an interval in which future observations will fall, with a certain probability, given what has
already been observed.

36 What is Null value in a hypothesis test?


Null hypothesis. The null hypothesis, denoted by H0, is usually the hypothesis that sample
observations result purely from chance.
37 For a sample of say size 38, what will be the ratio of Margin of Error and Standard Error if the
population standard deviation is not known?
38 What are the parameters of a Binomial Distribution?
Ans. N = no of independent trail
P = the prob. Of success.
39 What is the null hypothesis in the test of Chi-square Independence Test?
Ans. H0: Variable A and Variable B are independent.
40 What is a Control Group in an experiment? What kind of treatment the group does receive
and why?
Ans. The control group in an experiment is the group who does not receive any treatment and is used as a
benchmark against which other test results are measured. This group includes individuals who are very
similar in many ways to the individuals who are receiving the treatment, in terms of age, gender, race or
other factors.

41 What is a Blocking variable? Give a good example.


Ans. A categorization variable for information within a dataset that is used to control, test, or manipulate
the distribution and statistical results. These variables should be observed, reliable, and unchanging. For
example, a dataset can be blocked based on race or gender.
Age classes and gender

In studies involving human subjects, we often use gender and age classes as the blocking factors. We
could simply divide our subjects into age classes, however this does not consider gender. Therefore we
partition our subjects by gender and from there into age classes. Thus we have a block of subjects that is
defined by the combination of factors, gender and age class.

42 What is the difference between observational studies and an experiment?


Ans. Both studies involve observation, so don't let the titles fool you. The fundamental difference is that
an experiment requires you to observe what happens when you make some sort of change. An
observational study finds the relationship between changes that already exist. It is the method used when
it would be impractical or unethical to cause the change yourself. Examples are studying the effects of
smoking on birth defect rates (unethical) or on what things do people spend over $1000 when
unexpectetedly given a large amount of money (impractical) But if you actually had the money to give
people and then found out what they spent it on that would be an experiment.
An experimental study involves a hypothesis and a variable. An observational study is simply a set of
observations without a hypothesis or a variable.
43 What is replication in design of an experiment?
Ans. Replication is also the idea behind large sample sizes. When repeating observations across a larger
group, the variability between the sample size and the true population is reduced. Replication relies on
tests that are repeatable in the same or similar fashion and are based on a random sampling of the
population.
44 How many parameter(s) does Poisson Distribution have?
Ans.
Parameters

> 0 (real)
The average number of events in an interval

45 What do you mean by independent events?


Ans. If the occurrence of one event doesnt provide any information or doesnt affect the occurrence of
the
other
event
then
the
two
events
are
called
as
independent
events.

46 What are mutually exclusive events?


Ans. If the two events doesnt occur at the same time

47 What do you understand by Number Fishers scoring iteration?


Ans. The reference to Fisher scoring iterations has to do with how the model was estimated. A
linear model can be fit by solving closed form equations. Unfortunately, that cannot be done with most
GLiMs including logistic regression. Instead, an iterative approach (the Newton-Raphson
algorithm by default) is used. Loosely, the model is fit based on a guess about what the estimates might
be. The algorithm then looks around to see if the fit would be improved by using different estimates
instead. If so, it moves in that direction (say, using a higher value for the estimate) and then fits the model
again. The algorithm stops when it doesn't perceive that moving again would yield much additional
improvement. This line tells you how many iterations there were before the process stopped and output
the results.
48 How is Sensitivity defined?
Ans. Accuracy: the proportion of the total number of predictions that were correct.
Positive Predictive Value or Precision: the proportion of positive cases that were correctly identified.
Negative Predictive Value: the proportion of negative cases that were correctly identified.
Sensitivity or Recall: the proportion of actual positive cases which are correctly identified.
Specificity: the proportion of actual negative cases which are correctly identified.

49 What is an ROC?
Ans. ROC (Receiver operating characteristic) curve is used to visualize the
performance of a binary classifier. The ROC curve is the plot between sensitivity and
(1- specificity). (1- specificity) is also known as false positive rate and sensitivity is
also known as True Positive rate. For a good model the curve should rise steeply.
More close to Sensitivity better the model. Area under the curve is 1.Greater the
area, better the predictive ability of the model is.

50 How many degrees of freedom an F statistic has?


Ans. F Statistic = variance of the group means / mean of the within group variances.

51 How would you define Specificity?


Ans. The proportion of actual negative cases which are correctly identified.
52 What percentile does median represent?
Ans. 50%
53 For a normally distributed sample, what percentage of observations is greater than mean?

Ans. 50%
54 If X and Y are two variables having a coefficient of correlation 0.3, and X and Z are two variables
having a coefficient of correlation -0.7, which correlation is higher?
Ans. X and Z. Inverse relation and more strong than x and y.
55 What kind of distribution would you expect the residuals in a linear regression model to have?
Ans. Normal distribution
56 What is VIF? Answer how it is calculated.
Ans. Each of the predictor variables is regressed up on other predictor vars. If that Rsquared is high then
this variable has collinearity with others

VIF >= 1.5 (not more than 2) means there is multicollinearity in data.
57 What is the major difference between Stepwise Regression and Forward Regression?
Ans.

58 Why would you use a Kruskal -Wallis test?


Ans. The Kruskal-Wallis H test (sometimes also called the "one-way ANOVA on ranks") is a rank-based
nonparametric test that can be used to determine if there are statistically significant differences between
two or more groups of an independent variable on a continuous or ordinal dependent variable. It extends
theMannWhitney U test when there are more than two groups.
59 What do you mean by a quadratic model?

Ans. A regression equation is a polynomial regression equation if the power of independent variable is
more than 1. The equation below represents a polynomial equation:

y=a+b*x^2

In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data
points.
60 What minimum and maximum value of r, the coefficient of correlation?
Ans. -1 to + 1
61 What do you mean by Se, the Standard Error of Estimate?
Ans. The standard error of the estimate is a measure of the accuracy of predictions. Recall that the
regression line is the line that minimizes the sum of squared deviations of prediction (also called the sum
of squares error). The standard error of the estimate is closely related to this quantity and is defined
below:

where est is the standard error of the estimate, Y is an actual score, Y' is a predicted score, and N is the
number of pairs of scores. The numerator is the sum of squared differences between the actual scores and
the predicted scores.

62 What do you mean by Five Number Summary?


Ans. Five Number Summary. For a set of data, the minimum, first quartile, median, third quartile, and
maximum. Note: A boxplot is a visual display of thefive-number summary.
63 When do you use Tukey-Kramer procedure?
Ans. To test for differences in pairs of means
64 When do we conduct a Randomized Block Design?
Ans. I) divides the group of experimental units into n homogeneous groups of size t.
II
III

These homogeneous groups are called blocks.


The treatments are then randomly assigned to the experimental units in each block one treatment to a unit in each block.

65 How do you test the overall significance of a Binary Logistic Regression model?
Ans. AIC (low value is better)
SC (Schwarz Criterion)
Hosmer-Lemeshow statistic
ROC
66 What is Power of a hypothesis test?
Ans. Power is the probability of rejecting the null hypothesis when in fact it is false.Power is the
probability of making a correct decision (to reject the null hypothesis) when the null hypothesis is false.
= I Type II Error
67 When do you use Durbin-Watson test?
Ans. the DurbinWatson statistic is a test statistic used to detect the presence of autocorrelation (a
relationship between values separated from each other by a given time lag) in the residuals (prediction
errors) from a regression analysis.
68 Under what condition the mean of sample proportions will be equal to population
proportion if you do repeated sampling
Ans. The Central Limit Theorem for Proportions
Let p be the probability of success, q be the probability of failure. The sampling distribution for samples
of size n is approximately normal with mean

69 What is the mean of Z-distribution?


Ans. Mean = 0
70 What do you mean by Positive Predictive Value of a clinical test?
Ans. Positive predictive value is the probability that subjects with a positive screening test truly have the
disease.
Negative predictive value is the probability that subjects with a negative screening test truly don't have
the disease.

Sensitivity: A/(A+C) 100


Specificity is the fraction of those without disease who will have a negative test result:

Specificity: D/(D+B) 100


Sensitivity and specificity are characteristics of the test. The population does not affect the results.
Positive Predictive Value: A/(A+B) 100
Negative Predictive Value: D/(D+C) 100

71 What is blinding in an experiment?


Ans. Blinding is the practice of not telling subjects whether they are receiving a placebo. In this way,
subjects in the control and treatment groups experience the placebo effect equally. Often, knowledge of
which groups receive placebos is also kept from analysts who evaluate the experiment. This practice is
called double blinding.
72 What do you mean by double blinding in an experiment?
Ans. Often, knowledge of which groups receive placebos is also kept from analysts who evaluate the
experiment. This practice is called double blinding.
73 What is the value of VIF above which you will conclude that there is very high multicollinearity among the predictors?
Ans. Industry standard: 1.5 (not more than 2)
Book: >= 10
74 Which distribution will you apply to represent occurrences of rare events?
Ans. Poisson Distribution

75 What benefit do you achieve by a Randomized Block Design?


Ans. This design reduces variability within treatment conditions and potential confounding, producing a
better estimate of treatment effects.
76 If alpha is 0.01, what is the area under the lower tail of the rejection area in a two tailed
hypothesis test?
Ans. 0.005
77 What do the edges of the box in a Box Plot represent? What does the line within the box?
Ans. Lower Edge = Q1

Upper Edge = Q3

Line within box = Median

78 Which axis represents the quantitative variable on a Histogram?


Ans. X axis
79 Why do we go for sampling instead of a census?
Ans. - Time
- Cost
- Incremental improvement in sampling error is very marginal after a certain sample size is
achieved
- Useful when all the items of a population are not accessible or available
- Helps when sampling is done in destructive testing
- When element of population is changing
80 What is the main difference between stratified sampling and cluster sampling?

Ans.

81 Does correlation imply causation? What is required to explore causation?


Ans. Correlation does not imply causation
1. Temporal Priority - The Cause precedes the effect.
2. Concomitant Variation -There must be an association or correlation between the treatment and
outcome.
3. Elimination of other plausible alternatives - You must control for spurious or other possible causes.
82 What type of explanatory variable you deal with in ANOVA?
Ans. Categorical
83 Explain how would you draw a histogram of residuals of a linear regression model
MODEL1 in R to test normality. State the command.
Ans. plot(fitted(lme1), residuals(lme1),
xlab = Fitted Values, ylab = Residuals)
abline(h=0, lty=2)
lines(smooth.spline(fitted(lme1), residuals(lme1)))
Normality test: adTest(rnorm(100,mean=5,sd=3))
84 For an Odds Ratio of 1, what is p?

Ans. P = .5
85 What are the parameters of an ARIMA model?
Ans. ARIMA (p,d,q)

p is the number of autoregressive terms,


d is the number of non-seasonal differences needed for stationarity
q is order of the moving average part.

86 What is PACF for a time series data?


Ans. The partial autocorrelation function (PACF) gives the partial correlation of a time series with its
own lagged values, controlling for the values of the time series at all shorter lags. It contrasts with
the autocorrelation function, which does not control for other lags.
This function plays an important role in data analyses aimed at identifying the extent of the lag in
an autoregressive model. The use of this function was introduced as part of theBoxJenkins approach to
time series modelling, where by plotting the partial autocorrelative functions one could determine the
appropriate lags p in an AR (p) model or in an extended ARIMA (p,d,q) model.

87 What are the components of a Time Series data? Explain them in terms of time periods.
Ans. Trend
A trend exists when there is a long-term increase or decrease in the data. It does not have to be
linear. Sometimes we will refer to a trend changing direction when it might go from an
increasing trend to a decreasing trend.
Seasonal
A seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the
year, the month, or day of the week). Seasonality is always of a fixed and known period.
Cyclic
A cyclic pattern exists when data exhibit rises and falls that are not of fixed period. The duration
of these fluctuations is usually of at least 2 years.
Irregular
These are sudden changes occurring in a time series which are unlikely to be repeated, it is that
component of a time series which cannot be explained by trend, seasonal or cyclic movements .It is

because of this fact these variations some-times called residual or random component. These variations
though accidental in nature, can cause a continual change in the trend, seasonal and cyclical oscillations
during the forthcoming period. Floods, fires, earthquakes, revolutions, epidemics and strikes etc,. are the
root cause of such irregularities.
88 What is the basic purpose of ANOVA?
Ans. the purpose of analysis of variance (ANOVA) is to test for significant differences between means.
89 Is Sampling Distribution a probability distribution? Explain.
Ans. Suppose that we draw all possible samples of size n from a given population. Suppose further that
we compute a statistic (e.g., a mean, proportion, standard deviation) for each sample. The probability
distribution of this statistic is called a sampling distribution. And the standard deviation of this
statistic is called the standard error.
90 For two mutually exclusive events A and B, what is the probability of A or B?
Ans. P(A or B) = P(A) + P(B)
91 Which one is more severe? Type I or Type IIwhy?
Ans. n most fields of science, Type II errors are not seen to be as problematic as a TypeI error. With
the Type II error, a chance to reject the null hypothesis was lost, and no conclusion is inferred from a
non-rejected null. The Type I error is more serious, because you have wrongly rejected the null
hypothesis.
o me, both of these situations are dangerous. In one case the patient receives unnecessary treatment. In the
other, the patient doesn't receive needed treatment.
Here's another example. Suppose our hypothesis is "This material is not radioactive." A type 1 error
would be to conclude that the material is radioactive, when it really isn't. A type 2 error would be to
conclude
that
the
material
is
not
radioactive,
when
it
really
is.
In this case, the type 2 error is more dangerous.
Finally, here is a really vivid example. Suppose our hypothesis is "This gun isn't loaded." A type 1 error
would be to conclude that is loaded, when it really isn't. A type 2 error would be to conclude it isn't
loaded, when it really is. The type 2 error is clearly more dangerous!
In general, a type 1 error is not more dangerous than a type 2 error. It depends on the nature of your
hypothesis.

92 Is 95% C. I. wider than 99% C.I.? Why or Why not?


Ans. A 99% confidence interval is wider (has more values) than a 95% confidence interval.
For example we could have XBar = 100 with a standard error of 10. Thus, the 95% C.I. would be
100 + - 1.96* 10.
The 99% C.I. would be
100 + - 2.58*10
obviously making it wider than the 95% C.I.
93 What is p-value in a hypothesis test?
Ans. A P-value measures the strength of evidence in support of a null hypothesis. Suppose the test
statistic in a hypothesis test is equal to S. The P-value is the probability of observing a test statistic as
extreme as S, assuming the null hypotheis is true. If the P-value is less than the significance level, we
reject the null hypothesis.
94 How would identify a paired test?

Ans. The sampling method for each sample is simple random sampling.

The test is conducted on paired data. (As a result, the data sets are not independent.)

The sampling distribution is approximately normal, which is generally true if any of the following
conditions apply.

The population distribution is normal.

The population data are symmetric, unimodal, without outliers, and the sample
size is 15 or less.

The population data are slightly skewed, unimodal, without outliers, and the sample
size is 16 to 40.

The sample size is greater than 40, without outliers.

95 What should be the nature of Time Series data with which you can develop an ARIMA
model?
Ans. Stationary
96 What is the basic purpose of regression?
Ans. Regression analysis is a form of predictive modelling technique which investigates the relationship
between a dependent (target) and independent variable (s) (predictor). This technique is used for
forecasting, time series modelling and finding the causal effect relationship between the variables.
You can even use this regression curve to predict values of y for x's that are outside of the range of the
current dataset (extrapolation). For example, relationship between rash driving and number of road
accidents by a driver is best studied through regression.
97 Why would use Tukeys 4-quadrant approach for model building?

Ans.
98 What is an outlier in the context of regression?
Ans. Data points that diverge in a big way from the overall pattern are called outliers. There are four ways
that a data point might be considered an outlier.

It could have an extreme X value compared to other data points.

It could have an extreme Y value compared to other data points.

It could have extreme X and Y values.

It might be distant from the rest of the data, even without extreme X or Y values

99 Regarding variances of a linear regression model, what should be your assumption?


Ans. If the variance of the Y is not constant, then the the error variance will not be constant. The most
common form of such heteroscedasticity in Y is that the variance of Y may increase as the mean of Y
increases, for data with positive X and Y.

Vous aimerez peut-être aussi