Vous êtes sur la page 1sur 57

Linear Regression

Case Study 1 : Boston Data


• As a builder you are planning to build
Residential Apartment . Find the
Price_per_sqrt.

• Step 1: Try to figure out Parameters


???????????????????
Parameters
• Quality of neighbourhood
• Floor size
• Size of the apartment
• Distance from town
• Amount of pollution
• Crime_rate
• No_Of_Schools
Price_per_sqrt = A + B * Distance_Town + C *
Amt_Pollution + ......
Dependent Variable Linear Regression Technique – Predictive Analysis
(Value of continuous variables)

A -> Intercept of Equation (Statistical term)


In Business Term : Base value

B,C -> Coefficient of Independent Variable (IDV)

Slope of the equation


Price of the apartment - ????? When size = 1500 sq feet.
Slope – if size is increasing then by
what amount the price is increasing

Residual – difference
between the actual value
and the value on the line

Error – dont have data because


of any reason. Ex- crime_rate

 There could be many such


lines which can pass through
the data.
So, the objective is

Line that Best –fit on data


Line of Best - fit
• Best- fit line on data passes through the
closest to all data points.
Minimises the value of residuals

Method : To find Best-fit line


Ordinary Least Square (OLS)
Check :
Dependent Variable
Independent Variable
What kind of correlation in plot ?
For a linear relationship, we try to
apply some transformation like ......
Apply log or inverse or other
transformations to make it linear.
Multicollinearity
• Correlated Independent variables

Example: Two members in a team having


similar analysis views working on a same
project
Find two
independent
variables with
medv to be
predicted ?
Find two
independent
variables with
medv to be
predicted ?

rm and lstat
Randomly split the data into two :
Training and validation dataset
(Imp for predictive analysis)
Approx: 70% - to train
30 % - to test
Install caTools for split function
To split the dataset
Against every record
we have TRUE/FALSE

65% - TRUE
35% - False
Train and test the
dataset
Use
Use. For
. Forallallvariables
variables

Linear Equation =
36.459488 + (-0.10)* crim +
0.04*zn + .....
Compare the
Predicted
output with
the original
one
Predict command
Case Study - 2
• A company is facing high churn_rate this year,
and they are in process to find out the reasons
behind it. Salary_Hike being the major reason,
Let us consider a company’s data where we
try to find out the relationship between these
two variables.
• Linear Regression is a powerful technique used
for predicting the unknown value of a variable
(Dependent variable) from the known value of
another variables (Independent variables).
– A Dependent variable is the variable to be predicted
or explained in a regression model. This variable is
assumed to be assumed to be functionality related to
the independent variable.
– An independent variable is the variable related to the
dependent variable in a regression equation. The
independent variable is used in a Regression Model to
estimate the value of the dependent variable.
In Case Study : Dependent Variable – Churn_data
Independent Variable – Salary_Hike
Equation of the Regression Line

• Y- intercept (a) is the value of the Dependent variable (y)


when the value of the independent variable (x) is zero. It is
the point at which the line cuts the y-Axis.

• Slope (b) is the change in the Dependent Variable for a unit


increase in the Independent Variable. It is the tangent of
the angle by the line with the x-axis.
From the graph we can see that
as the salary_hike decreases, the
Churn_out rate increases
Function – lm () in R
• To calculate the regression function to fit the
linear model, lm is used.

• lm() – is used to fit linear models


Syntax: lm(formula, data= )
Use abline() creates a regression line
in the current plot.

lm() used for linear model

lwd -> line width


Segment()
draw line
segments
between pairs
of points.
Case Study - 3
• Computer manufacturing company is trying to
analyse the data of the price of a computer
with another independent variable like- CPU
speed, Hard disc, RAM, Screen Size, CD
(yes/no), produced by premium
company(yes/no) and so on. Based on this
data, company wants to decide on the price of
a new configuration of PC.

• Is the case solvable by Linear model ?????


Use specified dataset
Speed and price are proportionally related.
When speed=50 ... Price = 4900
Create Linear regression model
for the case study 3
Predict from the linear model
created
Predict command
Compare the predicted
Output to the original
dataset Price values
Calculate residual values for the
model
Residual values
Now plot the predictive and
residual values obtained
In order to bring more clarity to the
graph obtained, plot the predictive
and residual values obtained in a
line graph.
Here blue lines represent the price and the red lines
represent the predictive value generated for the data.
As seen from the graph, most of the predictive values are
overlapped with the actual values.
If r-squared value is equal to 1, then the model obtained is
more efficient.

We have good amount of overlaps that indicates that model is efficient


Case Study - 4
• In a college comparative Student HS_Grade College_Grade
study was done to 1 2.0 1.6
predict the grades of 2 2.2 2.0
student at college level 3 2.6 1.8
on the basis of marks 4 2.7 2.8
scored by student in 5 2.8 2.1
6 3.1 2.0
higher school.
7 2.9 2.6
8 3.2 2.2
9 3.3 2.6
10 3.6 3.0
Correlation analysis from scatter plot ------ +ve
Major facts to Analyze
1. P – value : The p-value for each term tests
the null hypothesis that the coefficient is
equal to zero (no effect). A low p-value
(< 0.05) indicates that you can reject the null
hypothesis.
2. Estimate : Unit change in HS will increase the
college grade by 0.6465 (Magnitude = 0.6465
and direction is +ve)
Major facts to Analyze
3. Multiple R square : Ideal value should be 1.
- sensitive to number of independent
variables.

4. Adjusted R-square: Not sensitive to number of


independent variables.
Hypothesis
• Null Hypothesis : HS grade does not depend
on college grades.

• Alternate Hypothesis : HS grade depend on


college grades.
Analyze
Student HS_Grade College_Grade
1 2.0 1.6
2 2.2 2.0
3 2.6 1.8
4 2.7 2.8
5 2.8 2.1
6 3.1 2.0
7 2.9 2.6
8 3.2 2.2
9 3.3 2.6
10 3.6 3.0
Compare the predicted and original values
Residual evaluated
Combined all steps ......

Vous aimerez peut-être aussi