20 LinearRegression

Clicker Question #1
What is your best guess of the

correlation between x and y?
A. -2
B. -1
C. -0.7
More on Linear Regression
D. 0
E. 0.7
F. 1
G. 2
Residuals
Overview
𝑦" = -3.4 + 2.5𝑥
• Residuals • A residual is the distance between

• Sums of squares an observed data point and the
fitted line
• Regression models • 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦 − 𝑦"
• Assumptions for Linear Regression
1
Finding the line Clicker Question #2:
• Statistical software can find the line the best Student Q earned 3 points of
fits the data extra credit and scored an 88
• This line of best fit is calculated by on the exam. What is the
minimizing the distance between the residual for student Q?
observed data values and the line of best fit
Enter your numeric answer.
• That is, we minimize the residual sum of
Round to one decimal place.
squares
minimize ∑ 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 0
minimize ∑ 𝑦 − 𝑦" 0 𝑦" = 72.3 + 3.9𝑥
• This is also known as the method of least
squares, or a least squares regression line
• The sum of the residuals about the
regression line is always zero
R2
Overview
• The strength of the fit of a linear model is most commonly evaluated
• Residuals using R2
• This can be calculated in two ways:
• Sums of squares 1. R2 = square of the correlation coefficient (r)
– Always between 0 and 1
• Regression models 2. R2 =
123456718 95:65;646<=
<><54 95:65;646<=
• Assumptions for Linear Regression • Interpreted as the percentage of variability in y explained by x
2
The idea behind explained and unexplained Partitioning Variability
variability
• If a student still needed to take exam 2, what would be my best

guess of this student’s score?
• What other information could I use to make a better guess?
R2 and sums of squares Linear Regression output from R
• Sums of squares are summarized in an ANOVA table (Analysis of

Variance) 64% of the variation in the rate of Nobel laureates can be explained by
• In R, ANOVA tables are produced with linear regression results the amount of chocolate consumed
• Higher R2 indicate that x is better at predicting y (because x explains
a high amount of variability in y)
3
r vs R2
Overview
• Both r and R2 describe the strength of association.
• r falls between -1 and +1 • Residuals
It represents the slope of the regression line when x and y have
equal standard deviations
• Sums of squares
@ 1 = 𝑟 × ( 𝒔𝒚 )
𝜷
𝒔𝒙 • Regression models
r = 𝛽H 1 , when sy = sx
• R2 falls between 0 and 1
It is the proportion of variation in y accounted for by x
Clicker Question #3 What can we conclude about the

relationship between gestation days and The regression model
birth weight of a baby?
A. 𝛽K is significantly greater than 0, so
there is no association between
gestation days and birth weight.
B. 𝛽K is significantly greater than 0, so
there is association between
C. 𝛽K is not significantly greater than 0,
so there is no association between We are modeling an average response (𝜇= ), but we say there is
variation at the individual level about the average response
D. 𝛽K is not significantly greater than 0,
so there is association between
4
The regression model Example
• The population regression equation approximates the actual • There were 32 students who earned 4.75 points
relationship between x and the population mean of y of extra credit on Exam 2.
• At a given value of x, the equation • 𝑦" = 𝛽HP + 𝛽HK 𝑥
𝑦" = 𝛽HP + 𝛽HK 𝑥 predicts a single value of • 𝑦" = 57.85 + 4.73 × 4.75 = 80.3
the response variable
• At each fixed value of x, variability occurs • This is both the predicted exam grade for
in the y values around their mean, 𝜇= – Any individual student who earns 4.75 extra credit points
• The distribution of y values at a fixed value – The average of all students who earn 4.75 extra credit
points.
of x is a conditional distribution.
• There is variation about the predicted mean
• An additional parameter 𝜎 describes the grade
standard deviation of each conditional
distribution
Clicker Question #4
Overview
• Residuals
• Sums of squares
What is the estimated regression equation? • Regression models
A. 𝑦" = −51.211 + 0.613×𝑔𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛
B. 𝑦" = 0.613 − 51.211×𝑔𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛
C. 𝑦" = 0.613 + 0.213×𝑔𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛
D. 𝑦" = −51.211 + 0.213×𝑔𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛
5
Assumptions for Linear Regression Checking assumptions
1. The data were gathered using randomization.
2. The observations are independent.
3. Linearity: The population means of y at different values
of x have a straight-line relationship
with x, that is, 𝝁𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙
4. (4a) The population values of y at each value
of x follow a normal distribution (nearly
normal residuals), (4b) with the same
standard deviation (𝜎) at each x value
(constant variance in y for all x)
Linearity assumption (3) violated Constant variance assumption (4b) violated
Assumptions for Linear Regression Checking assumptions

1. The data were gathered using randomization.
2. The observations are independent. Linearity relationship between gestation
period and birth weight of baby
3. Linearity: The population means of y at different values Assumption #3 satisfied
of x have a straight-line relationship
with x, that is, 𝝁𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙 constant variance in baby’s birth
weight for each length of pregnancy
4. (4a) The population values of y at each value Assumption #4b satisfied
of x follow a normal distribution (nearly
normal residuals), (4b) with the same
standard deviation (𝜎) at each x value
(constant variance in y for all x) How about assumption #4a
(normal distribution of y for specific values of x)?
6
Checking assumptions with residuals Checking assumptions with residuals
• We can use plots of residuals (residual plots) to check the assumptions for
linear regressions
Check to see if the residuals are approximately normally distributed. This • Regular residual = 𝑦 − 𝑦" ~ 𝑁 0, 𝜎
can be used to assess the first part of assumption 4(4a): The population =f="
values of y at each value of x follow a normal distribution. • Standardized residual = g1 ~𝑁(0,1)
j
hih
– Standardized residual allows us to identify potential outliers
Checking assumptions Clicker Question #5

with residuals
Which assumption may be violated?
Select the BEST answer.
Check assumptions A. Linear relationship given by
using residual plots 𝜇= = 𝛽P + 𝛽K 𝑥
3.Linearity B. The population values of y
4b.Constant variance conditioned on each value of x
follow a normal distribution
Check histogram for C. The population values of y
4a. Normal distribution conditioned on each value of x
have the same standard
deviation (𝜎) at each x value
D. None of the above

20 LinearRegression

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

20 LinearRegression

Transféré par

Droits d'auteur :

Formats disponibles

Clicker Question #1

What is your best guess of the

• Residuals • A residual is the distance between

• Assumptions for Linear Regression

• If a student still needed to take exam 2, what would be my best

R2 and sums of squares Linear Regression output from R

• Sums of squares are summarized in an ANOVA table (Analysis of

Clicker Question #3 What can we conclude about the

Assumptions for Linear Regression Checking assumptions

Checking assumptions Clicker Question #5

Vous aimerez peut-être aussi