Académique Documents
Professionnel Documents
Culture Documents
Residuals
Overview
𝑦" = -3.4 + 2.5𝑥
1
Finding the line Clicker Question #2:
• Statistical software can find the line the best Student Q earned 3 points of
fits the data extra credit and scored an 88
• This line of best fit is calculated by on the exam. What is the
minimizing the distance between the residual for student Q?
observed data values and the line of best fit
Enter your numeric answer.
• That is, we minimize the residual sum of
Round to one decimal place.
squares
minimize ∑ 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 0
minimize ∑ 𝑦 − 𝑦" 0 𝑦" = 72.3 + 3.9𝑥
• This is also known as the method of least
squares, or a least squares regression line
• The sum of the residuals about the
regression line is always zero
R2
Overview
• The strength of the fit of a linear model is most commonly evaluated
• Residuals using R2
• This can be calculated in two ways:
• Sums of squares 1. R2 = square of the correlation coefficient (r)
– Always between 0 and 1
• Regression models 2. R2 =
123456718 95:65;646<=
<><54 95:65;646<=
• Assumptions for Linear Regression • Interpreted as the percentage of variability in y explained by x
2
The idea behind explained and unexplained Partitioning Variability
variability
3
r vs R2
Overview
• Both r and R2 describe the strength of association.
• r falls between -1 and +1 • Residuals
It represents the slope of the regression line when x and y have
equal standard deviations
• Sums of squares
@ 1 = 𝑟 × ( 𝒔𝒚 )
𝜷
𝒔𝒙 • Regression models
r = 𝛽H 1 , when sy = sx
• Assumptions for Linear Regression
• R2 falls between 0 and 1
It is the proportion of variation in y accounted for by x
4
The regression model Example
• The population regression equation approximates the actual • There were 32 students who earned 4.75 points
relationship between x and the population mean of y of extra credit on Exam 2.
• At a given value of x, the equation • 𝑦" = 𝛽HP + 𝛽HK 𝑥
𝑦" = 𝛽HP + 𝛽HK 𝑥 predicts a single value of • 𝑦" = 57.85 + 4.73 × 4.75 = 80.3
the response variable
• At each fixed value of x, variability occurs • This is both the predicted exam grade for
in the y values around their mean, 𝜇= – Any individual student who earns 4.75 extra credit points
• The distribution of y values at a fixed value – The average of all students who earn 4.75 extra credit
points.
of x is a conditional distribution.
• There is variation about the predicted mean
• An additional parameter 𝜎 describes the grade
standard deviation of each conditional
distribution
Clicker Question #4
Overview
• Residuals
• Sums of squares
What is the estimated regression equation? • Regression models
A. 𝑦" = −51.211 + 0.613×𝑔𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛
B. 𝑦" = 0.613 − 51.211×𝑔𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛
• Assumptions for Linear Regression
C. 𝑦" = 0.613 + 0.213×𝑔𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛
D. 𝑦" = −51.211 + 0.213×𝑔𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛
5
Assumptions for Linear Regression Checking assumptions
1. The data were gathered using randomization.
2. The observations are independent.
3. Linearity: The population means of y at different values
of x have a straight-line relationship
with x, that is, 𝝁𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙
4. (4a) The population values of y at each value
of x follow a normal distribution (nearly
normal residuals), (4b) with the same
standard deviation (𝜎) at each x value
(constant variance in y for all x)
Linearity assumption (3) violated Constant variance assumption (4b) violated
6
Checking assumptions with residuals Checking assumptions with residuals
• We can use plots of residuals (residual plots) to check the assumptions for
linear regressions
Check to see if the residuals are approximately normally distributed. This • Regular residual = 𝑦 − 𝑦" ~ 𝑁 0, 𝜎
can be used to assess the first part of assumption 4(4a): The population =f="
values of y at each value of x follow a normal distribution. • Standardized residual = g1 ~𝑁(0,1)
j
hih
– Standardized residual allows us to identify potential outliers