Review of ANOVA and Linear Regression

Review of ANOVA and linear
regression
Review of simple ANOVA
ANOVA
for comparing means between
more than 2 groups
Hypotheses of One-Way
ANOVA
 H0 : μ1  μ2  μ3    μc
 All population means are equal
 i.e., no treatment effect (no variation in means among
groups)

H1 : Not all of the population means are the same
 At least one population mean is different
 i.e., there is a treatment effect
 Does not mean that all population means are different
(some pairs may be the same)
The F-distribution
 A ratio of variances follows an F-distribution:
 2
between
~ Fn ,m
 2
within
The F-test tests the hypothesis that two variances

are equal.
F will be close to 1 if sample variances are equal.
H 0 :  between
2
  within
2
H a : 2
between  2
within
How to calculate ANOVA’s by
hand…
Treatment 1 Treatment 2 Treatment 3 Treatment 4
y11 y21 y31 y41
y12 y22 y32 y42 n=10 obs./group
y13 y23 y33 y43
y14 y24 y34 y44 k=4 groups
y15 y25 y35 y45
y16 y26 y36 y46
y17 y27 y37 y47
y18 y28 y38 y48
y19 y29 y39 y49
y110 y210 y310 y410
10

10 10 10
y1 j
y 2j y 3j y 4j The group means
j 1 j 1
y1  y 2 
j 1
y 3 
j 1 y 4 
10 10 10 10
10

10 10
(y (y
10
( y 2 j  y 2 ) 2
(y  y 3 )  y 4 ) 2
2
1j  y1 ) 2
3j 4j
j 1 j 1 j 1 j 1 The (within)
10  1 10  1 10  1 10  1 group variances
Sum of Squares Within (SSW),
or Sum of Squares Error (SSE)
10
(y
10 10
(y (y
10
 y 2 )
(y
2
1j  y1 ) 2 2j 3j  y 3 ) 2
4j  y 4 ) 2
j 1 j 1 j 1 j 1
The (within) group
variances
10  1 10  1 10  1 10  1
10 10
 (y
10 10
(y  ( y 3 j  y 3 ) +  y 4 ) 2
2
 y1 ) +
2 ( y 2 j  y 2 ) 2 + 4j
1j
j 1 j 3 j 1
j 1
4 10
  i 1 j 1
( y ij  y i ) 2 Sum of Squares Within (SSW)
(or SSE, for chance error)
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)
4 10
Overall mean of
all 40  y
i 1 j 1
ij
observations
(“grand mean”) y  
40
(y
Sum of Squares Between
 y  ) 2 (SSB). Variability of the

10 x i group means compared to
the grand mean (the
i 1 variability due to the
treatment).
Total Sum of Squares (SST)
Total sum of squares(TSS).

4 10

Squared difference of every
( y ij  y  ) 2 observation from the overall

mean. (numerator of
variance of Y!)
i 1 j 1
Partitioning of Variance
4 10 4 4 10
 ( y
i 1 j 1
ij  y i ) 2

+ 10x ( y i   y  ) 2
=  ( y ij  y  ) 2
i 1 i 1 j 1
SSW + SSB = TSS

ANOVA Table
Mean Sum
Source of Sum of of Squares
variation d.f. squares F-statistic p-value
Between k-1 SSB SSB/k-1 Go to

SSB
(sum of squared k 1
(k groups) SSW Fk-1,nk-k
deviations of nk  k chart
group means from
grand mean)
Within nk-k SSW s2=SSW/nk-k

(sum of squared
(n individuals per
deviations of
group)
observations from
their group mean)
Total nk-1 TSS

variation (sum of squared deviations of
observations from grand mean) TSS=SSB + SSW
Example
60 inches 50 48 47
67 52 49 67
42 43 50 54
67 67 55 67
56 67 56 68
62 59 61 65
64 67 61 65
59 64 60 56
72 63 59 60
71 65 64 65
Example
Step 1) calculate the sum
of squares between groups:
60 inches 50 48 47
67 52 49 67
42 43 50 54
Mean for group 1 = 62.0 67 67 55 67
Mean for group 2 = 59.7 56 67 56 68

62 59 61 65
Mean for group 3 = 56.3 64 67 61 65
59 64 60 56
Mean for group 4 = 61.4 72 63 59 60
71 65 64 65
Grand mean= 59.85
SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per

group= 19.65x10 = 196.5
Example
Step 2) calculate the sum
of squares within groups:
60 inches 50 48 47
67 52 49 67
42 43 50 54
(60-62) 2+(67-62) 2+ (42-62) 67 67 55 67
2+ (67-62) 2+ (56-62) 2+ (62-
56 67 56 68
62) 2+ (64-62) 2+ (59-62) 2+ 62 59 61 65
(72-62) 2+ (71-62) 2+ (50- 64 67 61 65
59.7) 2+ (52-59.7) 2+ (43- 59 64 60 56
59.7) 2+67-59.7) 2+ (67- 72 63 59 60
59.7) 2+ (69-59.7) 71 65 64 65
2…+….(sum of 40 squared
deviations) = 2060.6
Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value
Squares
Between 3 196.5 65.5 1.14 .344
Within 36 2060.6 57.2
Total 39 2257.1
Step 3) Fill in the ANOVA table
Squares
Between 3 196.5 65.5 1.14 .344
Within 36 2060.6 57.2
Total 39 2257.1
INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
Coefficient of Determination
SSB SSB
R 2

SSB  SSE SST
The amount of variation in the outcome variable (dependent
variable) that is explained by the predictor (independent variable).
ANOVA example
Table 6. Mean micronutrient intake from the school lunch by school
S1a, n=25 S2b, n=25 S3c, n=25 P-valued
Calcium (mg) Mean 117.8 158.7 206.5 0.000
SDe 62.4 70.5 86.2
Iron (mg) Mean 2.0 2.0 2.0 0.854
SD 0.6 0.6 0.6
Folate (μg) Mean 26.6 38.7 42.6 0.000
SD 13.1 14.5 15.1
Mean 1.9 1.5 1.3 0.055
Zinc (mg)
SD 1.0 1.2 0.4
a School 1 (most deprived; 40% subsidized lunches). FROM: Gould R, Russell J,
Barker ME. School lunch menus
b School 2 (medium deprived; <10% subsidized). and 11 to 12 year old children's
c School 3 (least deprived; no subsidization, private school). food choice in three secondary
schools in England-are the
d ANOVA; significant differences are highlighted in bold (P<0.05). nutritional standards being met?
Appetite. 2006 Jan;46(1):86-92.
Answer
Step 1) calculate the sum of squares between groups:
Mean for School 1 = 117.8
Grand mean: 161
SSB = [(117.8-161)2 + (158.7-161)2 + (206.5-161)2] x25 per

group= 98,113
Answer
Step 2) calculate the sum of squares within groups:
S.D. for S1 = 62.4

S.D. for S2 = 70.5
S.D. for S3 = 86.2
Therefore, sum of squares within is:

(24)[ 62.42 + 70.5 2+ 86.22]=391,066
Answer
Step 3) Fill in your ANOVA table

Squares
Between 2 98,113 49056 9 <.05
Within 72 391,066 5431
Total 74 489,179
**R2=98113/489179=20%
School explains 20% of the variance in lunchtime calcium
intake in these kids.
Beyond one-way ANOVA
Often, you may want to test more than 1
treatment. ANOVA can accommodate
more than 1 treatment or factor, so long
as they are independent. Again, the
variation partitions beautifully!
TSS = SSB1 + SSB2 + SSW

Linear regression review
What is “Linear”?
 Remember this:
 Y=mX+B?
B
What’s Slope?
A slope of 2 means that every 1-unit change in X

yields a 2-unit change in Y.
Regression equation…
Expected value of y at a given level of x=
E ( yi / xi )    xi
Predicted value for an
individual…
yi=  + *xi + random errori
Fixed – Follows a normal

exactly distribution
on the
line
Assumptions (or the fine print)
 Linear regression assumes that…
 1. The relationship between X and Y is linear
 2. Y is distributed normally at each value of X
 3. The variance of Y at every value of X is the
same (homogeneity of variances)
 4. The observations are independent**
 **When we talk about repeated measures

starting next week, we will violate this
assumption and hence need more
sophisticated regression models!
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Regression Picture
yi
ŷi  xi  
C A
B
y
B y
A
C
yi
*Least squares estimation

x gave us the line (β) that
n n n minimized C2
(y
i 1
i  y) 2
  ( yˆ
i 1
i  y) 2
  ( yˆ
i 1
i  yi ) 2
R2=SSreg/SStotal
A2 B2 C2
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to naïve mean of y Variance around the regression line
observations from naïve mean Variability due to x (regression) Additional variability not explained
of y by x—what least squares method aims
Total variation to minimize
Recall example: cognitive
function and vitamin D
 Hypothetical data loosely based on [1];
cross-sectional study of 100 middle-
aged and older European men.
 Cognitive function is measured by the Digit
Symbol Substitution Test (DSST).
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
Distribution of vitamin D
Mean= 63 nmol/L
Standard deviation = 33 nmol/L
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
Four hypothetical datasets
 I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):
 0
 0.5 points per 10 nmol/L
Dataset 1: no relationship
Dataset 2: weak relationship
Dataset 3: weak to moderate
relationship
Dataset 4: moderate
relationship
The “Best fit” line
Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
Note how the line is

a little deceptive; it
draws your eye,
making the
relationship appear
stronger than it
really is!
Regression
equation:
E(Yi) = 26 + 0.5*vit
Di (in 10 nmol/L)
Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)
Note: all the lines go

through the point
(63, 28)!
Significance testing…
Slope
Distribution of slope ~ Tn-2(β,s.e.( ˆ ))
H0: β1 = 0 (no linear relationship)

H1: β1  0 (linear relationship does exist)
Tn-2=
ˆ  0
s.e.( ˆ )
Example: dataset 4
 Standard error (beta) = 0.03
 T98 = 0.15/0.03 = 5, p<.0001
 95% Confidence interval = 0.09 to 0.21

Multiple linear regression…
 What if age is a confounder here?
 Older men have lower vitamin D
 Older men have poorer cognition
 “Adjust” for age by putting age in the
model:
 DSST score = intercept + slope1xvitamin D
+ slope2 xage
2 predictors: age and vit D…
Different 3D view…
Fit a plane rather than a line…
On the plane, the

slope for vitamin
D is the same at
every age; thus,
the slope for
vitamin D
represents the
effect of vitamin
D when age is
held constant.
Equation of the “Best fit”
plane…
 DSST score = 53 + 0.0039xvitamin D
(in 10 nmol/L) - 0.46 xage (in years)
 P-value for vitamin D >>.05

 P-value for age <.0001
 Thus, relationship with vitamin D was

due to confounding by age!
Multiple Linear Regression
 More than one predictor…
E(y)=  + 1*X + 2 *W + 3 *Z…
Each regression coefficient is the amount of

change in the outcome variable that would be
expected per one-unit change of the
predictor, if all other variables in the model
were held constant.
Functions of multivariate
analysis:
 Control for confounders
 Test for interactions between predictors
(effect modification)
 Improve predictions
ANOVA is linear regression!
 Divide vitamin D into three groups:

 Deficient (<25 nmol/L)
 Insufficient (>=25 and <50 nmol/L)
 Sufficient (>=50 nmol/L), reference group
DSST=  (=value for sufficient) + insufficient*(1
if insufficient) + 2 *(1 if deficient)
This is called “dummy coding”—where multiple
binary variables are created to represent
being in each category (or not) of a
categorical variable
The picture…
Sufficient vs.
Insufficient
Sufficient vs.
Deficient
Results…
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 40.07407 1.47817 27.11 <.0001

deficient 1 -9.87407 3.73950 -2.64 0.0096
insufficient 1 -6.87963 2.33719 -2.94 0.0041
 Interpretation:
 The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
 The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.

Review of ANOVA and Linear Regression

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Review of ANOVA and Linear Regression

Transféré par

Droits d'auteur :

Formats disponibles

Review of ANOVA and linear

The F-test tests the hypothesis that two variances

 y  ) 2 (SSB). Variability of the

Total sum of squares(TSS).

( y ij  y  ) 2 observation from the overall

SSW + SSB = TSS

Between k-1 SSB SSB/k-1 Go to

Within nk-k SSW s2=SSW/nk-k

Total nk-1 TSS

Mean for group 2 = 59.7 56 67 56 68

Grand mean= 59.85

SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per

59.7) 2+67-59.7) 2+ (67- 72 63 59 60

Between 3 196.5 65.5 1.14 .344

Within 36 2060.6 57.2

Between 3 196.5 65.5 1.14 .344

Within 36 2060.6 57.2

Grand mean: 161

SSB = [(117.8-161)2 + (158.7-161)2 + (206.5-161)2] x25 per

S.D. for S1 = 62.4

Therefore, sum of squares within is:

Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value

Within 72 391,066 5431

TSS = SSB1 + SSB2 + SSW

A slope of 2 means that every 1-unit change in X

Fixed – Follows a normal

 **When we talk about repeated measures

*Least squares estimation

Note how the line is

Note: all the lines go

H0: β1 = 0 (no linear relationship)

 95% Confidence interval = 0.09 to 0.21

On the plane, the

 P-value for vitamin D >>.05

 Thus, relationship with vitamin D was

E(y)=  + 1*X + 2 *W + 3 *Z…

Each regression coefficient is the amount of

 Divide vitamin D into three groups:

Intercept 1 40.07407 1.47817 27.11 <.0001

Vous aimerez peut-être aussi

E(y)=  + 1X + 2 W + 3 *Z…