What Statistics Books Try To Teach You But Dont Joe King University of Washington

What Statistics Books Try To Teach You But
Dont
Joe King
University of Washington
1
Contents
I Introduction to Statistics 4
1 Principles of Statistics 5
1.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Sample vs. Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Type I & II Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 What does Rejecting Mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Writing in APA Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Description of A Single Variable 8
2.1 Wheres the Middle? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Skew and Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Testing for Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
II Correlations and Mean Testing 13
3 Relationships Between Two Variables 14
3.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Pearsons Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 R Squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Point Biserial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Spurious Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Means Testing 16
4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Independent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Dependent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Eect Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
CONTENTS CONTENTS
III Latent Variables 20
5 Latent Constructs and Reliability 21
5.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
IV Regression 22
6 Regression: The Basics 23
6.1 Foundation Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3 Bibliographic Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Linear Regression 25
7.1 Basics of Linar Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.1.1 Sums of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3 Interpretation of Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3.1 Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3.1.1 Transformation of Continous Variables . . . . . . . . . . . . . . . . . . . . . 29
7.3.1.1.1 Natural Log of Variables . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3.2 Categorical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.3.2.1 Nominal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.3.2.2 Ordinal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.4 Model Comparisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.6 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.6.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.6.1.1 Normality of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.6.1.1.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.6.1.1.2 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.7 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8 Logistic Regression 36
8.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2 Regression Modeling Binomial Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2.2 Regression for Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2.2.1 Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.2.2.2 Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.2.2.3 Logit or Probit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.2.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3
Part I
Introduction to Statistics
4
Chapter 1
Principles of Statistics
Statistics is scary to most students but it does not have to be. The trick is to build up your knowledge base
one-step at a time to make sure you get the building blocks necessary to understand the more advanced
statistics. This paper will go from very simple understanding of variables and statistics to more complex
analysis for describing data. This mini-book of statistics will give several formulas to calculate parameters
yet rarely will you have to calculate these on paper or insert the numbers in an equation for a spreadsheet.
This rst chapter will look at some of the basic principles of statistics. Some of the basic concepts that will
be necessary to understand statistical inference. These may seem simple and some of these many may be
familiar with but best to start any work of statistics with the basic principles as a strong foundation.
1.1 Variables
First we start with the basics. What is a variable? Essentially a variable is a construct we observe. There
are two kinds of variables, manifest (or observed variables) and latent variables. Latent variables are ones we
can only measure by measuring other manifest variables, but we infer it (socio-economic status is a classic
example). Manifest variables we directly measure and can model or we can use them to construct more
complex latent variables, for example we may measure parents education, parents incoming and combine
those into the construct of socio-economic status.
1.1.1 Types of Variables
There are four primary categories of manifest variables, nominal, ordinal, interval, and ratio. The rst
two are categorical variables. Nominal variables are variables which are strictly categorical and have no
discernible hierarchy or order to them, this includes race, religion, or states for example. Ordinal is also
categorical but this has a natural order to it. Likert scales (strongly disagree, disagree, neutral, agree,
strongly agree) is one of the most common examples of an ordinal variable. Other examples include include
class status (freshman, sophomore, junior, senior) and levels of education obtained (high school, bachelors,
masters, etc).
The continuous variables are interval and ratio. These are not categorical such as having a set number of
values but can take any value between two values. A continuous variable is exam scores; your score may
take any value from 0%-100%. Interval has no absolute value so we cannot make judgements about the
distance between two values. Temperature is a good example, Celsius and Fahrenheit realistically wont
have an absolute minimum or maximum from the temperatures we experience. We cannot say 30 degrees
Fahrenheit is twice as warm as 15 degrees Fahrenheit. A ratio scale is still continuous but has an absolute
zero, so we can make judgements about dierences. I can say a student who got an 80% on the exam did
twice as good as the student who got a 40% on their exam.
1.1.2 Sample vs. Population
One of the primary interests in statistics is to try to generalize our sample to a population. A population
doesnt always have to be the population of a state or nation as we usually think of the word. Lets say for
example the head of UW Medicine came to me and asked me to do a workplace climate survey on all the
nursing sta at UW Medical Center. While there are alot of nurses there, I could conceivably give my survey
to each and every one of them. This would mean I would not have a problem of generalizability because I
know the attitudes of my entire population.
Unfortunately statistics is rarely this clean, and you will not have access to an entire population. Therefore
I must collect data that is representatives of the population I want to study, this will be a sample. It is
important to note though because dierent notation is used for samples versus populations. For example
5
CHAPTER 1. PRINCIPLES OF STATISTICS 1.2. TERMINOLOGY
x is generally a sample mean while is used as the population mean. Rarely will you be able to know
the population mean where this becomes a huge issue. Many books on statistics have the notation at the
beginning of their book, yet I feel this is not a good idea. I will introduce notation as it becomes relevant,
and specically discuss it when its necessary. Do not be alarmed if you nd yourself coming back to chapters
remembering notation, it happens to everyone, and committing this to memory is a truly life long aair.
1.2 Terminology
There is also the discussion of terminology. This will be discussed before the primary methods for under-
standing how to do statistics because the terminology can get confusing. Unfortunately statistics tends to
like to change its terminology and have multiple words for the same concept, which dier between journals,
disciplines and dierent coursework.
One area where this is most true is when talking about types of variables. We classied variables into how
they are measured above, but how they t into our research question is dierent. Basic statistics books still
talk about variables as independent or dependent variables. Although these have fallen out of favor in
alot of disciplines, especially the methodology literature, but still bears weight so will be discussed. We will
talk about which variables are independent and dependent based on the models we run when we get to those
models but in general, the dependent variable is the one we are interested in knowing about. In short, we
want to know how our independent variables inuence our dependent variable(s).
Now of course there are dierent names for the dependent and independent variables depending on what
we are studying. Almost universally the dependent variable is called the outcome variable. This seems
justied given its the outcome we are studying. Its the independent variable which has been given many
names. In many cases its called the regressor (in regression model), predictor (again generally in regression)
or covariate. I prefer the second term, and dont like the third. The rst one seems too tied to regression
modelling and not as general as predictor. Covariate has dierent meanings with dierent tests so in my
opinion can be confusing. Predictor also can be confusing because some people may conate this with cau-
sation which would be a very wrong assumption to make. I will usually use the term independent variable
or predictor due to the lack of better terms and these are the more common ones you will see in the literature.
1.3 Hypothesis Testing
The basis from where we start our research is the null hypothesis. This simply says there is no relationship
between the variables we are studying. When we reject the null hypothesis, we are saying we accept the
alternative hypothesis which says the null hypothesis is not true and there is a signicant relationship
between the variable(s) we are studying.
1.3.1 Assumptions
There are many types of assumptions that we must make in our analysis in order for our coecients to be
unbiased.
1.3.2 Type I & II Error
So we have a hypothesis associated with a research question. This mini-book will look at ways to explore
hypothesis and how we can either support or not support our hypothesis. First we must make a few basics
about hypothesis testing. We have to have some basis to determine whether the questions we are testing
are true or not. Yet we also dont want to make hasty judgements about whether our hypothesis is correct
or not. This leads us to committing errors in our judgements. There are two primary errors in this context.
Type I error is where we reject the null hypothesis when it is correct. Type II error is when we do not
reject the null hypothesis when it is wrong. While we attempt to avoid both types of errors, the latter is
more acceptable than the former. This is because we do not want to make hasty decision about discussing
an important relationship between variables when none exists. If we say there is no relationship when in
6
CHAPTER 1. PRINCIPLES OF STATISTICS 1.4. WRITING IN APA STYLE
fact there is one, this is a more conservative approach that hopefully future research will correct.
1.3.3 What does Rejecting Mean?
When we try to reject the null hypothesis rst we must determine our critical value which is generally 0.05.
It is by convention that it is done and currently debated on whether its still of practical use given computing
technology today. When we reject the null hypothesis all we are saying is the chances of nding as large or
larger result is less than the signicance level. This does not mean that your research question really merits
any major practical eect. Rejecting the null hypothesis may be important but so can not rejecting the null
hypothesis be important. For example if there was a school where lower income groups and higher income
groups were performing signicantly dierent on exams 5 years ago, and I came in and tested again, and
I found no statistically signicant dierences, I would nd that to be highly important. It would mean
there was a change in the test scores and there is now some relative parity.
The next concern is practical signicance. If my research is signicant, but there may not be any real reason
to think its going to make a dierence if implemented in policy or clinical settings. This is where other
measures come into play, like eect sizes which will be discussed later. One should also note that larger
sample sizes can make even a very small statistics statistically signicant and a small sample size can mask
a signicant result. All of these must be considerations. One should not take a black and white approach to
answering research questions. Something is just not signicant or not.
1.4 Writing in APA Style
One thing to be cautious about is how to write up your results and present them in a manner which is both
ethical and concise. This includes graphics, tables and paragraphs. These should make the main points of
what you want to say while not mis-representing your results. If you are going to be doing alot of writing
for publication you should pick up a copy of the APA Manual (Association2009).
1.5 Final Thoughts
A lot was discussed in this rst part. These concepts will be revisited in later sections as we begin to
implement these concepts. There are many books which have been written which expand on these concepts
further and articles which have been written about these concepts. I ask that you constantly keep an open
mind as researchers and realize statistics can never tell us truth, it can only hint at it, or point us in the
right directions, and the process of scientic inquiry never ends.
7
Chapter 2
Description of A Single Variable
So when we have variables we want to understand the nature of these variables. Our rst job is to describe
our data, before we can start to do any test. There are two measures we want to know about our data.
The rst is we want to know where the center of the mass of the data is, and how far from the center of
the mass our data is distributed. The middle is calculated by the measures of central tendency (discussed
momentarily), how far from the middle of that helps us know how much variability there is in our data. This
is also called uncertainty or the dispersion parameter. These concepts are more generally known as the
location and scale parameters. Location being the middle of the distribution, where on a real number line
does the middle lie. Scale is how far away from the middle does our data go. These are concepts that are
common among all statistical distributions. Although, for now our focus is on the normal distribution. This
is also known as the Gaussian distribution and is widely used in statistics for its satisfying mathematical
properties and being able to conform to allow us to run many types of analyses.
2.1 Wheres the Middle?
The best way to describe data is to use the measure of central tendency, or what is the middle of a set of
values. This includes the mean, median, and mode.
The equation to nd the mean is in 2.1. The equation below has some notation which requires some dis-
cussion as you will see this in alot of formulas. The
is the summation sign, which tells us to sum

everything to its left. The i = 1 below the summation sign simply means start at the rst value in the vari-
able, and the N at the top means go all the way to the end (or the number of responses seen in that variable).
x =
N
i=1
x
N
(2.1)
If we return to our x vector we get 2.2
x = 1 + 2 + 3 + 4 + 5 = 15/5
x = 15/5
x = 3
(2.2)
Our mean is inuenced by all the numbers equally, so our example of variable y would give a dierent mean
by formula 2.3.
x = 1 + 1 + 2 + 3 + 4 + 5 = 15/6
x = 16/6
x = 2.67
(2.3)
The addition of the extra one weighed our mean down. As we will see, values can have dramatic changes on
our mean, especially when the number of values we have is low. Finally we represent mean in several ways,
the Greek letter represents the population mean, while the mean of a sample can be denoted with a at
bar on top, so we would say x = 3. Finally the mean is also known as the expected value, so we can write
8
CHAPTER 2. DESCRIPTION OF A SINGLE VARIABLE 2.2. VARIATION
it as E(x) = 3.
For categorical data there are two great measures. The rst is Median which is simply the middle number
of a set, so for a set of values as in 2.4
Median = 1, 2, 3
..
Median=3
, 4, 5 (2.4)
Now if there is an even number of values we take the mean of the two middle values 2.5
Median = 1, 1, 2, 3
..
Median=2.5
, 4, 5 (2.5)
Mode is simply the most common number in a set, so the last example, 1 is the mode since it occurs twice,
the others occurs once. You may get bi-modal data where there is two numbers that occur most of all, or
even more.
These last two measures if discussing the middle of a distribution are of great interest in categorical data
mostly. Mode is rarely useful in interval or ordinal data, although median can be of help in this data. Mean
is the most relevant for continuous data and one that will be used a lot in statistics. The mean is more
commonly referred to as the average Mean is computed by taking the sum of all of the values and dividing
by the number of values.
2.2 Variation
We now know how to get the mean, but much of the time we also want to know how much variation is in
our data. When we talk about variation we are talking about why we get dierent values in the data set.
So going on our previous example of [1,2,3,4,5] we want to know why we got these values and not all 3s, or
4s. A more practical example is why does one student score a 40 on an exam, and another 80, another 90,
another 50, etc. This measure of variation is called variance. It is also called the dispersion parameter in
the statistics literature and the word dispersion will be used in discussion of other models.
Variance for the normal distribution is rst to nd the dierence between each value and the sample mean.
Then those dierences are squared, and the sum of that is divided by the number of observations as seen
below in taking the variance of x. Taking the square root of the variance gives the standard deviation for
the normal distribution. Formula 2.6 shows the equation for this.
V ar(x) =
N
i=1
(x x)
2
N
(2.6)
Formula 2.7 below shows how we take the formula above and use our previous variable x to calculate the
sample variance.
9
CHAPTER 2. DESCRIPTION OF A SINGLE VARIABLE 2.3. SKEW AND KURTOSIS
V ar(x) = ([1 3 = 2] + [2 3 = 1] + [3 3 = 0] + [4 3 = 1] + [5 3 = 2])/5
= (2
2
+ 1
2
+ 0
2
+ 1
2
+ 2
2
)/5
= (4 + 1 + 0 + 1 + 4)/5
= 10/5
= 2
(2.7)
A plot of the normal distribution with lines pointing to the distance between 1, 2 and 3 standard deviations
is shown in 2.1.
6 10 14 18 22 26 30 34
1 Standard Deviation (68.2%)
2 Standard Deviations (95.4%)
3 Standard Deviations (99.7%)
Figure 2.1: Normal Distribution
Now is when we start getting into the discussion of distributions. Specically here we will talk about the
normal distribution. The standard deviation is one property of the normal distribution. The standard
deviation is a great way to understand how data is spread out and gives us an idea of how close to the mean
our sample is. The rule for the normal distribution is 68% of the population will be within one standard
deviation of the mean, 95% will be within two standard deviations, and 99% will be within three standard
deviations. This is shown in Figure 1, which has a mean of 20, and a standard deviation of two.
There is two other forms of variation that are good to see. This the interquartile range. This shows the
middle 50% of the data. It goes from the upper 75th percentile to the lower 25th percentile. One good
graphing technique for this is a box and whisker plot . This is shown in 8.1. The line in the middle is the
middle of the distribution. The box is the interquartile range, the horizontal lines are two standard devia-
tions out. The dots outside those are outliers (data points more than two standard deviations from the mean).
2.3 Skew and Kurtosis
Two other concepts which help us evaluate a single normal variable is skew and kurtosis. This is not talked
about as much but they are still important. Skew is when one part of the sample is on one side of the mean
than the other. Negative skew is where the peak of the curve is to the right of the mean (the tail going
to the left). Positive Skew is where the peak of the distribution is to the left and the tail is going to the right.
Kurtosis is how at or peaked a distribution looks. A distribution which has a more peaked shaped is called
leptokurtic, and a shape that is atter is called platokurtic. Although skewness and Kurtosis can make a
distribution violate normality, it does not always.
10
CHAPTER 2. DESCRIPTION OF A SINGLE VARIABLE 2.4. TESTING FOR NORMALITY
Figure 2.2: Box and Whisker Plot
2.4 Testing for Normality
Can we test for normality? Well we can, and should. One way is to use descriptive statistics and to look
at a histogram. Below you can see a histogram of the frequency of a normal distribution. We can overlay a
normal distribution over it, and we can see if the data looks normal. This is not a test per se but we can
get a good idea of our data looks like. This is shown in 2.3.
5 10 15 20 25 30 35
Figure 2.3: A Histogram of the normal distribution above with the normal curve overlaid
We could also example a PP Plot. This is a plot with a line at a 45 degree angle going from bottom left to
upper right of a plot. the closer the points are to the line the closer to normality the distribution is. This is
also the same principle behind a qqplot (Q meaning quantiles).
2.5 Data
I will try to give examples of data analysis and its interpretation. One good data set is on Cars released in
1993 (Lock, 1993), names of the variables and more info on the data set can be found in Appendix ??.
2.6 Final Thoughts
A lot of concepts were discussed are necessary for a basic understanding of statistical knowledge. Although
do not feel you have to have this entire chapter memorized. The concepts here you may need to come back
to from time to time. Do not focus either on memorizing formulas, focus on what the formulas tell you
about the concept. With todays computing powers your concern will be understanding what the output is
telling you and how to connect that to your research question. While it is good to know how numbers are
11
CHAPTER 2. DESCRIPTION OF A SINGLE VARIABLE 2.6. FINAL THOUGHTS
calculated, its just to understand how to use it in your test.
12
Part II
Correlations and Mean Testing
13
Chapter 3
Relationships Between Two Variables
The rst part of this book we just looked at describing variables. Now we look at how they are related and
want to test the strength of those relationships. This is a dicult task, something that will take time to
master not only the concepts but its implementation. Course homeworks are actually the easiest way to do
statistics. You are given a research question told what to run and to report your results. In real analysis
you will have to decide for yourself what test to run that best ts your data and your research question.
While I will provide some equations, its best to look at them just to see what they are doing, and what
they mean, its less important to memorize them. This rst part will look at basic correlations and testing
of means (t-tests and ANOVA).
Much of statistics is correlational research. It is research where we look at how one variable changes when
another changes, yet causal inferences will not be assessed. It is very tempting to use the word cause or to
imply some directionality in your research but you need to refrain from it unless you have alot of evidence to
justify it as the ethical standards for determining causality is high. If you are wishing to learn more about
causality see (Pearl, 2009a;Pearl, 2009b)
3.1 Covariance
Before discussing correlations we have to discuss the idea of a covariance. One of the most basic ways to
associate variables is by getting a variance-covariance matrix. Now a matrix is like a spreadsheet, each cell
having a value in it. The diagonal going from upper left to lower right is the variance of the variable (as
it will be the same variable on the top row as it will be on the left column. The other values will be the
covariance between the two variables. The idea of covariance is similar to variance, except we want to know
how one variable varies with another. So if one changes in one direction, how will another variable change
in the same direction? Do note though we are only talking about continuous variables here (for the most
part interval and ratio scales are treated the same and the distinction is rarely made in statistical testing,
so when I mention continuous it may be either interval or ratio without compromising my analysis). The
formula for covariance is in 3.1.
Cov(x, y) =
N
i=1
(x x)(y y)
N
(3.1)
As one can see it is taking the deviations from the mean, and multiplying them together and then dividing
by the sample size. This gives a good measure of the relationship between the two variables. While this
concept is necessary and a bedrock of many statistical tools, its not very intuitive. It is not standardizing
it in anyway that allows us to make quick understandings of the relationships, this is what leads us into
correlations.
3.2 Pearsons Correlation
A correlation is essentially a standardized covariance. We take the covariance and divide it by the standard
deviation in 3.2:
r
x,y
=
N
i=1
(x x)(y y)
_
N
i=1
(x x)
2
N
i=1
(y y)
2
(3.2)
If we dissect this formula its not as scary as it looks. The top of the equation is simply the covariance. The
bottom is the variance of x and the variance of y multiplied by each other. Taking the square root is simply
converting that to a standard deviation. This puts the correlation coecient into the metric of -1 to 1. A
correlation of 0 means no association what so ever. A correlation of 1 is a perfect correlation. So lets say
14
CHAPTER 3. RELATIONSHIPS BETWEEN TWO VARIABLES 3.3. R SQUARED
we are looking at the association of temperatures between two cities, if city A temperature went up by one
degree, city B would also go up by one degree if their correlations were 1 (remember a correlation assumes
the units of measurement). If the correlation is -1, its a perfect inverse correlation, so if temperature of city
A goes up one degree, city B will go DOWN one degree. In social science the correlations are never this
clean, or clear to understand. Since the metrics can dier between correlations one must be careful about
when you do a correlation and how you interpret it. Also remember a correlation is non-directional, so if we
have a correlation of .5 and temperature in city A goes up one degree and up a half degree in city B, then
if city B goes up a full degree then will go up a half degree in city A.
Pearsons correlations are reported with an r and then the coecient, followed by the signicance level.
For example r = 0.5, p < .05 if signicant.
3.3 R Squared
When we get a pearsons correlation coecient we can take the square of that value, and that is whats called
the percentage of variance explained. So if we get a correlation of .5, then the square of that is .25, so we
can say that 25% of the variation in one variable is accounted for by the other variable. Of course as the
correlation increases so will the amount of variance explained.
3.4 Point Biserial Correlation
One special case where a categorical variable can use a continuous Pearsons r is the point-biserial correlation.
If you have a binary variable you can calculate the correlation between the two categories if the other variable
you are comparing it to is continuous. This is similar to a t-test we will examine later. The test looks at
whether or not there is a signicant dierent between the two groups of the dichotomous variables. When
we ask whether its signicant or not, we are wanting to determine whether or not the dierence is due
to random chance. We already know there is going to be random variability in any sample we take, but
we want to know if the dierence between the two groups is due to this randomness or is there a genuine
dierence in the groups which is due to true dierences.
3.5 Spurious Relationships
So lets say we get a pearsons r=.5, so what now? Can we say there is a direct relationship between variables?
No, because we dont know if the relationship is direct or not. There are many examples of spurious rela-
tionships. For example, if I look at the rate of illness students report to the health center at their University
and the the relative time of exams, I would most likely nd a good (probably moderate) correlation. Now
before any students starts using this statement as a reason to cancel tests, there is no reason to believe your
exams are causing you to get sick! Well what is it then? Well something we DIDNT measure, Stress! Stress
weakens the immune system, and stress is higher during periods of examinations, so you are more likely to
get ill. If we just looked at correlations we would only be looking at the surface, so take the results but use
them with caution, as they may not be telling the whole story.
3.6 Final Thoughts
This may seem like a short chapter given the heavy use of correlations but much of the basics of this chapter
will be used in future statistical analysis. One of the primary concerns to take from this is this is not in
anyway measuring causality, and this point can not be discussed enough. Correlations are a good way of
looking at associations, but thats all, but is a good way to help us explore data and work towards more
advanced statistical models which can help us support or not support our hypotheses. While correlations
can be used, use them with caution.
15
Chapter 4
Means Testing
This chapter goes a bit more into exploring the dierneces between groups. So if we have a nominal or ordinal
variable, and we want to see if these categories are statistically dierent based on a continous variable, there
are several tests we can do. We already looked at the point bi-serial correlation, which is one test. This
chapter examines the t-test which is a test that gives a bit more detail, and Analysis of Variance (ANOVA)
which will explore when the number of groups is greater than 2 (the letter denoting groups is generally k,
as n denotes sample size, so ANOVA will be k > 2 or k 3). Here we will want to know whether the
dierence in the means is statistically signicant.
4.1 Assumptions
So the rst assumption we will make is the continuous variables we are measuring are normally distributed,
and we learned to test that earlier. Another assumption we must make is called homogeneity of variance.
This means the variance is the same for both groups (it doesnt have to be exactly the same but similar,
again it will be somewhat dierence due to randomness but is the variance dierent enough to be statisti-
cally dierent). If this assumption is untenable we will have to correct for the degrees of freedom, which will
inuence whether our t-statistic is signicant or not.
This can be shown in the two gures below. 4.1 shows the dierence in the means (mean of 10 and 20) but
with same variance of 4.
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Mean Difference
Figure 4.1: Same Variance
4.2 has same means but one variance is 4 and the other is 16 (standard deviation of 4).
4.2 T-Test
The t-test is similar to the point-biserial as we are wanting to know whether two groups are statistically
dierent.
So we will look at the rst equation, which the numerator is the dierence between the means. The denom-
inator is the dierence between the standard deviations. the variance of the sample is denoted s
2
, and n is
16
CHAPTER 4. MEANS TESTING 4.2. T-TEST
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
Mean Difference
Figure 4.2: Dierent Variances
the sample size for that group. This is shown in 4.1
t =
x
1
x
2
_
s
2
1
n1
+
s
2
2
n2
(4.1)
The degrees of freedom is denoted by 4.2.
df =
s
2
1
/n
1
+ s
2
2
/n
2
2
(s
2
1
/n
1
)
2
/n
1
1 + (s
2
2
/n
2
)
2
/n
1
1
(4.2)
The above equations assume unequal sample sizes and variances. The equations get smaller if you have same
variance or same sample size in each group. Although this only generally occurs in experimental settings
where sample size and other parameters can be more strictly controlled.
In the end we want to see if there is a statistical dierence between groups. If we look at data from the
National Educational Longitudinal Study from 1988 baseline year, we can see how this works. If we look
at the dierence in gender and science scores, we can do a t-test and we nd theres a signicant mean
dierence. The means for gender are in 4.1
Mean SD
Male 52.1055 10.42897
Female 51.1838 10.03476
Table 4.1: Means and Standard Deviations of Male and Female Test Scores
Our analysis shows t(10963) = 4.712, p < .05. Although the test of whether variances are the same is
signicant F = 13.2, p < .05, so we have to use the variances not assumed. This changes our results to
t(10687.3) = 4.701, p < .05. You can see the main dierence is our degrees of freedom dropped, thus our
t-statistic dropped.
This time it didnt matter, our sample size was so large that both values were signicant, but in some tests
this may not be the case. If the test of equal variances rejects the null hypothesis but the test of unequal
17
CHAPTER 4. MEANS TESTING 4.3. ANALYSIS OF VARIANCE
variances does not reject, even if levenes test is not signicant, you should really be cautious about how you
write it up.
4.2.1 Independent Samples
The above example was an independent samples t-test. This means the participants are independent of each
other and so their responses will be too.
4.2.2 Dependent Samples
This is a slightly dierent version of the t-test where you still have two means but the samples are not
independent of each other. A classic example of this is pre-test, post-test designs. Also longitudinal data
where a measure was collected at one year then measured on the same test at a later date.
4.2.3 Eect Size
The eect size r is used in this part. The equation for this is in 4.3:
r =
t
2
t
2
+ df
(4.3)
4.3 Analysis of Variance
Analysis of Variance (ANOVA) is used to compute when you have more than two groups. Here we will look
at what happens when have race and standardized test scores. The problem we will encounter is to see which
groups are signicantly dierent. ANOVA adds some steps to testing the analysis. First all of the means
are compared (the equations for this will be quite complex so we will just go through the analysis steps).
First you see if any of the means are statistically dierent. This is called an omnibus test and follows the
F distribution (the F distribution and t distribution are similar to the normal but have fatter tails which
means it allows for more outliers but this is of not much consequence to the applied analysis). We get an F
statistic for both levenes test and the omnibus test. In this analysis we get four group means. These means
are below in 4.2:
Race Mean SD
Asian, Pacic Islander 56.83 10.69
Hispanic 46.72 8.53
Black, Not Hispanic 45.44 8.29
White, Not Hispanic 52.91 10.03
American Indian, Alaskan 45.91 8.13
Table 4.2: Means and Standard Deviations of Race Groups Test Scores
Table 4.3 is the mean dierences. Now after we reject the omnibus test we need to see if theres a signicant
dierences between the tests. We do this by doing post-hoc tests. For simplicity reasons I have put it in
a matrix where the numbers inside is the dierences between the groups. Those with (*) beside them are
statistically signicant. Now this is not how it is done in SPSS, because it will give you it in rows but this
is easily made. There are many post-hoc tests one can do. The ones done below are Tukey and Games-
Howell, and both reject the same mean dierence groups. There are alot more post-hoc tests but these
two do dierent things. Tukey adjusts for dierent sample sizes, Games Howell corrects for heterogeneity
of variance. If you do a few types of post-hoc tests and the result is the same this gives credence to your
hypothesis. If not you should go back to see if there is a real dierence or not or re-examine your assumptions.
18
CHAPTER 4. MEANS TESTING 4.3. ANALYSIS OF VARIANCE
Race Groups
Asian-PI Hispanic Black White AI-Alaskan
Asian-PI 0
Hispanic 10.1092* 0
Black 11.3907* 1.2815* 0
White 3.9193* -6.1899* -7.4714* 0
AI-Alaskan 10.9178* 0.8086 -0.4729 6.9985* 0
Note: PI-Pacic Islander; AI-American Indian
Table 4.3: Mean Dierences Among Race Groups
19
Part III
Latent Variables
20
Chapter 5
Latent Constructs and Reliability
Sometimes in statistics we have variables we want to study, but we cant measure them directly. This means
we have to use multiple measures (called manifest variables), which come together to measure the construct
we are trying to understand. Unfortunately we cant say for certain the variables we measure are informing
on the overall construct we want to test. This means we have to have measures to test this. Some examples
of latent variables include socio-economic status and intelligence. We cant measure socio-economic status
directly but we can look at income, education, neighborhood, and other measures to get an overall gauge of
the construct.
5.1 Reliability
One measure of reliability (also called internal consistency) is chronbachs alpha. It is on a metric from 0
to 1. The closer the one the more reliable the measure is. A measure less than .7 is considered too weak
to be reliable. The true measure of reliability depends on your measure (some tests that are critical like
standardized test scores may have higher restrictions on them). In the end it comes down to the researcher
to defend if a measure is reliable or not.
21
Part IV
Regression
22
Chapter 6
Regression: The Basics
Regression techniques make up a major portion of social science statistical inference. Regression is also
called linear models (this will be generalized later but for now we will stick with linear models) as we try to
t a line to our data. These methods allow us to create models to predict certain variables of interest. This
section will be quite deep, since regression requires a lot of concepts to consider, but as in past portions of
this book, we will take it one step at a time, starting out with basic principles and moving to more advanced
ones. The principle of regression to we have a set of variables (known as predictors, or independent variables)
that we want to use to predict an outcome (known as the dependent variable but fallen out of favor in more
advanced statistics classes and works). Then we have a slope for each independent variable, which tells us
the relationship between the predictors and outcomes.
If you see yourself not understanding something, come back to the more fundamental portions of regression
and it will sink in. This type of method is so diverse people spend careers learning and using this modeling
procedure, so its not expected you pick it up in one quarter, but are just laying the foundations for the use
of it.
6.1 Foundation Concepts
So how do we try to predict an outcome? Well it comes back to the concept of variance. Remember early on
in this book we looked at variance as simply variation in a variable. There are dierent values for dierent
cases (i.e. dierent scores on a test for dierent students). Regression allows us to use a set of predictors to
explain the variation in our outcome.
Now we will look at the equations themselves and the notation that we will use. The basic equation of a
regression model (or linear model) is 7.11.
y =
0
+
p
i=1
p
x
p
+ (6.1)
This basic equation may look scary but it is not. There are some basic parts to the equation which will
be relevant to the future understanding of these models. So let us go left to right. The y is our outcome
variable, this is the variable we want to predict the behavior of. The
0
is the slope of the model (where the
regression line crosses the y axis on a coordinate plane. The
p
x
p
the actually two components together.
The x is the predictor variables, and the is the slopes for each predictor. This tells us the relationship
between that predictor and the outcome variable. The summation sign is there, yet unlike other times this
has been used, at the top is the letter p instead of n. This is because p stands for number of predictors, and
not summing to the number of cases. The is the error term, which takes into account the variability in
the model the predictors dont explain.
6.2 Final Thoughts
This brief chapter introduces regression as a concept, or more generally linear modeling. I dont say linear
regression (which is the next chapter) as this is just one form of regression. Many more types of regression
will be done in future chapters. There are many books on regression, and at the end of each chapter I will
note very good ones. One extraordinary one is Gelman and Hill (2007) which I will use a lot to refer to with
regards to creating this chapter.
23
CHAPTER 6. REGRESSION: THE BASICS 6.3. BIBLIOGRAPHIC NOTE
6.3 Bibliographic Note
Many books have been written on regression. I have used many as inspiration and references for this work
although much of the information is freely available online. On top of Gelman and Hill (2007) for doing
regression, the books Everitt, Hothorn, and Group (2010), Chatterjee and Hadi (2006) and nally the free
book Faraway (2002), and other excellent books that are available for purchase Faraway, 2004; Faraway, 2005.
More theory based books are Venables and Ripley (2002), Andersen and Skovgaard (2010), Bingham2010
Rencher and Schaalje (2008), Rencher and Schaalje (2008),Sheather (2009). As you can tell most of these
books use R which is my preferred statistical package of choice. Some books are focused on SPSS and do
a good job at that, one notable one being by Field (2009), also more advanced books but still very good is
Tabachnick and Fidell (2006) and Stevens (2009). Stevens (2009) would not make a good text book but is an
excellent reference, including SPSS and SAS instructions and syntax for almost all multivariate applications
in social sciences and is a necessary reference for any social scientist.
24
Chapter 7
Linear Regression
Lets focus for a while on one type of regression, linear regression. This requires us to have an outcome
variable that is continuous and normally distributed. When we have a continuous normally distributed out-
come, we can use least squares to calculate the parameter estimates. Other forms of regression use maximum
likelihood, which will be discussed in later chapters. Although the least squares estimates are the maximum
likelihood estimates.
7.1 Basics of Linar Regression
This rst regression technique we will learn, and the most common one used is where our outcome is contin-
uous in nature (interval or ratio it nature, it does not matter). Linear regression uses an analytic technique
called least squares. We will see how this works graphically and then how the equations give us the numbers
for our analysis.
What linear regression does is it looks at the plot of x and y and tries to t a straight line that is closest
to all of these points. Figure 7.1 shows how this is done. I just randomly drew values for both x and y
and the line is the regression line that is the best t for the data. Now as the plot shows, the line doesnt
t perfectly, its just the best tting line. The dierence between the actual data and the line is whats
termed residuals as it is what is not being captured in the model. The better the line ts and the less
residual there is, the stronger the predictor will predict the outcome.
1 0 1 2 3
1
0
1
2
3
x
y
Figure 7.1: Simple Regression Plot
7.1.1 Sums of Squares
When discussing the sums of squares we get two equations, one is for the sums of squares for the model in
7.1. This is the dierence between our predicted values and the mean. This is how good our model is tting.
We want this number to be as high as possible.
25
CHAPTER 7. LINEAR REGRESSION 7.2. MODEL
SSR =
n
i=1
( y
i
y)
2
(7.1)
The second is the sums of squares regression (or error), this is the dierence between predicted and actual
values of the outcome, this we want to be as low as possible and is shown in 7.2.
SSE =
n
i=1
( y
i
y
i
)
2
(7.2)
The total sums of squares can be done by summing the SSR and SSE or by 7.3.
SST =
n
i=1
(y
i
y
i
)
2
(7.3)
The table 7.1 shows how this can be arranged. We commonly report sums of squares and degrees of freedom
along with the F statistic, the mean squares are less important but will be shown for the purposes of the
examples in this book.
Sums of Squares DF Mean Square F Ratio
Regression SSR p MSR =
SSR
p
F =
MSR
MSE
Residual (Error) SSE n p 1 MSE =
SSE
np1
Total SST n-1
Table 7.1: ANOVA Table
7.2 Model
First lets look at the simplest model, if we had one predictor it would be a simple linear regression 7.4.
As shown,
0
is the slope parameter for the model, also called the y intercept, it is where on the co-
ordinate plane the regression line crosses the y-axis when x = 0. The is the parameter estimate for
that predictor beside it, the x. This shows the magnitude and direction of the relationship to the outcome
variable. Finally is the residual, this is how much the data deviates from the regression line. This is also
called the error term, its the dierence between the predicted values of the outcome and the actual values.
y =
0
+
1
x
1
+ (7.4)
More than one predictor is multiple linear regression, such as having two or more predictors will look like
7.5, note the subscript p stands for parameters, so there will be a x for each independent variable.
y =
0
+
1
x
1
+
2
x
2
+ +
p
x
p
+ (7.5)
26
7.2.1 Simple Linear Regression
If we have the raw data, we can nd the equations by hand. While in the era of very high speed computers
it is rare you will have to manually compute these statistics we should still look at the equations to see how
we derive the slopes. The slope below is how to calculate the beta coecient for a simple linear regression.
We square values so we get an approximation of the distance from the best tting line as shown in 7.6. If we
just added the numbers up, some would be below the line, and some above giving us negative and positive
values respectively so they would add to zero (as is one of the assumptions of error term). Squaring makes
sure we have this issue removed.
1
=
n
i=1
(x x)(y y)
n
i=1
(x x)
2
(7.6)
The equation 7.7 shows how the slope parameter is calculated in a simple linear regression. This is where
the regression line crosses the y-axis when x = 0.
0
= y +

x (7.7)
Finally we come to our residuals. When we plug in values for x into the equation, we get the tted val-
ues. These values are predicted by the regression equation. This is signied by y. When we subtract the
actual outcome value for the predicted value (which the tted value is known as). This shows how much
our actual values t from the line, and it gives us an idea of which values are furthest from the regression line.
= y y (7.8)
We can also nd in the model how much of the variability within our outcome is being explained by our
predictors. When we run this model we will get a Pearsons correlation coecient (r). We can still square
this number (as we did in correlation) and get the amount of variance explained. This is done in several
ways, see 7.10.
r
2
=
SSR
SST
= 1
SSE
SST
=
n
i=1
(y
i
y
i
)
2
n
i=1
( y
i
y)
2
(7.9)
We do need to adjust our r squared value to account for complexity of the model. Whenever we add a
predictor, we will always explain more variance. The question is, is this is truly explaining variance for
theoretical reasons or if it is just randomly adding variation explanation. The adjusted r squared should
be comparable to the non-adjusted value, if they are substantially dierent, you should look at your model
more closely. The adjusted r-squared can be particularly sensitive to sample size, so smaller sample size will
show dierences in adjusted r squared values. Also its best to report both if they vary by a non-trivial amount.
Adjustedr
2
= 1 (1 r
2
)
SSE/n p 1
SST/n 1
(7.10)
We can look at an example of data. Lets look at our cars example. Lets see if we can predict the price of
a vehicle based on its miles per gallon (MPG) of fuel used while driving in the city.
27
> mod1<-lm(Price~MPG.city);summary(mod1)
Call:
lm(formula = Price ~ MPG.city)
Residuals:
Min 1Q Median 3Q Max
-10.437 -4.871 -2.152 1.961 38.951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.3661 3.3399 12.685 < 2e-16 ***
MPG.city -1.0219 0.1449 -7.054 3.31e-10 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7.809 on 91 degrees of freedom
Multiple R-squared: 0.3535, Adjusted R-squared: 0.3464
F-statistic: 49.76 on 1 and 91 DF, p-value: 3.308e-10
We nd that it is a signicant predictor of price. Our rst test is similar to ANOVA, which is the F test. This
we reject the null hypothesis, F(1, 91) = 49.76, p < .001. We then look at the signicance of our individual
predictor. It is signicant, here we report two statistics, the parameter estimate (), and the t-test associated
with that. Here miles per gallon in city is signicant with = 1.0219, t(91) = 7.054, p < .001. The rst
interesting thing is there is an inverse relationship, as one variable increases, the other decreases, here we
can say that for every mile per gallon used in the city increase, theres a drop in price of $1,000.
1
We can
also look at the r
2
value to see how well the model is tting. The r
2
= 0.3535 and the Adjustedr
2
= 0.3464.
While the adjusted value is slightly lower its not a major issue, so we can trust this value.
7.2.2 Multiple Linear Regression
Multiple linear regression is similar to simple regression except we place more than one predictor in the
equation. This is how most models in social science are ran, since we expect more than one variable to be
related to our outcome.
y =
0
+
p
i=1
p
x
p
+ (7.11)
Lets go back to the data, lets add to our model above not only miles per gallon in the city but fuel tank
capacity.
> mod3<-lm(Price~MPG.city+Fuel.tank.capacity);summary(mod3)
Call:
lm(formula = Price ~ MPG.city + Fuel.tank.capacity)
Residuals:
-18.526 -4.055 -2.055 2.618 38.669
Coefficients:
(Intercept) 10.1104 11.6462 0.868 0.38763
MPG.city -0.4608 0.2395 -1.924 0.05747 .
Fuel.tank.capacity 1.1825 0.4104 2.881 0.00495 **
1
I say $1000 dollars and not one dollar as this is the unit of measurement, be sure when interpreting data you use the unit
of measurement unless the data is transformed (which will be discussed later).
28
CHAPTER 7. LINEAR REGRESSION 7.3. INTERPRETATION OF PARAMETER ESTIMATES
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
We nd we reject the null hypothesis with F(2, 90) = 31.03, p < .05. We have an r
2
= .408 and adjustedr
2
=
0.395. So this model is tting well and we can explain around 40% of the variance by these two parameter
estimates. Interestingly, miles per gallon fails to remain signicant in the model, = 0.4608, t(90) =
1.924, p = 0.057 This is one of those times where signicance is close, and most people who hold rigidly to
the alpha of .05 would say this isnt important. I dont hold such views, while this seems less important than
in the last model, its still worth mentioning as a possible predictor, but in the presence of fuel tank capacity
has less predictive power.
Fuel talk capacity is strongly related to price = 1.1825, t(90) = 2.881, p < .05. We nd the relationship
here is positive, so the more fuel tank capacity the higher the price. We could speculate larger vehicles, with
larger capacity will be more expensive. Although we have seen consistently that miles per gallon in the city
is inversly related, well this may also deal with size. Larger vehicles may get less fuel eciency but may
be more expensive, smaller cars may be more fuel ecient and yet cheaper. I am not an expert on vehicle
pricing so we will just trust the data from this small sample.
7.3 Interpretation of Parameter Estimates
7.3.1 Continuous
When a variable is continuous generally, interpretation is relatively straight forward. We interpret the coef-
cients to mean that one unit increase in the predictor will mean an increase in y by the amount of . So
lets say you have a coecient y =
0
+ 2x + . Well here the 2 is the parameter estimate (), so we say for
each unit increase in x, we will increase y by 2 units. Now when saying the word unit we are referring to
the original measurements of the individual variables. So if x is income in thousands of dollars, and y is test
scores, then for each one thousand dollars increase in income (x) will mean 2 points greater score on the exam.
This changes if we transform our variables. If we standardize our x values, we would say for each standard
deviation increase in x, increase y by two units. If we standardized y and x, we would say one standard
deviation increase of x would mean two standard deviation increase in y.
If we log our outcome, then we would say that one thousand dollar increase in come would mean 2 log units
increase in y. One thing to note is when statisticians (or almost all scientists say log) they mean the natural
log. To transform this back to the original units, you take the exponential function, so e
y
if you had taken
the log of the outcome (reasons for this will be discussed in testing assumptions). If we take the log of y and
x, the we can talk about percents, so a one percent increase in x, means a 2 percent increase in y. Although
to get back to original units, exponentiation is still necessary.
If we look at our models above, in the simple linear regression model of just MPG in the city, for each increase
in one MPG in the city, the price goes down by 1.0219 thousand dollars. This is because the coecient is
negative, so the relationship is inverse. In our multiple regression model we see for each gallon increase in
fuel tank capacity the price increases 1.1825 thousand dollars. This is because the coecient is positive.
7.3.1.1 Transformation of Continous Variables
Sometimes its neccessary to transform our variables. This can be done to make interpretation easier, more
relevant to our research question, or to allow our model to meet assumptions.
29
7.3.1.1.1 Natural Log of Variables Here we will explore what happens when we take the log of con-
tinous variables.
> mod2<-lm(log(Price)~MPG.city);summary(mod2)
Call:
lm(formula = log(Price) ~ MPG.city)
Residuals:
-0.58391 -0.19678 -0.04151 0.19854 1.06634
Coefficients:
(Intercept) 4.15282 0.13741 30.223 < 2e-16 ***
MPG.city -0.05756 0.00596 -9.657 1.33e-15 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Here we have taken the natural logarithm of our outcome variable. This will be shown later to be advantan-
geous when looking at our assumptions and violations of that. It can also make model interpration dierent
and sometimes easier. So now instead of the original units, its in log units, so we would say, for each MPG
unit increase, the price will decrease 0.0576 percent. This is because the coecient is negative and so the
relationship is still inverse. Notice the percent of variance explained dramatically increased, from 35% to
50%, this is due to the transformation process.
> mod3<-lm(log(Price)~log(MPG.city));summary(mod3)
Call:
lm(formula = log(Price) ~ log(MPG.city))
Residuals:
-0.61991 -0.21337 -0.03462 0.19766 1.05362
Coefficients:
(Intercept) 7.5237 0.4390 17.14 <2e-16 ***
log(MPG.city) -1.5119 0.1421 -10.64 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
F-statistic: 113.2 on 1 and 91 DF, p-value: < 2.2e-16
This model looks at what happens when we take the natural log of both the outcome and the predictor.
This is also interpreted dierently, but now both estimates are in percents. So for each percent increase in
30
MPG in the city, the price decreases by 1.512 percent. Also the model estimates have changed due to our
transformation.
7.3.2 Categorical
When our predictors are categorical, we need to be careful how they are modeled. They cannot be added
simply as numerical values or words. This would cause estimates to be wrong, as the model will assume it
is a continuous variable.
7.3.2.1 Nominal Variables
For nominal variables we must recode the levels of the factor. One way to do this is dummy coding. This
is where we code one factor per variable as a 1, with the other factors as 0. If we denote the number
of factors as k, then the total number of dummy variables we can model for a factor variable is k 1. For
example, if we are coding sporting events, lets say we have a variable of dierent sporting events, such as
football, basketball, soccer, and baseball. The total number of dummy variables we can have is 3. The
coding can be done in statistics programs as shown in Table 7.2.
Factor Levels Dummy 1 Dummy 2 Dummy 3
Football 1 0 0
Basketball 0 1 0
Soccer 0 0 1
Baseball 0 0 0
Table 7.2: How Nominal Variables are Recoded in Regression Models using Dummy Coding
As you can see, the baseball part of our sports variable has all zeros. This is the baseline group, for which the
other groups are compared. This is good when there is a natural baseline group (like treatment vs. control
in medical studies). Although ours does not have a natural baseline. So we can do another type of coding
called contrast coding.
Factor Levels Dummy 1 Dummy 2 Dummy 3
Football -1 0 0
Basketball 0 -1 0
Soccer 0 0 -1
Baseball 1 1 1
Table 7.3: How Nominal Variables are Recoded in Regression Models using Contrast Coding
As you can see, the factors sum to 0 in the column. Of course in real data sets we may not have an even
number of levels of the factors, the dierent levels (or group) may have dierent amounts. So if there were
25 participants that played football and only 23 baseball players, nding the numbers that contrasts that
equal zero will be more dicult. Luckily many software programs allow for this type of coding automatically.
If we only had these variables as our predictors, this would be equivalent to an Analysis of Variance, and
the intercept would be the mean of the baseline variable. This is not so if more predictors are added, as this
would be an Analysis of Covariance.
7.3.2.2 Ordinal Variables
For ordinal variables, we generally can allow them to be in the model as one variable and not require dummy
coding. This is because our assumption of linearity is relatively tenable, as we expect the categories to be
naturally ordered and to be increasing. The interpretation of this would be as you go up one category, the
31
CHAPTER 7. LINEAR REGRESSION 7.4. MODEL COMPARISIONS
value of y will change the amount of the parameter estimate (your beta-coecient for that variable).
7.4 Model Comparisions
In many cases of research we want to know the eect of how much we add to the t of a model when we
add or take away one or more predictors. When we do model comparisions, we must ensure the models
are nested. This means we add or take away predictor(s), but otherwise still measuring same things. For
example in the above models we compared MPG and fuel capacity. We will want to know how much adding
fuel capacity to the model adds to model t, or how adding MPG to the model with fuel capacity already in
the model compares. We can not compare directly a simple regression with only fuel capacity and another
model just measuring MPG.
7.5 Assumptions
The assumptions for regression depend on the nature of the regression being used. For continuous outcomes,
the assumptions are the errors are homoscedastic, normally distributed errors, linearly related outcome and
samples are independent of one another. We look at the assumptions of linear regression and how to test
them. Then we will discuss corrections to them.
7.6 Diagnostics
We need to make sure our model is tting our assumptions, and we need to see if we can correct for times
our assumptions are violated.
7.6.1 Residuals
So rst we need to look at our residuals. Remember residuals are the actual y values subtracted from the
predicted y values. For this exercise, I will use the cars data I used above, as it is a good data set to discuss
regression on. For the purposes of looking at our assumptions, let us stick with simple regression where
we have price of vehicles as our outcome and miles per gallon in the city as our predictor. Here I will just
provide R commands and code along with discussions of it.
7.6.1.1 Normality of Residuals
First lets look at our assumption of normality. We assume our errors are normally distributed with mean
0 and some unknown variance. We can do tests of this via my preferred test, Shapiro Wilks test which is
good from sample sizes from 3 - 5000 (Shapiro and Wilk (1965)).
7.6.1.1.1 Tests Lets look at the above model and see if our normality assumption is met. First we test
mod1 which is just the variables in its original form.
> shapiro.test(residuals (mod1))
Shapiro-Wilk normality test
data: residuals(mod1)
W = 0.8414, p-value = 1.434e-08
As you can see the results arent pretty, we reject the null hypothesis for the test, so W = 0.8414, p < .05
which means theres enough evidence to say that the sample deviates from the theoretical normal distribution
the test was expecting. This test, the null hypothesis is the sample does conform to a normal distribution,
so unlike most testing, we do not want to reject this test.
32
CHAPTER 7. LINEAR REGRESSION 7.6. DIAGNOSTICS
W = 0.9675, p-value = 0.02022
Doing a second model with the log of the outcome help some, but we still cant say our assumption is teneable,
W = 0.9675, p < .05.
W = 0.9779, p-value = 0.1154
This time we cannot reject the null hypothesis, W = 0.9779, p > .05, so taking the log of both our outcome
and predictor allows us approximate the normal distribution, or at the very least we can say there isnt
enough evidence to say our distribution is signicantly dierent than the theoretical (or expected) normal
distribution.
7.6.1.1.2 Plots Now lets look at plots. Two plots are important, one is a QQ plot, and another is a
histogram. A histogram allows us to look at the frequency of values, and the QQ plot plots our residuals
against what we would expect from a theoretical normal distribution. In those plots the line represents
where we want our residuals to be, means its matching the theoretical normal distribution.
Distribution of Residuals
residuals(mod1)
D
e
n
s
i
t
y
10 0 10 20 30 40
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
2 1 0 1 2
1
0
0
1
0
2
0
3
0
4
0
Normal QQ Plot
Theoretical Quantiles
R
e
s
i
d
u
a
l
s
Figure 7.2: Histogram of Studentized Residuals for Model 1
The rst set of plots shows us what we expected from our statistics above. Our residuals dont conform to
33
CHAPTER 7. LINEAR REGRESSION 7.6. DIAGNOSTICS
a normal distribution, we can see heavy right skew in the residuals, and the QQ plot is very non-normal at
the extremes.
residuals(mod2)
D
e
n
s
i
t
y
0.5 0.0 0.5 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2 1 0 1 2
0
.
5
0
.
0
0
.
5
1
.
0
Normal QQ Plot
R
e
s
i
d
u
a
l
s
As we saw in our statistics, taking the log of our outcome made it better, but still not quite to make our
assumption of normality tenable. We are still seeing too much right skew in our distribution.
residuals(mod3)
D
e
n
s
i
t
y
0.5 0.0 0.5 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2 1 0 1 2
0
.
5
0
.
0
0
.
5
1
.
0
Normal QQ Plot
R
e
s
i
d
u
a
l
s
This looks much better! Our distribution is looking much more normal. Our QQ plot still shows some de-
viation at the top and bottom but our Shapiro-Wilks test gives us enough evidence to show the assumption
of normality is tenable, so this is OK.
34
CHAPTER 7. LINEAR REGRESSION 7.7. FINAL THOUGHTS
7.7 Final Thoughts
Linear regression is used very widely in statistics, most notably because of the pleasing mathmatical prop-
erties of the normal distribution. Its ease of interpretation and wide implementation in software packages
enhances its abilities. One should be cautious about the use of it though to ensure your outcome is normally
distributed.
35
Chapter 8
Logistic Regression
So now we begin to discuss the idea that our outcome is not linear. Logistic regression deals with the idea
out outcome is binary, that is it can only take one one of two values (almost universally 0 and 1). This has
many applications, graduate or not graduate, contract in illness or not, get a job or not, etc. This does pose
problems for interpretation at times, because its not as easy to study.
8.1 The Basics
So we have to model the events that take on values of 0 or 1. The problem is with linear regression in this
sense is that it requires us to use a straight line. This cant be done since our values are bounded. This
means we must go to a dierent distribution than the normald distribution
8.2 Regression Modeling Binomial Outcomes
Contingency tables are useful when we have one categorical covariate. Contingency tables are not possible
when we have a continuous predictor or multiple predictors. Even if there is one variable of interest in
relationship to the outcome, researchers still try to control for the eects of other covariates. This leads to
the use of a regression model to test the relationship between a binary outcome and one or several predictors.
8.2.1 Estimation
The basic regression model taught in introductory statistics classes is linear regression. This has a continuous
outcome, and estimation is done by least squares. That is a line t to the data where the dierence between
each data point and the line is at its minimum. In a binomial outcome, we cannot use this estimation
technique. The binomial model will estimate proportions, which are bound from 0 to 1. A least squares
model may give estimates outside these bounds. Therefore we turn to maximum liklihood and a class of
models known as Genearlized Linear Models (GLM)
1
.
E(y)
. .
RandomComponent
=
0
+
p
i=1
p
x
p
. .
Systematic Component
2
(8.1)
The random component is the outcome variable, its called the random component because we want to know
why there is variation in this variable. The systematic component is the linear combination of our covariates
and the parameter estimates. When our variable is continuous we dont have to worry about establishing
a linear relationship as we assume it exists if the covariates are related to the outcome. When we have
categorical outcomes we can not have this linear relationship, so GLMs provide a link function, that allows
a linear relationship to exist if there is a signicant relationship.
8.2.2 Regression for Binary Outcomes
Two of the most common functions are logit and probit functions. These allow us to look at a linear
relationship between our outcome and our covariates. In gure 8.1, you can see there is not a lot of dierence
between logit and probit, the dierence is in the interpretation of coecients (discussed below). The green
line does show how a traditional regression line is not an appropriate t, because the data (the blue dots)
goes outside the range of the data. The logit and probit ts look at the probabilities of being a success.
The gure also shows that there is little dierence in the actual model t between the two models. Logit
and probit models will be very similar in the substantive conclusions made. The primary dierence is in
the interpretation of the results. While we dont have a true r
2
coecient, there is a pseudo r
2
that was
created by Nagelkerke (1992) which does give a general sense of how much variation is being explained by
the predictors.
1
For SPSS users, do not confuse this with General Linear Model which performs ANOVA, ANCOVA and MANOVA
2
Some authors use to denote the intercept term, although most still use
0
which is still the most popular and will continue
to be used here
36
CHAPTER 8. LOGISTIC REGRESSION 8.2. REGRESSION MODELING BINOMIAL OUTCOMES
x
(
x
)
0.0
0.2
0.4
0.6
0.8
1.0
10 0 10 20
Logit Probit OLS Regression
Figure 8.1: Logit, Probit and OLS regression lines; data simulated from R
37
CHAPTER 8. LOGISTIC REGRESSION 8.3. FURTHER READING
8.2.2.1 Logit
The most common model in education is the logit model, also known as logistic regression, there are two
equations we can solve, equation 8.2 allows us to get the log odds of a positive response (a success).
logit[(x)] = log
_
(x)
1 (x)
_
=
0
+
p
x
p
(8.2)
The probability of a positive response is calcualted from equation 8.3.
(x) =
e
0+pxp
1 e
0+pxp
(8.3)
Fitted values (either log odds or probabilities) are usually what is given in statistical programs, and just
uses the values from the sample. Although a researcher can place values for the covariates of hypothetical
participants and it will give a probability for those values. One caution would be to ensure the values you
place in the covariates are within the range of the data values (i.e. if your sample ages are 18-24 dont solve
for an equation of a 26 year old). Since the model was tted with data that did not include that age range.
8.2.2.2 Probit
The probit function is similar in that its function is assumes an underlying latent normal distribution bound
between 0 and 1 which is found in 8.4. A probit model will change the probabilities into z scores. In Agresti
(2007, p. 72) he uses the probit coecient of 0.05, which is -1.645, which is 1.645 standard deviations below
the mean for that probability.
P( ) =
1
(
0
+
p
x
p
) (8.4)
8.2.2.3 Logit or Probit?
As can be seen in gure 8.1 the model t for both logistic and probit regression is very similar and this is
usually true. Its also possible to alter the coecients to change the coecients from logit to probit or vice
versa. Amemiya (1981) showed multiplying a logit coecient by 1.6 will give the probit coecient. Andrew
Gelman (2006) ran simulations and found results between 1.6 and 1.8 to be correct corrections, and also
corresponds to Agresti (2007) which mentions the scaling being between 1.6 and 1.8.
8.2.3 Model Selection
Researchers tend to t multiple models to try and nd the best tting model consistent with their theoretical
framework. There are several ways to evaluate models to determine which model ts best. Sequential model
building is a technique frequently used to look at the addition of predictors to a regression model. The same
framework that is used with other regression models as well. In a linear regression the test to test the models
will be an F test (since the null hypothesis of the model uses an F distribution), models which use maximum
likelihood use the likelihood ratio test which is chi-squared like the ratio test used above. Shmueli (2010)
examines the dierences in building a model to explain the relationship of predictors to an outcome, or a
model to predict an outcome from future data sources. The article also discusses the information criteria
such as the AIC and BIC measures used to test model t.
8.3 Further Reading
This chapter borrows heavily from Alan Agresti (2007) who is well known and respected for his work in
categorical data analysis. Some books which cover many statistical models yet still do a good job at logistic
regression is Tabachnick & Fidell (2006) and Stevens (2009). The rst book is great for a textbook, Stevens
is a dense book, but has both SPSS syntax and SAS code, works well a must have reference. Gelman and
Hill (2007) is rapidly becoming a classic book in statistical inference yet its computation is focused on R
which hasnt hit mainstream academia much, but they do have some supplemental material at the end of the
book for other programs. Although for those who have an interest in R, another great book is by Faraway
(2005). Andy Field (2009) has a classic book called Discovering Statistics Using SPSS which blends very
nicely SPSS and statistical concepts, and is good at explaining of dicult statistical concepts. Students
who wish to explore categorical data analysis conceptually there are a few good books, I recommend Agresti
(Agresti2002); this is a dierent book from his 2007 book a focus on theory yet still a lot of great examples
of application). Longs (1997) book explores maximum likelihood methods focusing on categorical outcomes.
38
CHAPTER 8. LOGISTIC REGRESSION 8.4. CONCLUSIONS
It combines a more conceptual and mathematical ideas of maximum likelihood. A classic by McCullagh and
Nelder (1989) which is a seminal work in the concept of generalized linear models (the citation here is their
well known second edition).
8.4 Conclusions
This chapter looked in an introductory manner. There is more to analyzing the binomial outcomes and
reading some of the works above can help add to analyzing binomial outcomes. This is especially important
for researchers whose outcomes will be binomial. These principals will also act as a starting point to learn
about other categorical outcomes such as nominal outcomes with more than two categories, or an ordinal
outcomes (used often as likert scales).
39
Bibliography
Agresti, A. (2007, March). An Introduction to Categorical Data Analysis. Hoboken, NJ: Wiley-Blackwell.
doi:10.1002/0470114754
Amemiya, T. (1981). Qualitative response models: a survey. Journal of Economic Literature, 19(4), 1483
1536. doi:10.2298/EKA0772055N
Andersen, P. K., & Skovgaard, L. T. (2010). Regression with Linear Predictors. Statistics for Biology and
Health. New York, NY: Springer New York.
Chatterjee, S., & Hadi, A. S. (2006). Regression analysis by example (4 ed). Hoboken, NJ: Wiley-Interscience.
Everitt, B. S., Hothorn, T., & Group, F. (2010). A Handbook of Statistical Analyses Using R, Second Edition.
Boca Raton, FL: Chapman and Hall/CRC.
Faraway, J. J. (2005). Extending the Linear Model with R: Generalized Linear, Mixed Eects and Non-
parametric Regression Models (Chapman & Hall/CRC Texts in Statistical Science). Boca Raton, FL:
Chapman and Hall/CRC.
Faraway, J. J. (2004). Linear Models with R (Chapman & Hall/CRC Texts in Statistical Science). Boca
Raton, FL: Chapman and Hall/CRC.
Faraway, J. J. (2002). Practical Regression and ANOVA using R.
Field, A. (2009). Discovering Statistics Using SPSS (Introducing Statistical Methods). Thousand Oaks, CA:
Sage Publications Ltd.
Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. New
York: Cambridge University Press.
Gelman, A. (2006). Take logit coecients and divide by approximately 1.6 to get probit coecients. Retrieved
from http://www.andrewgelman.com/2006/06/take\ logit\ coef/
Lock, R. (1993). 1993 new car data. Journal of Statistics Education, 1(1). Retrieved from http: //www.
amstat.org/PUBLICATIONS/JSE/v1n1/datasets.lock.html
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks,
CA: SAGE Publications.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models, Second Edition (Chapman & Hall/CRC
Monographs on Statistics & Applied Probability). Boca Raton, FL: Chapman and Hall/CRC.
Nagelkerke, N. J. D. (1992). Maximum likelihood estimation of functional relationships. Springer-Verlag New
York.
Pearl, J. (2009a). Causal inference in statistics: An overview. Statistics Surveys, 3, 96146.
Pearl, J. (2009b). Causality: Models, Reasoning and Inference. Cambridge University Press.
Rencher, A., & Schaalje, B. (2008). Linear Models in Statistics (2nd ed.). Wiley-Interscience.
Shapiro, S. S., & Wilk, M. B. (1965, December). An analysis of variance test for normality (complete samples).
Biometrika, 52(3-4), 591611. doi:10.1093/biomet/52.3-4.591
Sheather, S. J. S. J. (2009). A modern approach to regression with R. New York, NY: Springer Verlag.
Retrieved from http://www.springerlink.com/content/978-0-387-09607-0
Shmueli, G. (2010, August). To Explain or to Predict? Statistical Science, 25(3), 289310.
Stevens, J. P. (2009). Applied Multivariate Statistics for the Social Sciences, Fifth Edition. New York, NY:
Routledge Academic.
Tabachnick, B. G., & Fidell, L. S. (2006). Using Multivariate Statistics (5th Ed.). Upper Saddle River, NJ:
Allyn & Bacon.
Venables, W. N. N., & Ripley, B. D. D. (2002). Modern applied statistics with S (4th Ed.). New York, NY:
Springer.
40

What Statistics Books Try To Teach You But Dont Joe King University of Washington

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

What Statistics Books Try To Teach You But Dont Joe King University of Washington

Transféré par

Droits d'auteur :

Formats disponibles

What Statistics Books Try To Teach You But

is the summation sign, which tells us to sum

Vous aimerez peut-être aussi