Académique Documents
Professionnel Documents
Culture Documents
Basics
The purpose of statistics is to design studies and analyze the data that those studies produce. In other words, it is the science of collecting and learning from data. Examples include: Predicting the outcome of an election based on a survey of 500 registered voters Designing and properly analyzing clinical trials to get FDA approval for an experimental drug Clustering online shoppers based on their purchasing behavior so that advertising can be targeted to specic groups likely to be interested in a product Analyzing data produced in an fMRI study to ascertain how the brain responds to certain types of tasks or stimuli In short, any eld that requires the collection of data and drawing subsequent conclusions uses statistics. The idea is that we are interested in some numerical summary (e.g. average) or characteristic of some extremely large group, but we cant observe the whole group directly. We must observe what we can, describe it, and decide what we think is true.
Parameter: A numerical value summarizing the entire population. E.g., the true proportion of voters in Georgia who will vote for Barack Obama, the true average gas mileage of every Cadillac Escalade on the road, etc. Statistic: A numerical value summarizing the sample. E.g., the proportion of our 200 voters who said they would vote for Obama, the maximum time to nish a race out of 15 selected athletes, etc.
A note on notation
The convention in statistics is to (usually) use Greek letters to denote parameters and Latin letters to denote statistics: Population mean Sample mean x Population standard deviation Sample standard deviation s Population correlation Sample correlation r Of course, there are exceptions. We will often use p to denote a population proportion and p when referring to a sample proportion. ( is also used for population proportions, but this can be confusing when the context is not clear.)
Aspects of Statistics
Statistics, broadly speaking, can be divided into three categories: 1. Design - Deciding how to collect data so that the most information possible can be extracted to answer the specic questions of interest
2. Descriptive Statistics - Methods of graphically displaying and numerically summarizing collected data
3. Inferential Statistics - Using the collected information to draw conclusions and/or make predictions about the population as a whole
Experimental design primarily falls under (1). However, we will also be considering how to analyze data after it has been collected in an experiment to draw appropriate conclusions. This aspect falls under (3).
contingency tables,
or bar charts.
Pie charts are also used, but these are pretty much useless from a statistics perspective.
Quantitative variables, on the other hand, are numerical measures of subjects, e.g. height, weight, test score, etc. We further make a distinction between discrete and continuous variables: Discrete - Variables can only take on a countable number of values. (Note that countable is not the same as nite.) If you can count the possible values that could possibly be observed, then the (quantitative) variable is considered discrete. Example: Discrete variables include the number of students in a randomly selected class on campus, the number of times a 6 is rolled out of 15 rolls of a die, the number of words on a page of notes, etc.
Continuous - If a variable can (theoretically) take on any value in a continuum of possible values, then it is considered continuous. For continuous variables, the number of possibilities that can be observed is uncountable. Example: Height, weight, and temperature may all be considered to be continuous variables. For example, there is no way to count all the possible weights of cars we could observe.
When dealing with quantitative data (as will usually be the case in this course), it is generally a good idea to plot your data. There are several ways in which we can graphically depict such data, including histograms,
stem-and-leaf plots,
and dotplots.
Often, at least one of the purposes of plotting your data in, e.g., a histogram, is to get some idea of the shape of the population from which our observations came. A common assumption in statistical inference is that our data are normally distributed, so that the population as a whole follows a normal (symmetric and bell-shaped) distribution. Certainly this assumption is not tenable if we plot a reasonably large sample and nd that it is skewed.
Skewed Right - The right tail of the distribution is stretched out more than the left tail
Skewed Left - The left tail of the distribution is stretched out more than the right tail
Symmetric - When split down the middle, the distribution is a mirror image of itself
As our sample gets larger and larger, and the bins of the histogram get smaller and smaller, we end up with a smooth curve that describes how values in the population are distributed. The actual smoothed histogram of the entire population (called a probability density function ) gives a complete picture of how the values in the population are distributed so that we can evaluate probabilities of observing events within this population. In a sense, most of the science of statistics boils down to questions of what this probability distribution (or density) looks like. Arguably the two most often asked questions of probability distributions are (1) what is the center of this distribution, or a typical value were likely to observe, and (2) how spread out is the distribution?
Remember that we denote the sample mean with x and the population mean with . Median - This value occupies the middle position when the data are arranged in ascending order. In other words, the median separates the top 50% of the data from the bottom 50%. To compute the medain: 1. Arrange the data in ascending order 2. (a) If n is odd, the median is the (n + 1)/2 value in the ordered list (b) If n is even, we take the median to be the average of the middle observations. That is, we average the n/2 and n/2 + 1 values. Mode - The mode is just the value that occurs most often in a dataset; i.e. it is the value that corresponds with the highest frequency. Note that the mode does not have to be unique. We can have bimodal datasets:
When thinking about the best number to use to describe a typical value, you want to think about the spread of the data and whether there are any outliers. A mean is sensitive to every value in the dataset so that even one outlier can pull the average quite far o from what we would consider a typical value. The median is much more robust (resistant) to extreme values. Further, we can compare the mean and the median to get some idea of the shape of the distribution: Mean > Median The data are skewed right Mean < Median The data are skewed left Mean = Median The data are symmetric
Boxplots are graphical representations of the ve-number summary (Min, Q1, Median, Q3, Max)
Z scores measure how far a value is away from its mean in terms of number of standard deviations: xi x z-score of observation i = s Recall that z-scores (i) scale all of our data to a common metric so that they can be compared, (ii) should usually between -3 and 3, if were using the right mean and standard deviation, and (iii) follow a standard normal distribution if the data are themselves normally distributed.
Probability Distributions
Probability
Probability, simply stated, is how we quantify uncertainty. For our purposes, we can think of the probability of a particular outcome as the long-run relative frequency of its occurrence. In other words, when I repeat a trial a large number of times, the ratio of the number of times I observe that particular outcome to the total number of trials is approximately its probability. The larger the number of trials, the closer the ratio is to the actual probability. This is a case of the Law of Large Numbers. Another way of thinking about probability is the proportion of the population that satises the event of interest. Example: 57% of all college students in the United States are female. So, if the population is all college students in the United States, then the probability of randomly selecting a female is P (female) = .57.
10
A few basic rules about probability: 0 P (A) 1, for any event A. If A1 , . . . , An partition the sample space so that each possible outcome is in one and only one event Ai , then n i=1 P (Ai ) = 1. P (Ac ) = 1 P (A) P (A or B ) = P (A) + P (B ) P (A and B ) P (A|B ) = P (A and B )/P (B ) Two events are said to be independent if knowing that one occurred does not aect any probability statements we make about the other. This is another assumption we will usually make about any data we analyze. This is actually an extremely important assumption and why it is important to make sure we have a random sample. Example: If I get heads on the rst ip of a coin, what does that tell me about the probability of getting heads on the second ip?
11
Example: Suppose we draw an ace from an ordinary 52-card deck and dont replace it. What is the probability of getting another ace on the second draw? Is it the same as the probability of an ace on the rst draw? Are the two draws independent or dependent?
Lastly, note that if two events A and B are independent, then P (A and B ) = P (A)P (B ).
Probability Distributions
We have already noted that distributions can be described with histograms for an entire population. Specically, probability distributions are a way of specifying all possible values and the probabilities with which any of those values may occur. The mean of the probability distribution of X is called the expected value of X , denoted E (X ). We usually describe the probability distribution of continuous random variables with a curve. In such cases, the probability of a range of values occurring is just the area under the curve between the values specied.
Note that calculating an area always involves length height. For probability curves, the length is the length of the interval of values were interested in and the (varying) height is determined by the function. Note that when we consider something like P (X = 0), the length of this interval is 0. Hence, for any continuous random variable, the probability that it is exactly equal to any number is 0. By far the most important distribution we deal with in statistics is the normal distribution. This distribution is characterized by the usual symmetric, bell-shaped curve. Its very dicult to even approximate areas under a normal curve, so we have to use a normal table or a computer to calculate probabilities associated with normal distributions. However, there are some special cases we can gure out via the Empirical Rule. 12
Example: Suppose X follows a normal distribution with mean and standard deviation , for which we will use the notation X N (, 2 ). Using the Empirical Rule, nd P ( < X < + 2 ).
13
Recall that if X N (, 2 ), then Z = (X )/ N (0, 1). Example: For a standard normal random variable Z , nd P (0 < Z < 2). Compare this to the previous example.
The Central Limit Theorem (CLT) says that the sampling distribution of the sample mean will be well approximated by a normal distribution as long as the mean is based on adding up n independent random variables, provided n is large. looks like reThe CLT is useful because it tells us what the sampling distribution of X gardless of the population distribution. Of course, if we have a small sample (or dependent observations!), the result is of little use. However, if the population itself is normal, then X is automatically normally distributed regardless of the sample size. So, concerning the sam , we have two things that are always true, and one that is sometimes pling distribution of X true:
14
The Sampling Distribution of X Regardless of how the population is distributed: ) = , i.e. the mean of the distribution of the sample mean is the same as the E (X mean of the population ) = /n, i.e. the standard error of the sample mean is the population standard se(X deviation divided by the square root of the sample size N (, 2 /n), by CLT. If the sample size is large, then X
Example: The average fat content of hot dogs is 18 grams. Suppose the standard deviation is 3. What is the probability that the average of 47 randomly selected hot dogs exceeds 19.2 g?
Now suppose that the fat content of hot dogs is normally distributed. What is the probability that the fat content of a randomly selected hot dog exceeds 19.2?
15
Inference
Point Estimate - A single number used as our best guess for the value of an unknown parameter
Interval Estimate - An interval of numbers that we use as a set of plausible values of the parameter of interest. These intervals are constructed from properties that we know to be true of sampling distributions
Condence Intervals
Interval estimation is done using condence intervals (CI). These are just intervals that we say contain the true value of the parameter with a certain level of condence. The general form for most (but not all!) condence intervals we will be concerned with is given by point estimate (critical value) standard error. The product on the right hand side of the expression is called the margin of error. The critical value is totally determined by our desired level condence. That is, for a C % condence interval, the critical value is the number of standard errors away from the mean within which the middle C % of the probability is contained under the sampling distribution of the point estimate.
16
Example: Under certain conditions, the sampling distribution of the sample proportion ( p) is normal. So, the critical value for 95% condence intervals about proportions is approximately 1.96
NOTE : We must be very careful about how we interpret condence intervals. Lets say a 95% CI for the average number of peaches produced by a tree in some orchard is given by (112, 148). We would then say, We are 95% condent that the true average number of peaches produced by trees in this orchard is between 112 and 148. This is not the same as saying the true average has a 95% chance of being between 112 and 148. Since the true average is (assumed to be) a xed number, its either in that interval or its not. Were saying that we have no idea if the true value was captured in that particular interval, but we know that 95% of all the intervals we could ever construct using that method would capture it.
17
The t distribution is used an approximation to the normal distribution, accounting for additional uncertainty introduced by estimating the standard deviation.
Properties of the t distribution: 1. It is bell-shaped and symmetric about 0 2. Its spread and shape, and hence the probabilities associated with it, depend on the degrees of freedom, df . 3. The t distribution has more probability in the tails than a standard normal, but its shape gets closer to a normal as df increases
18
With the t distribution in place of the standard normal, the general form for a condence interval for means then becomes: s x t n 1 n Here, t n1 refers to the appropriate critical value from a t distribution with df = n 1. Strictly speaking, the t distribution only holds exactly when the population is normally distributed (i.e. the CLT doesnt quite work). However, using this distribution still works quite well with mild departures from normality, so if we believe the population is close enough to being normal, we can still use the t as a basis for inference.
19
Hypothesis Testing
The goal of interval estimation is to produce a set of plausible values for a parameter, based on the data that we have observed. Alternatively, we may be particularly interested in one specic value for a parameter and want to see if this is a plausible value. This is the realm of hypothesis testing, a.k.a. signicance testing. The logic of hypothesis testing is to assume some value is true, then assess how probable our actual observations would be under this assumed distribution. If the results are extremely unlikely under the assumption, then either (a) we have observed something that rarely occurs, or (b) our assumption was wrong to begin with and we reject the assumed value of the parameter as being implausible. Steps of a Hypothesis Test 1. Assumptions: We must make certain assumptions about our data for the testing procedure to be valid. The conditions must hold to make sure that were using the correct reference distribution of our test statistics 2. Hypotheses: The null and alternative hypotheses must be specied so that the test will, in fact, answer the question we have. The null hypothesis, H0 , gives us the assumed value with which probabilities will be evaluated. The alternative hypothesis, HA (sometimes denoted H1 ), is what we actually believe to be true or hope to show. It also dictates how the p-value will be calculated. 3. Test Statistic: This is the quantity we calculate to quantify how extreme our observed data would be if H0 were true. It is (hopefully) chosen to be a random quantity whose probability distribution (sampling distribution) we know exactly under H0 so we can determine the probability of observing it or observing something more extreme. 4. p-value: The p-value is the probability that our chosen test statistic takes on the value we observed, or something more extreme, if the null were true. The direction of what we mean by extreme is determined by our alternative of interest: Right-tailed Test - The alternative is that the true parameter is greater than the null value, e.g. HA : > 0.
20
Left-tailed Test - The alternative is that the parameter is less than the null value, e.g. HA : < 0.
Two-tailed Test - The alternative is that the parameter is not equal to the null value (just something else), e.g. HA : = 0.
Since the p-value is representative of how likely our observations would be assuming H0 , small values are evidence against the null. That is, the smaller the p-value, the less plausible H0 is. If the p-value is too small, we reject H0 in favor of the alternative. 5. Conclusions: To determine how small is too small, we usually compare the p-value to some pre-specied signicance level, . The most common choices are = .01, .05 or .1. If p-value < , we reject H0 and say that the results are statistically signicant. Otherwise, we fail to reject H0 . Failing to reject the null hypothesis is not the same as claiming the null to be true. It simply says we dont have enough evidence to think otherwise. Remember: Absence of evidence is not evidence of absence. - Carl Sagan
21
NOTE : It is important to draw the appropriate conclusions in the context of the problem. If a psychologist is interested in whether or not attention span decreases in children after watching Spongebob Squarepants, you dont go back to them and say, we rejected the null hypothesis. Make sure you are able to translate the results of the signicance test back into a context meaningful to the problem at hand.
which follows a t distribution with df = n 1, if H0 is true. Hence the name, t test. 4. p-value: We always use the null distribution to evaluate the p value: (a) HA : < 0 :
(b) HA : > 0 :
22
(c) HA : = 0 :
5. Conclusions: Again, make sure that you can interpret the results in the context of the problem. There is no xed rule for this; just always keep in mind the big picture of what youre trying to do.
23
24
The two things statisticians really consider when evaluating a testing procedure are the probabilities of committing each type of error, := P (Type I Error) and := P (Type II Error). The used here is the exact same used in deciding to reject H0 . That is, the chosen signicance level is the probability of committing a type I error. This follows directly from how we constructed the testing procedure to begin with. For example, when using = .05, we reject when there is less than a 5% chance of observing what we did under H0 . But there is a 5% chance of getting a result like this when H0 is, in fact, true, so P (Type I Error) = .05. Ideally, we want both and to be small. However, these can be competing goals. For example, letting get really small obviously decreases the chance of committing a type I error. However, if is too small, it will be hard to ever reject H0 , even when its false, so the chance of a type II error increases. Statisticians generally consider a type I error to be worse than a type II error (i.e. hanging an innocent man is worse than letting a guilty one go free). Thus, we set to be some desired level, then do what we can with subject to that constraint.
All the techniques discussed thus far involve the analysis of a single population. That is, there has only been one variable with which we were concerned and we wanted to estimate or perform tests about parameters governing that variables distribution. There are many practical situations, though, where we wish to analyze samples from two or more separate populations to see how they compare to each other. This is certainly true in the analysis of experiments, where each treatment condition has its own population of possible observations. Fortunately, many statistical procedures for the analysis of a single population can be extended to answer analogous questions about two or more groups. Exactly which procedure is appropriate depends on what we know about the samples and what were willing to assume about them.
To test this, we randomly select students and ask them to estimate their average number of study hours per week. We do this prior to Spring Break and after, so that we are comparing two populations. The rst population is that of the study hours per week for all students prior to spring break, which has its own mean (1 ) and its own standard deviation (1 ). Likewise, the second population is that of the study hours per week for all students after spring break, with mean 2 and standard deviation 2 . For comparing two groups, we can always us a test statistic of the general form, Test stat. = point estimate H0 value . s.e. of point estimate
The details (the appropriate standard error, really) depend on (i) whether the two sampled 2 2 groups are independent, and (ii) whether we believe 1 = 2 .
s2 p
We can then write the standard error of x 1 x 2 as s.e.( x1 x 2 ) = so that our test statistic is t= s2 p 1 1 + n1 n2
x 1 x 2 s2 p 26
1 n1
.
1 n2
We compare this to a t distribution with df = n1 + n2 2 to get the appropriate p value. The corresponding condence interval for this test is x 1 x 2 t n 1 +n 2 2 s2 p 1 1 + n1 n2
2 2 1. (ii) 1 = 2 In this case, we still have the two sample variances from each group. However, theres no sense in combining them because they are estimating two dierent variances. Rather than pooling, we use the unpooled estimate of the standard error,
s2 s2 1 + 2. n1 n2
x 1 x 2
s2 1 n1
s2 2 n2
It turns out that this test statistic isnt exactly distributed as a t. Its exact distribution is unknown, but it can be approximated with a t distribution with degrees of freedom given by :=
s2 1 n1
2 (s2 1 /n2 ) n 1 1
+ +
s2 2 n2
2 (s2 2 /n2 ) n 2 1
Notice that by collapsing each pair of observations down to a single observed dierence, we have reduced this to a one-sample testing problem. We can simply use another one-sample t test for this: x d , t= sd / n where x d is the sample average of dierences and sd is the sample standard deviation of those dierences. We compare this a t distribution with df = n 1 to nd the p value. Note that n is the number of pairs of observations (e.g. the number of students), not the n1 + n2 total number of observations. The corresponding condence interval follows: x d t n 1 sd / n
28