Vous êtes sur la page 1sur 66

Basic Statistics and R: An Introductory Tutorial

M. Ehsan. Karim, Eugenia Yu, Derrick Lee University of British Columbia June 3, 2008

Contents
1 Introduction to R 1.1 Getting Started: Introduction to R . . . . . . . . . . . . . . . . . . . 1.1.1 1.1.2 1.1.3 1.2 1.3 1.4 1.5 1.6 Downloading R . . . . . . . . . . . . . . . . . . . . . . . . . . Creating Vectors . . . . . . . . . . . . . . . . . . . . . . . . . Combining Vectors . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 1 2 3 4 5 6 6 6 6 7 8 8 9

Basic Mathematical Functions in R . . . . . . . . . . . . . . . . . . . Basic Statistical Functions in R . . . . . . . . . . . . . . . . . . . . . Graphics in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quitting and Saving your Data in R . . . . . . . . . . . . . . . . . . . Searching Help for R commands . . . . . . . . . . . . . . . . . . . . . 1.6.1 1.6.2 1.6.3 R help les from web . . . . . . . . . . . . . . . . . . . . . . . R Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R help system . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Dealing with External Data and Simple Analysis 2.1 2.2 2.3 2.4 2.5 2.6 Data Set for Lab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Graphical Interpretation and Transformations . . . . . . . . . . . . . 11 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Graphical Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 14 15

3 Simple and Multiple Linear Regression 3.1 3.2 3.3 Required Datasets

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Association of Variables . . . . . . . . . . . . . . . . . . . . . . . . . 15 Introduction to Simple Linear Regression . . . . . . . . . . . . . . . . 16 3.3.1 Reading Data for Simple Linear Regression . . . . . . . . . . . 17 1

3.3.2 3.3.3 3.4 3.5 3.4.1 3.5.1 3.5.2 3.5.3 3.5.4

Data Analysis - I: Simple Linear Regression . . . . . . . . . . 18 Diagnosis for Simple Linear Regression . . . . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . 26 Correlation and Causality . . . . . . . . . . . . . . . . . . . . 26 Reading Data for Multiple Linear Regression . . . . . . . . . . 27 Data Analysis - II: Multiple Linear Regression . . . . . . . . . 28 Diagnosis for Multiple Linear Regression and Model Selection 29 Assumptions and Troubleshooting . . . . . . . . . . . . . . . . 30 32

Introduction to Correlation

Introduction to Multiple Linear Regression . . . . . . . . . . . . . . . 27

4 Condence Intervals and Hypothesis Testing 4.1 4.1.1 4.1.2 4.2 4.2.1 4.2.2 4.2.3 4.2.4

Condence Intervals, Hypothesis Testing . . . . . . . . . . . . . . . . 35 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . 35 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . 37 Hypothesis Testing using Z test . . . . . . . . . . . . . . . . . 39 Hypothesis Testing using t test . . . . . . . . . . . . . . . . . 42 CIs, Hypothesis Testing, and Two Samples . . . . . . . . . . 46 Hypothesis Testing and Condence Intervals using external dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Required Datasets and Analysis in R . . . . . . . . . . . . . . . . . . 39

5 Central Limit Theorem (CLT) and Analysis of Variance (ANOVA) 51 5.1 5.2 .1 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . 51 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . 57 Appendix: Uniform Distribution . . . . . . . . . . . . . . . . . . . . . 62 .1.1 Continuous Uniform Distribution . . . . . . . . . . . . . . . . 62

Abstract This is the Lab accumulated materials for Stat 241/251, an introductory Statistics course for the engineers in Department of Statistics, contributed by many TAs and instructors (all names are not mentioned). Each Chapter consisting a computer lab material (these are not the course materials), taken during the summer of 2008 in a short run course (full length course materials are more extensive, which are not fully covered by these materials).

Vancouver, Canada June 3, 2008

M. Ehsan. Karim http://ehsan.karim.googlepages.com/

Chapter 1 Introduction to R
1.1
1.1.1

Getting Started: Introduction to R


Downloading R

R is a freeware version of a statistical programming language known as S, which you can download at: http://www.r-project.org.

1.1.2

Creating Vectors

First, we will show you how to create a vector. To create a vector we type in the command: c(x1 , x2 ,...,xn ), where each xi represents a piece of information. The c stands for concatenate, which means to link together, but an easier analogy to use is that it creates a column of data. So for now, create any vector of data you want of whatever size and hit the enter key. What you should see next is: [1] followed by your vector of data. Note: Each element in the vector is not exclusive to numerical data, in fact, we can place words in our vector with the stipulation that you place quotations marks around the nonnumeric data. Now we are faced with a dilemma because what is the point of creating the vector when we cannot manipulate it so easily, well have to keep on typing in the same data over and over again if we want to do anything to it. The next step is to learn how to save your data to a variable. To save a variable, simply type in any word, followed by <- or by an = sign, and then your data: myvect <- c(1,2,3,4,5,6,7) or 1 myvect = c(1,2,3,4,5,6,7)

What this does is sets the word myvect as a variable that stores the vector created, and this action is indicated by the <- or = sign. Note: For most cases any word you choose will be ne, but some words are saved for use by R, and so if you get an error chances are its because of this. You might be asking yourself after creating myvect, is there an easier way to create a vector with sequences, the answer is yes, and in fact there are several methods. Two such options that are very useful are: Option A: Using the colon (x:y), where x is the starting point and y is the ending points. So if we type in: myvect2 <- 8:14 or myvect2 <- c(8:14)

We will create a sequence that starts from 8 and increases by 1 up to 14. Note: Both methods are equivalent and this option only increases by a stepsize of 1 unit. Option B: The other option is: seq(x,y), which will actually create a sequence just like option A, so if we type in: seq(8,14), we would get the same results. This, though, has an advantage over option A because in seq() you can choose the size of the stepsize. So, for example, if you wanted only the odd numbers from 1 to 14, you would type in: myvect3 <- seq(1,14,2) Where we would start from 1 and then increase by 2, which would give you 3, then 5, and then up until 13.

1.1.3

Combining Vectors

Now, what happens if you have two datasets you want to put together and they are a bit more complex than the vectors we have created? Remember, that c (concatenate) means to link data together and so it is not exclusive to a one time use. That is, myvect has a sequence from 1 to 7 and myvect2 has a sequence from 8 to 14, so by typing in: 2

combine <- c(myvect, myvect2) It will create a new vector that imports the data from myvect and then immediately at the end of it, inserts the data from myvect2.

1.2

Basic Mathematical Functions in R

Because this is a vector, we can perform the most basic mathematical functions on it like adding, subtracting, multiplying, or dividing a constant, except it is applied to each element in the vector. That is, if we typed in: combine + 1 We would get 2, 3, 4, 5,..., 15. And similarly, if we typed in: combine*2 We would get 2, 4, 6, 8,..., 28. This, though, is not exclusive to performing functions with constants and you can add, subtract, and multiply vectors together if and only if the vectors are the same length. What this means is that the number of elements in each vector must be the same because the operator is applied to the same element position. i.e. the rst element of myvect will be added to the rst element of myvect2 Dont worry, if you dont have vectors of the same length, R will give you a warning. To check if they have the same length, type in the command: length(myvect), and you should get a length of 7, now see if myvect2 has the same length. Just like adding a constant, to add, subtract, or multply two or more vectors of the same length, simply type in: myvect + myvect2 myvect myvect2

myvect myvect2 myvect / myvect2 We can even perform more complex functions like: Taking the logarithm: log(combine) Squaring or any other power: combine 2 Square root: sqrt(combine) Exponential: exp(combine) Again, though, it will aect each of the individual elements, and so if we have 15 elements originally, we will still have 15 after we execute the function. As well, just as we mentioned in the beginning, if you want to reuse the data again, and not have to type it in over and over again, you must save the data to a new variable. Note: Just like on any basic calculator, if you are performing more than one operator on the data, priority order works here; that is, multiplication and division have priority over addition and subtraction, etc. and multiplication versus division is based on who comes rst. If you are wondering whether or not it is possible to modify/manipulate only certain elements in our vector, the answer is yes. This is done by using hard brackets immediately after your variable, which will dene the position in the variable. For example: combine[2] will give you the data in the second position and combining this with what we learned in option A, mentioned in creating a vector, if we type in: combine[2:5] or combine[c(2:5)] we get the second through fth elements.

1.3

Basic Statistical Functions in R

Obviously since this is a statistics course you are going to want to know some of the commands for the most basic functions. Since we already have a data le from the rst part of the lab, well use that for our calculations. Total sum: sum(combine)

Mean: mean(combine) Median: median(combine) Variance: var(combine) Standard Deviation: sd(combine) Maximum: max(combine) Minimum: min(combine) Range: range(combine) Quantiles: quantile(combine) (This is used for calculating the IQR) Summary: summary(combine) (This is similar to quantile)

1.4

Graphics in R

As the saying goes, a picture is worth a thousand words, and having a visual representation of your data can make analyzing it easier. The three most common graph types are the scatterplot, boxplot, and histogram. Pay special attention to the last two as they are important for analyzing data for distributions and outliers, which we will review in Lab 2 with a more appropriate dataset to better display the usefulness of these plots. For the scatterplot, you can use the command: plot(combine), which will plot each value in the volume dataset based on its sequential order. For the boxplot, you use the command: boxplot(combine), where it will show you the minimum value, the rst, second, and third quartiles that represent the 25th, 50th (or median), and 75th percentiles, and the maximum value. These are represented by a box with a line through it and then two vertical bars above and below the box. Note: The max and min values are based on the IQR (a concept we will discuss later) and so may NOT be the actual maximum and minimum values. The reason for this is so that the boxplot may help identify potential outliers. The last graphical display is the histogram: hist(combine), which is a graphical display of tabulated frequencies that shows what proportion of the data falls into a particular cateogory or range of values. So if we look at our histogram, the range is 2 for each bar (x-axis), and so the rst bar, which represents 0-2, tells us

that there are two values within that range.

1.5

Quitting and Saving your Data in R

When quitting/exiting R, you simply type in the command: q() and when you do it will ask you if you want to save your workspace image with the option to save it (Y), exit without saving (N), or to not quit and continue with what you were doing (C). WARNING: if you do not save your workspace image, you will LOSE ALL YOUR WORK, so be sure to save the image. The advantage of this is, as you do each lab and assignment, you can keep the data and names of variables you used separate from one another.

1.6
1.6.1

Searching Help for R commands


R help les from web

When in doubt, please feel free to ask any of us TAs, even if you are not in our lab slot because we are here to help you, or alternatively you can email us anytime. You can nd our email addresses on the course website. If you want to try and gure things out on your own, you can nd a lot of material online, which can be found at: http://cran.r-project.org/doc/manuals/R-intro.html or http://cran.r-project.org/doc/manuals/R-intro.pdf

1.6.2

R Books

Alternatively, you can nd many resources at the library by searching for: R (Computer program language) in the library search engine and for more advance work, we currently have the book Modern Applied Statistics with S by W.N. Venables, which is the industry version of R but the commands are pretty much the same, on reserve. The call number is: QA 276.4 V46 2002 and it is at the cicula-

tion desk of the Irving Barbour Learning Centre. But this book might be too much for a beginner.

1.6.3

R help system

As well, you can use the command: help.start() to run the help les within R or if you want additional information about a particular function, say the read.table() command, you can type in: help(read.table) or ?read.table.

Chapter 2 Dealing with External Data and Simple Analysis


2.1 Data Set for Lab

We will start by looking at a dataset, which will be provided on the course website, so please visit slate: and go to the lab section to download the les : rain.txt, gpa.dat. When you download your les, please save it to your Z: drive, which is your personal directory where all your data will be saved and backed up daily. Once you have downloaded your le it is time to run R, which should be located on your desktop. After executing R, the rst thing you need to do is change the directory. The reason is because R does not know where you are and so to open the data le you must tell R where to go. To change the directory, click on [File] at the top right and around the bottom you should see the option for [Change dir]. By default it should already be set at the Z: drive, but if you start creating folders, like for lab 2, you will have to direct R to go there. Note: You should make a habit of creating subdirectories because it will help keep you organized and prevent over-writing of les.

2.2

Reading Data

Now that we have R properly set up, its time to read that data le you downloaded. For our purposes, there will be two ways to read a data le here, either using the command read.table() or scan(). The rst option, as it implies, reads a table of data, so if we have multiple columns and rows of data. The second option reads a single column of numerical data and it will ONLY read that type of dataset. Since we have more than one column of data with nonnumeric components, we will want to use the read.table() command. In order to do use the command, we do the following: rain = read.table(rain.txt, head=T) So similar to the last lab where we saved a vector to a variable, the variable rain will hold the table of data from the le. Please note that the quotes around the le name must always be present. The subcommand head=T (or equvialent, header=True) indicates that there is a header, i.e. the names at the top of the columns of data and so in our case, Year and Volume. WARNING: if you do not have a header and you set the header to true, you will loss one line of data, which will alter your analysis. Likewise, if you have a header but set the option to false, you will gain one line of data. Now that you have the data saved, to bring it up all you have to do is type the command rain and it will appear. If by chance you forgot what you named the le, you can try and gure it out by typing in the command: ls(), which will show you everything saved within the R directory. When you do, you should see the name of the le, plus any other les in the directory, and so if this is the same directory as in the last lab, you should see the vectors of data we created last week. If the data was a single column of numbers, the scan() command would be optimal, although you could use the read.table() command as well. We will note, however, that there is no option for a header and the command to save the data would be: 9

rain = scan(rain.txt)

2.3

Data Manipulation

We will start by viewing the data by typing in rain, which will show 2 columns of data, Year and Volume, and 49 rows of data. The data in Volume is our focus, which is the volume of rainfall each year in Sydney, Australia over 49 years. To focus our attention on Volume, we have three options we can use. Option A: By using the command attach(), we can attach the names of the headers in rain to pseudo-variables. That is, if we type in the command: attach(rain), two pseudovariables will be created, Year and Volume. So, if you type in Volume, the associated data will appear but when you type in ls(), you will not see the variable like you would with rain. Note: you can only attach one dataset at a time, if you end up working on another one, you must type in the command: detach(), and then you can attach the new one. Option B: Another way of using the headers is by using the $ symbol. So since we know that the header is called Volume we can use the command: rain$Volume to bring up the same data as in Option A. If by chance we do not know the names of the headers, we simply type the command: names(rain) and it will bring up the all the category names. Option C: You can use the data positions, which is the stu we did to manipulate the vector we created in Lab 1. Any time you have a dataset, the use of hard brackets immediately after it denes positioning. For example: rain[5,2] will give you the data in the fth row and second column. Therefore, to obtain all the data in the second column, we type in the command: rain[,2], which will give the data in all the rows and the second column.

10

Example: A visual example might help with extracting these elements so what we are going to do is create another dataset, which should explain things a bit easier. Simply follow the code commands and dont worry about it for most parts, if you have questions, feel free to ask: data = seq(1,25) mat = matrix(data, nrow=5, ncol=5, byrow=T mat and so if we type in: mat[1,] we get the entire rst row and similarly, if you type in: mat[,1] you will get the rst column of data For both Option B and C we can make things easier by saving it under a variable: volume = rain$Volume or volume = rain[,2]. As well, if you want only particular rows or columns, what you can couple it with a vector like last week to specify which ones you want. For example, if you only want the rst 10 rows of the volume data, you can use the command: volume[c(1:10)], or if using the original dataset: rain[c(1:10),2]. Conversely, if you want to manipulate everything except the rst 10 rows of data, you can use the command: volume[c(1:10)] or rain[-c(1:10),2]. The minus sign will tell R not to include that vector of data.

2.4

Graphical Interpretation and Transformations

First of all before we talk about data transformations, we will talk about data distributions. For most parts, all of you should be familiar with how a normal distribution should look; just think of a bell curve. So when you look at the histrogram of any dataset, you should see that shape if the data is normally distributed. So lets take a look at the rain data and see whether or not the data is normally

11

distributed. To do this, we type in the command: hist(volume) that is if we used the extra step for Option B and C. When you do this, you should see a rather ugly looking gure that does not look like a normal bell curve, which is expected else there wouldnt be much point to this exercise. If you know that the data should be normally distributed, which makes analysis easier, then you may need to do what is called a transformation on the data, in which we use a function to link the response (y) with the predictor (x). Do not worry yourself with these concepts that we are talking about, they will become clearer later on when we get into Linear Regression. There are four common functions that we can use to transform data: (i) squaring the data: volume 2 (ii) square rooting the data: sqrt(volume) (iii) taking the log of the data: log(volume) (iv) taking the exponential of the data: exp(volume) Now, before we produce the histograms, well show you how to place multiple graphs on one display, this way we can make an easier comparison of the graphs to help choose the better function for transforming the data. To do this, you type in the command: par(mfrow=c(x,y)) which will bring up a display with x number of rows and y number of columns. Initially, you will not see anything, but once you start plotting they will come up on the display. So, lets take a look at our data transformations: par(mfrow=c(2,2)) hist(sqrt(volume)) hist(volume 2) hist(log(volume)) hist(exp(volume)) Upon closer inspection, we see that two of the transformations look horrible, but 12

two are quite nice looking minus a small discrepany on the right. Nonetheless, these two do show a nice bell-shaped curve that we would expect. We may ask, though, which transformation do we use and what is that little blip on the right side of the graph? The rst question cannot be answered easily, but answering the second question can help identify which transformation is best. For now, were going to use the log transformation because I on purposely transformed a normal distribution using that function. To continue with our graphical analysis with the histograms, well type in: hist(volume[1]) we see a denite improvement over the unmodied dataset and when we type in: hist(log(volume[-1])) Vola! A very nice, and perfectly symmetrical bell curve, indicating that our data now is approximately normally distributed. Please be cautious of the order of brackets because they will give you will get something half of the time (which will not be what you are looking for) and error messages the other half.

2.5

Outliers

In the last lab we made a comment about outliers and the use of graphics to view them, but rst, though, what is an outlier? An outlier is an observation that is numerically distant from the rest of the data that may be due to some clerical error or experimental error that aicts our data. Now, from the histogram it looks like there may be an outlier but we can do some extra checking by using the boxplot. What the boxplot does is it creates the IQR (please see the last lab/course notes for details) and any point that is outside of the IQR the boxplot will identify as a potential outlier. To do this, we simply type in: boxplot(volume). Note: You can use the transformed data as well, they will give you similar results. We can see that boxplot does in fact identify potential outliers. The art of identifying outliers takes practice and some times, unfortunately, trial-and-error. For all intensive purposes, you wont have to worry too much as we will make the outliers visible and clearly evident. For example, looking at our data for rainfall we see that the mean is approximately 131mm (while the median is approximately 123mm) and

13

yet the rst value of rainfall is approximately 388mm, almost 3 times the average. Therefore, Sydney either had several oods that year or there was an error inputting that data and we can consider it an outlier. Now that weve identied a potential outlier, how do we get rid of it? As mentioned in the last lab, but using the minus sign (-) coupled with the hard brackets([ ]) on any dataset, we can tell R to ignore that data. In our dataset, the outlier is in the rst position, how convenient, and so we type in the command: volume[-1] and it will give us our data minus the rst row. Note: Please do not think that we can always deal with an outlier by simply removing it as sometimes isnt that easy and other more complex steps may be required. For our purposes here, though, you can assume you can delete the outlier.

2.6

Graphical Comparisons

We looked at how to do basic graphs, now, well do some more complex graphs, more specically how to graph based on groups. Now, we could do something similar to what we did above, but for now well use a graphical comparison instead. So, this looks like it calls for a boxplot, but how do we create a boxplot based on groupings since everything up to now only involved a single column of data? Just like any other plot we make, we can always change the x and y-axis and to do this, well use the subcommand with boxplot: boxplot(VolumeState) What this does is it plots a boxplot of Volume based on the groups (catergorical variable) in State, which is assigned using the command. Since there are only two States, 1 and 2, we only get two boxplots. If, however, there were multiple groups or if you did it backwards, youd see quite a mess.

14

Chapter 3 Simple and Multiple Linear Regression


3.1 Required Datasets

This week we will learn about linear regression (simple and multiple), so please go to the course web-site and download the data le called: gestation.txt and blood pressure.txt. When you download your le, please save it somewhere on your Z: drive, run R, and change the directory to wherever you saved the data-set.

3.2

Association of Variables

In statistics, an association comes from two variables that are related and is often confused with causation though association does not imply a causal relationship. In formal statistics, correlation and regression are considered as the measures of association, and are very closely related topics. Technically, if the independent variable (X) is xed, that is, if it includes all of the values of X to which the researcher wants to generalize the results, then the analysis is a regression analysis. If both X and the dependent variable (Y) are random, free to vary (were the research repeated, dierent values of X and Y would be obtained), then the analysis is a correlation analysis.

15

3.3

Introduction to Simple Linear Regression

The rst question is, what is linear regression? In statistics, it is a regression method that models the relationship between a dependent variable, y, and an independent variable(s), x. Note that there is a plural there, as there can be many independent variables, xi , but for now well focus on simple linear regression and so only one x is involved. The methods we will learn, though, can be applied to multivariate linear regression. The simplest example can be found back in high-school math when we learned about linear equations, which is essentially what linear regression is, and so the form: y = mx + b will look familiar, where m is the slope and b is the y-intercept. In our model, though, we will use a similar form with the addition of an extra term, , which we call the error term. This term is a random variable that explains the dierences between the predicted values, calculated from the model, and the observed values. Remember, we are developing a model to predict the values of y through x, but the predicted values arent always going to be exactly the same, hence the error term. In general, our simple linear regression model will look like the following: y = 0 + 1 x1 + where 0 is the intercept, and 1 is the regression coecient for the independent variable. We associate the magnitude of the regression coecient, i.e., slope (1 ) with the change in the dependent variable (Say, y) that results from the unit increase in the independent variable (say, x1 ). This magnitude does not tell you how much x1 changes. x1 always increases by one unit to get y to change by 1 units. The sign associated with 1 tells us whether y increases or decreases by 1 units when x1 increases by one unit. (If you begin your interpretation with a unit decrease in x1 , then remember to reverse the direction indicated by the sign when you describe the change in y.). A positive coecient means x1 and y change in the same direction. If x1 increases, then y increases. If x1 decreases, then y decreases. A negative coecient means x1 and y change in opposite directions. If x1 increases, then y 16

decreases. If x1 decreases, then y increases. The intercept 0 represents the mean of y when x1 = 0. The most interesting parameter in a linear model is usually the slope 1 . If the slope is zero, the line is at, so theres no relationship between the variables.

3.3.1

Reading Data for Simple Linear Regression

Now that we had a brief background intro into what linear regression is, lets read our data and start modeling. Just like last week we must use the read.table() command, so lets type in: > gest = read.table(gestation.txt, head=T) and if we type in gest, we should see A LOT of lines of data. What we have here is a random sample of 400 pregnant women. Lets have a sneak peak of the data (rst ve rows) using the following: > gest[1:5,] Also, lets check the column names for the use of further analysis as follows: > names(gest) The purpose of our analysis is to see if there is a relationship (hopefully linear) between the abdominal circumference (ac) and gestation period (gawks). Now, if we plot a scatterplot: > plot(gest$gawks, gest$ac)

We would see the relationship between gawks (x) and ac (y), and low and behold, there is a denite linear relationship. Please be careful about the order by which you plot your data, the rst variable is the independent variable, x, and the second variable is the dependent variable, y. If you switch the order, youll still see the linear 17

relationship, but when we get into more complex stu, that mistake may mislead you.

3.3.2

Data Analysis - I: Simple Linear Regression

So in this dataset we have the gestation period of the woman (gawks) and the abdomine circumference that has an obvious linear relationship, which is what we want to model. To do this, we will use the function: lm(), which stands for Linear Model. To use this function, rst we need to dene the variables, which we can do the normal way by either attaching the dataset or by creating, in this case, two variables that will contain the info in gestation.txt. > attach(gest) so now ac and gawks become pseudo-variables containing their respective data. In this case, gawks is our independent variable, x, and ac is our dependent variable, y. To use the lm() function, we type: > linreg = lm(acgawks) > plot(gawks, ac, main = Scatter, sub = Regression line) > abline(linreg, lwd = 4)

18

Here we are creating a variable/object, called linreg, that will contain a linear model of ac based on its relationship to gawks. Again, BE CAREFUL about the order of these as y comes rst and x comes second, with the indicating a relationship between the two. Do not get this mixed up with the order for the function plot() as it will give you completely dierent values. IMPORTANT: In the example above we have a simple independent variable. However, if there were several independent variables, xi s, then we would use: lm(y x1 + x2 + ... + xp ) more of which we will learn in multiple linear regression. This linreg have several sub-class: to check try the following command: > names(linreg) and we can call each of them using $ sign. For example, we can extract only the coecients as follows: > linreg$coecients

Now if we take a look at the model using the command: summary(linreg), it 19

will tell us how well the linear model we created ts the data. When you do this, youll get an output with the rst lines showing you your formula for the model. The most important information are (i) Coecients and the (ii) Adjusted R-squared value. As you can see under the Coecient Information, you see the estimate for our intercept, 0 = 67.57293, and the estimate for the slope, 1 = 10.90542, which corresponds to the variable gawks. Therefore, our estimated linear model is: y = 0 + 1 x y = 67.57293 + 10.90542x where x = gawks and y is the estimated expected values of y, the observed value, under our model. Please remember that we are creating a model to represent the data, and therefore we are talking about estimations, hence the terms y , 0 , and 1 , which are estimations of y, 0 , and 1 . Along with the estimates, you will also see the data for standard errors and pvalues. These are VERY important as the standard error will tell us how far from the true value for, say 0 , we really are. Therefore, the larger the standard error, the further our estimates and model could be from the true values. The p-value is very important because it tells you how important the coecient is. If the p-value is less than 0.05, then the regression coecient that corresponds to it is important and so, for example, 0 = 6.57293 would be true. If, however, the p-value was greater than 0.05, then the coecient is not important and we would assume that 0 = 0, and so our linear model would go through the origin. The adjusted R-squared value (adj. R2 ) and Multiple R-squared values are important because they tell us how much of the variability in the data can be explained by our model. Therefore, a value close to 1 implies that we have a very good tting model and a smaller value may indicate that we either have a bad model and we need to modify it or that the variable (x) used has no relationship to what we are trying to predict (y). So, since we have Multiple R-squared R2 0.9764 and adj. R2 = 0.9764 (this is more useful for multiple regression where we have more than one

20

variable), we can say that our model above is pretty good. To see this, if we re-plot the data and then use the abline(), we can see both the plotted data and our model: > plot(gawks, ac) > abline(linreg) And as we can see, the line cuts right through our plotted graph, giving us the best t. Back tracking for a moment, we can see that the some points are not on the line exactly, and some are even further from the line than others. The dierence between those data points and the line corresponds to the residuals, , that we were talking about earlier. This next part is very important, because the residuals have a lot to do with regression analysis, which is the next section. To calculate the residuals through R, you can either calculate the predicted values, which is calculated from our model linreg using the command predict() and minusing that from the actual values from the observed values: e = y y (where e is the estimated version of ) > pred = predict(linreg) # (method 1) > pred = linreg$tted.values # (method 2) > res = ac - pred # (method 1) or by using either the command: residuals() or the suboption: $residuals on the model: > res = residuals(linreg) # (method 2) > res = linreg$residuals # (method 3) Note: # is used to put comment in R commands.

3.3.3

Diagnosis for Simple Linear Regression

Although we have what appears to be a good tting model, we need to perform model diagonistics to see if it is, in fact, the right model. Whenever we t a model like this, we make assumptions about the data and so if these assumptions do not hold, our model cannot be trusted, even if it looks good. The assumptions we 21

make are based on the errors, , but because they are not observable we can only inspect the residuals, e, to verify the assumptions, which are: (i) mean of 0 (linear model holds) (ii) constant variance (iii) identically normally distributed Mean 0 If the mean does not equal to 0, then already we have a problem and the assumption is violated. However, the mean of the residuals is always supposed to equal 0 by construction of the least squares estimate, which will generated our estimates i , but just because it equals 0 it does not mean that our model holds its linearity. Therefore we need to graph a scatter plot of e versus x, or e versus y , to make sure that the assumption holds. In general, when we graph the scatter plot all the points should be spread along in a horizontal line, centred about y = 0, but if we see a slight curvature or any major deviations from linear trend, it may mean that the residuals are nonlinear and the rst assumption would be violated.

Constant Variance: Homoscedasticity A constant variance is as it seems, that if we plot a scatter plot of the residuals versus x, or versus y (known as residual plot), we should see that overall there is a constant variance. What this means visually, though, is that on the plot I should be able to draw two lines to create a band that all the residuals fall within. If that line does not have a relatively constant height end to end, we may not have a constant variance. Please note, though, the dierence between a non-constant height and a slight deviation due to an outlier. In statistics, outliers are very common and so do not misinterpret nonconstant variance because of an outlier and so if there is an obvious continuous change in the variances height, then we have nonconstant variance. So as we have read when we plot a scatter plot of the residuals (against the predicted values) we should be looking for three distinct characteristics, and with our datatset we get the following: > plot(pred, res) 22

And as we can see, the plot is centred approximately around y = 0, overall there is no nonlinearity as the plot is relatively horizontal across y = 0, and the variance is constant since there are no major deviations from the band across (20, 20), i.e. a constant height with no upward or downward trend. The gure below are some examples where the two assumptions are violated:

So in the rst gure, this is pretty much the ideal looking scatter plot that holds the rst two assumptions. In the second gure, however, we can see that the there is a slight curvature, and so the residuals are nonlinear, inspite of the mean equaling 0. And in the third gure, we can see that the variance is changing as we increase the predicted values, which indicates nonconstant variance.

Independently and identically normally distributed For the residuals to be identically normally distributed, it simply means that they come from a normal distribution, which has a mean of 0 ( = 0) and a variance of 2 , hence the need for a constant variance. To gure out whether or not the residuals are normally distributed, though, there are two methods. The rst is to simply a look at the histrogram of the residuals and, as we discussed previously, if the data is normally distributed the residuals would show a bell-shaped curve. The other, more denitive method is to use the command: qqnorm() > qqnorm(res) So rst, the theory behind how the qqnorm() function works. The qqnorm() creates a sample that comes from a normal distribution and then it takes the sample 23

data that, in theory, comes from the same distribution. If it does come from the a normal distribution, then there should be a linear relationship between the two and therefore we should see a straight line. And as we can see in our graph, there is a relatively linear relationship and therefore our sample data (the residuals) are normally distributed. IMPORTANT: Although the tail ends are slightly o, the focus is on the central part of the graph, which is what needs to show the linear trend the most, which can be done if we use the command: qqline(), which plots a line of best t through the QQ-plot > qqline(res)

In the picture of the qqnorm and qqline from two sets of data, the one on the right shows a near perfect t and therefore we can assume that the residuals are normally distributed. The one on the left, though, obviously shows a deviation from the qqline and therefore we can assume that the residuals for this particular dataset are NOT normally distributed. And as we can see, the line went through the central portion of the plot, showing no major deviations like the gure on the right, and therefore we can assume that the residuals are normally distributed. Now, since all three assumptions have held, we can safely assume that our model is appropriate. A nal remark, though, how do we deal with our model if one or more of the assumptions are violated? Well, we discard the model and search for a better one. We will discuss more about it in the multiple linear regression sections. Before we nish this section, lets point out a command for advanced users, that gives us four useful graphs all at once (residual vs tted, scale-location, normal Q-Q, 24

residual vs leverage): > par(mfrow=c(2,2)) > plot(linreg, las = 1)

25

3.4

Introduction to Correlation

Correlation, (often measured as a correlation coecient), indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation or co-relation refers to the departure of two variables from independence. A number of dierent coecients are used for dierent situations. The best known is the Pearson product-moment correlation coecient, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. For our dataset, we nd it as follows: > cor(gest$gawks, gest$ac) where notice that, if we square it, we get the R-squared value (0.9763925) as reported by the summary of the linear regression (0.9764): > cor(gest$gawks, gest$ac)2

3.4.1

Correlation and Causality

The conventional dictum that correlation does not imply causation means that correlation cannot be validly used to infer a causal relationship between the variables. That is, establishing a correlation between two variables is not a sucient condition to establish a causal relationship (in either direction). A correlation can be taken as evidence for a possible causal relationship, but cannot indicate what the causal relationship, if any, might be.

26

3.5

Introduction to Multiple Linear Regression

In the previous sections, we learned about simple linear regression and so in the following sections, well learn about how to create and select a model for multiple linear regression, which is done the same way as for simple linear regression, except that we are trying to nd the relationship between the dependent variable y and several independent variables xi s. Similar to the model for simple linear regression, the general model for multiple linear regression is: y = 0 + 1 x1 + 1 x2 + ... + k xk + where k = the number of independent variables. Note, 0 is the intercept, and i (for i 1) are the regression coecients for our independent variables, therefore there are k + 1 regression coecients. Now well talk about multiple linear regression, model selection using the p-value, and model analysis, which is done the same way as for simple linear regression (of course simple linear regression is a special case of multiple linear regression where k = 1). In multiple regression, interpretation is the same as simple linear regression, where each coecient i measures the impact of the corresponding xi on y keeping all other independent variables constant.

3.5.1

Reading Data for Multiple Linear Regression

First of all, when we try to create a model to nd the relationship between our independent variables, xi s, and our dependent variable, y, there may not always be a relationship. For example, say we wanted to create a model to show the relationship between a persons height and how well they can read, but obviously there may not be a relationship because a persons height shouldnt eect their eye sight. So, keeping this in mind lets run R and load the le as usual. The dataset blood pressure.txt contains the 20 male individuals systolic blood pressure measurements, height, and age, and it is our job to create a model to nd a relationship between a persons blood pressure (y) and their height and age (xi s). > bp = read.table(blood pressure.txt, head=T) > attach(bp) > names(bp) 27

Similar to how we set up our model for simple linear regression, we use the command lm(). Unfortunately unlike with simple linear regression, we cannot use a scatter plot to see if there is a relationship between a persons blood pressure, also known as the response response, and their height and age, also known as the predictor variables. This is because the plot() command can only really deal with two variables at a time, and therefore our analysis is based on how well the model ts and whether or not our model assumptions hold true. So, lets t our model and start our analysis: > multireg = lm(bpress height + age)

3.5.2

Data Analysis - II: Multiple Linear Regression

Since we cannot see whether or not there is an obvious relationship between the response and predictor variables via graphical methods, we need to analyze our assumptions rst because if the assumptions do not hold, we need to change our model. So just like with simple linear regression, well look at both the residuals versus predicted value plot: > res = residuals(multireg) > pred = predict(multireg) > plot(pred,res) and as we can see, there are no obvious violations of our assumptions, as there are no nonlinear trends and the variance is relatively constant. To check for normal distribution, well use the qqnorm() and the qqline() commands: > qqnorm(res) > qqline(res) and as we can see, the central portion of the Q-Q plot does show a linear trend, although it is obviously not as well dened as we saw last week, which is mainly due to the fact that our sample size is so small (20 individuals in this sample versus the 400 individuals in the sample used in lab 4). Therefore since our assumptions hold 28

true the model type we selected is valid.

3.5.3

Diagnosis for Multiple Linear Regression and Model Selection

Since we know that there is a linear relationship between an individuals blood pressure and their height and age, since the model assumptions hold true, we need to take a look at the model generated by R, which is done using the summary() command: > summary(multireg) As we can see, the adjusted R2 value is 0.9997, which obviously shows that the model we generated ts very well. If it does t well, though, then why should we bother continuing? To answer that, well need to talk about a concept called parsimony, which basically means that the simpler model is the better choice, and to gure out which is the better choice, we look at the p-values for each of the regression coecients, i . For each regression coecient calculated by R, there is an accompanying p-value that tells you how signicant the i value is. Dont worry about what signicance is supposed to mean, just know that if the p-value for a particular i is greater than 0.05, then we can assume that i = 0. Therefore, if i equals 0, then the predictor variable xi has no relationship to the response variable y. Now knowing this, though, it is important to NOT automatically disregard every i that has a p-value greater than 0.05. We select the predictor variable with the highest p-value (the maximum p-value any coecient can have is 1) and then we remove it from the model. We then check the summary again to see the eects of removing that variable on the p-values of the other predictor variables and on the response, which we do through the adjusted R2 value. We continue doing this until we have all the p-values under 0.05 or until the adjusted R2 value starts to get too small. Reviewing the summary of our model, in spite of that the fact that the adjusted R2 value is so high, the p-value for the predictor variable height is very high and 29

denitely greater than 0.05. Therefore, because of parsimony and the high p-value, we can consider the i value that corresponds to height to be 0, and we can now reduce our model from: y = 0 + 1 x1 + 2 x2 to y = 0 + 2 x2

where x1 represents the individuals height and x2 represents their age. Therefore, we now have to create a second model to analyze: > multireg2 = lm(bpress age) With this new model, however, we have to check to make sure that our assumptions hold true to ensure that our model type is valid. Again, by plotting the residuals against the predicted values of multireg2 and looking at the Q-Q plot of the residuals, we can see that all of our assumptions do hold true. Then by using the summary() command again, we can see that our simplied model looks good with the p-values all below 0.05 and the adjusted R2 value is actually even better, although this is by coincidence as the adjusted R2 value doesnt always increase.

3.5.4

Assumptions and Troubleshooting

In both examples that we gave you the assumptions always held true, but what do we do if they dont? For most cases, the easiest way to deal with the data if there is any violation of our model assumptions is to transform the data via one of the functions we talked about earlier. To do this, we can either create a second y variable to analyze using the same methods as we did in the two prior examples, or because this is R, we can combine functions together. For example, if the assumptions for blood pressure.txt did not hold true, we would have to transform the predictor variable bpress. So say we believed that it required a log transformation, then: > bpress2 = log(bpress) and by transforming bpress, we can still generate a model using the lm() command: 30

> multireg3 = lm(bpress2 height + age) or alternatively, we could combine functions > multireg3 = lm(log(bpress) height + age) Now if the assumptions held true, then we can proceed to see if we can simplify our model. If all the p-values for this nonlinear model are less than 0.05, then our model becomes: log(y) = 0 + 1 x1 + 2 x2 or another way of looking at it is:

y = e0 +1 x1 +2 x2
For any other cases, which may involve outliers and other number of reasons (say, non-linearity - try ?nls for more help) that could violate out assumptions, will not be dealt with in the scope of this course, and therefore we will use only data transformations.

31

Chapter 4 Condence Intervals and Hypothesis Testing


Handling batch of commands in R
Say, you are dealing with a sequence of R command lines to analyze your data, and after a while you realize that you made some mistake at the beginning. One way to correct such mistake is to type all the commands again in R after correcting it in the appropriate places. Certainly, there are ecient ways to deal with such problems (which is often the case when you are doing the exercise part of the labs). Programmers prefer to use several text editors such as: R Editor (in-built in R) RWinEdt TinR X(Emacs) with ESS Kile (for Linux) to port R commands from these editors, while having more control over the codes. In this lab, we will discuss only about the R Editor (in-built in R). To work with this, from R console, menu: File > New script which will open an R Editor window. Type in (or more conveniently, copy-paste) the followings in the R Editor window:

32

a = rnorm(10,mean = 0, sd = 1) b = rnorm(1000,mean = 0, sd = 1) print(a) sum(b) par(mfrow=c(2,1)) plot(density(a)) plot(density(b))

Then select the lines there in the R Editor window and press Ctrl+R or from menu: Edit > Run line or selection. All the selected command lines will go to R console and for the graph commands, they will open up the graphic device. Now, if you like to set a = rnorm(50,mean = 0, sd = 1) instead of a = rnorm(10,mean = 0, sd = 1), all you need to do is to change the number in the rst line in the R Editor, and run again. To save the commands, select R Editor, then go to File > Save as ... and save it in Z: by the name test.R (please write test.R not test). To run the saved le, type in R console as follows: source(z:/test.R) To make additional changes in that le, open the script as menu: File > Open script and then select the test.R from Z: and edit from the R Editor.

Loading a Package in R (optional)


Sometimes, other than the built-in functions in R, we might need some other user written functions. These functions are usually stored in packages which can be downloaded from the R site. This can be done by (1) running R, (2) click on Packages, (3) click on Install package(s)..., (4) select any mirror, like Canada (BC),

33

and then (5) download the required package, say, for example, if our required function is in TeachingDemos package, we will download that at rst. Now that weve got the appropriate package, well load the library by: > library(TeachingDemos) or > require(TeachingDemos) before running the function required for our analysis. Note: when you run the package TeachingDemos, it requires you to create a temporary directory on the C: drive. If you do not have permission to make directory on C: drive (which might be the case in the lab computers), then you might not be able to install it properly, and hence the codes related to this package will not work.

34

4.1

Condence Intervals, Hypothesis Testing

In this weeks lab we will talk about condence intervals and hypothesis testing.

4.1.1

Condence Intervals

In statistics, we always deal with numbers like the average or mean of some variable, e.g. the average weight a beam can handle. These numbers, however, are based on the data weve collected and so one sample dataset could dier from another, especially if the sample sizes are small, and so theres always variability. This is where condence intervals come in. If we repeat the experiment over and over again (for a large number of times) and construct a 95% condence interval for each run of the experiment, then 95% of all such condence intervals should contain the true value of

Z test For example, say we have a sample of n observations x1 , x2 , , xn , and weve assumed that the observations came from a normal distribution with a mean value and variance 2 , and so X N (, 2 ). Keep in mind that and 2 are the true values and come from the population, whereas x1 , x2 , , xn are a smaller sample of the population. Now lets say that we are trying to nd a 95% C.I. for , which we do by using the sample mean X. First, though, We have to standardize X so that it follows the standard normal distribution, which we can read from a table. Z= X N (0, 1) / n

When we look for a 95% C.I., we are looking at a two-tailed probability of = 0.05 under the standard normal density curve. Keeping this in mind and the fact that a normal distribution is symmetrical about 0, then: X < Z1 P Z < 2 2 / n or equivalently P Z1 < 2 X < Z1 2 / n 35 = 0.95

= 0.95

Looking up a Z-value with /2 = 0.025, and 1 (/2) = 0.975, we nd the probability to be: P 1.96 < X < 1.96 / n = 0.95

and by rearranging the probability, we see that: 1.96 < X < 1.96 n n X 1.96 < < X + 1.96 n n X 1.96 < < X + 1.96 n n Therefore a 95% C.I. for is: X 1.96 , X + 1.96 n n

Now that we know the general formula for a 95% CI, or any 100(1-)% CI, lets put it to practice. Lets say that that we want to compute a 95% CI for the true average , where = 2.0, n=31, and x = 80.0, then the interval is: 2.0 x 1.96 = 80.0 1.96 = 80.0 0.7 = (79.3, 80.7) n 31 t test Now this is good for the case when we know the true population variance, but what happens when we dont know what it is? To deal the case of an unknown 2 , we use the sample variance s2 and the student t-distribution. All the steps are the same, except that when we try to gure out our z-value, we actually need to gure out a t-value instead, which is done in the exact same manner as the z-value except that it is based on (n1) degrees of freedom. This will compensate for the fact that we dont know what 2 is by making our CI slightly larger. Therefore, a 95% CI for a sample with unknown 2 is: s s x t1 ,(n1) , x + t1 ,(n1) 2 2 n n

36

4.1.2

Hypothesis Testing

A statistical hypothesis is a claim about the value of a parameter, e.g. = 0.75, is the true average diameter of a PVC pipe. In any hypothesis-testing problem, there are two contradictory hypotheses to consider. One hypothesis may claim that = 0.75 and the other = 0.75. The objective is to decide, based on sample information, which of the two hypothesis is right. Now that we have a condence interval, how does hypothesis test relate to it? The rst claim is called the null hypothesis Ho and the other is the alternative hypothesis Ha . When we are testing, we always assume that the null is true, and then based on the sample data, we will either reject the null hypothesis, in favor of the alternative hypothesis, or fail to reject the null hypothesis. We do NOT accept the null hypothesis, we can only fail to reject it. This is just like the system when Judge says that some defendant is guilty when it can be proved that he did the crime for which he was accused of. But when his crime cannot be proved, Judge does not say he is innocent, he says, based on evidence (which is sample data in our case), he is NOT guilty (fail to prove him guilty). Just because we could not prove that he did the crime, does not mean that he did not do it. Think about it. This is the basis of hypothesis testing. A question that you may ask now is, how we determine whether or not we reject or fail to reject the null hypothesis? First, we need to determine our signicance level of the test, , which by tradition are = 0.10, 0.05, or 0.01 (the most common being 0.05). This should start looking familiar to you, especially after the last section and the p-value in previous labs. And so, the smaller the value, the more stringent the test to reject or fail to reject the null hypothesis is. For practical purposes, there are two types of tests that we will deal with. If is the true value and o is the postulated value, for which we are testing, then the possible one-sided and two-sided tests are: Now that we know what hypothesis testing is, how do we perform the test? This is where condence interval comes in. If the corresponding CI excludes o then we reject the null hypothesis. Alternatively, if the corresponding CI 37

One-Sided Test Ho : o , Ha : < o Ho : o , Ha : > o

Two-Sided Test Ho : = o , Ha : = o

contains o then we fail to reject the null hypothesis. To calculate the size of the CI, we need to nd the z-value or t-value, which is based on the value. For a two-sided test, the z- and t-values are the same as we determined before; that is, our condence interval is: x z1 , x + z1 2 2 n n Please note that for a CI based on a t-value, just replace it with the z-value, by replacing with sample standard deviation s, there is NO DIFFERENCE IN THE STRUCTURE. On the other hand, for a single or one-sided test, the z-value is z1 , but the condence interval is no longer symmetrical like for a two-sided test. For the cases where the alternative hypothesis is Ha : < o , the CI is: , x + z1 n and for cases where the alternative is Ha : > o , the CI is: x z1 , n However, this is just one way of reaching conclusion of a hypothesis testing. Other equivalent ways are 1. to calculate the value of statistic, and compare that value with tabulated value. 2. to calculate p-value and them compare that with .

38

4.2

Required Datasets and Analysis in R

In the previous sections, we learned about condence intervals and hypothesis testing and so now, well learn about some practical applications seeing how it actually works. So, as usual, go to the course website and download: support.txt and load it into R, accept that this time, well use the scan() command to import the data. In this dataset, we have the percentage of voters that agree with the President from a sample representing each of the 50 States in the US. The government claims that the average support rating is at least 50%. Is there enough justication to support this claim at, say, a level of = 0.05? We will discuss this in 4.2.4.

4.2.1

Hypothesis Testing using Z test

We will go through some dummy exercises before addressing the above problem. Instead of downloading a dataset, we are going to create our own to do some tests. Lets now create some test data. Well create two samples from a normal distribution: > n = 500 > alpha = 0.05 > std = 2 > x1 = rnorm(n, 0, std) > x2 = rnorm(n, 5, std) By doing this, x1 and x2 are from a normal distribution with mean 0 and 5, respectively, and a standard deviation of 2. Note: Everyones results will be dierent because the data is randomly generated. Now lets say we are testing Ho : = 0 versus Ha : = 0 with x1. Therefore, to calculate a 95% CI, we: > margin.of.error1 = abs(qnorm(alpha/2))*std/sqrt(length(x1)) > ci1 = c(mean(x1)-margin.of.error1, mean(x1)+margin.of.error1) > ci1

39

and as we can see, o = 0 is contained within the 95% CI and therefore we fail to reject the null hypothesis. Doing the same for x2: > margin.of.error2 = abs(qnorm(alpha/2))*std/sqrt(length(x2)) > ci2 = c(mean(x2)-margin.of.error2, mean(x2)+margin.of.error2) > ci2 and without looking we should know that we are going to reject the null because the mean of x2 is approximately 5. Conrming this by looking at the interval, we can see that o = 0 is NOT contained within the 95% CI, and is considerably o. Thus, we reject the null hypothesis in favor of the alternative.

Z test using Package / Function (Optional)


To do the test, let us download norm.R from the course web to the Z: drive, and source it as: > source(z:/norm.R) This le has a function norm.test written in it to do the Z test. Probably you are wondering how we can do the test by hand. We will explain this step by step in the next section. Note: We could download and install a package for Z-testing called TeachingDemos where z.test is the command to do the test, and norm.test is just the extracted version of z.test (instructions provided before on how to install and load packages), but we wanted to show you how we could use the source command. For curious readers, the extracted function is as follows (saved in the le norm.R):

norm.test <- function (x, mu = 0, stdev, alternative = c("two.sided", "less", "greater"), sd = stdev, conf.level = 0.95, ...) { if (missing(stdev) && missing(sd)) 40

stop("You must specify a Standard Deviation of the population") alternative <- match.arg(alternative) n <- length(x) z <- (mean(x) - mu)/(sd/sqrt(n)) out <- list(statistic = c(z = z)) class(out) <- "htest" out$parameter <- c(n = n, "Std. Dev." = sd, "Std. Dev. of the sample mean" = sd/sqrt(n)) out$p.value <- switch(alternative, two.sided = 2 * pnorm(abs(z), lower.tail = FALSE), less = pnorm(z), greater = 1 - pnorm(z)) out$conf.int <- switch(alternative, two.sided = mean(x) + c(-1, 1) * qnorm(1 - (1 - conf.level)/2) * sd/sqrt(n), less = c(-Inf, mean(x) + qnorm(conf.level) * sd/sqrt(n)), greater = c(mean(x) - qnorm(conf.level) * sd/sqrt(n), Inf)) attr(out$conf.int, "conf.level") <- conf.level out$estimate <- c("mean of x" = mean(x)) out$null.value <- c(mean = mu) out$alternative <- alternative out$method <- "One Sample z-test" out$data.name <- deparse(substitute(x)) names(out$estimate) <- paste("mean of", out$data.name) return(out) } Now to do our test in an automated way using norm.R, we can use the above function norm.test(). On our rst sample: > norm.test(x1, mu=0, sd=std, alt=two) Where we are testing to see if, for x1, Ho : = 0 versus Ha : = 0. Looking at the output, we can see that o = 0 is contained within the 95% CI and therefore we fail to reject the null hypothesis. Doing the same for x2 :

41

> norm.test(x2, mu=0, sd=std, alt=two) without looking we should know that we are going to reject the null because the mean of x2 is approximately 5. Conrming this by looking at the output, we can see that o = 0 is NOT contained within the 95% CI, and is considerably o. Thus, we reject the null hypothesis in favour of the alternative. Please note that the CIs calculated are dierent for each student because rnorm() randomly generates the dataset. Similarly, we can adjust the alternative test via the suboption: alt=type, where type can be set to either two, less, or greater, depending on the hypothesis test we are trying to perform.

4.2.2

Hypothesis Testing using t test

Instead of downloading a dataset, we are going to generate our own data to do some tests. Well randomly create two samples from a normal distribution with 1 = 0, 2 = 5, and both with a = 2. > x1 = rnorm(20, 0, 2) > x2 = rnorm(20, 5, 2) Now, for the purpose of illustration, let us assume that x1 and x2 are two vectors and we do not know from where they came (in other words, we do not know from which distribution they came from, and the parameters are also unknown). Now, in this case scenario we do know what the 2 value is, but how do we deal with it in R when we dont know what it is? Lucky for us, R has this test built in, so we dont have go searching for the t-value for (n 1) degrees of freedom. In this case we use the t.test() command, so if we did not know the 2 for x1 and was testing Ho : = 0 versus Ha : = 0, then: > t.test(x1, mu=0, alt=two,conf=0.95) By default it is always set for a two-sided test and the condence interval to 95%. And similar to before, we should see that = 0 is contained within the condence 42

interval and we dont reject Ho . If we did the same set up for x2, we would again most likely reject the null hypothesis. Instead of using t.test function, we could do it as follows (in the R Editor):

# (hash) sign is used here for comments n1 = 20 # sample size of x1 n2 = 20 # sample size of x2 alpha = 0.05 # level of significance x1 = rnorm(n1, 0, 2) # sample 1 x2 = rnorm(n2, 5, 2) # sample 2 # Here n1 and n2 are lengths of X1 and x2 respectively margin.of.error1 = abs(qt(alpha/2,(n1-1)))*sd(x1)/sqrt(length(x1)) ci1 = c(mean(x1)-margin.of.error1, mean(x1)+margin.of.error1) ci1 # confidence interval for true mean of x1 m.x1 = mean(x1) # sample estimate of mean of x1 t.test(x1, mu=0, sd=sd(x1), alt="two.sided") t.test(x1, mu=0, sd=sd(x1), alt="less") t.test(x1, mu=0, sd=sd(x1), alt="greater") margin.of.error12 = abs(qt(alpha,(n1-1)))*sd(x1)/sqrt(length(x1)) ci11 = c(-Inf, mean(x1)+margin.of.error12) 43

ci11 # corresponds to C.I. reported with "less" ci12 = c(mean(x1)-margin.of.error12, Inf) ci12 # corresponds to C.I. reported with "greater" margin.of.error2 = abs(qt(alpha/2,(n2-1)))*sd(x2)/sqrt(length(x2)) ci2 = c(mean(x2)-margin.of.error2, mean(x2)+margin.of.error2) ci2 # confidence interval for true mean of x2 m.x2 = mean(x2) # sample estimate of mean of x2 t.test(x2, mu=0, sd=sd(x2), alt="two.sided") t.test(x2, mu=0, sd=sd(x2), alt="less") t.test(x2, mu=0, sd=sd(x2), alt="greater") margin.of.error22 = abs(qt(alpha,(n2-1)))*sd(x2)/sqrt(length(x2)) ci21 = c(-Inf, mean(x2)+margin.of.error22) ci21 # corresponds to C.I. reported with "less" ci22 = c(mean(x2)-margin.of.error22, Inf) ci22 # corresponds to C.I. reported with "greater"

Note: The above example is created for illustrative purpose where we generated samples from the normal distribution, and the variances of the populations were already known. However, in real examples, if either were true (sample from normal, or population variance known), then we would have to do Z test instead of t test.

44

Understanding C.I. (optional)


To get a better grasp of what we mean by a 95% CI for is (a, b), copy and run the following function in R Editor:

sim.t.ci <- function(nsim = 1000,coln = 10,Mu = 0,StDev = 1){ result = matrix(c(1:coln,rep(NA,coln),rep(nsim*0.95,coln), rep(NA,coln)),ncol=4,byrow=F) for(j in 1:coln){ u=0 for(i in 1:nsim){ x=rnorm(20, Mu, StDev) ci=t.test(x, mu = Mu, alternative = "two.sided", conf.level = 0.95)$conf.int[1:2] u=u+(ci[1]<0 & ci[2]>0) v = nsim*0.95 - u result[j,2] <- u result[j,4] <- v/nsim*100 } } dimnames(result) <- list(NULL,c("[Simulation #]", "[Sample Result]","[Ideal case]","[Deviation (in %)]")) return(result) }

To call the function, we type as follows: sim.t.ci(nsim = 100,coln = 10,Mu = 0,StDev = 2) # for ss = 20 This generates 1000 xs, which are datasets of 20 randomly generated data points from a standard normal distribution. It then counts the number of times 0 is contained within the 95% condence interval. And as we can see, the number of times this occurs hovers somewhere around 950, and therefore 95% of the time.

45

4.2.3

CIs, Hypothesis Testing, and Two Samples

Suppose that we have two samples, X1 and X2 , that came from a normal distribution. As you learned in class, for 2 known, the a 95% CI is: (x1 x2 ) 1.96 which is calculated by: > margin.of.error3 = 1.96**sqrt((length(x1)+length(x2))/ (length(x1)*length(x2))) > c(mean(x1)-mean(x2)-margin.of.error3, mean(x1)-mean(x2)+margin.of.error3) And for an unknown 2 , a 95% CI is: (x1 x2 ) t1 ,(n1 +n2 2) 2 n1 + n2 , (x1 x2 ) + t1 ,(n1 +n2 2) 2 n1 n2 n1 + n2 n1 n2 n1 + n2 , (x1 x2 ) + 1.96 n1 n2 n1 + n2 n1 n2

which again can by automatically done by R: > t.test(x1, x2, mu=0, alt=two, var.equal=T) and so if we were testing Ho : = 0 versus Ha : = 0, where = 1 2 , then similarly, our output should show that = 0 is not contained within the condence interval, and therefore we would reject the null. Again, if we wanted to change the condence interval or test type, we would just adjust the appropriate sub-options.

46

4.2.4

Hypothesis Testing and Condence Intervals using external dataset

Now that we are familiar with the test illustrations, in this section, we will use support.txt. Their claim is that the average support rating is at least 50%, therefore we are testing (based on sample data - that is, a small representative part of the population): Statement 1: whether or not the true average support rating (from the population) is more than or equal to 50 (as the word at least is used), or Statement 2: if we rephrase the question, we want to test whether or not the true average support rating (from the population) is less than 50. Three things to notice here: 1. Both the statements are opposite to each other 2. As both statements are contradictions to each other - both cannot be true at the same time. We have to decide which one is ok. 3. One intuitive solution of this problem is to see whether the average of the sample is more than or equal to 50 or not. If it is more than 50, then, we might want to say that the claim is substantiated. However, this will not work, as the inference is always about the population values (here average), which we did not observe. We only had a small sample out of the whole population. Therefore, the average we are having are for sample data, not the population data and hence cannot be directly used to say anything about population. This is why we are doing this hypothesis testing which is a probabilistic way to solve this problem. Statistically speaking, the (Statement 1) is called Null hypothesis, Ho and the (Statement 2) is called Alternative hypothesis, Ha . For this problem, they are as follows: Ho : 50, Ha : < 50 Note: Notice that the strict inequality is in the Ha . This is a trick to recognize which one should be Ha (that means, it only can be one of these: < or > - when 47

the question tells that the population value is greater than or less than some particular value) and the opposite to that would be Ho (where we have the equality, with or without any other inequality sign with it: that means, it only can be one of these: or = or - when the question tells that the population value is at most, or equal or at least some particular value). As we stated previously, knowing the sample mean is not enough because of variability, and so we must create a condence interval to cover this variability. Now, because we dont know what the population variance is, since we only have a sample, we must use the t-test to perform our hypothesis testing.

Note: In the class it was taught that there might be three cases for choosing the test statistic (please refer to the Chapter 8 of course notes on CI and Hypothesis testing on one sample): 1. When the sample size is suciently large and the population variance is unknown, they should construct the z-interval instead of the t-interval. 2. The t-interval will be used when the underlying distribution of the variable of interest is normal and the sample size is less than 20. 3. With large enough sample size, the sample variance is a close estimate of population variance, and hence the t-curve is close to the z-curve, and hence the t-interval can be constructed as an approximation to Z-interval (both intervals will be almost exactly the same - therefore will not make much dierence which one we do). We will discuss more on this issue in the next lab under the topic Central Limit Theorem (CLT). To construct t-interval, we use the t.test() command and as we learned in class, we create a two-sided condence interval of 100(1 2 )%, i.e. a 90% C.I. (as we are testing a level of = 0.05): > t.test(support, mu=50, conf=0.90) and as we can see in the output, the 90% CI is: (47.33, 50.79). Therefore, since = 50 is contained within the CI, there is not enough evidence to reject the governments claim. 48

However, our hypothesis was one-sided; and thus for sake of appropriateness, we need to use the sub-option alt=less to perform a one-sided test for = 0.05, which would generate a 100(1 )%, i.e., 95% CI, we write: > t.test(support, mu=50, alt=less, conf=0.95) which produces a CI of: (, 50.76343) (Note that the test is one-sided and the way we calculated margin.of.error used to nd one limit before would be modied as margin.of.error = abs(qt(alpha,(n-1)))*sd(x1)/sqrt(length(x1)) instead of alpha/2 to do such kind of tests, where as the other limit have to be some innite value). The conclusion about this problem can be obtained as follows (all of which always agree, therefore doing it in any one method will be sucient): 1. Since = 50 is contained within the condence interval (, 50.76343), we Fail To Reject the null hypothesis. Remember, we do not accept hypotheses, we fail to reject them. 2. The reported test statistic value is t = - 0.9511. There were n = 49 data points, making the df = n-1 = 48. The df can also be obtained as DF = t.test(support, mu=50, alt=less)$parameter. Now, we compare the t statistic value with t tabulated value for = 0.05 (which can be obtained from R as well using the command qt(0.05,df=DF), which returns -1.677224. However, during the exam, you might need to nd this from a table given to you. So, it is adviced that you learn to look up the table). Now, as -1.677224 (tabulated value) < - 0.9511 (calculated value), we again fail to reject null hypothesis. 3. The reported p-value is 0.1732 where as = 0.05. Here p-value > . Therefore using the P-value method of hypothesis, we fail to reject null hypothesis and reach the same conclusion. Now, what happens if the US government got a little over condent and believe that their popularity was in fact exactly 52% of the voters (now using two sided test is appropriate)? Similarly, wed use the t-test for our hypothesis test except that this time we are testing: 49

Ho : = 52, Ha : = 52 and so adjusting the t.test() command for the new alternative and : > t.test(support, mu=52) Remember, by default the alternative hypothesis for the t.test() command is a two-sided test. This time, our condence interval is: (46.88602, 51.11398) and so since = 52 is not contained within the 95% CI, we can reject the null hypothesis of their claim that the average support rating is 52%. Note: As the sample size is large, students could do a Z-test on the same external data to check that the constructed intervals are almost the same as that of t. This is not done here, leaving this as an exercise for the curious students.

50

Chapter 5 Central Limit Theorem (CLT) and Analysis of Variance (ANOVA)


5.1 Central Limit Theorem (CLT)

Now, recapping what the CLT is, if X1 , , Xn are random sample from a distribution, with mean and variance 2 , then if n is suciently large, X is approximately
2 normally distributed with x = and x = 2 /n; that is,

X N ,

2 n

A question may arise, though, how do we know that n is suciently large? In general, the Rule of Thumb is, when the sample size n 20 we can use the CLT and so the sample can be assumed to approximately follow a normal distribution. For example, when making batches of gold for lament wires in superconductors there are always impurities in the batches, and lets say that on average the amount of impurities in each batch is = 4.0g, with a standard deviation of = 1.5g. If 50 batches were independently made, what is the probability that the average sample impurity x falls between 3.5 and 3.8g? That is, we want to approximate P (3.5 X 3.8). Since n = 50, we can use the CLT and we can assume that X is normally distributed with x = 4.0 and x = 1.5/ 50: XN 4, 1.52 50

51

Therefore, normalizing the probability we get: P (3.5 X 3.8) P =P 3.5 x 3.8 x Z x x 3.5 4.0 3.8 4.0 Z 1.5/ 50 1.5/ 50

= (0.94) (2.36) = 0.1645 Now suppose that the true mean and variance were unknown, and that our sample of size n = 50 produces x = 4.0 and s2 = 1.5. How do we construct a 95% CI for the unknown ? Under most circumstances, we dont know what 2 is, nor the original distribution of X. However, because n is suciently large, we can use the CLT and assume that the mean X follows a normal distribution. Therefore, like we learned before, our 95% CI would be: s s x z1 , x + z1 2 2 n n 1.5 1.5 4.0 z0.975 , 4.0 + z0.975 50 50 (3.586, 4.414)

Validation of the CLT


To validate the CLT, we will show two examples that are of dierent distributions. The rst is a uniform distribution (please check the appendix .1 for its properties) and the second a binomial distribution.

Example using Uniform distribution For the uniform distribution, let T =


n i=1

Xi where the Xi are independent and

follow a U nif (0, 1). Then, by the CLT, for n large enough:

T N

1 2 52

,n

1 12

Figure 5.1: Central Limit Theorem application using Uniform distribution using n = 3, 5, 20, 50 sample size

Histogram of x

Histogram of x

0.6

Density

Density

0.4

0.0

0.2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 x For n = 3

0.0

0.2

0.4

x For n = 5

Histogram of x
0.30 0.20

Histogram of x

Density

0.20

Density

0.00

0.10

10

12

14

0.00

0.10

20

25 x For n = 50

30

x For n = 20

53

To see this, using R well simulate 10,000 repications with n = 3, 5, 20, and 50. First we prepare the graphics device: > par(mfrow=c(2,2)) # We do not replicate this line Then we repeat the following codes 4 times (after changing values of n each time) > n = 3 # We change this line each time to 5, 20, and 50 > x = c() > for(r in 1:10000){ x[r] = sum(runif(n)) } > hist(x,freq=F) > curve(dnorm(x,n/2,sqrt(n/12)),add=T) We will repeat the above to create three more xs for n = 5, 20, and 50 and plot them. The reason we have to do it this way is because for the curve() function to work, it requires the expression to be a function or an expression containing x. As we can see from gure 5.1, as n get larger, the histogram ts the bell-curve shape better and when we do n = 20 and 50, which met the requirement for the CLT, the overlaps are near perfect.

Example using Binomial distribution For the binomial distribution, Bin(n, p), well do the same for T =
n i=1

Xi , where

the Xi s are independent Bin(1, p) random variables. Then, for n large enough: T N (np, np(1 p)) A good question at this point, though, is how large n should be so that the normal approximation is good. In this case, our normal rule of thumb n 20 doesnt apply the same way because we are dealing with a binomial distribution, and so the size of n depends on the probability p. The Rule of Thumb in this case is min{np, n(1 p)} 5. This rule takes into account the fact that when p is close to 0 or 1, the binomial distribution is more asymmetric, and so in these cases the normal approximation requires a larger value of n.

54

Figure 5.2: Central Limit Theorem application using Binomial distribution using n = 20, 30, 40, 50, 100 sample size
Histogram of x Histogram of x
0.20 Density

Histogram of x

0.5

0.4

0.4

Density

Density

0.3

0.3

0.2

0.2

0.1

0.1

10

0.00

0.0

0.0

0.05

0.10

0.15

0 2 4 6 8 x For n = 40

12

x For n = 20

x For n = 30

Histogram of x
0.12 Density

Histogram of x

0.15

Density

0.10

0.00

0.05

0 2 4 6 8 x For n = 50

12

0.00

0.04

0.08

10

15

20

x For n = 100

55

Now, using R lets set p = 0.10 and simulate 10,000 replication with n = 20, 30, 40, 50, and 100. Note: For n = 20, min{np, n(1 p)} = min{2, 18} = 2. > r = 10000; p = 0.10; q = 1-p > par(mfrow=c(2,3)) > x = rbinom(r,20,p) > hist(x,freq=F);curve(dnorm(x,20*p,sqrt(20*p*q)),add=T) > x = rbinom(r,30,p) > hist(x,freq=F);curve(dnorm(x,30*p,sqrt(30*p*q)),add=T) > x = rbinom(r,40,p) > hist(x,freq=F);curve(dnorm(x,40*p,sqrt(40*p*q)),add=T) > x = rbinom(r,50,p) > hist(x,freq=F);curve(dnorm(x,50*p,sqrt(50*p*q)),add=T) > x = rbinom(r,100,p) > hist(x,freq=F);curve(dnorm(x,100*p,sqrt(100*p*q)),add=T) and so as we can see from gure 5.2, the histogram ts the bell-curve shape of a normal distribution much better as we increase the number of n trials from 20 to 100. The better t is because as we increase the number of n trials, the min{np, n(1 p)} goes from 2 to 10, which eventually mets and exceeds our rule of thumb of min{np, n(1 p)} 5. Note that in the gures, the subtitles are added for convenience, which is not generated by given codes directly. You have to add sub sub-command in hist. And of course, when you plot gure 5.1 and 5.2, do not get surprised if your plots are dierent than the given plot: as we are dealing with random numbers, this should be the case. Run again your own codes, and next time it will produce dierent set of plots. But general features (the impact of central limit theorem) will be the same in all plots. That is why the random numbers are so powerful. We can generalize a idea by the use of random numbers.

56

5.2

Analysis of Variance (ANOVA)

In the previous labs we learned about hypothesis testing, normally along the lines on whether or not some parameter, like , is equal to some constant value that the experimenter claims. What happens, though, if we want to test to gure out whether or not groups are the same? For example, what if we wanted to test on whether or not two shipments of steel rods have the same breaking points. To test the equality of groups, for k 2, we build an ANOVA table, which stands for Analysis of Variance. Note: Technically speaking, ANOVA can be used to compare 2 or more population means. However, when there are only 2 groups to compare, we can still use ANOVA, but which is much more computationally intensive: therefore we just use the 2-sample z or t tests in such cases. As youve seen in class, the calculations for ANOVA can be quite tedious, and so well show you how to do you analysis via R to make things much simpler. To start, lets go to the course website and download: players.txt and load it into R using the read.table() command. The data le contains the stats on the members of three hockey teams, the Vancouver Canucks, Calgary Flames, and the Edmonton Oilers. Now suppose you wanted to test the hypothesis that the hockey players in the three teams have the same mean height. Therefore, our hypotheses are H0 : canucks = oilers = f lames , Ha : at least one i = j To do this test, well need to use the lm() command. As we saw previously, the lm() command was used to build linear models and again it comes into play here because we use to build a relationship between the three teams and the height of their players. > hockey = read.table(players.txt, head=T) > attach(hockey) > names(hockey) > model = lm(height team)

57

Note: The response variable is height and we are comparing that with respect to explanatory variable team, which are either the Canucks, the Oilers, or the Flames. If, by chance, we use numerical values like 1, 2, 3 to represent the dierent teams, R would think of the variable as continuous instead of being categorical. To alleviate this, we would have to use the as.factor() command, which would tell R that the data is categorical. Therefore, to set up our model: > model = lm(height as.factor(team)) Now that we have a model built, we can use perform an ANOVA on our data. Interestingly enough, although we are performing ANOVA on our data, the command we will use is actually the aov() command. There is, in fact, an anova() command, but interpreting the results is a bit more dicult, and so well opt to perform our ANOVA with the former: > t = aov(model) > summary(t) and as we can see in the display output, our familar ANOVA table.

Df team Residuals 2 72

Sum Sq 5.2706 0.3332

Mean Sq 2.6353 0.0046

F value 569.38

Pr(>F) < 2.2e-16 ***

Similar to how we did our analysis in linear regression for signicance of covariates, we focus on the p-value. The p-value is the farthest right column, this time signied by Pr(>F), recalling the p-value is the probability of getting a higher statistic (in our case, F value and previously for linear regression a T value) than the current one. Recall, if the p-value was greater than 0.05 then we could assume that the hypothesis was true and so wed fail to reject the null and be done, but wheres the fun in that? So, comparing the p-value in the output to = 0.05 (or any suitable level), there is overwhelming evidence to reject the null hypothesis in favor of the alternative hypothesis (that is, the p-value is way less than 0.05 which leads to the rejection of the null). 58

Alternatively, as you learned in class, you can compare the F-value directly since you know the degrees of freedom from the ANOVA. Therefore, if we looked up the critical F-value for F2,15 at a signicance level of = 0.05 in a F-table, we nd F2,15 = 6.3589, and comparing it against the F-value calculated by ANOVA, which is 569.38, again we see overwhelming evidence to reject the null. Note: The 3 symbols indicate how strong the signicance, the more s, the more evidence against the null.

Pairwise comparison of Means


Now you may question, how do we determine if any of the mean heights are the same? To do this, we have to do pairwise comparisons. The number of comparisons to be made is K= k 2 = 3 2 =3

Therefore, we get the following hypotheses to compare one by one (not simultaneously anymore which we were doing in ANOVA): 1. H0 : canucks = oilers vs. Ha : canucks = oilers 2. H0 : oilers = f lames vs. Ha : oilers = f lames 3. H0 : canucks = f lames vs. Ha : canucks = f lames To do the comparison, we rst we need to grab some vital information, which includes the number of players on the team and their respective means. It is easy enough to look through the data to gure which lines correspond to which teams, but lets bypass that and use some R-code: > u1 = mean(height[which(team == Can)]) > u2 = mean(height[which(team == Oil)]) > u3 = mean(height[which(team == Fla)]) > n1 = length(which(team == Can)) > n2 = length(which(team == Oil)) 59

> n3 = length(which(team == Fla)) Therefore, if we were to compare the Canucks against the Oilers, 1,2 = canucks oilers = u1 u2 = 0.6492 se(1,2 ) = M SE n1 + n2 = 0.0046 n1 n2 25 + 25 = 0.019183 25 25

where MSE came from our ANOVA table. Now, recalling that for a 95% CI we need to modify our t-value,
t value = t(),(nk) = t( K ),(nk) = t( 0.05 ),(753) = t(0.016667),(72) = 2.451 3

To calculate the t-value, you can normally look it up on a student t-distribution table, but because we sometimes deal with odd ratios, the following is the code to look up a t-value based on a certain number of degrees of freedom. > qt((1-0.0166667/2),72) or > qt((1-(0.05/length(unique(team)))/2), (length(team)-length(unique(team)))) Note: The way it is set up is to keep it similar to the form of t(1 ),(n1) Therefore, 2 our 95% condence interval is: 1,2 t( 2K ),(nk) se(1,2 ), 1,2 + t( 2K ),(nk) se(1,2 ) (0.6492 2.451 0.019183, 0.6492 + 2.451 0.019183) (0.6962, 0.6022) To do this in R, we do as follows: > (u1-u2)-qt((1-(0.05/length(unique(team)))/2), (length(team)-length(unique(team)))) sqrt((n1+n2)/(n1 n2) (sum(t$residuals2 )/t$df.residual)) And similar to how we analyze any of our previous condence intervals, since 1,2 = 0 (0.6962, 0.6021779), we can conclude that the mean heights of the / Canucks and the Oilers is dierent and that, in general, the Oilers are taller than the Canucks. And similarly, we would compare Canucks against Flames and Oilers 60

against Flames; recalling that K = 3. Skipping the details, we nd that a 95% CI for Canucks vs. the Flames is: (0.3598, 0.2658), which implies that again there is a dierence and that in general, the Flames are taller than the Canucks. For a 95% CI for Oilers vs. Flames is: (0.2894, 0.3834, which implies that there is a dierence and that in general, the Oilers are taller than the Flames. Therefore, we conclude that the mean heights of the three hockey teams are dierent and that in general, the Oilers are the tallest, followed by the Flames, followed by the Canucks: Oilers F lames Canucks .

61

.1
.1.1

Appendix: Uniform Distribution


Continuous Uniform Distribution

The probability density function of the continuous uniform distribution is: 1 ba 0 for a x b, for x < a or x > b,

f (x) =

Properties of this distribution are: The cumulative distribution function is: 0 for x < a F (x) = xa for a x < b ba 1 for x b Mean
a+b 2 a+b 2

Median

Mode is any value in [a, b] Variance


(ba)2 12

62

Vous aimerez peut-être aussi