Vous êtes sur la page 1sur 35

Week 1

Types of Observations

Time Series: data consists of measurements of the same concept at different points of time.
E.g., Sydney-are births per day, for each day in a year.
Cross-sectional: data consist of measurements of one or more concepts at a single point of
time. E.g., age, gender, and marital status of a sample of UNSW staff in a particular year.

Descriptive Statistics
What are the key features of a data set?
Many tools and techniques, some are graphical and numerical depending on data.
Frequency Distributions
Summaries of categorical data using counts
Bar Charts and Pie Charts+-Graphical representation of frequency distributions
Bar Charts: Can display multiple patterns in frequencies enabling quick visual comparisons
Pie Charts: Shows relative frequencies more explicitly
Histogram
Qualitative Data
Categories Bins
Cumulative Frequency or Relative Frequency Distributions
Stem-and-leaf Displays
Describing Histograms

Symmetry (or lack thereof)


Skewness
o Long tail to the right: positively skewed
o Long tail to the left: negatively skewed
o May be associated with outliers, is associated with asymmetric histogram
Number of Modal Classes/Bins
o Modal class is the class with the highest frequency
o Histograms may be unimodal or multimodal

Bivariate Relationships

How can we characterize relationships between variables?


o Contingency Table (cross-tabulation of cross-tab)
Captures relationship between two qualitative variables
o Scatterplots
Captures the relationship between two quantitative variables
TiIf one of these variables is time, then we get a time series plot

Time Series Plot

Plots a bivariate relationship between some variable and time, E.g., business cycles measured
as GDP growth over time

Week 2
Numerical Summaries of Key Features of Data

Key features of a single variable:


o Location, Spread, Relative Location, Skewness
Key features of a two variables
o Measures of (linear) association

Measures of Location
Parameter describes a key feature of a population
Statistic describes a key feature of a sample
Arithmetic Mean a natural measure of location or central
tendency
Median is the middle value of ordered observations
o When n is odd, median is a particular value, when n is
even, the median is the average of the two middle
values.
o Median depends on ranks, not absolute values
Mode is the most frequently occurring value(s)
o Modal Class (the most common class)
The mean, median and mode each provide different notions of representative or typical
central values
o For quantitative data, mode is often not useful
o Mean vs. Median
For symmetric distributions, mean=median
For positively (negatively) skewed data: mean > (<)
median
Median may be preferred when the data contain outliers
Outliers
Range is a simple measure of variability
o Range = maximum minimum
Variance (most common measure of variability) measures average
squared distance from mean
o Division by n-1 for sample variance relates to properties of
estimators
Standard Deviation is the spread measured in the original units of the
data (NOT squared)
Standardizing Data

Calculating Z-scores, a variable free of units of measurement (one z-score per observation)
o Calculate (observation Mean)/Standard Deviation

Coefficient of Variation

Coefficient of variation: cv = s/ x
o
o

Provides a measure of relative variability


Comparable across variables

Measures of Relative Location

Median relies on a ranking of observation to measure location, we can generalise this notion
to percentiles
o The Pth percentile is the value for which P percent of observations are less than that
value
o Median 50th percentile, the 25th and 75th percentiles are the lower and upper quartiles
o Interquartile Range = Upper Quartile Lower Quartile (another measure of spread)

Measures of Association

Covariance is a numerical measure


o Positive (negative) covariance positive
(negative) linear association
o Zero covariance no linear association
Correlation Coefficient is a standardized, unit-free
measure of association
o 1 (-1) perfect positive (negative) linear
relationship

Least Squares
Find best fit to describe bivariate relationship

Least Squares: minimizes the residual sum of squares


o Basis of Regression Analysis

The solution (with one independent variable)

R-Square: Percentage of Variation explained by the model

The fit of the model to the actual data, is described by R-Squared Statistic (also termed
coefficient of determination).
Maximum value of R-sq is 1 (perfect fit) and the minimum is 0 (no fit)
In a simple bivariate regression R-sq = [correlation of x and y]^2

Week 3
Probability
Probability is the mathematical means of studying uncertainty. It provides the logical foundation of
statistical inference.

Independence

Covariance and correlation are measures of linear association or linear dependence


o Dependence (and interdependence) captures a more general concept of association
between two variables.

A tease on sampling

Without replacement
With replacement

Probability Trees

Events in draws may be


represented by probability trees

Cumulative Distribution Function (cdf)

Defined as F(x) = P(Xx) for all


x

Week 4
Random Variables

Random variable: a function that assigns a real number to each possible value in the sample
space of an event.
o E.g. PHI = 1 if one has private health insurance, PHI = 0 if they dont
This is an example of a discrete random variable, it is a binary or a
indicator variable
o E.g., Time measured continuously is an example of a continuous random variable
Notation
o Upper-case X denotes the random variable
o Lower-case x denotes a value it might take on (a possible realized value)

Mathematical Expectation

The basis for formulating summary measures for probability distributions


Expected Value, multiply each possible value of the random variable by the probability of its
occurrence and adding up those products.
For any random variable, say X, its expected value is denoted by E(X)

Expectations for discrete random variables


The previous definitions of population
mean and variance assumed equally
likely outcomes. These definitions are
weighted averages of outcomes, with
the outcome probabilities as the weights.

Rules of Expectations

E(c) = c
E(X+c) = E(X) + c
E(cX) = cE(X)
Var(c) = 0
Var(X+c) = Var(X)
Var(cX) = c2Var(X)

Remember Z-scores? Use the expectations rules


to determine the mean and variance of Z. Recall
that the original random variable has an
expected value of and a variance of 2

Bivariate Distributions

The set of all values P(x,y), i.e., the probabilities associated with all possible pairs (x,y), is the
joint probability distribution of (X,Y)
The marginal distribution of Y can be found by summing the joint probabilities across all
values of X
o P(y) = all x P(x, y)
Population formulas for covariance and correlation (correlation has advantage of being unitfree)
o Cov (X,Y) = xy= all xall y (x x)(y y)P(x,y)
o Cor (X,Y) = = xy/xy

More Rules of Expectations


Let X and Y be random variables. Then:

E(XY) = E(X) E(Y)


Var(XY) = Var(X) + Var(Y) 2Cov(X,Y)
If X and Y are independent (hence zero covariance!), then
o Var(XY) = Var(X) + Var(Y)
Cov(c1X, c2Y) = c1c2 Cov(X,Y)

Portfolio Allocation

Variability is called volatility in finance, and is a measure of risk


A portfolio is a collection of stocks held by an investor
o An investor reduces risk through diversification achieved by holding a portfolio
o

Binomial Distribution
Requirements for something to qualify as a binomial experiment:

Sequence of fixed number of n trials


Each trial has 2 outcomes, arbitrarily denoted success and failure
There is a fixed probability of success (p) and failure (1-p) over all trials
Trials are independent
Under these assumptions, this is a sequence of Bernoulli trials
The outcome of each trial is recorded in a random variable

Binomial Random Variables


We can represent our sequence of Bernoulli Random Variables as:

X1, X2XN, where Xi = 1 is called success and Xi = 0 is failure


Under the assumption made, this is a sequence of independent and identically distributed
(i.i.d) random variables.

X is called a binomial random variable


Characterized by 2 parameters: n and p
I.e., once we know the values of these parameters in a given case, we know everything about
the random variable and its probability distribution

1
In this example

10

P(X = 5) is a binomial probability


P(X 1) is a cumulative binomial
probability
P(X > 1) is sometimes called a
survivor probability

Other common discrete distributions

11

There are many other common discrete distributions that can be used to represent or model
real phenomena
E.g., the discrete uniform distribution (pretty boring!)
The Poisson distribution is a little more interesting
o Binomial number of successes in sequence of trials
o Poisson number of successes in a period of time or region of space
o The Poisson distribution relates to a count variable (0, 1, 2, ), where successes are
relatively rare
E.g., the number of times an individual visited a GP in the last year

Week 5
Continuous Random Variables
For discrete random variables, we assigned positive probabilities to different outcomes.

Consider store deliveries that arrive somewhere between 7 and 8 AM


Suppose the random variable of interest is the number of minutes after 7 AM that deliveries
are made
o Possible outcomes would then be 0, 1, 2, , 60 (7:00 AM, 7:01 AM, 7:02 AM, )
o Assume all outcomes are equally likely, probability is 1/61

But if we can measure time to any degree of accuracy, then we could construct a random variable that
can take on any value in the interval from 0 to 60.

This would be a continuous random variable

Probability Density Function (pfd)

A continuous version of the familiar probability histogram used for discrete random variables.
Consider a continuous random variable X with range a x b. Its probability density
function (pdf), which we refer to as f(x), must satisfy:
o f(x) 0 for all x between a and b
o Total area under the curve between a and b is unity (i.e., equal to 1)
Probabilities are now represented by areas under pdf

Uniform (discrete) Random Variable


The discrete uniform pdf for our store delivery example (limiting measurement to whole minutes)
would have the form:

The equally likely nature of this random variable is now represented by any
interval of width m having equal probability
The Normal Distribution
For any normally distributed random variable:

P(X = x) = 0**
P(a < X < b) = area under pdf curve between possible values a and b**
A normal distribution is completely characterized by its mean, , and variance, 2
** true for CONTINUOUS random variables

Graphically the normal probability density function is symmetric, unimodal and bell-shaped.

Mean = median = mode

Its basic features include:

Range of support is unlimited: - x


Despite unlimited range, there is little probability area in the tails of a normal distribution
o 4.6% outside 2 ; 0.3% outside 3 (confirm from tables) this is where the 6895-99.7 rule comes from

If X is normally distributed with mean and variance 2 then write: X ~ N (, 2)


12

The Standard Normal

Z = (X-)/
As weve shown, this standardization yields a random variable with zero mean and a standard
deviation of one
How are Z and X related?
o Z = a + bX
(where a = -( /) and b = 1/)
So Z is a linear transformation of X

Uses of the Standard Normal


An important theorem says: Linear combinations of normally distributed random variables are
also normally distributed
When X~N (,2), Z (a linear function of X) is called a standard normal random variable
Calculating Normal Probabilities
Normal random variables are continuous probabilities need to be calculates as integrals.
continuity correction
Data Collection

13

Secondary Data
Primary Data
Observational Data measures actual behaviour or outcomes
Experimental Data imposes a treatment and measures resultant behaviour or outcomes
Simple Random Sampling: A sampling process by which all samples of the same size (n)
are equally likely to be chosen from the population of interest
o Avoids problems of selection bias where the design

Week 6
Estimation

Parameters describe key features of populations


In practical situations, parameters are unknown
Instead, a sample is drawn from the population to provide basic data
These data are used to calculate various sample statistics
These sample statistics are used as estimators for population parameters

Estimators:

A statistic is any function of data in the sample


An estimator is a statistic whose purpose is to estimate a parameter or some function thereof
A point estimator is simply a formula (rule) for combining sample information to produce a
single number to estimate
Estimators are random variables because they are functions of random variables

Properties of Estimators
Desirable properties for estimators include:

14

Unbiasedness: If we constructed it for each of many hypothetical samples of the same size,
will the estimator deliver the correct value (i.e., the value of the parameter) on average?
Consistency: As the sample size gets larger, does the probability that the estimator deviates
from the parameter by more than a small amount become smaller?
Relative efficiency: If there are two competing estimators of a parameter, does the sampling
distribution of one have less expected dispersion than that of the other?

Note that n appears in the denominator of the variance of the sampling distribution (and
hence in the standard error we just
used).
This means that the larger we set the
sample, the tighter the distribution of
the sample proportion will be.
So, as the sample size grows, the
interval we can construct, for any given
statement of confidence like the purple
statement, shrinks.

That is, we are 95% sure that OUR PARTICULAR p is within 2(0.008) of p.
The Margin of Error

Choosing the Sample Size

To get a narrower confidence interval without giving up confidence, we must choose a larger
sample.
Suppose a company wants to offer a new service and wants to estimate, to within 3%, the
proportion of customers who are likely to purchase this new service with 95% confidence.
Q: How large a sample do they need?
What we know: The desired ME and confidence level.
What we dont know: n, p, or p.

Hypothesis Testing

The underlying question in hypothesis testing:

Do the data support a contention/belief/hypothesis about this parameter of interest?

15

We know from our work in estimating population proportions that there can be no definitive
answer to this type of question
o Estimators are random variables!
The process of hypothesis testing is potentially subject to incorrect conclusions.

Week 7
Sampling from a Normal Population

Sampling from a Non-normal Population

Central Limit Theorem (CLT)


The sampling distribution of the mean of a random sample drawn from any population with mean
and variance 2 will be approximately normally distributed for a sufficiently large sample size

Based on a limiting argument, it can be shown that

How are data used to test a null hypothesis?

16

Proceed by comparing a test statistic with the value specified in H 0 and decide whether the
difference is:
o Small enough to be attributable to random sampling errors do not reject H0, or
o So large that H0 is more likely not to be correct reject H0
Formally define a rejection (or critical) region by your choice of alternative possibilities
o If values of the test statistic are extreme enough i.e., if they fall in the rejection
region then they lead us to reject H0 in favour of H1
Other possible values of the test statistic, that are not so extreme, lie in the non-critical region

One and two-tailed tests

A one-tailed test defines the rejection region as one extreme end of the sampling distribution
o The null and alternative hypotheses here will look something like:
o H0: = .7 H1: > .7
A two-tailed test defines the rejection region as both extreme ends of the sampling
distribution
o Here the null and alternatives might look like this:
o H0: = .7 H1: = .7
Also recall: the alpha level associated with a confidence level (e.g., 95%) is defined as the
maximum p-value that would enable that level of confidence in our judgment (e.g., 0.05).
This is the probability of our statistic falling into the rejection region (for a two-tailed test:
split the probability across the two ends of the sampling distribution!).

Sigma
When our sample size is large

CLT sampling distribution of the sample mean is approximately normal irrespective of the
population distribution
When is unknown and is replaced by s, our standardized test statistic remains approximately
(asymptotically) normally distributed
o Why? Because in large samples s will be close to with high probability (it is a
consistent estimator of )
So, for large n nothing has changed
o This is true for the sampling distribution of both the sample mean and the sample
proportion (where, recall, we had to use the sample proportion instead of the
population proportion in our variance formula for the sampling distribution)

But what about small n?

Let us confine ourselves first to sampling from a population in which the target random
variable is distributed normally
Let our sample be i.i.d. from N (,2)
o Recall that linear combinations of normal random variables are also normal, which is
why we know the sample mean and its standardized version (using ) will also be
normally distributed
o But what happens when we standardize using s?

Gossetts Student t Distribution

17

Properties of the t Distribution

Symmetric and unimodal


o Looks very similar to the normal, but has fatter tails
o Is characterized by degrees of freedom (df)
o For larger and larger , the distribution becomes more and more like a normal
distribution
Sharpe Table A-63 only provides critical values
l For a one-tailed test, P(t > t,) =
For a two-tailed test, P(t > t,) = /2
Check out the infinity line in the t-table. Its critical values are identical to the critical values
of the normal distribution!

Inference using the t distribution

18

Week 8
Interval Estimation

Point estimators produce a single estimate of the parameter of interest


In many real-world situations, some notion of the margin of error would be useful
Interval estimators produce an interval i.e., a range of values and a degree of confidence
associated with that interval
o Hence the name confidence interval
How often would you expect the true population parameter to be in this
(sample-specific) interval?

Interval estimation for means

The endpoints of the interval are themselves random variables


o We have constructed a random interval
is a constant
For a particular sample (and sample mean value), is either in the confidence interval, or it is
not
If 100 size-n samples were drawn, we would expect 95 of them to include

Confidence Intervals
CIs for means and proportions typically have a similar structure

19

Centred at sample statistics


Endpoints are some multiple of the standard error (if we dont know sigma) or standard
deviation (if we do know sigma) of the sampling distribution
The multiple is determined by the confidence level chosen by the investigator
Remember: If you dont know sigma and have a small sample, use the t-distribution tables to
get your bounds not the Z!

Selecting Sample Size

p-values
How do I choose the significance level ?

No rules
Conventional choices are = 0.1, 0.05, or 0.01

Why do I have to choose a particular ?

You dont (though doing so can helpfully bind your hands)


You can calculate the empirical significance level, or pvalue

The p-value associated with a given test statistic is the probability of obtaining a value of the test
statistic as or more extreme than that observed, given that the null hypothesis is true

More extreme depends on form of the alternative hypothesis

Hypothesis Testing and Errors

20

Type I errors occur when we reject a true null hypothesis


Only possible to make this error when the null is true
Denote P(Type I error) = (significance level)
P(Reject H0| H0 true) =
Type II errors occur when we dont reject a false null hypothesis
Only possible to make this error when the null is false
Denote P(Type II error) =
P(Do not reject H0 | H0 not correct) =
P(Type II error) depends on what the actual (alternative) parameter value is!

Power of a test
Power (in statistics): The probability of correctly rejecting a false null hypothesis

P(Do not reject H0| H0 not correct) =


P(Reject H0| H0 not correct) = Power = 1-

The power of a test increases when we increase significance level () or increase n.

21

Week 9
Chi-squared Tests

Data often occurs in nominal (categorical) form


Such data feature several possible outcomes or categories for the measured phenomenon
o Categories are mutually exclusive and exhaustive
o If we think of each observation as being a trial, its like a multinomial extension to
binomial experiments
o One could imagine an expected or hypothesised distribution of outcomes across the
categories, to which we can compare the distribution seen in our sample
To compare observed and expected distributions
o We could simply calculate differences in expected and observed category frequencies
o Our inference problem is to determine whether these differences are statistically large
enough to reject the claim that the expected (probability) distribution is, in fact, what
the sample data were drawn from

The Chi-squared goodness-of-fit test is used to test the null hypothesis that the observed and
expected distributions are the same.
Chi-squared tests

H0 specifies probabilities pi that an observation falls into i=1,,c categories or cells


o H0 implies expected frequencies for a sample of size n (ei = pi n), assuming:
Random sampling (independent trials)
Probabilities pi are constant over trials
The test can be unreliable if any ei = pi n are too small (e.g., 3 or 4)
o Solution: Merge categories together, where sensible
The distribution theory underlying the test is not exact
o It is large sample theory (a reason for the above limitation about small expected cell
frequencies)
The test statistic is given by

The 2 Distribution
An asymmetric distribution characterized (like the t distribution!) by degrees of freedom . Its support
lies on the interval (0,): that is, it is exclusively non-negative. It is the sum of the squares of
independent standard normal variables.

22

Contingency Tables

Previously, we used such tables as


descriptive tools
We also were interested in whether the two
events were independent
Now we want to formally test whether
these random variables are independent or not
o Q: Is there a statistically significant relationship between these two categorical
random variables?
The testing strategy here is similar to that used for the goodness-of-fit test
o We compare observed cell frequencies in our sample with those expected under the
null hypothesis of independence
How do you calculate the expected frequencies?
o Previously, these followed readily from the hypothesized probability distribution
o Now H0 simply asserts independence (or homogeneity, if the categories used for
one or both dimensions are not comprehensive) of the event described by one
probability distribution with respect to the other
o Recall what is required for independent events:
P(A B)=P(A)P(B)
To craft our null hypothesis, we thus set up an imaginary contingency table which assumes
independence between the two aspects under analysis
o We use marginal (row and column) totals from the data to generate expected
frequencies for each cell
o The expected
frequency of
observations
in the cell in
row I and
column j
under
independence is:

23

24

Week 10
Simple Regression
Suppose we have (Yi, Xi) pairs for i = 1, , n

We fit a line to the data by minimizing the residual sum of squares. This is the action
accomplished by ordinary least squares (OLS) regression.
o This line is defined by estimates of the intercept and slope in the linear relationship Yi
= 0 + 1Xi + i
The sign of the slope coefficient has the same sign as the sample covariance (and correlation)
between Y and X

Numerical versus statistical properties

OLS can be viewed as curve fitting


o The curve we fit describes the relationship amongst variables
But we also want to make inferences about the parameters of the population regression
function
o How can we use b1 to make inferences about 1?
o What are the properties of b1 as an estimator of 1?
We can also use regression models to make predictions or forecasts e.g.,:
o If a company increases its advertising expenditure, what is the predicted impact on
sales?
o What is the confidence interval for that prediction?

Some Basics
Terminology:

Yi is the dependent variable


Xi is the independent or explanatory variable i is the disturbance or error term
0 and 1 are the parameters to be estimated

OLS produces:
Estimated parameter values b0, b1
Predictions: Yi = b0 + b1Xi

25

Residuals: ei = Yi Yi
The population regression relationship is Yi = 0 + 1Xi + i

This equation links the (X,Y) pairs via the unknown parameters and the unobserved errors

OLS produces Yi = b0 + b1Xi + ei

This equation links the (X,Y) pairs via estimated parameters and calculated residuals

The disturbance term i plays a crucial role in regression

Distinguishes regression models from deterministic functions


Represents factors other than Xi that affect Yi
Regression treats these other factors as unobserved

1 is the marginal effect of Xi on Yi , holding these other factors constant

Reliable estimates of 1 will require assumptions restricting the relationship between Xi and
i
Our desire to make ceteris paribus interpretations of 1 include more explanatory
variables in the regression
Leads to an extension to multiple regression

Assumptions of Classical Linear Regression Model

Classical Linear Regression Model (CLRM)

26

Decomposition of Variance

Standard Error of the Estimate


The population variance, 2, measures the spread of the data around the population regression line

The standard error of the estimate (SEE) is an estimator of


It measures the fit of the regression model
Low SEE good fit

Coefficient of Determination

OLS Inference

27

What can we say about the properties of the OLS point estimators b 0 and b1, if our
assumptions hold?
o They are unbiased E(bj) = j
o They are normally distributed, as they are linear functions of Yi , which are assumed
to be drawn from a normal distribution

Even without normality of Yi , we can invoke the CLT and assume that bj will be
asymptotically normal
We need to know Var(b0) and Var(b1) in order to conduct inference
Just as we did when estimating means, we can define the true and estimated variances
The panel below gives:
o True variances on the left, and
o Estimated standard errors on the right, where the unknown is replaced by estimated
s.
o

28

Week 11
Prediction in Regression
One of the main uses for regression models is in prediction or forecasting

Basic idea: Use the fitted regression line (including b0 and b1) to predict a value of Y from a
given value of X
It is natural to think of forecasting in a time series context: e.g., what will happen to sales (Y)
if advertising (X) increases next year?
Predictions can also be made using cross-sectional regression results: e.g., if the income (X)
of a poor household increased, what would be is the predicted impact of this on that
households food expenditures (Y)?
It is often inadvisable to predict too far away from the (Xi,Yi) pairs to which OLS fits a line
(i.e., too far out of sample), or for a different population

Linear Trend Model


A very simple model applied to time series data, where the X variable is time:

Yt = 0 + 1t + t ; t=1,,T
Dates are converted to time: e.g., 1997Q3 becomes t=1

Prediction
Assume: Yt = 0 + 1 Xt + t

Given estimates of 0 and 1, and a particular Xf , we can generate a point prediction


(forecast) for Yf
o Since Yf = E(Yf | Xf) + f , our prediction is also an estimate of the (conditional) mean
of Yf , E(Yf | Xf)
o That conditional mean has a sampling distribution!
o To construct a confidence interval for E(Yf | Xf), we need to know the standard error
(s.e.) of the estimate of E(Yf | Xf)
o Our prediction is a linear combination of OLS estimates, so we can use a simple trick
to get Excel to calculate the standard error of our prediction of E(zY f | Xf) for us!

Running OLS on this transformed model will


produce a predicted (the intercept in above) and
its associated standard error!

29

That will enable us to construct a confidence interval for our prediction of the conditional
mean, E(Yf | Xf )
For the petrol prices example, the transformed model would require regressing prices (Yt) on
(t 37) [recall Xt =t]

Confidence Intervals for Predictions

There appears to be a difference between the expressions for the variance and standard error of
the prediction given in the lecture and those in Sharpe (p. 519). However, with a correction for a
typo in Sharpe (!), they are indeed the same. The trick is to know the sample variance for b1,
var(b1)

Predicition: Y or Expected Y?
We can also construct a prediction interval for Yf

Here we want to predict actual Y rather than mean Y


The point predictions are the same, but the confidence interval (CI) for Yf will be wider than
the CI for E(Yf | Xf)
Why?
See Sharpe et al.s box on p. 521-522 for a comparison of the difference between predicting
the conditional mean E(Yf | Xf), and predicting actual Yf for a particular population
member.

A note on Errors, Residuals and Related Matters

30

Motivation for Multiple Regression

A key threat to the appropriate interpretation of simple linear regression results is the problem
of omitted variables
o Our estimates may not accurately reflect the effect of height on income holding all
other omitted factors constant
o This is often called a problem of confounding and is due to omitted variable bias
o This bias occurs when the following facts BOTH hold:
A variable omitted from the regression is correlated with the included
explanatory variable
That omitted variable independently influences the dependent variable
A primary motivation for multiple regression is to avoid this type of confounding

Interpretation in Multiple Regression

Estimation and Inference in Multiple Regression

31

Conceptually, little has changed relative to simple linear regression


o Our prior assumptions, A1-A7, are essentially the same for multiple regression
o We need only add one additional assumption:
A8: No perfect (multi)collinearity, which precludes exact linear
relationships between explanatory variables
E.g.: In a regression of weight on height and sex, we cannot include both a
dummy for male and a dummy for female, because the two dummy variables
would sum to unity (i.e., 1) which is already captured in the intercept.
Intuition: You can only use a unique piece of information once in a regression
model. Knowing someone is a man means you also know he is not a woman!

Linear Regression

32

Week 12
The F-Test

The t-tests on individual coefficients tell you about one variable at a time
o E.g., is gender related to income?
But we can also test the overall significance of all variables also referred to as the
significance of the model
o This is essentially a test of the significance of the R-sq
The null hypothesis that this test assumes is that all coefficients are equal to zero
The alternative hypothesis is that any, some, or all of them is non-zero

As with the Chi-sq test, reporting the F-test result is not informative unless you interpret
where that result is coming from
You can do this by looking at which variables are significant, and which are insignificant,
using individual t-tests

Dummy Variable Arrays

33

Recall the example last time with an array of dummies to indicate educational attainment
Suppose we are modelling retail sales and want to capture the month of the observation
We could:
o Create ONE indicator for month, taking the value 1, 2, 3, 412
OR Create twelve SEPARATE indicators, one for each month, each taking the value 0 or 1 for
all observations, and leave one indicator out of the model
Which is better?

Modelling Non-linear Relationships

We mentioned last time how nonlinear relationships between X and Y can be modelled via
variable transformation prior to estimation
How do we interpret the results of the following (fairly common) models?

Multicollinearity
Predictor variables exhibit collinearity (or multicollinearity) when one of the predictors can be
predicted well (not perfectly!) from the others.
Consequences of collinearity:

34

Estimated coefficients can be surprising, taking on an unanticipated sign or being


unexpectedly large or small.
The stronger the association of one variable with the others in the model, the more the
variance of its estimated coefficient increases (variance inflation). This can lead to a smaller
tstatistic and correspondingly larger p-value.

High collinearity leads to the coefficient being poorly estimated and having a large standard
error (and correspondingly low t-statistic). The coefficient may seem to be the wrong size, or
even the wrong sign.
If a multiple regression model has a high R2 and large F, but the individual t statistics are not
significant, you should suspect collinearity.
Collinearity is measured in terms of the association between a predictor and all of the other
predictors in the model not in terms of just the correlation between any two predictors.

What can you do about multicollinearity?


You can try to simplify the model by removing some of the predictors. Which should you keep?

Variables that are inherently the most important to the problem


Variables that are the most reliably measured

Note that a moderate association between two or more independent variables is not a big deal! In
fact, if all independent variables were orthogonal, there would be no point in estimating a model
including more than one of them

35

Vous aimerez peut-être aussi