Académique Documents
Professionnel Documents
Culture Documents
Types of Observations
Time Series: data consists of measurements of the same concept at different points of time.
E.g., Sydney-are births per day, for each day in a year.
Cross-sectional: data consist of measurements of one or more concepts at a single point of
time. E.g., age, gender, and marital status of a sample of UNSW staff in a particular year.
Descriptive Statistics
What are the key features of a data set?
Many tools and techniques, some are graphical and numerical depending on data.
Frequency Distributions
Summaries of categorical data using counts
Bar Charts and Pie Charts+-Graphical representation of frequency distributions
Bar Charts: Can display multiple patterns in frequencies enabling quick visual comparisons
Pie Charts: Shows relative frequencies more explicitly
Histogram
Qualitative Data
Categories Bins
Cumulative Frequency or Relative Frequency Distributions
Stem-and-leaf Displays
Describing Histograms
Bivariate Relationships
Plots a bivariate relationship between some variable and time, E.g., business cycles measured
as GDP growth over time
Week 2
Numerical Summaries of Key Features of Data
Measures of Location
Parameter describes a key feature of a population
Statistic describes a key feature of a sample
Arithmetic Mean a natural measure of location or central
tendency
Median is the middle value of ordered observations
o When n is odd, median is a particular value, when n is
even, the median is the average of the two middle
values.
o Median depends on ranks, not absolute values
Mode is the most frequently occurring value(s)
o Modal Class (the most common class)
The mean, median and mode each provide different notions of representative or typical
central values
o For quantitative data, mode is often not useful
o Mean vs. Median
For symmetric distributions, mean=median
For positively (negatively) skewed data: mean > (<)
median
Median may be preferred when the data contain outliers
Outliers
Range is a simple measure of variability
o Range = maximum minimum
Variance (most common measure of variability) measures average
squared distance from mean
o Division by n-1 for sample variance relates to properties of
estimators
Standard Deviation is the spread measured in the original units of the
data (NOT squared)
Standardizing Data
Calculating Z-scores, a variable free of units of measurement (one z-score per observation)
o Calculate (observation Mean)/Standard Deviation
Coefficient of Variation
Coefficient of variation: cv = s/ x
o
o
Median relies on a ranking of observation to measure location, we can generalise this notion
to percentiles
o The Pth percentile is the value for which P percent of observations are less than that
value
o Median 50th percentile, the 25th and 75th percentiles are the lower and upper quartiles
o Interquartile Range = Upper Quartile Lower Quartile (another measure of spread)
Measures of Association
Least Squares
Find best fit to describe bivariate relationship
The fit of the model to the actual data, is described by R-Squared Statistic (also termed
coefficient of determination).
Maximum value of R-sq is 1 (perfect fit) and the minimum is 0 (no fit)
In a simple bivariate regression R-sq = [correlation of x and y]^2
Week 3
Probability
Probability is the mathematical means of studying uncertainty. It provides the logical foundation of
statistical inference.
Independence
A tease on sampling
Without replacement
With replacement
Probability Trees
Week 4
Random Variables
Random variable: a function that assigns a real number to each possible value in the sample
space of an event.
o E.g. PHI = 1 if one has private health insurance, PHI = 0 if they dont
This is an example of a discrete random variable, it is a binary or a
indicator variable
o E.g., Time measured continuously is an example of a continuous random variable
Notation
o Upper-case X denotes the random variable
o Lower-case x denotes a value it might take on (a possible realized value)
Mathematical Expectation
Rules of Expectations
E(c) = c
E(X+c) = E(X) + c
E(cX) = cE(X)
Var(c) = 0
Var(X+c) = Var(X)
Var(cX) = c2Var(X)
Bivariate Distributions
The set of all values P(x,y), i.e., the probabilities associated with all possible pairs (x,y), is the
joint probability distribution of (X,Y)
The marginal distribution of Y can be found by summing the joint probabilities across all
values of X
o P(y) = all x P(x, y)
Population formulas for covariance and correlation (correlation has advantage of being unitfree)
o Cov (X,Y) = xy= all xall y (x x)(y y)P(x,y)
o Cor (X,Y) = = xy/xy
Portfolio Allocation
Binomial Distribution
Requirements for something to qualify as a binomial experiment:
1
In this example
10
11
There are many other common discrete distributions that can be used to represent or model
real phenomena
E.g., the discrete uniform distribution (pretty boring!)
The Poisson distribution is a little more interesting
o Binomial number of successes in sequence of trials
o Poisson number of successes in a period of time or region of space
o The Poisson distribution relates to a count variable (0, 1, 2, ), where successes are
relatively rare
E.g., the number of times an individual visited a GP in the last year
Week 5
Continuous Random Variables
For discrete random variables, we assigned positive probabilities to different outcomes.
But if we can measure time to any degree of accuracy, then we could construct a random variable that
can take on any value in the interval from 0 to 60.
A continuous version of the familiar probability histogram used for discrete random variables.
Consider a continuous random variable X with range a x b. Its probability density
function (pdf), which we refer to as f(x), must satisfy:
o f(x) 0 for all x between a and b
o Total area under the curve between a and b is unity (i.e., equal to 1)
Probabilities are now represented by areas under pdf
The equally likely nature of this random variable is now represented by any
interval of width m having equal probability
The Normal Distribution
For any normally distributed random variable:
P(X = x) = 0**
P(a < X < b) = area under pdf curve between possible values a and b**
A normal distribution is completely characterized by its mean, , and variance, 2
** true for CONTINUOUS random variables
Graphically the normal probability density function is symmetric, unimodal and bell-shaped.
Z = (X-)/
As weve shown, this standardization yields a random variable with zero mean and a standard
deviation of one
How are Z and X related?
o Z = a + bX
(where a = -( /) and b = 1/)
So Z is a linear transformation of X
13
Secondary Data
Primary Data
Observational Data measures actual behaviour or outcomes
Experimental Data imposes a treatment and measures resultant behaviour or outcomes
Simple Random Sampling: A sampling process by which all samples of the same size (n)
are equally likely to be chosen from the population of interest
o Avoids problems of selection bias where the design
Week 6
Estimation
Estimators:
Properties of Estimators
Desirable properties for estimators include:
14
Unbiasedness: If we constructed it for each of many hypothetical samples of the same size,
will the estimator deliver the correct value (i.e., the value of the parameter) on average?
Consistency: As the sample size gets larger, does the probability that the estimator deviates
from the parameter by more than a small amount become smaller?
Relative efficiency: If there are two competing estimators of a parameter, does the sampling
distribution of one have less expected dispersion than that of the other?
Note that n appears in the denominator of the variance of the sampling distribution (and
hence in the standard error we just
used).
This means that the larger we set the
sample, the tighter the distribution of
the sample proportion will be.
So, as the sample size grows, the
interval we can construct, for any given
statement of confidence like the purple
statement, shrinks.
That is, we are 95% sure that OUR PARTICULAR p is within 2(0.008) of p.
The Margin of Error
To get a narrower confidence interval without giving up confidence, we must choose a larger
sample.
Suppose a company wants to offer a new service and wants to estimate, to within 3%, the
proportion of customers who are likely to purchase this new service with 95% confidence.
Q: How large a sample do they need?
What we know: The desired ME and confidence level.
What we dont know: n, p, or p.
Hypothesis Testing
15
We know from our work in estimating population proportions that there can be no definitive
answer to this type of question
o Estimators are random variables!
The process of hypothesis testing is potentially subject to incorrect conclusions.
Week 7
Sampling from a Normal Population
16
Proceed by comparing a test statistic with the value specified in H 0 and decide whether the
difference is:
o Small enough to be attributable to random sampling errors do not reject H0, or
o So large that H0 is more likely not to be correct reject H0
Formally define a rejection (or critical) region by your choice of alternative possibilities
o If values of the test statistic are extreme enough i.e., if they fall in the rejection
region then they lead us to reject H0 in favour of H1
Other possible values of the test statistic, that are not so extreme, lie in the non-critical region
A one-tailed test defines the rejection region as one extreme end of the sampling distribution
o The null and alternative hypotheses here will look something like:
o H0: = .7 H1: > .7
A two-tailed test defines the rejection region as both extreme ends of the sampling
distribution
o Here the null and alternatives might look like this:
o H0: = .7 H1: = .7
Also recall: the alpha level associated with a confidence level (e.g., 95%) is defined as the
maximum p-value that would enable that level of confidence in our judgment (e.g., 0.05).
This is the probability of our statistic falling into the rejection region (for a two-tailed test:
split the probability across the two ends of the sampling distribution!).
Sigma
When our sample size is large
CLT sampling distribution of the sample mean is approximately normal irrespective of the
population distribution
When is unknown and is replaced by s, our standardized test statistic remains approximately
(asymptotically) normally distributed
o Why? Because in large samples s will be close to with high probability (it is a
consistent estimator of )
So, for large n nothing has changed
o This is true for the sampling distribution of both the sample mean and the sample
proportion (where, recall, we had to use the sample proportion instead of the
population proportion in our variance formula for the sampling distribution)
Let us confine ourselves first to sampling from a population in which the target random
variable is distributed normally
Let our sample be i.i.d. from N (,2)
o Recall that linear combinations of normal random variables are also normal, which is
why we know the sample mean and its standardized version (using ) will also be
normally distributed
o But what happens when we standardize using s?
17
18
Week 8
Interval Estimation
Confidence Intervals
CIs for means and proportions typically have a similar structure
19
p-values
How do I choose the significance level ?
No rules
Conventional choices are = 0.1, 0.05, or 0.01
The p-value associated with a given test statistic is the probability of obtaining a value of the test
statistic as or more extreme than that observed, given that the null hypothesis is true
20
Power of a test
Power (in statistics): The probability of correctly rejecting a false null hypothesis
21
Week 9
Chi-squared Tests
The Chi-squared goodness-of-fit test is used to test the null hypothesis that the observed and
expected distributions are the same.
Chi-squared tests
The 2 Distribution
An asymmetric distribution characterized (like the t distribution!) by degrees of freedom . Its support
lies on the interval (0,): that is, it is exclusively non-negative. It is the sum of the squares of
independent standard normal variables.
22
Contingency Tables
23
24
Week 10
Simple Regression
Suppose we have (Yi, Xi) pairs for i = 1, , n
We fit a line to the data by minimizing the residual sum of squares. This is the action
accomplished by ordinary least squares (OLS) regression.
o This line is defined by estimates of the intercept and slope in the linear relationship Yi
= 0 + 1Xi + i
The sign of the slope coefficient has the same sign as the sample covariance (and correlation)
between Y and X
Some Basics
Terminology:
OLS produces:
Estimated parameter values b0, b1
Predictions: Yi = b0 + b1Xi
25
Residuals: ei = Yi Yi
The population regression relationship is Yi = 0 + 1Xi + i
This equation links the (X,Y) pairs via the unknown parameters and the unobserved errors
This equation links the (X,Y) pairs via estimated parameters and calculated residuals
Reliable estimates of 1 will require assumptions restricting the relationship between Xi and
i
Our desire to make ceteris paribus interpretations of 1 include more explanatory
variables in the regression
Leads to an extension to multiple regression
26
Decomposition of Variance
Coefficient of Determination
OLS Inference
27
What can we say about the properties of the OLS point estimators b 0 and b1, if our
assumptions hold?
o They are unbiased E(bj) = j
o They are normally distributed, as they are linear functions of Yi , which are assumed
to be drawn from a normal distribution
Even without normality of Yi , we can invoke the CLT and assume that bj will be
asymptotically normal
We need to know Var(b0) and Var(b1) in order to conduct inference
Just as we did when estimating means, we can define the true and estimated variances
The panel below gives:
o True variances on the left, and
o Estimated standard errors on the right, where the unknown is replaced by estimated
s.
o
28
Week 11
Prediction in Regression
One of the main uses for regression models is in prediction or forecasting
Basic idea: Use the fitted regression line (including b0 and b1) to predict a value of Y from a
given value of X
It is natural to think of forecasting in a time series context: e.g., what will happen to sales (Y)
if advertising (X) increases next year?
Predictions can also be made using cross-sectional regression results: e.g., if the income (X)
of a poor household increased, what would be is the predicted impact of this on that
households food expenditures (Y)?
It is often inadvisable to predict too far away from the (Xi,Yi) pairs to which OLS fits a line
(i.e., too far out of sample), or for a different population
Yt = 0 + 1t + t ; t=1,,T
Dates are converted to time: e.g., 1997Q3 becomes t=1
Prediction
Assume: Yt = 0 + 1 Xt + t
29
That will enable us to construct a confidence interval for our prediction of the conditional
mean, E(Yf | Xf )
For the petrol prices example, the transformed model would require regressing prices (Yt) on
(t 37) [recall Xt =t]
There appears to be a difference between the expressions for the variance and standard error of
the prediction given in the lecture and those in Sharpe (p. 519). However, with a correction for a
typo in Sharpe (!), they are indeed the same. The trick is to know the sample variance for b1,
var(b1)
Predicition: Y or Expected Y?
We can also construct a prediction interval for Yf
30
A key threat to the appropriate interpretation of simple linear regression results is the problem
of omitted variables
o Our estimates may not accurately reflect the effect of height on income holding all
other omitted factors constant
o This is often called a problem of confounding and is due to omitted variable bias
o This bias occurs when the following facts BOTH hold:
A variable omitted from the regression is correlated with the included
explanatory variable
That omitted variable independently influences the dependent variable
A primary motivation for multiple regression is to avoid this type of confounding
31
Linear Regression
32
Week 12
The F-Test
The t-tests on individual coefficients tell you about one variable at a time
o E.g., is gender related to income?
But we can also test the overall significance of all variables also referred to as the
significance of the model
o This is essentially a test of the significance of the R-sq
The null hypothesis that this test assumes is that all coefficients are equal to zero
The alternative hypothesis is that any, some, or all of them is non-zero
As with the Chi-sq test, reporting the F-test result is not informative unless you interpret
where that result is coming from
You can do this by looking at which variables are significant, and which are insignificant,
using individual t-tests
33
Recall the example last time with an array of dummies to indicate educational attainment
Suppose we are modelling retail sales and want to capture the month of the observation
We could:
o Create ONE indicator for month, taking the value 1, 2, 3, 412
OR Create twelve SEPARATE indicators, one for each month, each taking the value 0 or 1 for
all observations, and leave one indicator out of the model
Which is better?
We mentioned last time how nonlinear relationships between X and Y can be modelled via
variable transformation prior to estimation
How do we interpret the results of the following (fairly common) models?
Multicollinearity
Predictor variables exhibit collinearity (or multicollinearity) when one of the predictors can be
predicted well (not perfectly!) from the others.
Consequences of collinearity:
34
High collinearity leads to the coefficient being poorly estimated and having a large standard
error (and correspondingly low t-statistic). The coefficient may seem to be the wrong size, or
even the wrong sign.
If a multiple regression model has a high R2 and large F, but the individual t statistics are not
significant, you should suspect collinearity.
Collinearity is measured in terms of the association between a predictor and all of the other
predictors in the model not in terms of just the correlation between any two predictors.
Note that a moderate association between two or more independent variables is not a big deal! In
fact, if all independent variables were orthogonal, there would be no point in estimating a model
including more than one of them
35