Vous êtes sur la page 1sur 27

SAMPLING AND ESTIMATION

PARAMETERS AND STATISTICS


A parameter is a quantity used to describe a population, and a statistic is a
quantity computed from a sample and is used to estimate a population
parameter and describe the sample.
- We typically use statistics to estimate parameters because
1. It isnt possible to examine the entire population.
2. It isnt feasible to examine the entire population (e.g., too expensive).
- In order for a calculated statistic to convey information about the related
population parameter, certain conditions must be met; those conditions are
generally satisfied if the sample used to calculate the statistics is random.

RANDOM SAMPLES
A simple random sample is a subset of the population drawn in such a way
that each element of the population has an equal probability of being selected.

The key to random sampling lies in the lack of any patterns in the collection of
the data elements.
- Finite and limited populations can be sampled by assigning random numbers
to all of the elements in the population, and then selecting the sample
elements by using a random number generator and matching the generated
numbers to the assigned numbers.
If you can enumerate the population, why dont you just use it?
- When we cant identify all the members of the population, we often use kth
member sampling, where we select every kth member we observe until we
have the necessary sample size.

Survey Prediction: Alfred Landon wins over FDR with 57% of the
vote to 43% of the vote.
Literary Digest, 1936
3

SAMPLING ERROR AND SAMPLING DISTRIBUTION


The difference between the observed value of a statistic and the value of
the parameter is known as the sampling error.
Random sampling should reflect the characteristics of the underlying
population in such a way that the sample statistics computed from the sample
are valid estimates of the population parameter.
Because the sample represents only a fraction of the observations in the
population, we can extract more than one sample from any given population.
- The samples drawn, if truly random, may contain common elements.
Sample statistics, calculated from multiple samples from the same population,
will then have a distribution of differing values that is known as the sampling
distribution.

Outcome: FDR gets 62% of the vote Sampling error was 19%!
4

SAMPLING DISTRIBUTION
The distribution of possible outcomes of a sample statistic that would
result from repeated sampling from the population.
The samples drawn from the
population to derive the
distribution should be the same
size and drawn from the same
underlying population.
We generally refer to a sampling
distribution by indicating the
statistic to which the distribution
applies:
- the sampling distribution of the
sample mean.

STRATIFIED RANDOM SAMPLING


A set of simple random samples drawn from an overall population in such a
way that subpopulations are accurately represented in the overall sample.

In a large population, we may have subpopulations, known as strata, for which


we want to ensure inclusion in a representative way in the sample.
To do so, we can use stratified sampling, wherein we draw simple random
samples from each strata and then combine those samples to form the overall
sample on which we perform our analysis.
- This method guarantees proportional representation in each strata relative to
the population representation.
- Stratified random sample statistics have greater precision (less variance)
than simple random samples.
- Stratified random sampling is commonly used with bond indexing.

TIME-SERIES AND CROSS-SECTIONAL DATA


Time-series samples are constructed by collecting the data of interest at
regularly spaced intervals of time and are known as time-series data.

Cross-sectional samples are constructed by collecting the data of interest


across observational units (firms, people, precincts) at a single point in time
and are known as cross-sectional data.

The combination of the two is known as panel data.

CENTRAL LIMIT THEOREM


The central limit theorem (CLT) allows us to make precise probability statements about
the population mean using the sample mean, regardless of the underlying distribution.

Recall that we can estimate the population mean of a random sample by


calculating the average value of the sample observations, a statistic known as
the sample mean.
Given a population described by any probability distribution having mean and
finite variance 2, the sampling distribution of the sample mean, X, computed
from samples of size n from this population, will be approximately normal with
mean (the population mean) and variance 2/n (the population variance
divided by n) when the sample size n is large.

THE STANDARD ERROR OF THE SAMPLE MEAN


By combining the CLT, the sample mean, and the standard error of the
sample mean, we can make probability statements about population mean.
The standard deviation of the distribution of the sample mean is known as the standard
error of the sample mean.
- When the sample size is large (generally n > 30 or so), the distribution of the sample
mean will be approximately normal when the sample is randomly generated (collected).
- This distribution is independent of the distribution from which the sample is drawn.
The standard error of the mean can then be shown to take the value of
Population variance known

Population variance unknown

where

CENTRAL LIMIT THEOREM


Focus On: Importance
With a large, random sample
- The distribution of the sample mean will be approximately normal.
- The mean of the distribution will be equal to the mean of the population from
which the samples are drawn.
- The variance of the distribution will be equal to the variance of the population
divided by the sample size when the population variance is known, and equal
to the sample variance divided by the sample size when the population
variance is unknown.

10

POINT ESTIMATES & CONFIDENCE INTERVALS


Estimators are the generalized mathematical expressions for the calculation of
sample statistics, and an estimate is a specific outcome of one estimation.

Estimates take on a single numerical value and are,


therefore, referred to as point estimates.
- It is a fixed number specific to that sample.
- It has no sampling distribution.
In contrast, a confidence interval (CI) specifies a
range that contains the parameter in which we are
interested (1 )% at the time.
- The (1 )% is known as the degree of
confidence.
- Confidence intervals are generally expressed as
lower confidence limit or upper confidence limit.

11

ESTIMATOR PROPERTIES
There are a variety of estimators for each population parameter;
accordingly, we prefer estimators that exhibit certain valuable properties.
1. Unbiasedness
- Occurs when the estimator expected value is equal to the value of the
parameter being estimated.
- Examples: sample mean, sample standard deviation
2. Efficiency
- Occurs when no other estimator has a smaller variance.
- Examples: sample mean, sample standard deviation
3. Consistency
- Asymptotic in nature, thereby requiring a large number of observations.
- Occurs when the probability of obtaining estimates close to the value of the
population parameter increases as sample size increases.
12

CONFIDENCE INTERVALS
Focus On: Constructing Confidence Intervals (CIs)
Point estimate Reliability factor Standard error
1. Point estimate = A point estimate of the parameter (a value of a sample
statistic), such as the sample mean.
2. Reliability factor = A number based on the assumed distribution of the point
estimate and the degree of confidence (1 ) for the confidence interval.
3. Standard error = The standard error of the sample statistic providing the point
estimate.
Normal pop, known Unknown large sample Unknown small
sample
or

13

STUDENTS t-DISTRIBUTION
When the population variance is unknown and the sample is random, the
distribution that correctly describes the sample mean is known as the tdistribution.
The t-distribution has larger reliability (cutoff) values for a given level of alpha
than the normal distribution, but as the sample size increases, the cutoff values
approach those of the normal distribution.
For small sample sizes, use of the t-distribution
instead of the z-distribution to determine
reliability factors is critical.
The t-distribution is a symmetrical distribution
whose probability density function is defined
by a single parameter known as the
degrees of freedom (df).

14

DEGREES OF FREEDOM
The parameter that completely characterizes a t-distribution.
The degrees of freedom for a given t-distribution are equal to the sample size minus
1.
- For a sample size of 45, the degrees of freedom are 44.
- Consider that our calculation of the sample standard deviation is

and that the sample mean is measured with error because it is not the true
population mean,
- For our sample of 45, because we have already estimated our sample mean, when
we have enumerated 44 of the sample observations, the 45th must be the one that
ensures our estimated sample mean. Hence, we are only free to choose 44 of the
observations; the 45th must be that value that gives the estimated sample mean.

15

CONFIDENCE INTERVALS
Focus On: When to Use What
Statistic for Small
Sample Size

Statistic for Large


Sample Size

Normal distribution with


known variance

Normal distribution with


unknown variance

t*

Non-normal distribution
with known variance

Not available

Non-normal distribution
with unknown variance

Not available

t*

Sampling from:

*Use of z is also acceptable.

16

CONFIDENCE INTERVALS
Focus On: Calculations
Portfolio
Normal
Population

Known Variance

Unknown Variance,
Small Sample

Unknown Variance,
Large Sample

Target = 0.1

(1)

(2)

(3)

E(R) =

.014

0.11

0.25

Std Dev =

.020

0.08

0.27

15

20

45

n=

You have a client with a target rate of return of 10% who would like to be 90%
certain her realized return will include her target return. Construct a 90%
confidence interval for each of the investments in the table, and determine
whether each contains her target return.
17

CONFIDENCE INTERVALS
Focus On: Calculations
Portfolio
Normal
Population

Known Variance

Unknown Variance,
Small Sample

Unknown Variance,
Large Sample

Target = 0.1

(1)

(2)

(3)

E(R) =

.014

0.11

0.25

Std Dev =

.020

0.08

0.27

15

20

45

n=

For strategy (1), we use a z-statistic because the population variance (standard
deviation) is known and the population is normally distributed.

18

CONFIDENCE INTERVALS
Focus On: Calculations
Portfolio
Normal
Population

Known Variance

Unknown Variance,
Small Sample

Unknown Variance,
Large Sample

Target = 0.1

(1)

(2)

(3)

E(R) =

.014

0.11

0.25

Std Dev =

.020

0.08

0.27

15

20

45

n=

For strategy (2), we use a t-statistic because the population variance (standard
deviation) is unknown and the population is small and normally distributed.

19

CONFIDENCE INTERVALS
Focus On: Calculations
Portfolio
Normal
Population

Known Variance

Unknown Variance,
Small Sample

Unknown Variance,
Large Sample

Target = 0.1

(1)

(2)

(3)

E(R) =

.014

0.11

0.25

Std Dev =

.020

0.08

0.27

15

20

45

n=

For strategy (3), we can use a z-statistic or a t-statistic even though the
population variance (standard deviation) is unknown because we have a large
sample.

20

SAMPLE SIZE SELECTION


There are inherent trade-offs in selecting a sample based on both
statistical and economic factors.
Point estimate Reliability factor Standard error
Benefits of a large sample include:
Increased precision through
a) Large samples that enable the use of z-statistics rather than t-statistics.
b) The estimate of the standard error that decreases with increased sample
size.

Drawbacks of a large sample include:


Increased likelihood of sampling from more than one population.
Increased cost.
21

DATA-MINING BIAS
If you torture the data long enough, it will confess.
reportedly said in a speech by Ronald Coase, Nobel laureate
Data-mining bias results from the overuse and/or repeated use of the same
data to repeatedly search for patterns in the data.
- If we were to test 1,000 different variables, 50 of them would be significant at
the 5% level even though the significance is just an artifact of the testing
error rate.
- This approach is sometimes called a kitchen sink problem.
- Economic and financial decisions made on the basis of these tests will be
inherently flawed.
- There is no true underlying economic rationale for the relationship distinct
from the testing phenomenon.
To verify the relationship and/or discover data-mining biases, we can conduct
out-of-sample tests.

No story No future
22

SAMPLE SELECTION BIAS


Effectively represents a nonrandom sample.
Often caused by data for some portion of the population being
unavailable.
- If that portion is different in systematic ways, it results in a selection
bias.
Survivorship bias is a particular kind of selection bias wherein we only
observe those firms that have succeeded and, therefore, survive.
Selection bias in Literary Digest example Only wealthy people had phones.

Moral of the Story


Even a large sample size doesnt fix a biased sample.

23

LOOK-AHEAD BIAS AND TIME-PERIOD BIAS


Look-ahead bias occurs when researchers use data not available at the
test date to test a model and use it for predictions.
- May be particularly pronounced when using accounting data, which is
typically reported with a lag in time.
Time-period bias occurs when the model uses data from a time period
when the data is not representative of all possible values of the data
across time.
- Too short of a time period increases the likelihood of period-specific
results.
- Too long of a time period increases the chance of a regime change.

24

SUMMARY
The quality of the sample is critically important when conducting or
evaluating the results of a study.
To draw valid inferences, the sample must be random in order to avoid
a host of potential, often insidious, biases.
When we have a random sample or samples, we can use the central
limit theorem to conduct tests that compare the mean value of the
sample with its value relative to a possible underlying population value.
- The appropriate test will differ as a function of our knowledge of the
underlying population.

25

BOND INDEX AND STRATIFIED


SAMPLING
Suppose you are the manager of a mutual fund indexed to
the Lehman Brothers Government Index. You are exploring
several approaches to indexing, including a stratified
sampling approach. You first distinguish agency bonds from
US Treasury bonds.
For each of these two groups, you define 10 maturity
intervals: 12 years, 23 years, 34 years, 46 years, 68
years, 810 years, 1012 years, 1215 years, 1520 years,
and 2030 years. You also separate the bonds with coupons
(annual interest rates) of 6% or less from the bonds with
coupons of greater than 6%.

BOND INDEX AND STRATIFIED


SAMPLING
1. How many cells or strata does this sampling plan entail?
2(10)(2) = 40

2. If you use this sampling plan, what is the minimum


number of issues the indexed portfolio can have?
40, so that you have no empty cells

3. Suppose that in selecting among the securities that


qualify for selection within each cell, you apply a criterion
concerning the liquidity of the securitys market. Is the
sample obtained random? Explain your answer.
No. Now not every bond has an equal probability of being
accepted.

Vous aimerez peut-être aussi