Académique Documents
Professionnel Documents
Culture Documents
RANDOM SAMPLES
A simple random sample is a subset of the population drawn in such a way
that each element of the population has an equal probability of being selected.
The key to random sampling lies in the lack of any patterns in the collection of
the data elements.
- Finite and limited populations can be sampled by assigning random numbers
to all of the elements in the population, and then selecting the sample
elements by using a random number generator and matching the generated
numbers to the assigned numbers.
If you can enumerate the population, why dont you just use it?
- When we cant identify all the members of the population, we often use kth
member sampling, where we select every kth member we observe until we
have the necessary sample size.
Survey Prediction: Alfred Landon wins over FDR with 57% of the
vote to 43% of the vote.
Literary Digest, 1936
3
Outcome: FDR gets 62% of the vote Sampling error was 19%!
4
SAMPLING DISTRIBUTION
The distribution of possible outcomes of a sample statistic that would
result from repeated sampling from the population.
The samples drawn from the
population to derive the
distribution should be the same
size and drawn from the same
underlying population.
We generally refer to a sampling
distribution by indicating the
statistic to which the distribution
applies:
- the sampling distribution of the
sample mean.
where
10
11
ESTIMATOR PROPERTIES
There are a variety of estimators for each population parameter;
accordingly, we prefer estimators that exhibit certain valuable properties.
1. Unbiasedness
- Occurs when the estimator expected value is equal to the value of the
parameter being estimated.
- Examples: sample mean, sample standard deviation
2. Efficiency
- Occurs when no other estimator has a smaller variance.
- Examples: sample mean, sample standard deviation
3. Consistency
- Asymptotic in nature, thereby requiring a large number of observations.
- Occurs when the probability of obtaining estimates close to the value of the
population parameter increases as sample size increases.
12
CONFIDENCE INTERVALS
Focus On: Constructing Confidence Intervals (CIs)
Point estimate Reliability factor Standard error
1. Point estimate = A point estimate of the parameter (a value of a sample
statistic), such as the sample mean.
2. Reliability factor = A number based on the assumed distribution of the point
estimate and the degree of confidence (1 ) for the confidence interval.
3. Standard error = The standard error of the sample statistic providing the point
estimate.
Normal pop, known Unknown large sample Unknown small
sample
or
13
STUDENTS t-DISTRIBUTION
When the population variance is unknown and the sample is random, the
distribution that correctly describes the sample mean is known as the tdistribution.
The t-distribution has larger reliability (cutoff) values for a given level of alpha
than the normal distribution, but as the sample size increases, the cutoff values
approach those of the normal distribution.
For small sample sizes, use of the t-distribution
instead of the z-distribution to determine
reliability factors is critical.
The t-distribution is a symmetrical distribution
whose probability density function is defined
by a single parameter known as the
degrees of freedom (df).
14
DEGREES OF FREEDOM
The parameter that completely characterizes a t-distribution.
The degrees of freedom for a given t-distribution are equal to the sample size minus
1.
- For a sample size of 45, the degrees of freedom are 44.
- Consider that our calculation of the sample standard deviation is
and that the sample mean is measured with error because it is not the true
population mean,
- For our sample of 45, because we have already estimated our sample mean, when
we have enumerated 44 of the sample observations, the 45th must be the one that
ensures our estimated sample mean. Hence, we are only free to choose 44 of the
observations; the 45th must be that value that gives the estimated sample mean.
15
CONFIDENCE INTERVALS
Focus On: When to Use What
Statistic for Small
Sample Size
t*
Non-normal distribution
with known variance
Not available
Non-normal distribution
with unknown variance
Not available
t*
Sampling from:
16
CONFIDENCE INTERVALS
Focus On: Calculations
Portfolio
Normal
Population
Known Variance
Unknown Variance,
Small Sample
Unknown Variance,
Large Sample
Target = 0.1
(1)
(2)
(3)
E(R) =
.014
0.11
0.25
Std Dev =
.020
0.08
0.27
15
20
45
n=
You have a client with a target rate of return of 10% who would like to be 90%
certain her realized return will include her target return. Construct a 90%
confidence interval for each of the investments in the table, and determine
whether each contains her target return.
17
CONFIDENCE INTERVALS
Focus On: Calculations
Portfolio
Normal
Population
Known Variance
Unknown Variance,
Small Sample
Unknown Variance,
Large Sample
Target = 0.1
(1)
(2)
(3)
E(R) =
.014
0.11
0.25
Std Dev =
.020
0.08
0.27
15
20
45
n=
For strategy (1), we use a z-statistic because the population variance (standard
deviation) is known and the population is normally distributed.
18
CONFIDENCE INTERVALS
Focus On: Calculations
Portfolio
Normal
Population
Known Variance
Unknown Variance,
Small Sample
Unknown Variance,
Large Sample
Target = 0.1
(1)
(2)
(3)
E(R) =
.014
0.11
0.25
Std Dev =
.020
0.08
0.27
15
20
45
n=
For strategy (2), we use a t-statistic because the population variance (standard
deviation) is unknown and the population is small and normally distributed.
19
CONFIDENCE INTERVALS
Focus On: Calculations
Portfolio
Normal
Population
Known Variance
Unknown Variance,
Small Sample
Unknown Variance,
Large Sample
Target = 0.1
(1)
(2)
(3)
E(R) =
.014
0.11
0.25
Std Dev =
.020
0.08
0.27
15
20
45
n=
For strategy (3), we can use a z-statistic or a t-statistic even though the
population variance (standard deviation) is unknown because we have a large
sample.
20
DATA-MINING BIAS
If you torture the data long enough, it will confess.
reportedly said in a speech by Ronald Coase, Nobel laureate
Data-mining bias results from the overuse and/or repeated use of the same
data to repeatedly search for patterns in the data.
- If we were to test 1,000 different variables, 50 of them would be significant at
the 5% level even though the significance is just an artifact of the testing
error rate.
- This approach is sometimes called a kitchen sink problem.
- Economic and financial decisions made on the basis of these tests will be
inherently flawed.
- There is no true underlying economic rationale for the relationship distinct
from the testing phenomenon.
To verify the relationship and/or discover data-mining biases, we can conduct
out-of-sample tests.
No story No future
22
23
24
SUMMARY
The quality of the sample is critically important when conducting or
evaluating the results of a study.
To draw valid inferences, the sample must be random in order to avoid
a host of potential, often insidious, biases.
When we have a random sample or samples, we can use the central
limit theorem to conduct tests that compare the mean value of the
sample with its value relative to a possible underlying population value.
- The appropriate test will differ as a function of our knowledge of the
underlying population.
25