Vous êtes sur la page 1sur 20

Chapter 8: Sampling Methods and the Central Limit Theorem

Why Sample the Population?


1. To contact the whole population would be time consuming.
2. The cost of studying all the items in a population may be prohibitive.
3. The physical impossibility of checking all items in the population.
4. The destructive nature of some tests.
5. The sample results are adequate.
Probability Sampling
A probability sample is a sample selected such that each item or person in the population being
studied has a known likelihood of being included in the sample.
Most Commonly Used Probability Sampling Methods
→ Simple Random Sample
→ Systematic Random Sampling
→ Stratified Random Sampling
→ Cluster Sampling
Simple Random Sample: A sample selected so that each item or person in the population has the
same chance of being included.
Example: A population consists of 845 employees of Nitra Industries. A sample of 52 employees is to
be selected from that population. The name of each employee is written on a small slip of paper and
deposited all of the slips in a box. After they have been thoroughly mixed, the first selection is made by
drawing a slip out of the box without looking at it. This process is repeated until the sample of 52
employees is chosen.
Systematic Random Sampling: A random starting point is selected and then every kth member of
the population is selected for the sample.
Example: A population consists of 845 employees of Nitra Industries. A sample of 52 employees is to
be selected from that population.
First, k is calculated as the population size divided by the sample size. For Nitra Industries, we would
select every 16th (845/52) employee list. If k is not a whole number, then round down. Random
sampling is used in the selection of the first name. Then, select every 16th name on the list thereafter.
Risks Associated with Systematic Sampling
One risk that statisticians must take into account when conducting systematic sampling involves how
the list used with the sampling interval is organized. If the population placed on the list is organized in
a cyclical pattern that matches the sampling interval, the selected sample may be biased.
For example, a company's HR department wants to pick a sample of employees and ask how they feel
about company policies. Employees are grouped in teams of 20, with each team headed by a manager.
If the list used to pick the sample size is organized with teams clustered together, the statistician risks
picking only managers (or no managers at all) depending on the sampling interval.
Stratified Random Sampling: A population is first divided into subgroups, called strata, and a sample
is selected from each stratum. Useful when a population can be clearly divided in groups based on
some characteristics.
Example: Suppose we want to study the advertising expenditures for the 352 largest companies in
the United States to determine whether firms with high returns on equity (a measure of profitability)
spent more of each sales dollar on advertising than firms with a low return or deficit.
To make sure that the sample is a fair representation of the 352 companies, the companies are
grouped on percent return on equity and a sample proportional to the relative size of the group is
randomly selected.
Cluster Sampling: A population is divided into clusters using naturally occurring geographic or other
boundaries. Then, clusters are randomly selected and a sample is collected by randomly selecting
from each cluster.
Suppose you want to determine the views of residents in Oregon about state and federal
environmental protection policies.
Cluster sampling can be used by subdividing the state into small units—either counties or regions,
select at random say 4 regions, then take samples of the residents in each of these regions and
interview them. (Note that this is a combination of cluster sampling and simple random sampling.)
Methods of Probability Sampling
→ In nonprobability sample inclusion in the sample is based on the judgment of the person
selecting the sample.
→ The sampling error is the difference between a sample statistic and its corresponding
population parameter.
Sampling Distribution of the Sample Mean
The sampling distribution of the sample mean is a probability distribution consisting of all possible
sample means of a given sample size selected from a population.
Example: Tartus Industries has seven production employees (considered the population). The hourly
earnings of each employee are given in the table below.
Employee Hourly Earnings
Joe 7
Sam 7
Sue 8
Bob 8
Jan 7
Art 8
Ted 9
1. What is the population mean?
2. What is the sampling distribution of the sample mean for samples of size 2?
3. What is the mean of the sampling distribution?
4. What observations can be made about the population and the sampling distribution?
Solution:
7+7+8+8+7+8+9
1. Population Mean, 𝜇= = 7.71
7
2. To arrive at the sampling distribution of the sample mean, we need to select all possible samples of
2 without replacement from the population, and then compute the mean of each sample. There are 21
possible samples, found by
𝑁 7!
𝐶𝑛 = 7𝐶2 = = 21
2! (7 − 2)!
Sample Employees Earnings Sum Mean Sample Employees Earnings Sum Mean
1 Joe, Sam 7, 7 14 7.0 12 Sue, Bob 8, 8 16 8.0
2 Joe, Sue 7, 8 15 7.5 13 Sue, Jan 8, 7 15 7.5
3 Joe, Bob 7, 8 15 7.5 14 Sue, Art 8, 8 16 8.0
4 Joe, Jan 7, 7 14 7.0 15 Sue, Ted 8, 9 17 8.5
5 Joe, Art 7, 8 15 7.5 16 Bob, Jan 8, 7 15 7.5
6 Joe, Ted 7, 9 16 8.0 17 Bob, Art 8, 8 16 8.0
7 Sam, Sue 7, 8 15 7.5 18 Bob, Ted 8, 9 17 8.5
8 Sam, Bob 7, 8 15 7.5 19 Jan, Art 7, 8 15 7.5
9 Sam, Jan 7, 7 14 7.0 20 Jan, Ted 7, 9 16 8.0
10 Sam, Art 7, 8 15 7.5 21 Art, Ted 8, 9 17 8.5
11 Sam, Ted 7, 9 16 8.0
𝑆𝑢𝑚 𝑝𝑓 𝑎𝑙𝑙 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛𝑠 7.0 + 7.5 + ⋯ + 8.5 162
𝜇𝑋� = = = = 7.71
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑆𝑎𝑚𝑝𝑙𝑒𝑠 21 21
Population No. of means Probability Sample No. of means Probability
7 3 0.43 7.0 3 0.14
8 3 0.43 7.5 9 0.43
9 1 0.14 8.0 6 0.29
8.5 3 0.14
Total 7 1.00 21 1.00

Refer to Chart. It shows the population distribution based on the data in Table 1 and the distribution
of the sample mean based on the data in Table 2. These observations can be made:
a. The mean of the distribution of the sample mean is equal to the mean of the population: 𝜇 = 𝜇𝑋�
b. The spread in the distribution of the sample mean is less than the spread in the population values.
The sample means range from $7.0 to $8.5 while the population values vary from $7 up to $9. If we
continue to increase the sample size, the spread of the distribution of the sample mean becomes
smaller.
c. The shape of the sampling distribution of the sample mean and the shape of the frequency
distribution of the population values are different. The distribution of the sample mean tends to be
more bell-shaped and to approximate the normal probability distribution.
Chapter 9: Estimation and Confidence Intervals
Reasons for Sampling
1. To contact the entire population is too time consuming.
2. The cost of studying all the items in the population is often too expensive.
3. The sample results are usually adequate.
4. Certain tests are destructive.
5. Checking all the items is physically impossible.
Point and Interval Estimates
→ A point estimate is a single value (point) derived from a sample and used to estimate a
population value.
→ A confidence interval estimate is a range of values constructed from sample data so that the
population parameter is likely to occur within that range at a specified probability. The
specified probability is called the level of confidence.
Factors Affecting Confidence Interval Estimates
The factors that determine the width of a confidence interval are:
1. The sample size, n.
2. The variability in the population, usually σ estimated by s.
3. The desired level of confidence.
Interval Estimates – Interpretation
For a 95% confidence interval about 95% of the similarly constructed intervals will contain the
parameter being estimated. Also 95% of the sample means for a specified sample size will lie within
1.96 standard deviations of the hypothesized population.
How to obtain z value for a Given Confidence Level
The 95 percent confidence refers to the middle 95 percent of the observations. Therefore, the
remaining 5 percent are equally divided between the two tails.

Following is a portion of Appendix B.1.

Point Estimates and Confidence Intervals for a Mean – σ Known


𝜎
Confidence Interval for a Population Mean with σ Known = 𝑋� ± 𝑧
√𝑛
𝑋� = Sample Mean
𝑧 = z value for a particular interval level
𝜎 = the population standard daviation
𝑛 = number of observations in the sample
1. The width of the interval is determined by the level of confidence and the size of the standard
error of the mean.
2. The standard error is affected by two values:
- Standard deviation
- Number of observations in the sample
Example: Confidence Interval for a Mean – σ Known
The American Management Association wishes to have information on the mean income of middle
managers in the retail industry. A random sample of 256 managers reveals a sample mean of $45,420.
The standard deviation of this population is $2,050. The association would like answers to the
following questions:
1. What is the population mean?
2. What is a reasonable range of values for the population mean?
3. What do these results mean?
Solution:
1. In this case, we do not know. We do know the sample mean is $45,420. Hence, our best estimate of
the unknown population value is the corresponding sample statistic.
The sample mean of $45,420 is a point estimate of the unknown population mean.
2. Suppose the association decides to use the 95 percent level of confidence:
𝜎 2050
𝑋� ± 𝑧 = 45420 ± 1.96 = 45420 ± 251
√𝑛 √256
The confidence limit is $45,169 and $45,671
The ±$251 is referred to as the margin of error
3. If we select many samples of 256 managers, and for each sample we compute the mean and then
construct a 95 percent confidence interval, we could expect about 95 percent of these confidence
intervals to contain the population mean. Conversely, about 5 percent of the intervals would not
contain the population mean annual income, µ
Characteristics of the t-distribution
1. It is, like the z distribution, a
continuous distribution.
2. It is, like the z distribution,
bell-shaped and symmetrical.
3. There is not one t
distribution, but rather a
family of t distributions. All t
distributions have a mean of
0, but their standard
deviations differ according to
the sample size, n.
4. The t distribution is more
spread out and flatter at the center than the standard normal distribution as the sample size
increases, however, the t distribution approaches the standard normal distribution
Confidence Interval Estimates for the Mean
Use Z-distribution
If the population standard deviation is known or the sample is greater than 30.
𝜎
𝑋� ± 𝑧
√𝑛
Use t-distribution
If the population standard deviation is unknown and the sample is less than 30.
𝑠
𝑋� ± 𝑡
√𝑛

Confidence Interval for the Mean – Example using the t-distribution


Example: A tire manufacturer wishes to investigate the tread life of its tires. A sample of 10 tires
driven 50,000 miles revealed a sample mean of 0.32 inch of tread remaining with a standard deviation
of 0.09 inch.
Construct a 95 percent confidence interval for the population mean.
Would it be reasonable for the manufacturer to conclude that after 50,000 miles the population mean
amount of tread remaining is 0.30 inches?
Solution:
Given in the problem,
𝑛 = 10
Degree of freedom = 10 − 1 = 9
𝑋� = 0.32
𝑠 = 0.09
𝐹𝑟𝑜𝑚 𝑡 𝑡𝑎𝑏𝑙𝑒, 𝑡 = 2.262
𝑠 0.09
𝑋� ± 𝑡 = 0.32 ± 2.262 = 0.256, 0.384
√𝑛 √10
We can conclude that the manufacturer can be reasonably sure (95% confident) that the mean
remaining tread depth is between 0.256 and 0.384 inches.
Chapter 10: One-Sample Tests of Hypothesis
What is a Hypothesis?
A statement about the value of a population parameter developed for the purpose of testing.
Hypothesis testing
→ Based on sample evidence and probability theory
→ Used to determine whether the hypothesis is a reasonable statement and should not be
rejected, or is unreasonable and should be rejected

Step One: State the null and alternate hypotheses


Null Hypothesis 𝑯𝟎: A statement about the value of a population parameter
Alternative Hypothesis 𝑯𝟏 : A statement that is accepted if the sample data provide evidence that the
null hypothesis is false
Three possibilities regarding means
𝐻0 : 𝜇 = 0 𝐻0 : 𝜇 ≤ 0 𝐻0 : 𝜇 ≥ 0
𝐻1 : 𝜇 ≠ 0 𝐻0 : 𝜇 > 0 𝐻0 : 𝜇 < 0
The null hypothesis always contains equality.
Step Two: Select a level of significance
Level of Significance: The probability of rejecting the null hypothesis when it is actually true; the
level of risk in so doing.
Type I Error: Rejecting the null hypothesis when it is actually true (α).
Type II Error: Accepting the null hypothesis when it is actually false (β).
Risk Table
Researcher
Null Hypothesis
Accepts 𝐻0 Rejects 𝐻0
𝐻0 is true Correct Decision Type I Error (α)
𝐻0 is false Type II Error (β) Correct Decision
Step Three: Select the test statistic.
Test statistic: A value, determined from sample information, used to determine whether or not to
reject the null hypothesis.
Examples: 𝑧, 𝑡, 𝐹, 𝜒 2
z Distribution as a test statistic
𝑋� − 𝜇
𝑧= 𝜎
√𝑛
The z value is based on the sampling distribution of X, which is normally distributed when the sample
is reasonably large (recall Central Limit Theorem).
Step Four: Formulate the decision rule.
Critical value: The dividing point between the region where the null hypothesis is rejected and the
region where it is not rejected.
Sampling Distribution of the Statistic z, a Right-Tailed Test, .05 Level of Significance

Decision Rule
Reject the null hypothesis and accept the alternate hypothesis if
Computed − z < Critical − z
Computed z > Critical z
Using the p-Value in Hypothesis Testing
P-Value
The probability, assuming that the null hypothesis is true, of finding a value of the test statistic at least
as extreme as the computed value for the test
Decision Rule
If the p-Value is larger than or equal to the significance level, α, 𝐻0 is not rejected.
If the p-Value is smaller than the significance level, α, 𝐻0 is rejected.
Calculated from the probability distribution function or by computer
Interpreting p-values
. 05 > 𝑝 > .10 = 𝑆𝑜𝑚𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝐻0 𝑖𝑠 𝑛𝑜𝑡 𝑡𝑟𝑢𝑒
. 01 > 𝑝 > .05 = 𝑆𝑡𝑟𝑜𝑛𝑔 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝐻0 𝑖𝑠 𝑛𝑜𝑡 𝑡𝑟𝑢𝑒
. 001 > 𝑝 > .01 = 𝑉𝑒𝑟𝑦 𝑆𝑡𝑟𝑜𝑛𝑔 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝐻0 𝑖𝑠 𝑛𝑜𝑡 𝑡𝑟𝑢𝑒
Step Five: Make a decision.
Accept or Reject 𝐻0
One-Tailed Tests of Significance
The alternate hypothesis, 𝐻1 states a direction
Examples:
1. 𝐻1 : The mean yearly commissions earned by fulltime realtors are more than 35,000. (µ > 35,000)
2. 𝐻1 : The mean speed of trucks traveling on I-95 in Georgia is less than 60 miles per hour. (µ < 60)
3. 𝐻1 : Less than 20 percent of the customers pay cash for their gasoline purchase. (𝑝 < .20)
Sampling Distribution of the Statistic z, a Right-Tailed Test, .05 Level of Significance

Two-Tailed Tests of Significance


No direction is specified in the alternate hypothesis 𝐻1
Examples:
1. 𝐻1 : The mean amount spent by customers at the Wal-Mart in Georgetown is not equal to $25.
(µ ≠ $25).
2. 𝐻1 : The mean price for a gallon of gasoline is not equal to $1.54. (µ ≠ $25).
Regions of No rejection and Rejection for a Two-Tailed Test, .05 Level of Significance

Test for the population mean from a large sample with population standard deviation known
𝑋� − 𝜇
𝑧= 𝜎
√𝑛
Example 1: The processors of Fries’ Catsup indicate on the label that the bottle contains 16 ounces of
catsup. The standard deviation of the process is 0.5 ounces. A sample of 36 bottles from last hour’s
production revealed a mean weight of 16.12 ounces per bottle. At the .05 significance level is the
process out of control? That is, can we conclude that the mean amount per bottle is different from 16
ounces?
Step 1: State the null and the alternative hypotheses
𝐻0 : 𝜇 = 16
𝐻1 : 𝜇 ≠ 16
Step 2: Select the significance level
The significance level is .05.
Step 3: Identify the test statistic.
Because we know the population standard deviation, the test statistic is z.
Step 4: State the decision rule.
Reject 𝐻0 if 𝑧 > 1.96 or − 𝑧 < −1.96 or if 𝑝 < .05
Step 5: Make a decision and interpret the results.
𝑋� − 𝜇 16.12 − 16 0.12
𝑧= 𝜎 = = = 1.44
0.5 0.5
√𝑛 √36 6
The p(z > 1.44) is .1499 for a two − tailed test.
o Computed z of 1.44 < Critical z of 1.96
o p of .1499 > α of .05
Do not reject the null hypothesis.
We cannot conclude the mean is different from 16 ounces.
Testing for the Population Mean: Large Sample, Population Standard Deviation Unknown
As long as the sample size 𝑛 ≥ 30, z can be approximated using
𝑋� − 𝜇
𝑧= 𝑠
√𝑛
Here, s is unknown, so we estimate it with the sample standard deviation s.
Example 2: Roger’s Discount Store chain issues its own credit card. Lisa, the credit manager, wants to
find out if the mean monthly unpaid balance is more than $400. The level of significance is set at .05.
A random check of 172 unpaid balances revealed the sample mean to be $407 and the sample
standard deviation to be $38.
Should Lisa conclude that the population mean is greater than $400, or is it reasonable to assume that
the difference of $7 ($407-$400) is due to chance?
Step 1: State the null and the alternative hypotheses
𝐻0 : 𝜇 ≤ 400
𝐻1 : 𝜇 > 400
Step 2: Select the significance level
The significance level is .05.
Step 3: Identify the test statistic.
Because the sample is large we can use the z distribution as the test statistic.
Step 4: State the decision rule.
Reject 𝐻0 if 𝑧 > 1.65 or if 𝑝 < .05
Step 5: Make a decision and interpret the results.
𝑋� − 𝜇 407 − 400
𝑧= 𝑠 = = 2.42
38
√𝑛 √172
The p(z > 2.42) is .0078 for a one − tailed test.
o Computed z of 2.42 > Critical z of 1.65
o p of .0078 < α of .05
Reject the null hypothesis.
Lisa can conclude that the mean unpaid balance is greater than $400.
Testing for a Population Mean: Small Sample, Population Standard Deviation Unknown
The test statistic is the t distribution.
𝑋� − 𝜇
𝑡= 𝑠
√𝑛
The critical value of t is determined by its degrees of freedom equal to n – 1.
Example 3: The current rate for producing 5 amp fuses at Neary Electric Co. is 250 per hour. A new
machine has been purchased and installed that, according to the supplier, will increase the production
rate. The production hours are normally distributed. A sample of 10 randomly selected hours from
last month revealed that the mean hourly production on the new machine was 256 units, with a
sample standard deviation of 6 per hour.
At the .05 significance level can Neary conclude that the new machine is faster?
Step 1: State the null and the alternative hypotheses
𝐻0 : 𝜇 ≤ 250
𝐻1 : 𝜇 > 250
Step 2: Select the significance level. The significance level is .05.
Step 3: Find a test statistic. Use the t distribution since s is not known and n < 30.
Step 4: State the decision rule. There are 10 – 1 = 9 degrees of freedom.
Step 5: Make a decision and interpret the results.
𝑋� − 𝜇 256 − 250
𝑡= 𝑠 = = 3.162
6
√𝑛 √10
Computed t of 3.162 > Critical t of 1.833
Reject the null hypothesis. The mean number of amps produced is more than 250 per hour.
Proportion: The fraction or percentage that indicates the part of the population or sample having a
particular trait of interest.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
𝑝=
𝑁𝑢𝑚𝑏𝑒𝑟 𝑆𝑎𝑚𝑝𝑙𝑒𝑑
Test Statistic for Testing a Single Population Proportion
𝑝−𝜋
𝑧=
�𝜋(1 − 𝜋)
𝑛
The sample proportion is p and π is the population proportion.
Example: In the past, 15% of the mail order solicitations for a certain charity resulted in a financial
contribution. A new solicitation letter that has been drafted is sent to a sample of 200 people and 45
responded with a contribution. At the .05 significance level can it be concluded that the new letter
is more effective?
Solution:
Step 1: State the null and the alternate hypothesis.
𝐻𝑜 : 𝑝 ≤ .15
𝐻1 : 𝑝 > .15
Step 2: Select the level of significance. It is .05.
Step 3: Find a test statistic. The z distribution is the test statistic.
Step 4: State the decision rule.
The null hypothesis is rejected if z is greater than 1.65.
Step 5: Make a decision and interpret the results.
45
𝑝−𝜋 − 0.15
𝑧= = 200 = 2.97
� 𝜋 ( 1 − 𝜋 ) � 0.15 ( 1 − 0.15)
𝑛 200
Because the computed z of 2.97 > critical z of 1.65, the null hypothesis is rejected. More than 15
percent responding with a pledge. The new letter is more effective.
Chapter 13: Linear Regression and Correlation
Correlation Analysis and Scatter Diagram
Correlation Analysis is the study of the relationship between variables. It is also defined as group of
techniques to measure the association between two variables.
A Scatter Diagram is a chart that portrays the relationship between the two variables. It is the usual
first step in correlations analysis.
Dependent vs. Independent Variable
Dependent Variable: The variable that is being predicted or estimated. It is scaled on the Y-axis.
Independent Variable: The variable that provides the basis for estimation. It is the predictor
variable. It is scaled on the X-axis.
Regression Example
The sales manager of Copier Sales of America, which has a large sales force throughout the United
States and Canada, wants to determine whether there is a relationship between the number of sales
calls made in a month and the number of copiers sold that month. The manager selects a random
sample of 10 representatives and determines the number of sales calls each representative made last
month and the number of copiers sold.
Sales Representative Number of Sales Calls Number of Copiers Sold
Tom Keller 20 30
Jeff Hall 40 60
Brian Virost 20 40
Greg Fish 30 60
Susan Welch 10 30
Carlos Ramirez 10 40
Rich Niles 20 40
Mike Kiel 20 50
Mark Reynolds 20 30
Soni Jones 30 70
Scatter Diagram:

The Coefficient of Correlation, r


The Coefficient of Correlation (r) is a measure of the strength of the relationship between two
variables. It requires interval or ratio-scaled data.
→ It can range from -1.00 to 1.00.
→ Values of -1.00 or 1.00 indicate perfect and strong correlation.
→ Values close to 0.0 indicate weak correlation.
→ Negative values indicate an inverse relationship and positive values indicate a direct relationship.
Perfect Correlation

Correlation Coefficient – Interpretation

Correlation Coefficient Formula:


� )(𝒀 − 𝒀
∑(𝑿 − 𝑿 �)
𝐂𝐨𝐫𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧 𝐂𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭, 𝐫=
(𝒏 − 𝟏)𝑺𝒙 𝑺𝒚
Coefficient of Determination:
The coefficient of determination (𝑟 2 ) is the proportion of the total variation in the dependent variable
(Y) that is explained or accounted for by the variation in the independent variable (X).
→ It is the square of the coefficient of correlation.
→ It ranges from 0 to 1.
→ It does not give any information on the direction of the relationship between the variables.
Correlation Coefficient – Example: Using the Copier Sales of America data which a scatter plot was
developed earlier, compute the correlation coefficient and coefficient of determination.
Sales Representative Number of Sales Calls Number of Copiers Sold
Tom Keller 20 30
Jeff Hall 40 60
Brian Virost 20 40
Greg Fish 30 60
Susan Welch 10 30
Carlos Ramirez 10 40
Rich Niles 20 40
Mike Kiel 20 50
Mark Reynolds 20 30
Soni Jones 30 70
Solution:
20 + 40 + 20 + 30 + 10 + 10 + 20 + 20 + 20 + 30 220
𝑋� = = = 22
10 10
30 + 60 + 40 + 60 + 30 + 40 + 40 + 50 + 30 + 70 450
𝑌� = = = 45
10 10
Representative Calls, X Sales, Y (𝑿 − 𝑿 � ) (𝑿 − 𝑿 � )𝟐 (𝒀 − 𝒀 �) � )𝟐
(𝒀 − 𝒀 � )(𝒀 − 𝒀
(𝑿 − 𝑿 �)
Tom Keller 20 30 -2 4 - 15 225 30
Jeff Hall 40 60 18 324 15 225 270
Brian Virost 20 40 -2 4 -5 25 10
Greg Fish 30 60 8 64 15 225 120
Susan Welch 10 30 - 12 144 - 15 225 180
Carlos Ramirez 10 40 - 12 144 -5 25 60
Rich Niles 20 40 -2 4 -5 25 10
Mike Kiel 20 50 -2 4 5 25 -10
Mark Reynolds 20 30 -2 4 - 15 225 30
Soni Jones 30 70 8 64 25 625 200
Total 760 1850 900
∑(𝑋 − 𝑋�)2 760
𝑆𝑥 = � =� = 9.189
𝑛−1 9

∑(𝑌 − 𝑌�)2 1850


𝑆𝑦 = � =� = 14.337
𝑛−1 9
∑(𝑋 − 𝑋�)(𝑌 − 𝑌�) 900
Correlation Coefficient, r= = = 0.759
(𝑛 − 1)𝑆𝑥 𝑆𝑦 (10 − 1) × 9.189 × 14.337
How do we interpret a correlation of 0.759?
→ First, it is positive, so we see there is a direct relationship between the number of sales calls
and the number of copiers sold.
→ The value of 0.759 is fairly close to 1.00, so we conclude that the association is strong.
However, does this mean that more sales calls cause more sales?
No, we have not demonstrated cause and effect here, only that the two variables – sales calls and
copiers sold – are related.
The coefficient of determination, 𝑟 2 is 0.576, found by (0.759)2
→ This is a proportion or a percent; we can say that 57.6 percent of the variation in the number of
copiers sold is explained, or accounted for, by the variation in the number of sales calls.
Correlation and Cause
→ High correlation does not mean cause and effect
→ For example, it can be shown that the consumption of Georgia peanuts and the consumption of
aspirin have a strong correlation. However, this does not indicate that an increase in the
consumption of peanuts caused the consumption of aspirin to increase.
→ Likewise, the incomes of professors and the number of inmates in mental institutions have
increased proportionately. Further, as the population of donkeys has decreased, there has been an
increase in the number of doctoral degrees granted.
→ Relationships such as these are called spurious correlations.
Regression Analysis
In regression analysis we use the independent variable (X) to estimate the dependent variable (Y).
→ The relationship between the variables is linear.
→ Both variables must be at least interval scale.
→ The least squares criterion is used to determine the equation.
Regression Equation: An equation that expresses the linear relationship between two variables.
Least Square Principle: Determining a regression equation by minimizing the sum of the squares of
the vertical distances between the actual Y values and the predicted values of Y.
General form of Regression Equation:
𝑌� = 𝑎 + 𝑏𝑋
Where,
𝑌� , read Y hat, is the estimated value of the y variable for a selected x value.
‘a’ is the y-intercept. It is the estimated value of Y when x = 0. Another way to put it is: ‘a’ is the
estimated value of y where the regression line crosses the Y-axis when X is zero.
‘b’ is the slope of the line, or the average change in 𝑌� , for each change of one unit (either increase or
decrease) in the independent variable x.
‘X’ is any value of the independent variable that is selected.
Computing the Slope of the Line
𝑆𝑦
Slope of the regression line, 𝑏=𝑟
𝑆𝑥
Where:
r is the correlation coefficient.
𝑆𝑦 is the standard deviation of y (the dependent variable).
𝑆𝑥 is the standard deviation of x (the independent variable).
Computing the Y-Intercept
𝑎 = 𝑌� − 𝑏𝑋�
Where:
𝑌� is the mean of y (the dependent variable).
𝑋� is the mean of x (the independent variable).
Example: Recall the example involving Copier Sales of America. The sales manager gathered
information on the number of sales calls made and the number of copiers sold for a random sample of
10 sales representatives. Use the least squares method to determine a linear equation to express the
relationship between the two variables.
What is the expected number of copiers sold by a representative who made 20 calls?
Sales Representative Number of Sales Calls Number of Copiers Sold
Tom Keller 20 30
Jeff Hall 40 60
Brian Virost 20 40
Greg Fish 30 60
Susan Welch 10 30
Carlos Ramirez 10 40
Rich Niles 20 40
Mike Kiel 20 50
Mark Reynolds 20 30
Soni Jones 30 70
Solution:
Step 1: Find the slope (b) of the line.
𝑆𝑦 14.337
𝑏 = 𝑟 = 0.759 × = 1.1842
𝑆𝑥 9.189
Step 2: Find the Y intercept (a)
𝑎 = 𝑌� − 𝑏𝑋� = 45 − 1.1842 × 22 = 18.9476
The regression equation is
𝑌� = 𝑎 + 𝑏𝑋 = 18.9476 + 1.1842 × 20 = 42.6316
Computing the Estimates of Y
Step 1 – Using the regression equation, substitute the value of each X to solve for the estimated sales
Sales Representatives Sales Calls (X) Estimated Sales �𝒀 ��
Tom Keller 20 42.6316
Jeff Hall 40 66.3156
Brian Virost 20 42.6316
Greg Fish 30 54.4736
Susan Welch 10 30.7896
Carlos Ramirez 10 30.7896
Rich Niles 20 42.6316
Mike Kiel 20 42.6316
Mark Reynolds 20 42.6316
Soni Jones 30 54.4736
Plotting the Estimated and the Actual Y’s

The Standard Error of Estimate


The standard error of estimate �𝑆𝑦,𝑥 � measures the scatter, or dispersion, of the observed values
around the line of regression
A formula that can be used to compute the standard error:
2
∑�𝑌 − 𝑌� �
𝑆𝑦,𝑥 = �
𝑛−2
Standard Error of the Estimate – Example
Recall the example involving Copier Sales of America. The sales manager determined the least squares
regression equation is given below. Determine the standard error of estimate as a measure of how
well the values fit the regression line.
Representatives Actual Sales (Y) Estimated Sales �𝒀 �� �𝒀 − 𝒀�� �𝒀 − 𝒀� �𝟐
Tom Keller 30 42.6316 - 12.6316 159.557
Jeff Hall 60 66.3156 - 6.3156 39.887
Brian Virost 40 42.6316 - 2.6316 6.925
Greg Fish 60 54.4736 5.5264 30.541
Susan Welch 30 30.7896 - 0.7896 0.623
Carlos Ramirez 40 30.7896 9.2104 84.831
Rich Niles 40 42.6316 - 2.6316 6.925
Mike Kiel 50 42.6316 7.3684 54.293
Mark Reynolds 30 42.6316 - 12.6316 159.557
Soni Jones 70 54.4736 15.5264 241.069
Total 784.211
2
∑�𝑌 − 𝑌�� 784.211
𝑆𝑦,𝑥 = � =� = 9.901
𝑛−2 10 − 2
Chapter 17 – Nonparametric Methods: Chi-Square Applications
Chi-Square Applications
The major characteristics of the chi-square distribution are:
→ It is positively skewed
→ It is non-negative
→ There is a family of chi-square distributions
Goodness-of-fit test will show whether an observed set of frequencies could have come from a
hypothesized population distribution.
Goodness-of-Fit Test: Equal Expected Frequencies
A goodness-of-fit test can also be used to determine whether a sample of observations is from a
normal population.
Let 𝑓𝑜 and 𝑓𝑒 be the observed and expected frequencies respectively.
𝐻𝑜 : There is no difference between the observed and expected frequencies.
𝐻1 : There is a difference between the observed and the expected frequencies.
(𝑓𝑜 − 𝑓𝑒 )2
The test statistic is: 𝜒 2 = � � �
𝑓𝑒
The critical value is a 𝜒 2 value with (k-1) degrees of freedom, where k is the number of categories.
Example 1: The following information shows the number of employees absent by day of the week at a
large a manufacturing plant. At the .01 level of significance, is there a difference in the absence rate
by day of the week?
Day of Week Number Absent
Monday 120
Tuesday 45
Wednesday 60
Thursday 90
Friday 130
Total 445
Solution: Step 1: State the null and alternate hypotheses
𝐻𝑜 : There is no difference between the observed and expected frequencies.
𝐻1 : There is a difference between the observed and the expected frequencies.
Step 2: Select the level of significance.
This is given in the problem as .01.
Step 3: Select the test statistic.
It is the chi-square distribution.
Step 4: Formulate the decision rule.
Assume equal expected frequency as given in the problem
120 + 45 + 60 + 90 + 130 445
𝑓𝑒 = = = 89
5 5
The degrees of freedom: 𝑘 − 1 = 5 − 1 = 4
The critical value of 𝜒 2 is 13.28 (from the table).
Reject the null and accept the alternate if Computed 𝜒 2 > 13.28
Step Five: Compute the value of chi-square and make a decision.
(𝑓𝑜 − 𝑓𝑒 )2
Day of Week frequency Expected
𝑓𝑒
Monday 120 89 10.80
Tuesday 45 89 21.75
Wednesday 60 89 9.45
Thursday 90 89 0.01
Friday 130 89 18.89
Total 445 445 60.90
2
Because the computed value of 𝜒 , 60.90, is greater than the critical value, 13.28, 𝐻0 is rejected.
We conclude that there is a difference in the number of workers absent by day of the week.
Goodness-of-fit Test: Unequal Expected Frequencies
Example 2: The U.S. Bureau of the Census indicated that 63.9% of the population is married, 7.7%
widowed, 6.9% divorced (and not re-married), and 21.5% single (never been married). A sample of
500 adults from the Philadelphia area showed that 310 were married, 40 widowed, 30 divorced, and
120 single. At the .02 significance level can we conclude that the Philadelphia area is different from
the U.S. as a whole?
Solution:
Step 1: 𝐻𝑜 : The distribution has not changed
𝐻1 : The distribution has changed.
Step 2: The significance level given is .02.
Step 3: The test statistic is the chi-square.
Step 4: H0 is rejected if 𝜒 2 > 9.837(.02 significance level & Degree of Freedom = 𝑘 − 1 = 4 − 1 = 3)
Calculate the expected frequencies
Married: (.639)500 = 319.5
Widowed: (.077)500 = 38.5
Divorced: (.069)500 = 34.5
Single: (.215)500 = 107.5
Calculate chi-square values.
(𝑓𝑜 − 𝑓𝑒 )2
Status 𝑓𝑜 𝑓𝑒
𝑓𝑒
Married 310 319.5 0.2825
Widowed 40 38.5 0.0584
Divorced 30 34.5 0.5870
Single 120 107.5 1.4535
Total 500 2.3814
2
Step 5: Because the computed value of 𝜒 , 2.3814, is less than the critical value, 9.837
The null hypothesis is not rejected. The distribution regarding marital status in Philadelphia is
not different from the rest of the United States.
Contingency Table Analysis
Chi-square can be used to test for a relationship between two nominal scaled variables, where one
variable is independent of the other.
A contingency table is used to investigate whether two traits or characteristics are related.
Each observation is classified according to two criteria.
We use the usual hypothesis testing procedure.
The degrees of freedom = (𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑜𝑤𝑠 − 1)(𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑙𝑢𝑚𝑛𝑠 − 1)
(𝑟𝑜𝑤 𝑡𝑜𝑡𝑎𝑙 )(𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙 )
Expected Frequency, 𝑓𝑒 =
𝑔𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙
Example 3: Is there a relationship between the location of an accident and the gender of the person
involved in the accident? A sample of 150 accidents reported to the police were classified by type and
gender. At the .01 level of significance, can we conclude that gender and the location of the accident
are related?
Gender Work Home Other Total
Male 60 20 10 90
Female 20 30 10 60
Total 80 50 20 150
Solution:
Step 1: 𝐻0 : Gender and location are not related.
𝐻1 : Gender and location are related
Step 2: The level of significance is set at .01
Step 3: the test statistic is the chi-square distribution.
Step 4: The degrees of freedom equal (𝑟 − 1)(𝑐 − 1) = (2 − 1)(3 − 1) = 1 × 2 = 2.
The critical 𝜒 2 at 2 d.f. is 9.21. If computed 𝜒 2 > 9.21, reject null and accept alternate.
The expected frequency 𝑓𝑒 for
Gender Work Home Other Total
Male 80 × 90 50 × 90 20 × 90 90
= 48 = 30 = 12
150 150 150
Female 80 × 60 50 × 60 20 × 60 60
= 32 = 20 =8
150 150 150
Total 80 50 20 150
( ) 2
𝑓𝑜 − 𝑓𝑒
𝜒2 = � � �
𝑓𝑒
Gender Work Home Other Total
(60 − 48) 2 (20 − 30) 2 (10 − 12) 2
Male 6.66
=3 = 3.33 = 0.33
48 30 12
Female (20 − 32)2 (30 − 20)2 (10 − 8)2 10.00
= 4.5 =5 = 0.5
32 20 8
Total 16.66
Since 𝜒 2 of 16.66 > 9.21, reject the null and conclude that there is a relationship between the location
of an accident and the gender of the person involved.

Math
1. A social scientist sampled 140 people and classified them according to income level and
whether or not they played a state lottery in the last month. The sample information is
reported below. Is it reasonable to conclude that playing lottery is related to income level?
Use .05 significance level.
Low Middle High Total
Played 46 28 21 95
Not Played 14 12 19 45
Total 60 40 40 140
a) What is this table called?
b) State the null hypothesis and the alternate hypothesis.
c) What is the decision rule?
d) Make a decision on the null hypothesis and interpret the result.

2. The American Accounting Association classifies accounts receivables as “current”, “late” and
“not collectible”. The industry figures show that 60% of accounts receivable are current, 30%
are late and 10% are not collectible. Massa and Barr, a law firm has 500 accounts receivable:
320 are current, 120 are late and 60 are not collectible. Are these numbers in agreement
with the industry distribution? Use the .05 significance level.

Vous aimerez peut-être aussi