Vous êtes sur la page 1sur 152

Chapter 1

Introduction

Almost daily we apply statistical concepts in our lives. For example, to start the day you
turn on the shower and let it run for a few moments. Then you put your hand in the
shower to sample the temperature and decide to add more hot water or more cold water,
or conclude that the temperature is just right and enter the shower. As a second example,
you are at the grocery store looking to buy a frozen pizza. One of the pizza makers has a
stand, and they offer a small wedge of their pizza. After sampling the pizza, you decide
whether to purchase the pizza or not. In both the shower and pizza examples, you make a
decision and select a course of action based on a sample.

Definition of Statistics
 in its plural sense, statistics is a set of numerical data (e.g., annual GNP/GDP,
quarterly/monthly sales of a company, weekly/daily peso-dollar exchange rate)
 in its singular sense, Statistics is that branch of science which deals with the
collection, presentation, organization, analysis, and interpretation of data

Definition A population is a collection of all elements under consideration in a


statistical study.
A sample is a part or subset of the population.

Example A manufacturer of kerosene heaters wants to determine if customers are


satisfied with the performance of their heaters. Toward this goal, 5,000 of
his 200,000 customers are contacted and each is asked, “Are you satisfied
with the performance of the kerosene heater you purchased?” Identify the
population and the sample for this situation.

Definition A parameter is a numerical characteristic of the population.


A statistic is a numerical characteristic of the sample.

Example In order to estimate the true proportion of students at a certain college who
smoke cigarettes, the administration polled a sample of 200 students and
determined that the proportion of students from the sample who smoke
cigarettes is 0.12. Identify the parameter and the statistic.

In 1662 John Graunt published an article “Natural and Political Observations Made upon
Bills of Mortality.” His “observations” were the result of his study and analysis of a
weekly church publication called “Bill of Mortality,” which listed births, christenings,
and deaths and their causes. This analysis and interpretation of social and political data
are thought to mark the start of statistics.
Fields of Statistics
a. Statistical Theory of Mathematical Statistics - deals with the development and
exposition of theories that serve as bases of statistical methods.

b. Statistical Methods of Applied Statistics - refer to procedures and techniques used


in the collection, presentation, analysis, and interpretation of data.
 Descriptive Statistics - comprise those methods concerned with the collection,
description, and analysis of a sample without drawing conclusions or inferences
about a population

 Inferential Statistics - comprise those methods concerned with making predictions


or inferences about a population using only the information gathered from a
sample
- the main concern is not merely to describe but actually predict and make
inferences based on the information gathered
- conclusions are applicable to a population which the data on hand is only a
sample of

Descriptive Statistics Inferential Statistics


A bowler wants to find his bowling average A bowler wants to estimate his chance of
for the past 12 games winning a game based on his current
season averages and the averages of his
opponents
A politician wants to know the exact A politician would like to estimate, based
number of votes he received in the last on an opinion poll, his chance for winning
election in the upcoming election

Below are some more illustrations of descriptive statistics.


a) Given the daily sales performance for a product for the previous year, we can draw a
line chart or a column chart to emphasize the upward/downward movement of the series.
Likewise, we can use descriptive statistics to calculate a quantity index per quarter to
compare the sales by quarter for the previous year.
b) To compare the total area of Watershed Forest Reserves in Region IV and Region VIII
for a specified period, we cover all the provinces in Regions IV and VIII then measure
the area of each one of the watershed forest reserves in both regions. We can use
descriptive statistics to summarize the collected data by drawing a horizontal bar chart or
by computing ratios.
c) The Philippine Atmospheric Geographic Astronomical Service Administration
(PAGASA) measures the daily amount of rainfall in millimeters. They can use
descriptive statistics to compute the average daily amount of rainfall, every month for the
past year. They can use the results to describe the amount of rainfall for the past year.
Below are some more examples of inferential statistics.
a) To examine the performance of the country’s financial system, we can use inferential
statistics to arrive at conclusions that apply to the entire economy using data gathered
from a sample of companies or businesses in the country.
b) To determine if reforestation is effective, we can take a representative portion of
denuded forests and use inferential statistics to draw conclusions about the effect of
reforestation on all denuded forests.

Variables and Measurement


Definition A variable is a characteristic or attribute of the elements in a collection
which can assume different values for the different elements.

Definition Measurement is the process of determining the value or label of a


variable for a particular experimental unit.

Definition An experimental unit is the individual or object on which a variable is


measured.

Classification of Variables
1. Qualitative (or Categorical) vs. Quantitative
Qualitative variable a variable that yields categorical responses (e.g.,
political affiliation, occupation, marital status)

Quantitative variable a variable that takes on numerical values


representing an amount or quantity (e.g., weight,
height, no. of cars)

2. Discrete vs. Continuous


Discrete variable a variable which can assume finite, or, at most, countably
infinite number of values; usually measured by counting or
enumeration

Continuous variable a variable which can assume the infinitely many values
corresponding to a line interval
Levels of Measurement
1. Nominal Level (or Classificatory Scale)

The nominal level is the weakest level of measurement where numbers or


symbols are used simply for categorizing subjects into different groups.

Examples:
Sex M-Male F-Female
Marital status 1-Single 2-Married 3-Widowed 4-Separated

2. Ordinal Level (or Ranking Scale)

The ordinal level of measurement contains the properties of the nominal level,
and in addition, the numbers assigned to categories of any variable may be ranked
or ordered in some low-to-high-manner.

Examples:
Teaching ratings 1-poor 2- fair 3-good 4-excellent
Year level 1-1st yr 2 – 2nd yr 3 – 3rd yr 4 – 4th yr

3. Interval Level

The interval level is that which has the properties of the nominal and ordinal
levels, and in addition, the distances between any two numbers on the scale are of
known sizes. An interval scale must have a common and constant unit of
measurement. Furthermore, the unit of measurement is arbitrary and there is no
“true zero” point.

Examples:
IQ
Temperature (in Celsius)

4. Ratio Level

The ratio level of measurement contains all the properties of the interval level,
and in addition, it has a “true zero” point.
.
Examples:
Age (in years)
No. of correct answers in an exam
Exercise: Identify the population under study and variable/s of interest.

a) The Office of Admissions is studying the relationship between the score in the
entrance examination during application and the general weighted average upon
graduation among graduates of the university from 2000 to 2005.
b) The research division of a certain pharmaceutical company is investigating the
effectiveness of a new diet pill in reducing weight on female adults.
c) The Department of Health is interested in determining the percentage of children
below 12 years old infected by the Hepatitis B virus in Metro Manila in 2006.

Steps in a Statistical Inquiry


1. Define the problem.
2. Formulate the research design.
3. Collect the data.
4. Code and analyze the collected data.
5. Interpret the results.

Heart disease is the most common cause of death in industrialized nations. In the US and
Canada, nearly 30% of deaths yearly are due to heart disease, mainly heart attacks. Does
regular aspirin intake reduce deaths from heart attacks? Harvard Medical School
conducted a landmark study to investigate. The people participating in the study
regularly took either an aspirin or a placebo (a pill with no active ingredient). Of those
who took aspirin, 0.9% had heart attacks during the study. Of those who took the
placebo, 1.7% had heart attacks, nearly twice as many.

Can you conclude that it’s beneficial for people to take aspirin regularly? Or could the
observed be explained by how it was decided which people would receive aspirin and
which would receive the placebo? For instance, might those who took aspirin have had
better results merely because they were healthier (or have better diet or exercise more
regularly), on the average, than those who took the placebo?

A TV exit poll used to project the election outcome reported that 53.1% of a sample of
3889 voters said they have voted for candidate A. Was this sufficient evidence to project
A as the winner, even though such information was available from such a small portion
of the more than 9.5 million voters?

If candidate A were actually going to lose the election, what’s the chance that he/she
would be supported by 53.1% of the exit poll voters? If the chance were extremely small,
we’d feel comfortable making the inference that A’s election was supported by majority
of all 9.5 million voters.
Chapter 2
Collection and Presentation of Data

2.1 PRELIMINARIES

Classification of Data
1. Primary vs. Secondary
a. Primary source - data measured by the researcher/agency that published it
b. Secondary source - any republication of data by another agency

We now enumerate some agencies where a researcher can avail of primary data.
a) Central Bank is a primary source of data on banking and finance.
b) National Statistics Office is a primary source of data on population, housing,
and establishments.
c) Pulse Asia is a primary source of data on opinions or sentiments of the people
on current issues.
d) Bureau of Agricultural Statistics is a primary source of data on agriculture and
livestock.

We now give examples of secondary data.


a) the United Nations’ compiled data for its yearbook, which were originally
gathered by government statistical agencies of different countries
b) a medical researcher’s documented data for his research paper, which were
originally collected by the Department of Health
c) the documented data of the research team of a congressman for its report,
which were originally collected by the Department of Education and Commission
of Higher Education
d) the documented data of a student for his thesis, which were originally collected
by the Department of Labor and Employment

2. External vs Internal
a. Internal data - information that relates to the operations and functions of the
organization collecting the data
b. External data - information that relates to some activity outside the organization
collecting the data

Example The sales data of SM is internal data for SM but external data for
any other organization such as Robinson’s.
2.2 DATA COLLECTION METHODS

Data Collection Methods

1. Survey method - questions are asked to obtain information, either through self-
administered questionnaire or personal (or phone) interview

Self-administered questionnaire Personal interview


Obtained information is limited to subjects’ Missing information and vague responses
written answers to pre-arranged questions are minimized with the proper probing of
the interviewer
Lower response rate Higher response rate through call-backs
It can be administered to a large number of It is administered to a person or group one
people simultaneously at a time
Respondents may feel freer to express Respondent may feel more cautious
views and are less pressured to answer particularly in answering sensitive
immediately questions for fear of disapproval
It is more appropriate for obtaining It is more appropriate for obtaining about
objective information complex emotionally-laden topics or
probing sentiments underlying an
expressed opinion

Some actual surveys in the Philippines are as follows:

a) Pulse Asia conducted a sample survey on voter response to political ads in the
May 2013 election. Its respondents were selected registered voters who intend to
vote in the 2013 election.
b) The Department of Energy regularly conducts the Household Energy
Consumption Survey to measure the level and pattern of energy consumption at
the national and regional levels.
c) The Food and Nutrition Research Institute regularly conducts the National
Nutrition Survey that generates data on malnutrition, prevalence of anemia,
Vitamin A and iodine deficiencies, the nutrient intake/adequacies of the members
in the households.
2. Experimental method - a scientific investigation conducted under controlled
situations where treatments are applied and their effects measured on the response
of interest to the experimenter. This is an excellent method of collecting data for
causation studies. If properly designed and executed, experiments will reveal with
a good deal of accuracy, the effect of a change in one variable on another
variable.

Below are some examples of experiments.

a) A researcher wishes to study the effect of Minoxidil on male baldness. The


subjects are balding male patients from ages 40 to 45 years old and weighing
between 135 to 145 pounds. He randomly assigns the male patients into one of
two groups. The first group of male patients applies Minoxidil on their heads
daily for three months. The control group is the group that does not receive the
treatment of Minoxidil. After three months, they measure the male patients’ hair
length and compare it with the length of hair before the application of Minoxidil.
b) The school administration wishes to determine which of the two methods is
more effective in training new student leaders. They randomly assigned twenty
students leaders to training method 1 and twenty student leaders to training
method 2. After one month of training, they administered a standardized
achievement test to the two groups and compared their scores.
c) Objective: to determine the effect of sunlight on the height of a mongo plant
Explanatory/Independent Variable: Exposure to sunlight
Factor/Treatment Levels: Exposed or not to sunlight
Response/Dependent Variable: Height
Extraneous Variables: Amount of water and type of soil

3. Observation method - makes possible the recording of behavior but only at the
time of occurrence (e.g., observing reactions to a particular stimulus, traffic count,
behavior of animals in wildlife or newborn babies in nursery).

Advantages:
 Observation is superior over survey method in collecting data for
nonverbal behavior. In a survey, the researcher may encounter all sorts of
difficulties such as deliberate denial or memory failure. On the other
hand, an observer can make filed notes that record the salient features of
the behavior, or may even record behavior in its totality via videotape.
 Observation is superior over experiment in the sense that behavior takes
place in its natural environment. However the presence of an observer
may possibly alter the true behavior of the subjects.
 The observer is able to conduct his study in the subject’s natural
environment, and is thus usually able to study over a much longer time
period than with either survey or experiment.

Disadvantages over Survey Method:


 Data collected using observation method are difficult to analyze.
Measurements in observational studies take the form of the observer’s
qualitative perceptions rather than the quantitative measures often used in
survey research or experimentation.
 There are certain characteristics of interest that cannot be observed such as
opinions and beliefs. Also, there are certain activities that subjects will
refuse to be observed.
 For filed studies that are conducted in the natural environment, the
observer might find it difficult to enter to enter such environments as
secret environments or private companies.

4. Use of documented data

Possible sources:

a) The National Statistics Office is a major collector of data for both private and
government needs. It provides the public with basic data on various subject
matters such as household income and expenditure, housing, education, health,
employment, and others.
b) The National Statistical Coordination Board compiles data necessary for the
computation of the gross national product, gross domestic product, consumer
price index, and other indices.
c) The Department of Health is responsible for health statistics like prevalence of
diseases among infants and pregnant women, morbidity rates, family planning
methods, etc.
d) The Social Weather Station keeps a record of poll results, social issues, and
others.
e) Theses of graduate students contain data used in their statistical inquiry.

In an observational study, researchers simply observe or question the participants about


opinions, behaviors, or outcomes. Participants are not asked to do anything differently.
For example: do people with higher frequency of religious activity have lower blood
pressure?

Observational studies can be classified according to whether they are retrospective, in


which participants are asked to recall past events, or prospective, in which participants
are followed into the future and events are recorded.

In a case-control study, “cases” who have a particular attribute or condition are compared
to “controls” who do not. The idea is to compare the cases and controls to see how they
differ on an explanatory variable of interest. In medical settings, the cases usually are
individuals who have been diagnosed with a particular disease. Researchers then identify
a group of controls who are as similar as possible to cases, except that they don’t have the
disease. For example, samples of male heart attack patients (cases) and other male
hospital patients (controls) were compared to the extent of baldness.

Clinical trials are experiments that study the effectiveness of medical treatments on actual
patients.

A placebo is a dummy treatment.


Definition Census or complete enumeration is the process of gathering information
from every unit in the population.
 not always possible to get timely, accurate and economical data
 costly, if the number of units in the population is too large

The National Statistics Office has a mandate by law to conduct censuses on


population, agriculture, commerce and industry. It conducts four censuses on a
regular basis. These are:

a) Census of Population and Housing – a study done every 5 to 10 years to


determine the number of residents in the different geographic areas in the
Philippines and to provide a basic demographic profile of these residents. It also
counts the total number of housing units in the country and their structural
characteristics and available facilities.
b) Census of Philippine Business and Industry – a study done every 5 years to
obtain the number of establishments in the different industry sectors and to
provide basic economic information on these establishments such as total sales
and number of employees.
c) Census of Agriculture and Fisheries – a study done every 10 years to determine
the total number of households engaged in agricultural and fishing activities and
agricultural and fishing operators in the Philippines, and to provide basic
description of these farms such as farm area, crops planted, livestock and poultry
raised.
d) Census of Buildings – an inventory of buildings in the urban areas, together
with their basic descriptions.
Definition Survey sampling is the process of obtaining information from the units in
the selected sample.

Advantages of Survey Sampling:


 reduced cost
 greater speed
 greater scope
 greater accuracy

We now present situations wherein it is more appropriate to collect data from a


sample than to conduct a census.

a) Suppose a researcher is interested in investigating the effect of a specific diet


on the length of black tiger prawns in fishponds in Pangasinan. It would be more
practical to use sampling than census since it would be difficult to study every
single black tiger prawn in all the fishponds of Pangasinan.
b) The statistician of a manufacturing company of fluorescent bulbs is interested
in knowing the average lifetime in hours. Sampling is the only method possible
because if we do complete enumeration then there would be no fluorescent bulbs
left for the company to sell.

c) A medical researcher is interested in studying the psychological effects of the


HIV on persons afflicted with the virus. We cannot study the entire population of
people with the HIV since it would be difficult to get a complete listing of all
these people.
2.3 PROBABILITY AND NON-PROBABILITY SAMPLING
Definition A sampling procedure that gives every element of the population a
(known) nonzero chance of being selected in the sample is called
probability sampling. Otherwise, the sampling procedure is called non-
probability sampling.
 Whenever possible, probability sampling is used because there is no objective
way of assessing the reliability of inferences under nonprobability sampling.

Definition The target population is the population from which information is


desired.
Definition The sampled population is the collection of elements from which the
sample is actually taken.
Definition The population frame is a listing of all the individual units in the
population.

Examples of Non-probability Sampling

Convenience sampling

A group of social scientists is interested in studying the socioeconomic profile of persons


with Acquired Immune Deficiency Syndrome. In most cases, the subjects with the
disease will not admit that she or he is a carrier in an ordinary interview. There is also no
complete list of persons with AIDS. We cannot ask hospitals to give us a list of patients
afflicted with the disease since this information is confidential.

Thus, in conducting the survey, the researchers sought the assistance of doctors with
private clinics. When a patient consults one of these doctors and has AIDS, the social
scientists would interview this patient in return for a free-of-charge consultation. With
this method, the sample will include persons who consulted one of the appointed
physicians and volunteered to participate in the study to avail of the free consultation.

Here are some examples of purposive sampling:

a) A researcher may use a particular district, province, or city to be the sample cluster in
representing their population of interest. For instance, the researcher can identify a
specific district of Quezon City whose households have the same profile in terms of the
socio-economic characteristics as the households in the whole Quezon City.
b) For a study that aims to predict the senatorial winners in the national election, a
researcher may include in the sample the provinces that have voted for the actual winners
in a series of past senatorial elections.
We give an example of a government study using purposive sampling.

The Producer’s Price Survey of NSO is a nationwide undertaking intended to provide the
price data needed in the computation of the Producer’s Price Index for manufacturing. To
select the items included in the sample, NSO used purposive sampling by using a set of
criteria to identify the commodities for the market basket. Some of the criteria are: (i) the
commodity has relatively high market share; (ii) the commodity was available in the
market in the base year; and, (iii) the current production of the commodity; and the
market share of the commodity has been stable during the last three years based on the
NSO Annual Survey of Establishment reports.

We now illustrate quota sampling.

A researcher wishes to study the people’s views on birth control. The researcher believes
that a person’s views on birth control and his religion are related. Census results showed
that 70% of the people in the population are Catholics, 20% are Protestants, and 10% are
Muslims. The researcher then selects a sample reflecting the same proportions to
represent the three groupings. If there should be 200 respondents in the sample then this
means that the quota set for each group are as follows: (i) Catholics - 70% of 200=140,
(ii) Protestants – 20% of 200 = 40, and, (iii) Muslim – 10% of 200 = 20. This is quota
sampling and not stratified sampling if the researcher leaves the selection of the 140
Catholics, 40 Protestants, and 20 Muslims to the discretion of the interviewers.

Probability vs. Nonprobability Sampling

Shortly after Bill Clinton became President of the United States, a television station in
Sacramento, California asked viewers to respond to the question, “Do you support the
President’s economic plan?” The next day the result of a properly conducted study that
asked the same question were published in the newspaper.

Television poll Survey


Yes (support plan) 42% 75%
No (don’t support plan) 58% 18%
Not sure 0% 7%
Methods of Probability Sampling

Simple Random Sampling


Description of the Design
Simple random sampling is a method of selecting n units out of the N units in the
population in such a way that every distinct sample of size n has an equal chance of being
drawn. The process of selecting the sample must give an equal chance of selection to any
one of the remaining elements in the population at any one of the n draws.

Random sampling may be with replacement (SRSWR) or without replacement


(SRSWOR). In SRSWR, a chosen element is always replaced before the next selection is
made, so that an element may be chosen more than once.

Sample Selection Procedure


Step 1 Make a list of the sampling units and number them from 1 to N.
Step 2 Select n (distinct for SRSWOR, not necessarily distinct for SRSWR) numbers
from 1 to N using some random process, for example, the table of random
numbers.
Step 3 The sample consists of the units corresponding to the selected random numbers.

Suppose we wish to conduct a sample survey. The population consists of N=30


members of an organization and we wish to select a sample of size n=10 members
using simple random sampling without replacement.

To do this, we first list down all the 30 members of the organization and assign a
unique serial number, from 01 to 30, to each one of them.

01 Abad,Melissa 11 Gomez,May 21 Quiambao, Gina


02 Almeda, Joel 12 Joson, Sonia 22 Quidayan, Candy
03 Baluyot, Temy 13 Lanuza, Jon 23 Santos, Emily
04 Corpuz, Joan 14 La Pierre, Amy 24 Surla,Michael
05 Conlin, Juliet 15 Le, Diana 25 Tablante, Rita
06 Cruz, Raks 16 Macaibay,Macky 26 Tolentino,Magda
07 Dayrit, Erlyn 17 Macasaet, Erwin 27 Tuason, Joy
08 Diaz, Aurora 18 Peña, Lito 28 Valdez, Ernie
09 Foz, Vivian 19 Quebral, Joseph 29 Venegas, Anthony
10 Fuentes,Mar 20 Querido, Rose 30 Zamora, Bea

We then generate n=10 distinct numbers from 1 to 30 using a randomization


mechanism.
Table of Random Numbers
00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 32-35
00 4103 5778 4099 4089 2236 1361 5612 5858 4155
01 7786 4358 4934 9335 3397 3345 1507 0814 0066
02 7654 7803 4234 2322 0129 3253 0275 6836 2185
03 9655 4260 5253 1509 3752 0033 0091 0905 1468
04 5696 1350 9977 7147 8347 7317 9233 8409 3032
05 0803 0281 0159 9634 6566 1766 4195 6427 9168
06 7686 4882 1689 5058 7234 0736 2745 1171 8456
07 4794 1204 6465 4569 3882 2388 2520 6216 0422
08 7037 8610 0584 6101 5070 8476 4118 0783 3639
09 8983 6597 2170 0685 7814 5426 5695 6792 7673
10 8960 3638 7791 1494 2158 0141 3176 2025 4677
11 5931 4049 3766 0345 5865 4833 8357 0211 0240
12 1202 5203 3956 6740 1958 1596 6633 2408 2446
13 6260 3898 8687 7694 1242 7541 8720 4938 9196
14 0364 3201 0251 5461 3231 2830 9935 0924 8650
15 4572 3577 2706 4717 2038 1440 9125 6479 3731
16 9291 4477 1367 6456 7869 0190 8694 6236 6131
17 2377 5010 6496 2096 2648 0015 1567 5608 6394
18 3254 5512 9426 4582 2983 4365 1314 3668 4344
19 4682 2050 9419 3621 3136 3683 3030 5798 8838
20 5057 5249 9688 3653 5955 4694 1707 7437 6956
21 9983 6640 7507 1631 6683 4144 3336 6913 6167
22 2329 1180 0219 5456 8229 0172 7285 6811 0659
23 0370 5889 8506 5009 6501 3894 2396 6676 6389
24 1813 3784 1475 9608 9697 4478 9921 5364 8896
25 0185 3219 8044 5119 5448 5960 4397 4139 9267
26 8811 7537 4068 2362 4012 3407 2482 5714 5588
27 5984 0989 2803 4479 6081 9657 4600 1828 9219
28 2035 8234 3506 3649 3511 1842 6078 7935 7862
29 0677 3199 0161 8660 9495 1640 6736 5648 2017
30 6343 9781 5862 7606 8359 6610 1028 4987 2845
31 8870 0077 6080 2682 4846 9842 4408 4693 6444
32 9373 5887 9700 9074 3647 9086 3264 9367 3325
33 2910 8091 5165 4562 2599 6184 8283 2732 8337
34 9122 4000 1643 5485 1897 9943 0010 2284 8130

Advantages
 The theory involved is much easier to understand than the theory behind other
sampling designs.
 Inferential methods are simple and easy.

Disadvantages
 The sample chosen may be widely spread, thus entailing high transportation costs.
 A population frame, or list, is needed.
 Less precise estimates result if the population is heterogeneous with respect to the
characteristic under study.
Below are some examples of simple random sampling.

a) The Bureau of Internal Revenue auditors can select a sample of establishments in


Metro Manila using simple random sampling to verify the veracity of the tax declared. To
accomplish this, the BIR must avail of a list of all establishments in Metro Manila. For
this purpose, they can use the list of establishments generated by NSO for the Census of
Philippine Business and Industry. They can update this list then select their sample from
the updated list.
b) The Personnel Manager can get a sample of employees using simple random sampling
to get opinions on a new policy regarding tardiness. To do this, the manager uses the list
of employees in his files or from Accounting. From this list, he can already select the
sample.
c) The Chancellor of a university can select a sample of students using simple random
sampling to determine the students’ evaluation of the facilities of the university. The
Office of the Registrar can provide the Chancellor with a list of all registered students in
the university. The Chancellor will select the sample from this list.
d) A researcher can select a sample of elementary schools in Metro Manila using simple
random sampling to study the profile of the faculty of these schools. The researcher can
get a list of schools in Metro Manila from the Department of Education and choose his
sample from this list.

(1-in-k) Systematic Sampling


Description of the Design
Systematic sampling with a “random start” is a method of selecting a sample by taking
every kth unit from an ordered population, the first unit being selected at random. Here k
is called the sampling interval; the reciprocal 1/k is the sampling fraction.

Sample Selection Procedure


Step 1 Number the units of the population consecutively from 1 to N.
Step 2 Let k be the nearest integer to N/n.
Step 3 Select the random start r, where (a) 1 ≤ r ≤ k or (b) 1 ≤ r ≤ N. The unit
corresponding to r is the first unit of the sample.
Step 4 The other units of the sample correspond to r + k, r + 2k, r + 3k, ..., r+ (n-
1)k.

Advantages
 It is easier to draw the sample and often easier to execute without mistakes than
simple random sampling.
 It is possible to select a sample in the field without a sampling frame.
 The systematic sample is spread more evenly over the population.
Disadvantages
 If periodic regularities are found in the list, a systematic sample may consist only
of similar types. (Example: Store sales over seven days of the week – estimating
total sales based on a systematic sample every Tuesday would be unwise.)
 Knowledge of the structure of the population is necessary for its most effective
use.

Example Suppose we wish to select a sample of farms to estimate the total farm
production. If we have a list of farms with their corresponding sizes in
square meters, we can arrange the farms first according to size before we
select our systematic sample.

Stratified Sampling

Description of the Design


In stratified random sampling, the population of N units is first divided into
subpopulations called strata. Then a random sample is drawn from each stratum, the
selection being made independently in different strata.

Sample Selection Procedure


Step 1 Divide the population into strata. Ideally, each stratum must consist of
more or less homogeneous units.
Step 2 After the population has been stratified, a random sample is selected from
each stratum.

Let us select a sample from the same population used in the previous Example but
this time we will use stratified sampling. The population has N=30 members of an
organization and the sample size is n=10 members. If the stratification variable is sex
then we would partition the population into two strata: (i) Stratum 1 – Males, and (ii)
Stratum 2 – Females. One way of allocating the 10 units in the sample is to distribute
them equally into the two strata. Thus, we will select n1=5 males and n2=5 females.

MALES FEMALES
01 Almeda, Joel 01 Abad,Melissa 12 Querido, Rose
02 Baluyot, Temy 02 Conlin, Juliet 13 Quiambao, Gina
03 Cruz, Raks 03 Corpuz, Joan 14 Quidayan, Candy
04 Fuentes,Mar 04 Dayrit, Erlyn 15 Santos, Emily
05 Lanuza, Jon 05 Diaz, Aurora 16 Tablante, Rita
06 Macasaet, Erwin 06 Foz, Vivian 17 Tolentino,Magda
07 Peña, Lito 07 Gomez,May 18 Tuason, Joy
08 Quebral, Joseph 08 Joson, Sonia 19 Zamora, Bea
09 Surla,Michael 09 La Pierre, Amy
10 Valdez, Ernie 10 Le, Diana
11 Venegas, Anthony 11 Macaibay,Macky
Advantages
 Stratification may produce a gain in precision in the estimates of characteristics of
the population
 It allows for more comprehensive data analysis since information is provided for
each stratum.
 It is administratively convenient.

Disadvantages
 A listing of the population for each stratum is needed.
 The stratification of the population may require additional prior information about
the population and its strata.

This example illustrates stratified random sampling using proportional allocation.

Suppose we want to get the opinion of business administration college students regarding
premarital sex. A good stratification variable is sex because the views of the males may
be very different from the views of the females. The population consists of N =500
business administration students and the sample size is n=50. Out of the 500, there are
300 female and 200 male students. The list of business administration students, together
with their respective sex, is available at the records section of the college, or at the Office
of the Registrar.

Stratum Population Proportion of Sample


No. Sex Size Students Size
1 Male N1=200 200/500= 0.4 n1=50  0.4 =20
2 Female N2 = 300 300/500 =0.6 n2 =50  0.6 =30

The following example shows an actual survey using stratified sampling.

The Business Expectations Survey is a nationwide survey, which the Bangko Sentral ng
Pilipinas conducts every semester. The survey provides information useful to policy
makers and monetary managers for their economic and financial policy planning. It
presents data on the general perceptions of the business sector on the current state of
business and the economic prospects for the succeeding semester, and it computes
indicators of economic activity.

In the 2000 BES, the sampling frame was the Securities and Exchange Commission list
of the Philippines’ Top 2000 Corporations. BSP stratified the firms in the list according
to the nine industry groups of the Philippine Standard Industry Classification. This allows
the representation of each industry group in the sample. BES selected the sample of firms
from each industry group using systematic sampling.
Cluster Sampling
Description of the Design
Cluster sampling is a method of sampling where a sample of distinct groups, or clusters,
of elements is selected and then a census of every element in the selected clusters is
taken. Similar to strata in stratified sampling, clusters are non-overlapping sub-
populations which together comprise the entire population. For example, a household is
a cluster of individuals living together or a city block might also be considered as a
cluster. Unlike strata, however, clusters are preferably formed with heterogeneous, rather
than homogeneous elements so that each cluster will be typical of the population.

Clusters may be of equal or unequal size. When all of the clusters are of the same size,
the number of elements in a cluster will be denoted by M while the number of clusters in
the population will be denoted by N.

Sample-Selection Procedure
Step 1 Number the clusters from 1 to N.
Step 2 Select n numbers from 1 to N at random. The clusters corresponding to the
selected numbers form the sample of clusters.
Step 3 Observe all the elements in the sample of clusters.

Suppose we wish to conduct an opinion poll survey of households in Mandaluyong


City. We can select a sample of households using simple one-stage cluster sampling
as follows:

Step 1: Decide on how to divide the population into non-overlapping clusters. In this
example, we will use the barangays as the clusters so that the elementary units are the
households but the sampling units are the barangays.

Step 2: Get a list of all barangays in Mandaluyong City. Number the barangays in the
list, consecutively from 1 to 27.

01 Addition Hills 11 Hagdang Bato Libis 21 Pag-asa


02 Bagong Silang 12 Harapin ang Bukas 22 Plainview
03 Barangka Drive 13 Highway Hills 23 Pleasant Hills
04 Barangka Ibaba 14 Hulo 24 Poblacion
05 Barangka Ilava 15 Mabini-J. Rizal 25 San Jose
06 Barangka Itaas 16 Malamig 26 Vergara
07 Buayang Bato 17 Mauway 27 Wack-wack
08 Burol 18 Namayan
09 Daang Bakal 19 New Zaniga
10 Hagdang Bato 20 Old Zaniga
Step 3: Suppose we decide to include n=5 clusters in the study. Use the table of
random numbers to obtain 5 distinct numbers less than or equal to 27.

Advantages
 A population list of elements is not needed; only a population list of clusters is
required. Listing cost is reduced.
 Transportation cost is reduced.

Disadvantages
 The costs and problems of statistical analysis are greater.
 Estimation procedures are more difficult.

Here is an actual survey that used cluster sampling.

The National Statistics Office conducts the Census of Agriculture and Fisheries to collect
data from all agricultural and fishing operators, and all households engaged in
agricultural and fishing activities.

However, due to budgetary constraints, NSO was only able to collect sample data for the
1991 CAF. NSO used cluster sampling, where the barangays served as the clusters. For
each city/municipality, NSO prepared a list of barangays arranged in descending order,
according to the total farm area in the whole barangay. From this list, NSO selected a
sample of barangays using systematic sampling. All agricultural and fishing operators
and all households engaged in agricultural and fishing activities in the selected barangays
were included in the study. In the end, NSO included a total of 5,997,427 operators and
households for this study.

Multistage Sampling
Description of the Design
In multistage sampling, the population is divided into a hierarchy of sampling units
corresponding to the different sampling stages. In the first stage of sampling, the
population is divided into primary stage units (PSU) then a sample of PSUs is drawn. In
the second stage of sampling, each selected PSU is subdivided into second-stage units
(SSU) then a sample of SSUs is drawn. The process of subsampling can be carried to a
third stage, fourth stage and so on, by sampling the subunits instead of enumerating them
completely at each stage.

Advantages
 Listing cost is reduced.
 Transportation cost is reduced.

Disadvantages
 Estimation procedure is difficult, especially when the primary stage units are not
of the same size.
 Estimation procedure gets more complicated as the number of sampling stages
increases.
 The sampling procedure entails much planning before selection is done.

We now present an actual survey that used two-stage sampling in selecting the sample of
elements.

The Food and Nutrition Research Institute of the Department of Science and Technology
conducts the National Nutrition Survey every 5 years. This survey aims to determine the
prevalence of malnutrition and specific health problems in the country and to provide
data on food consumption and nutrient intake.

For this study, FNRI used 2-stage sampling to select a sample of individuals from each
province. The primary stage units are the barangays. The second stage units are the
individuals.

In each province, FNRI prepared a list of barangays arranged in ascending order,


according to the total number of households in the barangay. They then selected a sample
of barangays using systematic sampling. In each one of the selected barangays, they
listed down all the individuals. They collected data on variables like sex, age, and
classification of woman (pregnant or lactating mother). They formed strata based on
these variables. From each stratum, they selected a sample of individuals using
systematic sampling.

We now present an actual survey that used three-stage sampling in selecting the sample
of elements.

The Department of Tourism conducts the Visitor Sample Survey every month. This
survey aims to collect data on the demographic profile, travel characteristics and
preferences of foreign and overseas Filipinos who visited the country for tourism
development planning and policy-making purposes.

DoT selects the sample of visitors using three-stage sampling. The primary stage units are
the weeks of the month. The second-stage units are the weekly flights. The third-stage
units are the visitors.

For this monthly survey, DoT selects the week of the month using simple random
sampling. From the selected week, they select a sample of weekly flights. They perform
this using stratified random sampling. DoT stratified all the regular weekly international
flights leaving the different international airports in the Philippines according to country
market. It then selects a sample of flights from each country market using simple random
sampling. From the selected flights, DoT selects a sample of visitors using simple
random sampling.
Read on The Questionnaire (Optional)
• Strategies in Writing the Questions (Closed- vs. Open-ended questions)
• Pitfalls to Avoid in Wording Questions
• Ways to Avoid Irrelevant Questions
• Question Order
• Cover Letter/ Introduction
• Pretest

Read on Nonsampling Errors (Optional)


A.Error in the Implementation of Design
1.selection error
2.frame error
3.population specification error
B.Measurement Error
1.instrument error
2.response error (response & nonresponse bias)
3.processing error
4.interviewer bias
5.surrogate information bias
2.4 TABULAR AND GRAPHICAL PRESENTATION OF DATA

Textual Presentation
• data incorporated to a paragraph of text

Example
The 2013 Young Adult Fertility Study Findings from the 2013 Young Adult
(YAFS 4) conducted by the Demographic Fertility and Sexuality Study (YAFS 4)
Research & Development Foundation and …show that the levels of current drug use,
the University of the Philippines drinking alcohol and smoking among
Population Institute shows that 32 percent young people aged 15-24 have dropped
of young Filipinos between the ages 15 to considerably. The declining pattern is
24 have had sex before marriage. Of these, found in the practices of both young men
78 percent reported that their first sexual and women, as well as in younger and
encounter was unprotected: 84 percent older youth.
among young women and 73 percent
among young men. The percentage of young people who are
“current smokers” declined from 20.9
The same study also found that 7.3 percent percent in 2002 to 19.7 percent in 2013.
have engaged in casual sex while 3.5 Eleven years ago, 41 percent of young
percent have had regular sex without Filipinos reported to be “current alcohol
emotional attachment (FUBU). Five drinkers”. Now, 37 percent of young adults
percent of young men disclosed having are engaged in this behavior. But the most
experienced sex with another man (MSM). substantial decline is found in drug use.
Among individuals who are either formally Only 4 percent admitted to have ever used
married or in a live-in arrangement, 3 drugs in 2013, compared to almost 11
percent said they ever had an extra-marital percent in 2002.
affair.
The National Capital Region has the
Regional difference in premarital sex highest level of youth smokers (27 percent)
prevalence shows the National Capital while ARMM registered the lowest. Only
Region (NCR) having the highest 12 percent of young people in ARMM are
prevalence at 41 percent and ARMM, the smokers.
lowest (7.7 percent).

Advantages
• gives emphasis to significant figures and comparisons
• simplest and most appropriate approach when there are only a few numbers to be
presented

Disadvantages
• when a large mass of quantitative data are included in a text or paragraph, the
presentation becomes almost incomprehensible
• written paragraphs can be tiresome to read especially if the same words are repeated so
many times
Tabular Presentation
• the systematic organization of data in rows and columns

Advantages
• more concise than textual presentation
• easy to understand
• facilitates comparisons & analysis of relationship among different categories
• presents data in greater detail than a graph

Parts of a Formal Statistical Table


1. Heading - consists of a table number, title, and headnote. The title is a brief statement
of the nature, classification and time reference of the information presented and the area
to which the statistics refer. The headnote is a statement enclosed in brackets between the
table title and the top rule of the table that provides additional title information.

2. Box Head - the portion of the table that contains the column heads which describe the
data in each column, together with the needed classifying and qualifying spanner heads.

3. Stub - the portion of the table usually comprising the first column on the left, in which
the stubhead and row captions, together with the needed classifying and qualifying center
head and subheads are located. The stubhead describes the stub listing as a whole in
terms of the classification presented. The row caption is a descriptive title of the data on
the given line.

4. Field - main part of the table; contains the substance or the figures of one’s data

5. Source note - an exact citation of the source of data presented in the table (should
always be placed when the figures are not original)

6. Footnote - any statement or note inserted at the bottom of the table

Guidelines
• The title should be concise, written in telegraphic style, not in complete sentence.
• Column labels should be precise. Stress differences rather than similarities between
adjacent columns. As much as possible, two or more adjacent columns should not begin
nor end with the same phrase. This is frequently a signal that a spanner head is needed.
• The arrangement of lines in the stub depends on the nature of classification, purpose of
presentation or limitations of space.
• Categories should not overlap.
• The units of measure must be clearly stated.
• Show any relevant total, subtotals, percentages, etc.
• Indicate if the data were taken from another publication by including a source note.
• Tables should be self-explanatory, although they may be accompanied by a paragraph
that will provide an interpretation or direct attention to important figures.
Graphical Presentation
• a graph or chart is a device for showing numerical values or relationships in pictorial
form

Advantages
• main features and implications of a body of data can be grasped at a glance
• can attract attention and hold the reader’s interest
• simplifies concepts that would otherwise have been expressed in so many words
• can readily clarify data, frequently bring out hidden facts and relationships

Qualities of a Good Graph


1. Accuracy - A good chart should not be deceptive, distorted, misleading, or in any way
susceptible to wrong interpretations as a result of inaccurate or careless construction.
Also, care should be taken so as not to create any optical illusion.
2. Clarity - An effective chart can be easily read and understood. The graph should focus
on the message it is trying to communicate. There should be an unambiguous
representation of the facts. The graph must be able to aid the reader in the interpretation
of facts.
3. Simplicity - The basic design of a statistical chart should be simple, straight- forward,
not loaded with irrelevant, superfluous, or trivial symbols and ornamentation. There
should be no distracting elements in a chart that inhibit effective visual communication.
4. Appearance - A good chart is one that is designed and constructed to attract and hold
attention by holding a neat, dignified, and professional appearance. It must be artistic in
that it embodies harmonious composition, proportion, and balance.
Common Types of Graph
1. Line Chart - graphical presentation of data especially useful for showing trends over a
period of time.

2. Pie Chart - a circular graph that is useful in showing how a total quantity is distributed
among a group of categories. The “pieces of the pie” represent the proportions of the total
that fall into each category.

3. Bar Chart - consists of a series of rectangular bars where the length of the bar
represents the quantity or frequency for each category if the bars are arranged
horizontally. If the bars are arranged vertically, the height of the bar represents the
quantity.

4. Pictorial unit chart – a pictorial chart in which each symbol represents a definite and
uniform value
2.5 THE FREQUENCY DISTRIBUTION TABLE

Definition. The raw data is the set of data in its original form.

Definition. An array is an arrangement of observations according to their magnitude,


either in increasing or decreasing order.

Example: Final grades of Stat 101 Students arranged in an array

50 57 63 69 72 74 77 80 82 84 87
50 59 65 69 72 75 77 80 82 84 87
50 59 66 69 72 75 77 80 82 85 88
50 60 66 69 72 75 77 81 83 85 89
50 60 68 70 73 75 78 81 83 86 89
50 60 68 71 73 75 79 81 84 86 91
51 62 68 71 73 76 79 81 84 87 92
52 62 68 71 73 76 79 82 84 87 94
53 62 68 71 74 76 79 82 84 87 94
53 62 69 72 74 76 79 82 84 87 96

Suppose we have data on number of children of 50 married women using any modern
contraceptive method.

0 0 1 2 2 2 3 3 4 4
0 0 1 2 2 3 3 3 4 4
0 1 1 2 2 3 3 3 4 4
0 1 1 2 2 3 3 3 4 5
0 1 1 2 2 3 3 3 4 5

Since there are only 6 unique values in the data set then we use
single-value grouping.

No. of Children Number of Married Women


0 7
1 8
2 11
3 14
4 8
5 2
In the construction of a frequency distribution, the various items of a series are classified
into groups. The frequency distribution table shows the number of items falling into
each group.

Class Freq OR Class Freq


50 – 55 10 50 – 54 10
56 – 61 6 55 – 59 3
62 – 67 8 60 – 64 8
68 – 73 24 65 – 69 13
74 - 79 22 70 – 74 17
80 – 85 24 75 – 79 19
86 – 91 12 80 – 84 22
92 – 97 4 85 – 89 13
90 – 94 4
95 – 99 1

Definition of terms
1. Class interval - the numbers defining the class
2. Class limits - the end numbers of the class
3. Open-end class - a class that has no lower limit or upper limit
4. Class frequency - the number of observations falling in the class
5. Class size - the difference between the upper class limits of the class and the preceding
class; can also be computed as the difference between the lower class limits of the next
class and the class

Steps in Constructing a Frequency Distribution Table

1. Determine the number of classes. There must be an adequate number of classes to


show the essential characteristics of the data; at the same time, there should not be too
many classes that it is already difficult to grasp the picture of the distribution as a whole.
There are no precise rules concerning the optimal number of classes but Sturges’ formula
can be used as a first approximation.

Sturges’ formula: K = 1 + 3.322 log n


= approximate number of classes
Where n = number of observations

2. Determine the approximate class size. Whenever possible, all classes should be of the
same size. The following steps can be used to determine the class size.
• Solve for the range, R = max – min.
• Compute C’ = R ÷ K.
• Round-off C’ to the same number of decimal places as the original dataset, say
C, and use C as the class size.
3. Determine the lowest class limit. The first class must include the smallest value in the
data set and must agree with the number of decimal places in the dataset.
4. Determine all class limits by adding the class size, C, to the limit of the previous class.
5. Tally the frequencies for each class. Sum the frequencies and check against the total
number of observations.

Variations of the Frequency Distribution

1. Class boundaries - the true class limits; Classes Boundaries Marks


the lower class boundary (LCB) is usually 50 – 54 49.5-54.5 52
defined as halfway between the lower class 55 – 59 54.5-59.5 57
limit of the class and the upper class limit of 60 – 64 59.5-64.5 62
the preceding class while the upper class 65 – 69 64.5-69.5 67
boundary (UCB) is usually defined as 70 – 74 69.5-74.5 72
halfway between the upper class limit of the 75 – 79 74.5-79.5 77
class and the lower class limit of the next 80 – 84 79.5-84.5 82
class 85 – 89 84.5-89.5 87
90 – 94 89.5-94.5 92
2. Class mark (CM) - midpoint of a class 95 – 99 94.5-99.5 97
interval

3. Relative Frequency Distribution and CI f RF RFP


Relative Frequency Percentage 50 – 54 10 .09 9
RF = class frequency ÷ no. of 55 – 59 3 .03 3
observations 60 – 64 8 .07 7
RFP = RF * 100% 65 – 69 13 .12 12
70 – 74 17 .15 15
75 – 79 19 .17 17
80 – 84 22 .20 20
85 – 89 13 .12 12
90 – 94 4 .04 4
95 – 99 1 .01 1

4. Cumulative Frequency Distribution CB f <CF >CF


- shows the accumulated frequencies of 49.5-54.5 10 10 110
successive classes, beginning at either end 54.5-59.5 3 13 100
of the distribution 59.5-64.5 8 21 97
Greater than CFD – shows the no. of 64.5-69.5 13 34 89
observations greater than the LCB 69.5-74.5 17 51 76
Less than CFD – shows the no. of 74.5-79.5 19 70 59
observations less than the UCB 79.5-84.5 22 92 40
84.5-89.5 13 105 18
89.5-94.5 4 109 5
94.5-99.5 1 110 1
Graphical Presentation of the Frequency Distribution Table

1. Histogram - a bar graph that displays the classes on the horizontal axis and the
(relative) frequencies (percentage) of the classes on the vertical axis; the vertical lines of
the bars are erected at the class boundaries and the height of the bars correspond to the
class (relative) frequency (percentage)
CB f
49.5-54.5 10 25
54.5-59.5 3 20
59.5-64.5 8 15
64.5-69.5 13 10
69.5-74.5 17
5
74.5-79.5 19
0
79.5-84.5 22
84.5-89.5 13

.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
49
54
59
64
69
74
79
84
89
94
99
89.5-94.5 4
94.5-99.5 1

2. Polygon – a line chart that is constructed by plotting the (relative) frequencies


(percentage) at the class marks and connecting the plotted points by means of straight
lines; the polygon is closed by considering an additional class at each end and the ends of
the lines are brought down to the horizontal axis at the midpoints of the additional
classes.
CM RF 0.25
52 .09
57 .03 0.2
62 .07
0.15
67 .12
72 .15 0.1
77 .17
82 .20 0.05
87 .12
0
92 .04 47 52 57 62 67 72 77 82 87 92 97 102
97 .01
3. Ogives - graphs of the cumulative frequency distribution
a. < ogive - the <CF is plotted against the UCB
b. > ogive - the >CF is plotted against the LCB

120
UCB <CF
54.5 10
100 59.5 13
80 64.5 21
69.5 34
60
74.5 51
40 79.5 70
20
84.5 92
89.5 105
0 94.5 109
49.5 54.5 59.5 64.5 69.5 74.5 79.5 84.5 89.5 94.5 99.5
99.5 110

LCB >CF 120


49.5 110
54.5 100 100

59.5 97 80
64.5 89
60
69.5 76
74.5 59 40
79.5 40
20
84.5 18
89.5 5 0
49.5 54.5 59.5 64.5 69.5 74.5 79.5 84.5 89.5 94.5 99.5
94.5 1

2.6 THE STEM-AND-LEAF DISPLAY

The stem-and-leaf display is an alternative method for describing a set of data. It presents
a histogram-like picture of the data, while allowing the experimenter to retain the actual
observed values of each data point. Hence, the stem-and-leaf display is partly tabular and
partly graphical in nature.

In creating a stem-and-leaf display, we divide each observation into two parts, the stem
and the leaf. For example, we could divide the data value 234 as follows:

Stem Leaf
2 | 34
Alternatively, we could choose the point of division between the units and tens, whereby

Stem Leaf
23 | 4

The choice of the stem and leaf coding depends on the nature of the data set.

Steps in Constructing the Stem-and-Leaf Display


1. List the stem values, in order, in a vertical column.
2. Draw a vertical line to the right of the stem value.
3. For each observation, record the leaf portion of that observation in the row
corresponding to the appropriate stem
4. Reorder the leaves from lowest to highest within each stem row. Maintain uniform
spacing for the leaves so that the stem with the most number of observations has the
longest line.
5. If the number of leaves appearing in each row is too large, divide the stem into two
groups, the first corresponding to leaves beginning with digits 0 through 4 and the second
corresponding to leaves beginning with digits 5 through 9. This subdivision can be
increased to five groups if necessary.
6. Provide a key to your stem-and-leaf coding so that the reader can recreate the actual
measurements from your display.
Unit = 0.1 1 | 2 represents 1.2
Unit = 1 1 | 2 represents 12
Unit = 10 1 | 2 represents 120

Example: Typing speeds (net words per minute) for 20 secretarial applicants
68 72 91 47
52 75 63 55
65 35 84 45
58 61 69 22
46 55 66 71

Stem Leaf (unit = 1)


2 2
3 5
4 567
5 2558
6 135689
7 125
8 4
9 1
Exercises on FDT and SALD:

1. Americans are becoming increasingly concerned with the incidence of crime, and
voluminous data is being collected to document the magnitude of the problem.
The following table displays data on number of rapes per 100,000 residents for
the 50 states and the District of Columbia.

72.9 27.3 24.4 24.7


32.9 39.4 30.0 53.7
43.3 37.9 34.3 34.3
40.9 40.4 17.8 49.5
42.6 29.0 31.3 51.5
46.2 42.2 29.8 37.8
27.9 33.7 49.7 25.9
49.9 45.7 62.2 31.0
88.1 19.7 29.8 64.0
52.4 77.6 46.8 20.7
53.6 34.0 47.0 23.6
32.5 32.5 46.9 29.5
18.4 44.1 25.8

Florence Nightingale is known as the founder of the nursing profession. However, she
also saved many lives by using statistical analysis. When she encountered an unsanitary
condition or an undersupplied hospital, she improved the conditions and then used
statistical data to document the improvement. Thus, she was able to convince others of
the need for medical reform, particularly in the area of sanitation. She developed original
graphs to demonstrate that, during the Crimean War, more soldiers died from unsanitary
conditions than were killed in combat.
2. The paper “The Acid Rain Controversy: The Limits of Confidence” (Amer.
Statistician (1983): 385-394) presented data on average sulfur dioxide emission
rates for industrial and utility boilers in 47 states (data from Alaska, Hawaii, and
Idaho were not given).

.3 .9 1.5 2.5
.6 .6 1.4 2.7
.4 1.5 1.9 2.9
.5 1.5 1.0 2.1
.2 1.3 1.7 2.9
.7 1.2 1.8 3.8
.2 1.2 1.7 3.6
.7 1.0 1.8 3.4
.7 1.4 1.4 3.7
.5 1.0 2.3 4.2
.1 1.7 2.7 4.5
.6 1.5 2.2

3. The NSCB presented the following figures on the total deposits (in millions of
pesos) in the government banks in the different provinces in the Philippines in
2001.

74 329 839 1932


117 332 850 2071
128 347 906 2174
142 358 933 2174
151 368 1084 2430
154 403 1094 2438
157 434 1104 2530
167 437 1114 2544
170 440 1125 2732
198 448 1131 2775
214 450 1169 2809
215 471 1195 2826
218 520 1280 2846
235 528 1390 2999
241 569 1431 3345
245 604 1551 3502
268 622 1632 4515
285 645 1643 6632
304 654 1771
320 680 1880
CHAPTER 3
Measures of Central Tendency
and
Measures of Location
Definition. A measure of central tendency is any single value that is used to identify the
“center” of the data or the typical value. It is often referred to as the average.

3.1 NOTATIONS AND SYMBOLS


Suppose that X is the variable of interest, and that n measurements are taken. The
notation X1, X2, . . . ,Xn will be used to represent the n observations.

Let the Greek letter  (sigma) indicate the “summation of,” thus, we can write the sum of
n
the observations as X
i 1
i  X1  X 2    X n

The numbers 1 and n are called the lower and the upper limits of summation,
respectively.

Example: i 1 2 3 4
Xi 2 4 6 8
Yi 1 2 1 2

Calculate:
3
1. X
i 2
i

3
2.  X
i 2
i  Yi 
4
3. X Y
i 1
i i

4 4
4.  X i  Yi
i 1 i 1
4
Xi
5. 
i 1 Yi
4

X
i 1
i
6. 4

Y
i 1
i
Some Results on Summation
1. The summation of the sum (or difference) of variables is the sum (or difference) of
n n n
their summations. That is,   X i  Yi    X i  Yi
i 1 i 1 i 1

n n
2. If c is a constant, then  cX i  c X i
i 1 i 1

n
3. If c is a constant then  c  nc
i 1

3.2 THE ARITHMETIC MEAN


- the most common average
- the sum of all values of the observations divided by the number of observations
- simply referred to as the mean

The population mean for a population with N elements, denoted by the Greek letter μ
N

X i
(mū), is computed as   i 1

N
n

X i
The sample mean X (X bar) of n observations is computed as X  i 1

n
The sample mean (a statistic) is an estimate of the unknown population mean (a
parameter).

Examples:
1. The number of employees at 5 different drug stores are 10, 12, 6, 8, and 4. Find the
mean number of employees for the 5 stores.
2. Scores in the Statistics 101 first exam for a sample of 10 students are as follows: 60,
55, 30, 90, 88, 79, 45, 66, 93, and 80. Find the mean.
3. Refer to the example on the final grades of 110 Statistics 101 students. The sample
mean is 74.10909091
Definition. The weighted mean is a modification of the usual mean that assigns weights
(or measures of relative importance) to the observations to be averaged. If each
observation Xi is assigned a weight Wi the weighted mean is given by
n

W X i i
X i 1
n

W i 1
i

Examples:
1. Suppose a teacher assigns the following weights to the various course requirements:
Assignment 15%
Project 25%
Midterm Exam 20%
Final Exam 40%
The maximum score a student may obtain for each component is 100. Jeffry obtains
marks of 83 for assignments, 72 for the project, 41 for the midterm exam, and 47 for the
final exam. Find his mean mark (or final grade) for the course.

2. Alex’s grades are as follows:


History 1.0
Humanities 1.0
Math 19 3.0
Math 53 3.0
Philosophy 1.0
Math 53 is a 5-unit course and all others are 3-unit courses. Find Alex’s GWA.
Characteristics of the Mean
1. It is the most familiar measure used, and it employs all available information.
2. It is affected by the value of every observation. In particular it is strongly influenced by
extreme values.
3. Since the mean is a calculated number, it may not be an actual number in the data set.
4. Suppose the true average fuel efficiency for all 600,000 cars of a certain type under
specified conditions might be µ = 27.5 mi/gal. A sample of n = 5 cars might yield
efficiencies 27.3, 26.2, 28.4, 27.9, 26.5 for which we obtain x = 27.26. However a
second sample might give x = 28.52, a third x = 26.85 and so on. The value of x varies
from sample to sample whereas there is just one value for µ. Later on we shall see how
the value of x from a particular sample can be used to draw various conclusions about
the value of µ.
5. It possesses two mathematical properties that will prove to be important in subsequent
analyses.
i) The sum of the deviations of the values from the mean is zero.
ii) The sum of the squared deviations is minimum when the deviations are taken
 
n n

  X i  c2   X i2  n X
2 2
from the mean. Hint: show that n X c
i 1 i 1
6. a. If a constant c is added (subtracted) to all observations, the mean of the new
observations will increase (decrease) by the same amount c.
b. If all observations are multiplied or divided by a constant, the new observations
will have a mean that is the same constant multiple of the original mean.

Example: Given 5 temperature readings measured in Fahrenheit: 98, 100, 107, 90, 92. If
the mean temperature is X F = 97.4, find the mean temperature in centigrade if
C  95 F  32
Approximating the Mean from a Frequency Distribution
- possible only when the class mark can be assumed to be representative of all the values
in that class. If the assumption holds, the following equation may be used to approximate
the mean from a frequency distribution.

k k

 f CM i i  f CM i i
X i 1
k
 i 1

f
n
i
i 1
where fi = the frequency of the ith class
CMi = the class mark of the ith class
k = total number of classes
n = total number of observations

Example: Final grade of 110 Statistics 101 students

Class Freq fi CMi fiCMi


50 – 54 10 52 520
55 – 59 3 57 171
60 – 64 8 62 496
65 – 69 13 67 871
70 – 74 17 72 1224
75 – 79 19 77 1463
80 – 84 22 82 1804
85 – 89 13 87 1131
90 – 94 4 92 368
95 – 99 1 97 97
Total 110 8145

 f CM i i
8145
x i 1
k
  74.045455
f
110
i
i 1
Remarks:
1. The formula for approximating the mean cannot be used if a frequency distribution has
open-ended intervals, unless there are reasonably accurate estimates of the class marks
for the open intervals.
2. The mean of a frequency distribution is simply a weighted mean of the class marks,
where the fi’s are the weights.
3.3 THE MEDIAN
- the positional middle of the arrayed data
- in an array, one-half of the values precede the median and one-half follow it

The first step in calculating the median, denoted as Md, is to arrange the data in an array.
Let X(i) be the ith observation in the array, i = 1, 2, . . . , n.
If n is odd, the median position equals (n+1)/2, and the value of the (n+1)/2 th
observation in the array is taken as the median, i.e.,
Md = X  n 1 
2

 
If n is even, the mean of the two middle values in the array is the median, i.e.,
Md = 12 X  n   X  n 1
2 2

Examples:
1. Given the following heights (in inches): 71, 72, 75, 75, and 67. Find the median
height.
2. Given the following scores: 1, 7, 3, 3, 6, 5, 4, 3, find the median of the scores.
3. Refer to the example on the grades of 110 Statistics 101 students. Find the median
.
Characteristics of the Median:
1. The median is a positional measure.
2. The median is affected by the position of each item in the series but not by the value of
each item. This means that extreme values affect the median less than the arithmetic
mean.

Approximating the Median from the Frequency Distribution


- possible only when the values of the observations falling in the median class can be
assumed to be evenly spaced throughout the class. (The median class is the class
containing the median.)

Step 1. Construct the less than cumulative frequency distribution.


Step 2. Starting from the top, locate the class with less than cumulative frequency greater
than or equal to n/2 for the first time. This class is the median class.
Step 3. Approximate the median using the following formula:

 n   cf Md 1 
Md  LCBMd  c  2 
 f Md 
where LCBMd = the lower class boundary of the median class
c = class size of the median class
n = the total number of observations in the distribution
<CF Md - 1 = less than cumulative freq. of the class preceding the median class
fMd = frequency of the median class
Example:
Refer to the example on the final grades of 110 Statistics 101 students.

Class Freq <cf


50 – 54 10 10
55 – 59 3 13
60 – 64 8 21
65 – 69 13 34
70 – 74 17 51
75 – 79 19 70
80 – 84 22 92
85 – 89 13 105
90 – 94 4 109
95 – 99 1 110

 n   cf Md 1 
Md  LCBMd  c  2 
 f Md 
  51
110
 74.5  5 2   75.552632
 19 

3.4 THE MODE


- the observed value that occurs most frequently
- locates the point where the observation values occur with the greatest density
- generally a less popular measure than the mean or the median

The mode is determined by counting the frequency of each value and finding the value
with the highest frequency of occurrence.

Examples:
1. 2, 5, 2, 3, 5, 2, 1, 4, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 2
2. 2, 5, 5, 2, 2, 5, 1, 3, 5, 4, 2, 5, 5, 2, 2, 5, 5, 2, 2, 1
3. 1, 2, 3, 3, 2, 1, 2, 3, 1, 4, 4, 5, 5, 1, 2, 3, 4, 5, 4, 5
4. Refer to the example on the final grades of 110 Statistics 101 students. Find the mode.

Characteristics of the Mode:


1. It does not always exist; and if it does, it may not be unique. A data set is said to be
unimodal if there is only one mode, bimodal if there are two modes, trimodal if there are
three modes, and so on.
2. It is not affected by extreme values.
3. The mode can be used for qualitative as well as quantitative data.
Approximating the Mode from the Frequency Distribution
Step 1: Locate the class with the highest frequency. This is the modal class
Step 2: Approximate the mode using the following formula:

 f Mo  f Mo1 
Mo  LCBMo  c 
 2 f Mo  f Mo1  f Mo1 
where LCBMo = lower class boundary of the modal class
c = class size of the modal class
fMo = frequency of the modal class
fMo-1 = frequency of the class preceding the modal class
fMo+1 = frequency of the class following the modal class

Example:

Class Freq
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1

 f Mo  f Mo1 
Mo  LCBMo  c 
 2 f Mo  f Mo1  f Mo1 
 22  19 
 79.5  5   80.75
 222   19  13 
3.5 MEASURES OF LOCATION
Definition. Measures of location or fractiles or quantiles are values below which a
specified fraction or percentage of the observations in a given set must fall.

Definition. Quartiles are values that divide the array into 4 equal parts. Thus,
Q1, read as first quartile, is the value below which 25% of the values fall.
Q2, read as second quartile, is the value below which 50% of the values fall.
Q3, read as third quartile, is the value below which 75% of the values fall.

Lowest 25% Second lowest 25% Third lowest 25% Highest 25%
Q1 Q2 Q3

Definition. Deciles are values that divide the array into 10 equal parts. Thus,
D1, read as first decile, is the value below which 10% of the values fall.
D2, read as second decile, is the value below which 20% of the values fall.



D9, read as ninth decile, is the value below which 90% of the values fall.

10% 10% 10% 10% 10% 10% 10% 10% 10% 10%
D1 D2 D3 D4 D5 D6 D7 D8 D9

Definition. Percentiles are values that divide the array into 100 equal parts. Thus,
P1, read as first percentile, is the value below which 1% of the values fall.
P2, read as second percentile, is the value below which 2% of the values fall.



P99, read as ninety-ninth percentile, is the value below which 99% of the values fall.

Consider the lowest 10% of your data:

1% 1% 1% 1% 1% 1% 1% 1% 1% 1%
P1 P2 P3 P4 P5 P6 P7 P8 P9

Consider the entire data set:

10% 10% 5% 5% 10% 10% 10% 10% 5% 5% 10% 10%


P10 P20 P25 P30 P40 P50 P60 P70 P75 P80 P90
D1 D2 D3 D4 D5 D6 D7 D8 D9
Q1 Q2 Q3
Md
To compute the ith percentile:
n 1
Pi = the value of the i 100 th observation in the array

Examples: Determine P69, D3, Q1 and Q3 from the data on stat 101 final grades

50 57 63 69 72 74 77 80 82 84 87
50 59 65 69 72 75 77 80 82 84 87
50 59 66 69 72 75 77 80 82 85 88
50 60 66 69 72 75 77 81 83 85 89
50 60 68 70 73 75 78 81 83 86 89
50 60 68 71 73 75 79 81 84 86 91
51 62 68 71 73 76 79 81 84 87 92
52 62 68 71 73 76 79 82 84 87 94
53 62 68 71 74 76 79 82 84 87 94
53 62 69 72 74 76 79 82 84 87 96
CHAPTER 4
Measures of Dispersion
and
Measures of Skewness
Definition. Measures of dispersion indicate the extent to which individual items in a
series are scattered about an average.

4.1 MEASURES OF ABSOLUTE DISPERSION


Measures of absolute dispersion are expressed in the units of the original observations.
They can not be used to compare variations of two data sets when the averages of these
data sets differ a lot in value or when the observations differ in units of measurement.

• The Range
Definition. The range of a set of measurements is the difference between the largest and
the smallest values.

Range = maximum – minimum = X(n) – X(1)

The range is approximated from a frequency distribution by getting the difference


between the upper class limit of the highest class interval and the lower class limit of the
lowest class interval.

Examples:
1. The IQ’s of 5 members of a certain family are 108, 112, 127, 116, and 113. Find the
range.
2. Refer to the example on the final grade of 110 Statistics 101 students. The range is 96
– 50 = 46.
Approximating the range from the frequency distribution table, we get 99 – 50 = 49.

Characteristics of the Range


1. It uses only the extreme values. It fails to communicate any information about the
clustering or the lack of clustering of the values between the extremes.
2. A weakness of the range is that an outlier can greatly alter its value.
3. It can not be approximated from open-ended frequency distributions.
4. It is unreliable when computed from a frequency distribution table with gaps or zero
frequencies.
• The Standard Deviation and the Variance
N

 X i  
2

Definition. For a population of size N, the population variance is  2  i 1

N
N

 X i  
2

and the population standard deviation is   i 1


.
N

 X 
n
2
i X
Definition. For a sample of size n, the sample variance is S 2  i 1

n 1

 X 
n
2
i X
and the sample standard deviation is S  i 1
n 1

Remarks:
1. The standard deviation is the most frequently used measure of dispersion and is
interpreted as the average distance of the data values from their mean.
2. The variance is not a measure of absolute dispersion. It is not expressed in the same
units as the original observations.

Examples:
1. The following scores were given by 6 judges for a gymnast’s performance in the vault:
7, 5, 9, 7, 8, and 6. Find the standard deviation.

2. A sample of 5 households showed the following number of household members: 3, 8,


5, 4, and 4. Find the standard deviation.

3. Refer to the example on the final grades of 110 Statistics 101 students. The sample
standard deviation is given by s  11.2537745

A recent survey showed that customers of the US Postal Service were interested in more
consistency in the time it takes to make a delivery. Under the old conditions, a local
letter might take only one day to deliver, or it might take several. “Just tell me how many
days ahead I need to mail the birthday card to Mom so it gets there on her birthday, not
early, not late,” was a common complaint. The level of consistency is measured by the
standard deviation of the delivery times. A smaller standard deviation indicates more
consistency.
2
 n n

n X    X i  i
2

Computational formula: S 
i 1  i 1 
nn  1

110 110
Example: the final grade of 110 Statistics 101 students,  X i2 = 617936,
i 1
X
i 1
i  8152

110  617936  8152 2


s = 11.2537745
110  109

Approximating the Standard Deviation from the Frequency Distribution

 f CM 
2
k k
 k 
n f i CM    f i CM i 
2
X
2
i i i

S i 1
 i 1  i 1 
n 1 nn  1

Example:
Class Freq CMi fiCMi fiCMi2
50 – 54 10 52 520 27040
55 – 59 3 57 171 9747
60 – 64 8 62 496 30752
65 – 69 13 67 871 58357
70 – 74 17 72 1224 88128
75 – 79 19 77 1463 112651
80 – 84 22 82 1804 147928
85 – 89 13 87 1131 98397
90 – 94 4 92 368 33856
95 – 99 1 97 97 9409
Total 110 8145 616265
2
k
 k 
n f i CM    f i CM i 
2

110  616265  8145 2


i

s i 1  i 1   = 10.989892239821
nn  1 110  109

Characteristics of the Standard Deviation


1. It is affected by the value of every observation. It may be distorted by few extreme
values.
2. It can not be computed from an open-ended distribution.
3. If each observation of a set of data is transformed to a new set by the addition (or
subtraction) of a constant c, the standard deviation of the new set of data is the same as
the standard deviation of the original data set.
4. If a set of data is transformed to a new set by multiplying (or dividing) each
observation by a constant c, the standard deviation of the new data set is equal to the
standard deviation of the original data set multiplied (or divided) by (the absolute value
of) c.

4.2 MEASURES OF RELATIVE DISPERSION

Measures of relative dispersion are unitless and are used when one wishes to compare
the scatter of one distribution with another distribution.

• The Coefficientof Variation


Definition. The coefficient of variation, CV, is the ratio of the standard deviation to the

mean and is usually expressed in percentage. It is computed as CV   100% and its

S
sample counterpart is CV   100% .
X
Examples.
1. The foreign exchange rate is an indicator of the stability of the peso and is also an
indicator of economic performance. In 1992 Bangko Sentral ng Pilipinas put the peso on
a floating rate basis. Market forces and not government policy have determined the level
of the peso since. Government intervenes through the BSP, only when there are
speculative elements in the market. Given below are the means and standard deviations of
the quarterly P-$ exchange rate for the periods 1989 to 1991 and 1992 to 1994. Which of
the two periods is more stable?

Mean s.d.
1989-1991 22.4 1.84
1992-1994 26.4 1.15

2. Two of the quality criteria in processing butter cookies are the weight and color
development in the final stages of oven browning. Individual pieces of cookies are
scanned by a spectrophotometer calibrated to reflect yellow-brown light. The readout is
expressed in per cent of a standard yellow-brown reference plate and a value of 41 is
considered optimal (golden-yellow). The cookies were also weighed in grams at this
stage. The means and standard deviations of 30 sample cookies are presented below.
Which of the two quality criteria is more varied?

Mean s.d.
Color 41.1 10
Weight 17.7 3.2

• The Standard Score

Definition. The standard score measures how many standard deviations an observation
X 
is above or below the mean. It is computed as Z  and the sample counterpart is

X X
Z
S

Remarks:
1. The standard score is not a measure of relative dispersion per se but is somewhat
related.
2. It is useful for comparing two values from different series specially when these two
series differ with respect to the mean or standard deviation or both are expressed in
different units.

Examples:
1. Robert got a grade of 75% in Stat 101 and a grade of 90% in Econ 11. The mean grade
in Stat 101 is 70% and the standard deviation is 10%, whereas in Econ 11, the mean
grade is 80% and the standard deviation is 20%. Relative to the other students, where did
he perform better?
2. In problem (1), if the mean grade in Stat 101 is 65%, in which subject did Robert
perform better?

3. Different typing skills are required for secretaries depending on whether one is
working in a law office, an accounting firm, or for a mathematical research group at a
major university. In order to evaluate candidates for these positions, an agency
administers 3 distinct standardized typing samples. A time penalty has been incorporated
into the scoring of each sample based on the number of typing errors. The mean and
standard deviation for each test, together with the scores achieved by Nancy, an
applicant, are given in the following table. Where do you think should Nancy be placed?

Sample Nancy’s Score Mean std. dev.


Law 141 sec 180 sec 30 sec
Accounting 7 min 10 min 2 min
Scientific 33 min 26 min 5 min

Stockbrokers have a problem when they are considering two investments where the mean
rate of return is the same. They usually calculate the standard deviation of the rates of
return to assess the risk associated with the two investments. The investment with the
larger standard deviation is considered to have the greater risk.

In the 1941 Major League Baseball, Ted Williams batted .406 and nobody has hit over
.400 since. The highest batting average in recent times was by Tony Gwynn, .394 in
1994. It is interesting to note that the mean batting average for all players at about .260
for 100 years. The standard deviation of that average, however, has declined from .049
to .031. This indicates that there is less dispersion in the batting averages today and helps
explain why there have not been any .400 hitters in recent times.
4.3 MEASURES OF SKEWNESS
Definition. A measure of skewness shows the degree of asymmetry, or departure from
symmetry of a distribution. It indicates not only the amount of skewness but also the
direction.

Two Type of Skewness


1. Positively Skewed or Skewed to the Right
• distribution tapers more to the right than to the left
• longer tail to the right
• more concentration of values below than above the mean
• most skewed curves encountered in the social sciences are skewed to the right

2. Negatively Skewed or Skewed to the Left


• distribution tapers more to the left than to the right
• longer tail to the left
• more concentration of values above than below the mean
• only rarely do we find curves skewed to the left, and even more rarely do we find data
characteristically skewed to the left

Pearson’s First and Second Coefficients of Skewness

1. Sk 
X  Mo
2. Sk 

3 X  Md 
S S

Remarks:
1. Since the mode is frequently only an approximation, formula 2 is preferred.
2. Interpretation of the measure of skewness:
Sk > 0: positively skewed since x > Md > Mo
Sk < 0: negatively skewed since x < Md < Mo
Sk = 0: symmetric since x = Md = Mo

Example: Refer to the final grade of 110 Statistics 101 students


x = 74.1 Md = 75 Mo = 84 s = 11.25
4.4 THE BOXPLOT
Definition. The boxplot is a graph that is very useful for displaying the following
features of the data:
• location
• spread
• symmetry
• extremes
• outliers

Steps in Constructing a Boxplot


1. Construct a rectangle with one end at the first quartile and the other end at the third
quartile.
2. Put a vertical line across the interior of the rectangle at the median.
3. Compute the interquartile range (IQR), lower fence (FL) and upper fence (FU) given by:
IQR = Q3 - Q1
FL = Q1 - 1.5 IQR
FU = Q3 + 1.5 IQR
4. Locate the smallest value contained in the interval [FL , Q1]. Draw a line from this
value to Q1.
5. Locate the largest value contained in the interval [Q3,FU]. Draw a line from this value
to Q3.
6. Values falling outside the fences are considered outliers and are usually denoted by
“x”.

Remarks:
1. The height of the rectangle is usually arbitrary and has no specific meaning. If several
boxplots appear together, however, the height is sometimes made proportional to the
different sample sizes.
2. If the outlying observation is less than Q1 - 3 IQR or greater than Q3 + 3 IQR it is
identified with a circle at their actual location. Such an observation is called a far outlier.
Examples:

Set A: 1 15 21 22 24
10 18 22 23 25
14 20 22 24 28
Q1 = 15 IQR = 9
Q3 = 24 FL = 1.5
Md = 22 FU = 37.5

0 5 10 15 20 25 30

Set B: 3 10 11 12 19
8 10 11 16 19
9 10 12 16 30
Q1 = 10 IQR = 6
Q3 = 16 FL = 1
Md = 12 FU = 25

2. Boxplot of the final grade of 110 Statistics 101 students.

p
50 55 60 65 70 75 80 85 90 95 100
CHAPTER 5
Probability

5.1 RANDOM EXPERIMENTS, SAMPLE SPACES AND EVENTS

Definition of Terms
1. Random experiment any process of generating a set of data or observations that
can be repeated under basically the same conditions, which lead to well-defined
outcomes
2. Sample space set of all possible outcomes of an experiment, usually denoted by S
3. Sample point an element of the sample space, an outcome
4. Event any subset of the sample space, usually denoted by capital letters
5. Null space/Empty space a subset of the sample space that contains no elements
and denoted by the symbol .
6. Simple event an event which contains only one element of the sample space
7. Compound event an event that can be expressed as the union of simple events,
thus containing more than one sample point
8. Mutually exclusive events Two events A and B are mutually exclusive if AB =
; that is, A and B have no elements in common
Remarks:
 An event is said to have occurred if the outcome of the experiment is one of the
sample points in the event.
 The empty space can be viewed as an event that will never happen. It is called the
impossible event.
 The sample space S, as an event, always occurs, and is referred to as the certain or
sure event.

Event Composition and Event Relations


1. A intersection of events A and B is the event that both A and B occur
2. A  B the union of events A and B is the event that A or B or both occur
3. A’ or Ac the complement of an event A with respect to S contains all elements of
S that are not in A and is the event that A does not occur

Some relationships between events can be illustrated by means of a Venn Diagram.


5.2 THE PROBABILITY CONCEPT AND SOME PROPERTIES
Defn The probability of an event A, denoted by P(A), is the sum of the probabilities
of mutually exclusive outcomes that constitute the event. It must satisfy the
following properties:
 0≤ ≤ 1 for any event A
 P(S) = 1 where S is the sample space
 P() = 0

Approaches to Assigning Probabilities


1. A Priori or Classical Probability – probability is determined even before the
experiment is performed using the following rule: If an experiment can result in
any one of N different equally likely outcomes, and if exactly n of these outcomes
correspond to event A, then the probability of event A is

number of sample points in A n


P( A)  
number of sample points in S N

2. A Posteriori or Relative Frequency or Empirical Probability - probability is


determined by repeating the experiment a large number of times using the
following rule:

number of times A occurred


P( A) 
number of times experiment was repeated

The French naturalist Count Buffon (1707-1788) tossed a coin 4040 times.
Result: 2048 heads, or proportion 2048/4040 = .5069 for heads.
Around 1900, the English statistician Karl Pearson heroically tossed a coin 24,000
times. Result: 12,012 heads, a proportion of .5005.
While imprisoned by the Germans during World War II, the South African
statistician John Kerrich tossed a coin 10,000 times. Result: 5067 heads,
proportion of heads .5067.

3.Subjective Probability – probability is determined by the use of intuition, personal


beliefs, and other indirect information.

The late astronomer Carl Sagan believed that the probability of a major asteroid
hitting the Earth soon is high enough to be of concern. “The probability that the
Earth will be hit by a civilization-threatening small world in the next century is a
little less than one in a thousand.” To arrive at that probability, Sagan obviously
could not use the long-run frequency definition of probability. He would have to
use his own knowledge of astronomy, combined with past asteroid behavior.
Examples:

1. a. In tossing a fair coin, what is the probability of getting a head? Of either a


head or tail? Of neither a head nor tail?
b. In tossing a fair die, what is the probability of getting a 3? Of getting an
even number? Of getting a number greater than 6?
2. A coin is biased so that a head is twice as likely to occur as a tail. If the coin is
tossed once, what is the probability of getting a head?
Rules of Counting
Theorem If an operation can be performed in n1 ways, and for each of these a second
operation can be performed in n2 ways, then the two operations can be
performed in n1n2 ways.

Example How many sample points are there in the sample space when a pair of
balanced dice is thrown once?

Without considering strategy in a game of chess, there are 400 ways of playing the first
round of moves.

Theorem (Multiplication Rule) If an operation can be performed in n1 ways, if for


each of these a second operation can be performed in n2 ways, if for each
of the first two a third operation can be performed in n3 ways, and so on,
then the sequence of k operations can be performed in n1n2 ... nk ways.

Examples:
1. How many even three-digit numbers can be formed from the digits 1, 2, 5, 6, and
9 if each digit can be used only once?
2. How many ways can a 10-question true-false examination be answered?

Theorem The number of combinations of n distinct objects taken r at a time is nCr.

Example From 4 Republicans and 3 Democrats find the number of committees of 3


that can be formed with 2 Republicans and 1 Democrat.
Theorems on Probabilities of Events
Thm1 P(ABc) = P(A) – P(AB)
P(BAc) = P(B) – P(AB)

Thm2 (Additive Rule) P(AB) = P(A) + P(B) - P(AB)


Corollary If A and B are mutually exclusive, then
P(AB) = P(A) + P(B)
Corollary If A1, A2, . . . , An are mutually exclusive, then
P(A1  2   n) = P(A1) + P(A2) + . . . +P(An)
Thm3 If A and Ac are complementary events, then
P(A) + P(Ac) = 1.
Thm4 P(AB) = P(AcBc)
c

P(AB)c = P(AcBc)

Examples:
1. The probability that a student passes Statistics is 2/3, and the probability that he
passes English is 4/9. If the probability of passing at least one of the two courses
is 4/5, what is the probability that he will pass both courses? fail both courses?
2. What is the probability of getting a total of 7 or 11 when a pair of dice is tossed?
3. In the toss of a fair coin 4 times, what is the probability of no head in the toss? At
least one head?
Exercises: pp. 95-97 of Walpole nos. 1-20
1. Find the errors in each of the following statements:
a. The probability that it will rain tomorrow is 0.40 and the probability that it
will not rain tomorrow is 0.52.
b. The probabilities that a printer will make 0, 1, 2, 3, or 4 or more mistakes
in printing a document are, respectively, 0.19, 0.34, -0.25, 0.43, and 0.29.
c. The probabilities that an automobile salesperson will sell 0, 1, 2, or 3 cars
on any given day in February are, respectively, 0.19, 0.38, 0.29, and 0.15.
d. On a single draw from a deck of playing cards the probability of selecting
a heart is 1/4, the probability of selecting a black card is 1/2, and the
probability of selecting both a heart and a black card is 1/8.
2. An experiment involves tossing a pair of dice. Find the probability of event
a. A = sum is greater than 8
b. C = a number greater than 4 comes up on one die.
c. AC
3. Three men are seeking public office. Candidates A and B are given about the
same chance of winning, but candidate C is given twice the chance of either A or
B. What is the probability that C wins? A does not win?
4. A box contains 500 envelopes of which 75 contain $100 in cash, 150 contain $25,
and 275 contain $10. An envelope may be purchased for $25. Find the
probability that the first envelope purchased contains less than $100.
5. A 5-sided die with sides numbered 1, 2, 3, 4, and 5 is constructed so that the 1 and
5 occur twice as often as the 2 and 4, which occur three times as often as the 3.
What is the probability that a perfect square occurs when this die is tossed once?
6. If A and B are mutually exclusive events and P(A) = .3 and P(B) = .5, find
a. P(A  B)
b. P(A’)
c. P(A’  B)
7. If A, B, and C are mutually exclusive events and P(A) = .2, P(B) = .3 and P(C) =
.2, find
a. P(A  B  C)
b. P[A’  (B  C)]
c. P(B  C’)’
8. If a letter is chosen at random from the English alphabet, find the probability that
the letter
(a) is a vowel
(b) precedes the letter j
(c) follows the letter g.
9. If a permutation (rearrangement of the letters) of the word “white” is selected at
random, find the probability that the permutation
(a) begins with a consonant
(b) ends with a vowel
(c) has the consonants and vowels alternating.
10. If each coded item in a catalog begins with 3 distinct letters followed by 4 distinct
nonzero digits, find the probability of randomly selecting one of these coded
items with the first letter a vowel and the last digit even.
11. A pair of dice is thrown. Find the probability of getting (a) a total of 8; and (b) at
most a total of 5.
12. Two cards are drawn in succession from a deck without replacement. What is the
probability that both cards are greater than 2 and less than 8?
13. If 3 books are picked at random from a shelf containing 5 novels, 3 books of
poems, and a dictionary, what is the probability that (a) the dictionary is selected;
and (b) 2 novels and 1 book of poems are selected?
14. In a poker hand consisting of 5 cards, find the probability of holding (a) 3 aces;
and (b) 4 hearts and 1 club
15. In a game of Yahtzee, where 5 dice are tossed simultaneously, find the probability
of getting four of a kind.
16. In a college graduating class of 100 students, 54 studied mathematics, 69 studied
history, and 35 studied both mathematics and history. If one of these students is
selected at random, find the probability that the student
(a) takes mathematics or history
(b) does not take either of these subjects
(c) takes history but not mathematics.
17. Suppose that in a senior college class of 500 students it is found that 210 smoke,
258 drink alcoholic beverages, 216 eat between meals, 122 smoke and drink
alcoholic beverages, 83 eat between meals and drink alcoholic beverages, 97
smoke and eat between meals, and 52 engage in all three of these bad health
practices. If a member of this senior class is selected at random, find the
probability that the student
(a) smokes but does not drink alcoholic beverages
(b) eats between meals and drinks alcoholic beverages but does not smoke
(c) neither smokes nor eats between meals.
18. The probability that an American industry will locate in Munich is .7, the
probability that it will locate in Brussels is .4, and the probability that it will
locate in either Munich or Brussels or both is .8. What is the probability that the
industry will locate in
(a) both cities
(b) neither city?
19. From past experiences a stockbroker believes that under present economic
conditions a customer will invest in tax-free bonds with a probability of .6, will
invest in mutual funds with a probability of .3, and will invest in both tax-free
bonds and mutual funds with a probability of .15. At this time, find the
probability that a customer will invest in
(a) either tax-free funds or mutual bonds
(b) neither tax-free bonds nor mutual funds.
20. In a certain federal prison it is known that 2/3 of the inmates are under 25 years of
age. It is also known that 3/5 of the inmates are male and the 5/8 of the inmates
are female or over 25 years of age or older. What is the probability that a prisoner
selected at random from this prison is female and at least 25 years old?
Defn The probability of an event B occurring when it is known that some event A has
occurred is called a conditional probability. It is defined as

P( A  B)
P( B | A)  , if P(A)>0
P( A)

P(B|A) is read as “probability of B given A”.

Examples:
1. A random sample of 200 adults is classified below according to sex and the level
of education attained. If a person is picked at random from this group, find the
probability that the person
a. is a male, given that the person has a secondary education.
b. does not have a college degree, given that the person is a female.

Male Female
Elementary 38 45
Secondary 28 50
College 22 17

2. The probability that a regularly scheduled flight departs on time is .83, the
probability that it arrives on time is .92, and the probability that it departs and
arrives on time is .78. Find the probability that a plane (a) arrives on time given
that it departed on time, and (b) departed on time given that it has arrived on time.

3. Suppose there has been a crime and it is known that the criminal is a person
within a population of 6,000,000. Further, suppose it is known that that in this
population only about one person in a million has a DNA type that matches the
DNA found at the crime scene, so let’s assume that there are six people in the
population with this DNA type. Someone in custody has this DNA type. We
know the person’s DNA matches, but what is the probability that he is actually
innocent?
Define A = DNA of randomly chosen person matches DNA at the crime scene
B = person selected is innocent of the crime
AB = event that the selected person is innocent and the DNA matches
P( A  B) 5 / 6,000,000 5
So that P( B | A)   
P( A) 6 / 6,000,000 6
P( A  B) 5 / 6,000,000 5
And P( A | B)    .
P( B) 5,999,999 / 6,000,000 5,999,999
If you were the jury, it would be important to realize that without additional
evidence, the probability that this person is innocent is 5/6, even though the DNA
matches. The prosecutor surely would emphasize the other conditional
probability.
Defn Two events A and B are said to be independent if any one of the following
conditions is satisfied:
(a) P(A|B) = P(A) if P(B)>0
(b) P(B|A) = P(B) if P(A)>0
(c) P(AB) = P(A) P(B)
Otherwise, the events are said to be dependent.

Examples:
1. Consider an experiment in which 2 cards are drawn in succession from an
ordinary deck, with replacement. Define
A: the first card is an ace
B: the second card is a spade
Are A and B independent events?

Spade
Ace
SpadeC

Spade
C
Ace
SpadeC

2. Consider the following events in the toss of a single die where even numbers are
twice as likely to occur as the odd numbers:
A: Get a number greater than 3
B: Get a perfect square
Are A and B independent events?
3. Suppose that we have a fuse box containing 20 fuses, of which 5 are defective. If
2 fuses are selected at random and removed from the box in succession without
replacing the first, what is the probability that both are defective?

4. A small town has one fire engine and one ambulance available for emergencies.
The probability that the fire engine is available when needed is .98, and the
probability that the ambulance is available when called is .92. In the event of an
injury resulting from a burning building, find the probability that both the
ambulance and the fire engine will be available.

5. Three cards are drawn in succession, without replacement, from an ordinary deck
of playing cards. Find the probability that the first card is a red ace, the second
card is a ten or jack, and the third card is greater than 3 but less than 7.

6. A coin is biased so that a head is twice as likely to occur as a tail. If the coin is
tossed 3 times, what is the probability of getting 2 tails and 1 head?

7. Assuming birth months (days) are equally likely, what is the probability that the
next two unrelated strangers you meet both share your birth month (day)?
8. Sudden infant death syndrome (SIDS) causes babies to die suddenly (often in
their cribs) with no explanation. Deaths from SIDS have been greatly reduced by
placing babies on their backs, but as yet no cause is known.
When more than one SIDS death occurs in a family, the parents are sometimes
accused. One “expert witness” popular with prosecutors in England told juries
that there is only a 1 in 73 million chance that two children in the same family
could have died naturally. Here’s his calculation: the rate of SIDS in a
nonsmoking middle-class family is 1 in 8500. So the probability of two deaths is
8500  8500  72, 250, 000 .
1 1 1
Several women were convicted of murder on this basis,
without any direct evidence that they harmed their children.
As the Royal Statistical Science said, this reasoning is nonsense. It assumes that
SIDS deaths in the same family are independent events. The cause of SIDS is
unknown: “There may well be unknown genetic or environmental factors that
predispose families to SIDS, so that a second case in the family becomes much
more likely.” The British government decided to review the cases of 258 parents
convicted of murdering their babies.
9. Many people who come to clinics to be tested for HIV, the virus that causes
AIDS, don’t come back to learn the test results. Clinics now use “rapid HIV
tests” that give a result in a few minutes. The false positive rate for a diagnostic
test is the probability that a person with no disease will have a positive test result.
For the rapid HIV tests, the Food and Drug Administration has established 2% as
the maximum false positive rate. If a clinic uses a test that meets the FDA
standard and tests 50 people who are free of HIV antibodies, what is the
probability that at least one false-positive will occur?
P(at least one positive) = 1 – P(no positives)
= 1 – P(50 negatives)
= 1 – (1-.02)50 = .6358
There is approximately 64% chance that at least one of the 50 people will test
positive for HIV, even though no one has the virus.
Concern about excessive numbers of false positives led the New York City
Department of Health and Mental Hygiene to suspend the use of one particular
rapid HIV test.
10. Only 5% of male high school basketball, baseball, and football players go on to
play at the college level. Of these, only 1.7% enter major league professional
sports. About 40% of the athletes who compete in college and then reach the pros
have a career of more than three years. Define these events: A = {competes in
college}, B = {competes professionally}, C = {pro career longer than 3 years}.
What is the probability that a high school athlete competes in college and then
goes on to have a pro career of more than three years?
We know that P(A) = .05, P(B|A) = .017, P(C|AB) = .4. The probability we
want is therefore P(ABC) = P(A)P(B|A)P(C|AB)
= .05  .017  .4 = .00034
Only about 3 of every 10,000 high school athletes can expect to compete in
college and have a professional career of more than three years. High school
students would be wise to concentrate on studies rather than on unrealistic hopes
of fortune from pro sports.
Exercises: pp. 105-108 of Walpole nos. 1-18
1. If R is the event that a convict committed armed robbery and D is the event that
the convict pushed dope, state in words what probabilities are expressed by
a. P(R|D)
b. P(D’|R)
c. P(R’|D’)
2. A class in advanced physics is comprised of 10 juniors, 30 seniors, and 10
graduate students. The final grades showed that 3 of the juniors, 10 seniors, and 5
graduate students received an A for the course. If a student is chosen at random
from this class and is found to have earned an A, what is the probability that he or
she is a senior?
3. Consider the event B of getting a perfect square when a die is tossed. The die is
constructed so that the even numbers are twice as likely to occur as the odd
numbers. Suppose it is known that the toss of the die resulted in A = a number
greater than 3. Find P(B|A).
4. In the senior year of a high school graduating class of 100 students, 42 studied
mathematics, 68 studied psychology, 54 studied history, 22 studied both
mathematics and history, 25 studied both mathematics and psychology, 7 studied
history but neither mathematics nor psychology, 10 studied all three subjects, and
8 did not take any of the three. If a student is selected at random, find the
probability that a person
(a) enrolled in psychology takes all three subjects
(b) not taking psychology is taking both history and mathematics.
5. A pair of dice is thrown. If it is known that one die shows a 4, what is the
probability that
(a) the other die shows a 5
(b) the total of both dice is greater than 7.
6. A card is drawn from an ordinary deck and we are told that it is red. What is the
probability that the card is greater than 2 but less than 9?
7. The probability that an automobile being filled with gasoline will also need an oil
change is .25, the probability that it needs a new oil filter is .4, and the probability
that both the oil and filter need changing is .14.
(a) If the oil had to be changed, what is the probability that a new oil filter is
needed?
(b) If a new oil filter is needed, what is the probability that the oil has to be
changed?
8. The probability that a married man watches a certain television show is .4 and the
probability that a married woman watches the show is .5. The probability that a
man watches the show, given that his wife does, is .7. Find the probability that
(a) a married couple watches the show
(b) a wife watches the show given that her husband does
(c) at least one person of a married couple will watch the show.
9. The probability that a vehicle entering the Luray Caverns has Canadian license
plates is .12, the probability that it is a camper is .28, and the probability that it is
a camper with Canadian license plates is .09. What is the probability that
(a) a camper entering the Luray Caverns has Canadian license plates?
(b) a vehicle with Canadian license plates entering the Luray Caverns is a
camper?
(c) a vehicle entering the Luray Caverns does not have a Canadian license plates
or is not a camper?
10. The probability that the lady of the house is home when the Avon representative
calls is .6. Given that the lady of the house is home, the probability that she
makes a purchase is .4. Find the probability that the lady of the house is home
and makes a purchase when the Avon representative calls.
11. The probability that a doctor correctly diagnoses a particular illness is .7. Given
that the doctor makes an incorrect diagnosis, the probability that the patient enters
a law suit is .9. What is the probability that the doctor makes an incorrect
diagnosis and the patient sues?
12. One bag contains 4 white balls and 3 black balls, and a second bag contains 3
white balls and 5 black balls. One ball is drawn at random from the first bag and
placed unseen in the second bag. What is the probability that a ball now drawn
from the second bag is black? (Hint: Let B1, B2, and W1 represent, respectively,
the drawing of a black ball from bag 1, a black ball from bag 2, and a white ball
from bag 1. We are interested in B1  B2 and W1  B2.)
13. A real estate agent has 8 master keys to open several new homes. Only 1 master
key will open any given house. If 40% of these homes are usually left unlocked,
what is the probability that the real estate agent can get into a specific home if the
agent selects 3 master keys at random before leaving the office? (hint: Let A =
the house is open and B = the correct key is one of the 3 selected before leaving
the office. One event is A’  B.)
14. A town has 2 fire engines operating independently. The probability that a specific
fire engine is available when needed is .96. What is the probability that
(a) neither is available when needed
(b) that a fire engine is available when needed?
15. If the probability that Tom will be alive in 20 years is .7 and the probability that
Nancy will be alive in 20 years is .9, what is the probability that neither will be
alive in 20 years?
16. The probability that a person visiting his dentist will have an x-ray is .6; the
probability that a person who has an x-ray will also have a cavity filled is .3; and
the probability that the person who has had an x-ray and a cavity filled will also
have a tooth extracted is .1. What is the probability that a person visiting his
dentist will have an x-ray, a cavity filled, and a tooth extracted?
17. Find the probability of randomly selecting 4 good quarts of milk in succession
from a cooler containing 20 quarts of which 5 are spoiled.
18. From a box containing 6 black balls and 4 green balls, 3 balls are drawn in
succession, each ball being replaced in the box before the next draw is made.
What is the probability that all 3 are the same color? Each color is represented?
CHAPTER 6
Probability Distributions

6.1 CONCEPT OF A RANDOM VARIABLE


Defn A function whose value is a real number determined by each element in the
sample space is called a random variable.

Remark We shall use an uppercase letter, say X, to denote a random variable and
its corresponding lowercase letter, x in this case, for one of its values.

Examples:
1. (Experiment No. 1) An experiment consists of tossing a coin 3 times and
observing the result. The possible outcomes and the values of the random
variables X and Y, where X is the number of heads and Y is the number of heads
minus the number of tails are

Sample Points x y
HHH 3 3
HHT 2 1
HTH 2 1
HTT 1 -1
THH 2 1
THT 1 -1
TTH 1 -1
TTT 0 -3

2. (Experiment No. 2) A hatcheck girl returns 3 hats at random to 3 customers who


had previously checked them. If Jason, Charlie, and Ohmar, in that order, receives
one of the hats, list the sample points for the possible orders of returning the hats
and find the values m of the random variable M that represents the number of
correct matches.
6.2 DISCRETE PROBABILITY DISTRIBUTIONS
Defn If a sample space contains a finite number of possibilities or an unending
sequence with as many elements as there are whole numbers, it is called a
discrete sample space.

Defn A random variable defined over a discrete sample space is called a discrete
random variable.

Defn A table or formula listing all possible values that a discrete random variable can
take on, along with the associated probabilities, is called a discrete probability
distribution.

Remark The probabilities associated with all possible values of a discrete random
variable must sum to 1.

Examples:
1. For Experiment No. 1, the discrete probability distributions of the random
variables X and Y are

X 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8

Y -3 -1 1 3
P(Y=y) 1/8 3/8 3/8 1/8

2. Construct the discrete probability distribution for the random variable M defined
in Experiment No. 2.
6.3 EXPECTED VALUES
Defn Let X be a discrete random variable with probability distribution

x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)

The mean or expected value of X is

n
 X  E ( X )   x i f ( xi )
i 1

Examples:
1. Find the mean of the random variables X and Y of Experiment No. 1.

X 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8

 X  E(X )  (0)(1/8) + (1)(3/8) + (2)(3/8) + (3)(1/8) = 12/8 or 1.5

Y -3 -1 1 3
P(Y=y) 1/8 3/8 3/8 1/8

Y  E(Y )  (-3)(1/8) + (-1)(3/8) + (1)(3/8) + (3)(1/8) = 0

2. Find the expected number of correct matches in Experiment No. 2.


3. In a gambling game a man is paid P50 if he gets all heads or all tails when 3 coins
are tossed, and he pays out P30 if either 1 or 2 heads show. What is his expected
gain?
Thm Let X be a discrete random variable with probability distribution

x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)

The mean or expected value of the random variable g(X) is

n
E[ g ( X )]   g ( xi ) f ( xi )
i 1

Defn Let X be a random variable with mean µ then the variance of X is

 X2  V ( X )  E( X  ) 2

Defn Let X be a discrete random variable with probability distribution

x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)

The variance of X is

n
 X2  V ( X )  E ( X   ) 2   ( xi   ) 2 f ( xi )
i 1

Thm Computational Formula for  X2


 X2  Var(X) = E(X2) - [E(X)]2

Example In Experiment No. 1, find the variance of X.

Using the definition of Var(X),


 X  E(X) = 1.5
 X2  (0-1.5) (1/8) + (1-1.5) (3/8) + (2-1.5) (3/8) + (3-1.5) (1/8)
2 2 2 2

= 0.75

Using the computational formula of the Var(X),


E(X2) = 3
 X2  3 – (1.5)2 = 0.75
Binomial Distribution
Defn A binomial experiment is one that possesses the following properties:
 the experiment consists of n identical trials
 each trial results in one of two outcomes, a “success” or a “failure”
 the probability of success on a single trial is equal to p and remains the
same from trial to trial. The probability of a failure is equal to q=1-p.
 the trials are independent

The random variable of interest X, the number of successes observed in n trials, is


called a binomial random variable.

Defn The discrete probability distribution of the binomial random variable is

P( X  x) n C x p x (1  p) n x , x = 0,1,…,n and 0<p<1

Notation: If X follows the above distribution, we will write X~Bi(n, p).

Note: If X~Bi(n, p) then E(X) = np and Var(X) = npq.

Examples:
1. Find the probability of obtaining exactly three 2’s if an ordinary die is tossed 5
times.
2. In a certain city district the need for money to buy drugs is given as the reason for
75% of all thefts. What is the probability that exactly 2 of the next 4 theft cases
reported in this district resulted from the need for money to buy drugs?
3. The probability that a patient recovers from a rare blood disease is .4. If 15
people are known to have contracted this disease, what is the probability that (a) 5
survive; (b) 3 to 8 survive?; and (c) at least 10 survive?
Exercises: pp. 165-166 of Walpole nos. 4, 6-10, 12, 13
4. A baseball player’s batting average is .250. What is the probability that he gets
exactly 1 hit in his next 5 times at bat?
6. A multiple-choice quiz has 15 questions, each with 4 possible answers of which
only 1 is the correct answer. What is the probability that sheer guesswork yields
a. exactly 10 correct answers
b. at least 1 correct answer
c. 5 to 10 correct answers .
7. The probability that a patient recovers from a delicate heart operation is .9. What
is the probability that exactly 5 of the next 7 patients having this operation
survive?
8. A study conducted at George Washington University and the National Institute of
Health examined national attitudes about tranquilizers. The study revealed that
approximately 70% believe “tranquilizers don’t really cure anything, they just
cover up the real trouble.” According to this study, what is the probability that at
least 3 of the next 5 people selected at random will be of the opinion that
tranquilizers do cure the problem rather than just cover it up?
9. A survey of the residents in a United States city showed that 20% preferred a
white telephone over any other color available. What is the probability that more
than one-half of the next 20 telephone installed in this city will be white?
10. One-fourth of the female freshmen entering a Virginia college are out-of-state
students. If the students are assigned at random to the dormitories, 3 to a room,
what is the probability that in one room at most 2 of the 3 roommates are out-of-
state students?
11.
12. Suppose that airplane engines operate independently in flight and fail with
probability q = .2. Assuming that a plane makes a safe flight if at least one-half of
its engines run, determine whether a 4-engine plane or a 2-engine plane has the
higher probability for a successful flight.
13. Repeat Exercise 12 for q =.5 and q = 1/3.
Near the end of World War II, the Germans developed rocket bombs, which were fired at
the city of London. The Allied military command did not know whether these bombs
were fired at random or whether they had some type of aiming device. To investigate,
the city of London was divided into 576 square regions and the number of hits per region
was counted and compared with the expected number of hits under a special discrete
probability distribution. Because the actual number of hits was close to the expected
number of hits, the military command concluded that the bombs were falling at random.
The Germans had not developed a bomb with an aiming device.

6.4 Continuous Probability Distributions


Defn If a sample space contains an infinite number of possibilities equal to the number
of points on a line segment, it is called a continuous sample space.

Defn A random variable defined over a continuous sample space is called a continuous
random variable.

Defn The function with values f(x) is called a probability density function for the
continuous random variable X, if
 the total area under its curve and above the horizontal axis is equal to
1; and
 the area under the curve between any two ordinates x = a and x = b
gives the probability that X lies between a and b.

Remarks:
1. A continuous random variable has a probability of zero of assuming exactly any
of its values, that is, if X is a continuous random variable, then P(X=x) = 0 for all
real numbers x.
2. The probability density function can not be represented in tabular form.

Example A continuous random variable X that can assume values between 0 and 2
has a density function given by

.5, for 0  x  2
f ( x)  
 0, otherwise

Find the following probabilities:


a. P(1 < X < 2).
b. P(X > 1.5)
c. P(X < 0.75)
d. P(X = 0.75)
e. P(X ≤
Properties of the Mean and Variance
Let X and Y be random variables (discrete or continuous) and let a and b be constants.
1. E(aX  b) = aE(X)  b
 E(aX) = aE(X).
 E(b) = b.
2. E(aX  bY) = aE(X)  bE(Y)
3. E(XY) = E(X)E(Y) if X and Y are independent.
4. E[ X - E(X) ] = 0.
5. Var(aX  b) = a2Var(X).
 Var(aX) = a2Var(X).
 Var(b) = 0.
6. If X and Y are independent then Var(aX  bY) = a2Var(X) + b2Var(Y)

Example If X and Y are independent random variables with E(X) = 3, E(Y) = 2,


Var(X) = 2 and Var(Y)=1, find
a. E(3X + 5)
b. Var(3X +5)
c. E(XY)
d. Var(3X - 2Y)
e. V(-2X + 4Y -3)

Example A used car dealer finds that in any day, the probability of selling no car is
0.4, one car is 0.2, two cars is 0.15, 3 cars is 0.10, 4 cars is 0.08, five cars
is 0.06 and six cars is 0.01. Let g(X) = 500 + 1500X represent the
salesman’s daily earnings, where X is the number of cars sold. Find the
salesman’s expected daily earnings.
THE NORMAL DISTRIBUTION
Defn A continuous random variable X is said to be normally distributed if its density
function is given by:

1 
x 2
1
f ( x)  e 2 
 2

for - x< and σ, where - < µ  σ > 0 and


e2.71828 and 

Notation: If X follows the above distribution, we write X~ N(µ, σ2 ).

Note: If X~ N(µ, σ2 ), then E(X and Var (X) = σ2.

Johann Carl Friedrich Gauss (1777-1855)

The graph of the normal (or Gaussian) distribution is called the normal curve.

µ-3σ µ-2σ µ-σ µ µ+σ µ+2σ µ+3σ


Example of two normal curves with µ1 ≠ µ2 and σ1 = σ2

Example of two normal curves with µ1 = µ2 and σ1 ≠ σ2

Properties:
1. The curve is bell-shaped and symmetric about a vertical axis through the mean µ.
2. The normal curve approaches the horizontal axis asymptotically as we proceed in
either direction away from the mean.
3. The total area under the curve and above the horizontal axis is equal to 1.

Defn The distribution of a normal random variable with mean zero and standard
deviation equal to 1 is called a standard normal distribution.

If X ~ N(µ, σ2 ) , then X can be transformed into a standard normal random variable


through the following,
X 
Z

Hence, whenever X is between the values x1 and x2, the random variable Z will fall
between the corresponding values
x  x 
z1  1 and z 2  2
 

Thus, P ( x1 < X < x2 ) = P ( z1 < Z < z2 ) .


Examples:
1. Given a 300 and σ = 50, find the probability that X
assumes a value greater than 362.
2. Given a 50 and σ = 10, find the probability that X
assumes a value between 45 and 62.
3. Given a 40 and σ = 6, find the value of x that has
(a) 5% of the area above it and (b) 38% of the area below it.
4. A certain type of storage battery lasts on the average 3.0 years, with a standard
deviation of .5 year. Assuming that the battery lives are normally distributed, find
the probability that a given battery will last less than 2.3 years.
5. An electrical firm manufactures light bulbs that have a length of life that is
normally distributed with mean equal to 800 hours and a standard deviation of 40
hours. Find the probability that a bulb burns between 778 and 834 hours.
6. On an examination the average grade was 74 and the standard deviation was 7. If
the grades are curved to follow a normal distribution, find D6.
P(Z > z) where Z ~ N(0, 1)

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.9 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
-3.8 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
-3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
-3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
-3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
-3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
-3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
-2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
-2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
-2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
-2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
-2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
-2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
-2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
-2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
-2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
-2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
-1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
-1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
-1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
-1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
-1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
-1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
-1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
-1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
-1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
-0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
-0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
-0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
-0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
-0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
-0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
-0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
-0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
-0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
-0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
3.5 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002
3.6 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.7 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.8 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.9 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Exercises: pp. 197-199 of Walpole nos. 1-16
1. = 40 and σ = 6, find
a. the area below 32
b. the area above 27
c. the area between 42 and 51
d. the x value that has 45% of the area below it
e. the x value that has 13% of the area above it
2. Given a normal distributio 200 and σ2 = 100, find
a. the area below 214
b. the area above 179
c. the area between 188 and 206
d. the x value that has 80% of the area below it
e. two x values containing the middle 75% of the area
3. Given the normally distributed random variable X with mean 18 and standard
deviation 2.5, find
(a) P(X < 15)
(b) P(17 < X < 21)
(c) the value of k such that P( X < k) = .2578
(d) the value of k such that P(X > k) = .1539.
4. A soft-drink machine is regulated so that it discharges an average of 200 ml. per
cup. If the amount of drink is normally distributed with standard deviation equal
to 15 ml.
(a) what fraction of cups will contain more than 224 ml.?
(b) what is the probability that a cup contains between 191 and 209 ml.?
(c) how many cups will likely overflow if 230-ml. cups are used in the next 1000
drinks?
(d) below what value do we get the smallest 25% of the drinks?
5. The finished inside diameter of a piston ring is normally distributed with a mean
of 10 cm. and a standard deviation of .03 cm.
(a) What proportion of rings will have inside diameters exceeding 10.075 cm.?
(b) What is the probability that a piston ring will have an inside diameter between
9.97 and 10.03 cm.?
(c) Below what value of inside diameter will 15% of piston rings fall?
6. A lawyer commutes daily from his suburban home to his midtown office. On the
average the trip one way takes 24 minutes, with a standard deviation of 3.8
minutes. Assume the distribution of trip times to be normally distributed.
(a) What is the probability that the trip will take at least ½ hour?
(b) If the office opens at 9 AM and he leaves his house at 8:45 AM, what
percentage of the time is he late for work?
(c) If he leaves the house at 8:35 AM and coffee is served at the office from 8:50
AM until 9 AM, what is the probability that he misses coffee?
(d) Find the length of time above which we find the slowest 15% of the trips
7. If a set of grades on a statistics examination are approximately normally
distributed with a mean of 74 and a standard deviation of 7.9, find
(a) the lowest passing grade if the lowest 10% of the students are given F’s
(b) the highest B if the top 5% of the students are given A’s
(c) the lowest B if the top 10% of the students are given A’s and the next 25% are
given B’s.
8. In a mathematics examination the average grade was 82 and the standard
deviation was 5. All students with grades from 88 to 94 received a grade of B. If
the grades are approximately normally distributed and 8 students received a B
grade, how many students took the examination?
9. The heights of 1000 students are normally distributed with a mean of 174.5 cm.
and a standard deviation of 6.9 cm. How many of these students would you
expect to have heights (a) less than 160.0 cm; (b) between 171.5 and 182.0 cm;
(c) equal to 175.0 cm; and (d) greater than or equal to 188.0 cm?
10. A company pays its employees an average wage of $7.25 an hour with a standard
deviation of 60 cents. If the wages are approximately normally distributed
(a) what percentage of the workers receive wages between $6.75 and $7.69 an
hour?
(b) the highest 5% of the employee hourly wages are greater than what amount?
11. The weights of a large number of miniature poodles are approximately normally
distributed with a mean of 8 kg. and a standard deviation of .9 kg. Find the
fraction of these poodles with weights
(a) over 9.5 kg.
(b) at most 8.6 kg.
(c) between 7.3 and 9.1 kg.
12. The tensile strength of a certain metal component is normally distributed with a
mean of 10,000 kg/cm2 and a standard deviation of 100 kg/cm2.
(a) What proportion of these components exceeds 10,150 kg/cm2 in tensile
strength?
(b) If specifications require that all components have tensile strength between
9800 and 10,200 kg/cm2, what proportion of pieces would you expect to scrap?
13. If a set of observations is normally distributed, what percentage of the
observations differs from the mean by
(a) more than 1.3?
(b) less than .52?
14. The IQs of 600 applicants to a certain college are approximately normally
distributed with a mean of 115 and a standard deviation of 12. If the college
requires an IQ of at least 95, how many of these students will be rejected on this
basis regardless of their other qualifications?
15. The average rainfall in Roanoke, Virginia for the month of March is 9.22 cm.
Assuming a normal distribution with a standard deviation of 2.83 cm, find the
probability that next March Roanoke receives (a) less than 1.84 cm of rain; (b)
more than 5 cm but not over 7 cm of rain; and (c) more than 13.8 cm of rain.
16. The average life of a certain type of small motor is 10 years with a standard
deviation of 2 years. The manufacturer replaces free all motors that fail while
under guarantee. If he is willing to replace only 3% of the motors that fail, how
long a guarantee should he offer? Assume that the lives of the motors follow
normal distribution.
CHAPTER 7
Sampling Distributions
Consider three observations making up the population values 0, 1, and 2 with parameters
N N

X i
2 (X i  )2
 i 1
 1 and  2  . i 1

N N 3
Suppose we list all possible samples of size 2, with replacement, and for each possible
n

X i
sample compute the value of the sample mean, X  i 1
:
n

Number Sample X Number Sample X Number Sample X


1 0,0 0 4 1,0 .5 7 2,0 1
2 0,1 .5 5 1,1 1 8 2,1 1.5
3 0,2 1 6 1,2 1.5 9 2,2 2

3.5
3  X  average of the X ’s = 1 = µ
2.5
2
1.5 And
1
0.5
1 2/3  2
0
-0.5 0 0.5 1 1.5 2 2.5  X2  variance of the X ’s =  
3 2 n

Thm1 If all possible random samples of size n are drawn with replacement from a finite
population of size and standard deviation σ, then the sample
mean will have mean µ and variance  2 / n.

Suppose we list all possible samples of size 2, without replacement, and for each possible
sample compute the value of the sample mean, X :

Number Sample X Number Sample X Number Sample X


1 0,1 .5 3 1,0 .5 5 2,0 1
2 0,2 1 4 1,2 1.5 6 2,1 1.5
2.5

2
 X  average of the X ’s = 1 = µ
1.5

1
And
0.5
1 2 N n
0  X2  variance of the X ’s = 
0 0.5 1 1.5 2 6 n N 1

Thm2 If all possible random samples of size n are drawn without replacement from a
finite population of size N with mean µ σ, then the
 N n
2
sample mean will have mean µ and variance
n N 1

N n
The factor is called the finite population correction factor. For large N relative
N 1
to the sample size n, this factor will be close to 1 and the variance is approximately equal
to σ2 /n.

Defn The probability distribution function of a statistic is called its sampling


distribution.

0.35 0.35

0.3 0.3

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
-0.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2

 A statistic (e.g. sample mean, sample standard deviation) is a variable whose


value depends only on the observed sample and may vary from sample to sample.
 The sampling distribution of a statistic will depend on the size of the population,
the size of the sample, and the method of choosing the sample.
 The standard deviation of the sampling distribution is called the standard error
of the statistic. It tells us the extent to which we expect the values of the statistic
to vary from different possible samples
2
n n
 n 
 (Xi  X )2
n  X i
2
  Xi 
 i 1  for each possible
Exercise: Compute S 2  i 1  i 1
n 1 n(n  1)
sample and determine if  S 2 = average of the S2’s =  2 .

Thm3 (Central Limit Theorem) If X is the mean of a random sample of size n taken
and variance σ2, then the
sampling distribution of X is approximately normally distributed with mean
E( X ) = µ X ) = σ2/n when n is sufficiently large. Hence, the
limiting form of the distribution of
X 
Z
/ n
as n approaches infinity is the standard normal distribution.
 The normal approximation in the theorem will be good if n ≥
of the shape of the population.
 If n < 30, the approximation is good only if the population is not too different
from the normal.
 If the distribution of the population is normal then the sampling distribution
will also be exactly normal, no matter how small the size of the sample.

Example An electrical firm manufactures electric light bulbs that have a length of
life which is normally distributed with mean and standard deviation equal
to 500 and 50 hours, respectively. Find the probability that a random
sample of 15 bulbs will have an average life of less than 475 hours.

Thm4 (The t-distribution) If X and S2 are the mean and variance, respectively, of a
random sample of size n taken from a population which is normally distributed
and variance σ2, then
X 
T
S/ n
is a random variable having the t - distribution with v = n-1 degrees of freedom.

Notation: T~ tv=n-1

 Comparison between the t-distribution and the standard normal distribution

1. Both are symmetric about zero


2. Both are bell-shaped, but the t-distribution is more variable
(i) t-values depend on the fluctuation of 2 quantities: X and S2
(ii) z-values depend only on the changes in X from sample to sample
3. When the sample size is large, i.e. n ≥ 30, the t-distribution can be well
approximated by the standard normal distribution.
 Area under the curve

Just like any continuous probability distribution, the probability that a random
sample produces a t-value falling between any two specified values is equal to the
area under the curve of the t-distribution between any two ordinates
corresponding to the specified values

 Notation: tα is the t-value leaving an area of α -tail of the t-


distribution. That is, if T~t(v) then tα is such that P(T> tα) = α.

Examples:
1. Find the following values on the t -table:
(a) t0.025 when v = 14.
(b) t0.99 when v=10.
2. Find k such that P(k < T < 2.807) = 0.945 when T ~ t(23)
3. A manufacturing firm claims that the batteries used in their electronic games will
last an average of 30 hours. To maintain this average, 16 batteries are tested each
month. If the computed t-value falls between -t0.025 and t0.025 , the firm is satisfied
with its claim. What conclusion should the firm draw from a sample that has
mean X = 27.5 hours and standard deviation S = 5 hours? Assume the
distribution of battery lives to be approximately normal.

Thm5 If independent random samples of size n1 and n2 are drawn from two large or
infinite populations with means µ1 and µ2 and standard deviations σ1 and σ2,
respectively, then the sampling distribution of the difference of means X 1  X 2
is approximately normally distributed with mean and standard deviation given by
 12  22
X 1X 2
 1   2 and  X 1  X 2  
n1 n2

Hence z 
x 1 
 x 2  1   2 
is a value of the standard normal variable Z.
 12 / n1   22 / n2
P(tv > tα,v) = α

v\α 0.10 0.05 0.025 0.01 0.005


1 3.078 6.314 12.706 31.821 63.657
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604
5 1.476 2.015 2.571 3.365 4.032
6 1.440 1.943 2.447 3.143 3.707
7 1.415 1.895 2.365 2.998 3.499
8 1.397 1.860 2.306 2.896 3.355
9 1.383 1.833 2.262 2.821 3.250
10 1.372 1.812 2.228 2.764 3.169
11 1.363 1.796 2.201 2.718 3.106
12 1.356 1.782 2.179 2.681 3.055
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977
15 1.341 1.753 2.131 2.602 2.947
v\α 0.10 0.05 0.025 0.01 0.005
16 1.337 1.746 2.120 2.583 2.921
17 1.333 1.740 2.110 2.567 2.898
18 1.330 1.734 2.101 2.552 2.878
19 1.328 1.729 2.093 2.539 2.861
20 1.325 1.725 2.086 2.528 2.845
21 1.323 1.721 2.080 2.518 2.831
22 1.321 1.717 2.074 2.508 2.819
23 1.319 1.714 2.069 2.500 2.807
24 1.318 1.711 2.064 2.492 2.797
25 1.316 1.708 2.060 2.485 2.787
26 1.315 1.706 2.056 2.479 2.779
27 1.314 1.703 2.052 2.473 2.771
28 1.313 1.701 2.048 2.467 2.763
29 1.311 1.699 2.045 2.462 2.756
Inf. 1.282 1.645 1.960 2.326 2.576
v\α 0.10 0.05 0.025 0.01 0.005
CHAPTER 8
Estimation
Defn Statistical inference refers to methods by which one uses sample information to
make inferences or generalizations about a population.

Two Areas of Statistical Inference


1. Estimation
- point estimation
- interval estimation
2. Hypothesis Testing

Example A commonly prescribed drug on the market for relieving nervous tension
is believed to be only 60% effective. Experimental results with a new drug
administered to a random sample of 100 adults who were suffering from
nervous tension showed that 70 received relief.
a. Estimate the population proportion of nervous tension patients who
will receive relief with the experimental drug.
b. Is this sufficient evidence to conclude that the new drug is superior
to the one commonly prescribed?

8.1 BASIC CONCEPTS IN ESTIMATION

Point Estimation
Defn An estimator is any statistic whose value is used to estimate an unknown
parameter. A realized value of an estimator is called an estimate.

For example, the sample mean X , is an estimator of the population mean μ.

Remarks:
1. An estimator is said to be unbiased if the average of the estimates it produces
under repeated sampling is equal to the true value of the parameter being
estimated.

Examples:
 Under random sampling, the sample mean is an unbiased estimator of the
population mean, that is, E( X ) = μ.
 Under random sampling with replacement, S2 is an unbiased estimator of σ2,
but S on the other hand is a biased estimator of σ with the bias becoming
insignificant for large samples.

Samples S2 S Samples S2 S
0,0 0 0
0,1 .5 .707107 0,1 .5 .707107
0,2 2 1.414214 0,2 2 1.414214
1,0 .5 .707107 1,0 .5 .707107
1,1 0 0
1,2 .5 .707107 1,2 .5 .707107
2,0 2 1.414214 2,0 2 1.414214
2,1 .5 .707107 2,1 .5 .707107
2,2 0 0
Mean .666667 .628539 Mean 1 .942809

2. A parameter can have more than one unbiased estimator. We would naturally
choose the unbiased estimator with the smallest variance.

Possible Samples X Md
0,1,2 1 1
0,1,3 1.33 1
0,2,1 1 1
0,2,3 1.67 2
0,3,1 1.33 1
0,3,2 1.67 2
1,0,2 1 1
1,0,3 1.33 1
1,2,0 1 1
1,2,3 2 2
1,3,0 1.33 1
1,3,2 2 2
2,0,1 1 1
2,0,3 1.67 2
2,1,0 1 1
2,1,3 2 2
2,3,0 1.67 2
2,3,1 2 2
3,0,1 1.33 1
3,0,2 1.67 2
3,1,0 1.33 1
3,1,2 2 2
3,2,0 1.67 2
3,2,1 2 2
Mean 1.5 1.5
Variance .138889 .25
Interval Estimation

Defn An interval estimator of a population parameter is a rule that tells us how to


calculate two numbers based on sample data, forming an interval within which the
parameter is expected to lie. This pair of numbers, (a,b), is called an interval
estimate or confidence interval.

Example The running time (in minutes) of a sample of films produced by Star-
Regal Theater are as follows: 103 94 110 87 98.

A 95% confidence interval for the mean running time of films produced
by Star-Regal Theater is (87.6, 109.2).

 The number 0.95 in the example is called the confidence coefficient or


the degree of confidence.
 The endpoints 87.6 and 109.2 are called the lower and upper
confidence limits.

Remarks:
1. In general, we construct a (1-α)100% confidence interval. The fraction (1-α) is
called the confidence coefficient, and the endpoints a and b are called the lower
and upper confidence limits, respectively.

2. The confidence coefficient is not “the probability that the true value of the
parameter falls in the interval estimate” since once a sample is drawn and a
confidence interval constructed, the resulting interval estimate either encloses the
true value of the parameter or it doesn’t. Rather, the confidence coefficient is “the
probability that the interval estimator encloses the true value of the parameter.”

3. A good confidence interval is one that is as narrow as possible and has a large
confidence coefficient, near 1. The narrower the interval, the more exactly we
have located the parameter; whereas, the larger the confidence coefficient, the
more confidence we have that a particular interval encloses the true value of the
parameter. However, for a fixed sample size, as the confidence coefficient
increases, the length of the interval also increases.

4. Interpretation of (1-α)100% confidence interval: If we take repeated samples of


size n and if for each one of these samples we compute the (1-α)100% confidence
interval then (1-α)100% of the resulting confidence intervals will contain the
unknown value of the parameter.

Example: Consider four observations making up the population values 0, 1, 2


N N

X i (X i  )2
5
and 3 with   i 1
 1.5 and  2  i 1
 . Suppose
N N 4
we list all possible samples of size 2, with replacement and
compute the 90% confidence interval for each possible sample.

90% Confidence Interval


Possible Samples of Size n=2 LL UL
0 0 -1.3005 1.3005
0 1 -0.8005 1.8005
0 2 -0.3005 2.3005
0 3 0.1995 2.8005
1 0 -0.8005 1.8005
1 1 -0.3005 2.3005
1 2 0.1995 2.8005
1 3 0.6995 3.3005
2 0 -0.3005 2.3005
2 1 0.1995 2.8005
2 2 0.6995 3.3005
2 3 1.1995 3.8005
3 0 0.1995 2.8005
3 1 0.6995 3.3005
3 2 1.1995 3.8005
3 3 1.6995 4.3005

8.2 Estimation of µ

A point estimator of the population mean  X.

(1-α) 100% Confidence Interval for µ

a. when σ

X  z / 2 
n
 where z α/2 is the z-value leaving an area of α/2 to the right.

b. when σ , n ≤ 30

X  t / 2 ,v s
n
 where tα/2 is the t-value with v = n - 1 degrees of freedom
leaving an area of α/2 to the right.

Remarks:
1. The above formulas hold strictly for random samples from a normal distribution.
However, they provide good approximate (1-α)100% confidence intervals when
the distribution is not normal provided the sample size is large, i.e. n > 30.
2.  
If  is unknown and n > 30, use X  z / 2 Sn where zα/2 is the z-value leaving an
area of α/2 to the right.
t-table

α/2

v  tα/2,v

  zα/2

If you cannot find α/2 in the first row of


the t-table…

z-table

zα/2



 α/2

Examples:
1. The mean and standard deviation for the quality grade-point averages of a random
sample of 36 college seniors are calculated to be 2.6 and .3, respectively. Find the
95% and 99% confidence intervals for the mean of the entire senior class.

2. The contents of 7 similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8, 10.0,
10.2, and 9.6 liters. Find a 95% confidence interval for the mean content of all
such containers, assuming an approximate normal distribution.

Sample Size for Estimating µ

In random sampling, if X will be used to estimate  , we can be (1-α)100% confident


that that the error will not exceed a specified amount, e, when the sample size is

z  
2

n    /2 
 e 
Example How large a sample is needed in Example 1 if we want to be 95%
confident that our estimate of µ is not off by more than .05?
Exercises: pp. 262-264 of Walpole nos. 3-13
3. An electrical firm manufactures light bulbs that have a length of life that is
normally distributed, with a standard deviation of 40 hours. If a random sample of
30 bulbs has a mean life of 780 hours, find a 96% confidence interval for the
population mean of all bulbs produced by this firm.
4. A soft-drink machine is regulated so that the amount of drink dispensed is
approximately normally distributed with a standard deviation of 1.5 dl. Find a
95% confidence interval for the mean of all drinks dispensed by this machine if a
random sample of 36 drinks had an average content of 22.5 dl.
5. The heights of a random sample of 50 college students showed a mean of 174.5
cm. and a standard deviation of 6.9 cm. Construct a 98% confidence interval for
the mean height of all college students.
6. A random sample of 100 automobile owners shows that an automobile is driven
on the average 23,500 kilometers per year, in the state of Virginia, with a standard
deviation of 3900 kilometers. Construct a 99% confidence interval for the average
distance an automobile is driven annually in Virginia.
7. How large a sample is needed in Exercise 3 if we wish to be 96% confident that
our sample mean will be within 10 hours of the true mean?
8. How large a sample is needed in Exercise 4 if we wish to be 95% confident that
our sample mean will be within .3 ounce of the true mean?
9. An efficiency expert wishes to determine the average time that it takes to drill 3
holes in a certain metal clamp. How large a sample will he need to be 95%
confident that his sample mean will be within 15 sec. of the true mean? Assume
that it is known from previous studies that  = 40 sec.
10. Regular consumption of presweetened cereals contribute to tooth decay, heart
disease, and other degenerative diseases, according to a study by Dr. M. Albreight
of the National Institute of Health and Dr. D. Solomon, Professor of Nutrition and
Dietetics at the University of London. In a random sample of 20 similar servings
of Alpha-Bits, the mean sugar content was 11.3 grams with a standard deviation
of 2.45 grams. Assuming that the sugar content is normally distributed, construct
a 95% confidence interval for the mean sugar content for single servings of
Alpha-Bits.
11. The contents of 10 similar containers of a commercial soap are 10.2, 9.7, 10.1,
10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Find a 99% confidence interval for
the mean soap content of all such containers, assuming an approximate normal
distribution.
12. A random sample of 8 cigarettes of a certain brand has an average nicotine
content of 3.6 mg. and a standard deviation of .9 mg. Construct a 99% confidence
interval for the true average nicotine content of this particular brand of cigarettes,
assuming an approximate normal distribution.
13. A random sample of 12 female students in a certain dormitory showed an average
weekly expenditure of $8.00 for snack foods, with a standard deviation of $1.75.
Construct a 90% confidence interval for the average amount spent each week on
snack foods by female students living in this dormitory, assuming the
expenditures to be approximately normally distributed.
8.3 Inferences About 1 - 2

In comparing two populations with means 1 and 2 and standard deviations 1 and 2,
respectively, the analysis of the sample data depends on the sampling design used.

Defn Two samples are called independent samples when the measurements in one
sample are not related to the measurements in the other sample.
 Random samples are taken separately from two populations and the same
response variable is recorded for each individual
 One random sample is taken and a variable recorded for each individual, but then
units are categorized as belonging to one population or another
 Participants are randomly assigned to one of two treatment conditions and the
same response variable is recorded for each individual unit

Defn The term paired (or matched/related/dependent) data means that data have been
observed in natural pairs.
 Each person is measured twice. The two measurements of the same characteristic
or trait are made under different situations.
 Similar individuals are paired prior to the experiment. During the experiment,
each member of a pair receives a different treatment. The same response variable
is measured for all individuals.

Example An independent samples design and a matched samples design are under
consideration for a study to obtain an estimate of the weight loss in a
shipment of bananas during transit.

Independent Samples. A random sample of banana bunches is selected from the lot and
weighted before loading. After shipment, an independent random sample of bunches is
selected and weighed during the unloading. The difference in the two sample mean
weights per bunch is used as the estimate of weight loss per bunch.

Matched Samples. A random sample of banana bunches is selected and weighted before
loading. After shipment, the same bunches selected before loading are weighed again,
and the difference in weight for each bunch is noted. The mean of these differences is
used as an estimate of the weight loss per bunch.
Exercise: For each item, identify whether the samples are independent or not.
1. A police department performs an experiment to assess the effects of an obvious
radar trap on the speeds of cars. Ten cars are randomly selected on a highway,
and their speeds are measured just before a radar trap comes into view and just
after they pass the trap.
2. A tire manufacturer is testing 2 new tread designs in terms of stopping distance.
To do this, the company uses two test cars driving side by side at the same speed.
Both cars have automatic braking systems so that both sets of brakes engage
simultaneously on a signal. Then the stopping distances for both cars are
measured after repeating the experiment 10 times.
3. A company which does a large volume of business by mail decides to test whether
there is a difference in mail delivery between those items brought to a post office
as compared to those put in a corner mailbox. A random sample of 100 customers
from the same city was selected and their letters were mailed from the post office.
Another random sample of 100 customers was then selected but their letters were
sent from the corner mailbox.
4. Two formulations of a new skin-softening lotion are to be compared as to their
softening action. A random sample of 40 potential users of the lotion is selected.
Each person in the sample is independently assigned at random one of the two
formulations to be applied to the left arm and the other formulation to be applied
to the right arm. After a lapse of eight hours, each person is asked to rate the
skin-softening effect of each formulation on a 10-point scale.
If we have two populations with means 1 and  2 and standard deviations σ1 and σ2,
respectively, a point estimator of the difference 1 -  2 is the statistic X 1  X 2 .

(1-α)100% Confidence Interval for µ1-µ2 for Independent Samples


a.  12 and  22 are known
  12  22 
X  X  z  
 1 2  /2
n n 
 1 2 
b.  1 =  2 but unknown, n1,n2 ≤ 30
2 2


X  X t n1  1S12  n2  1S 22  1 1  
  
 1 2  / 2, n1 n 2 2
n1  n2  2  n1 n2  

c.  12 ≠  22 but unknown, n1,n2 ≤ 30
 S12 S 22 
X  X t  
 1 2  / 2 ,v
n n 
 1 2 

Where v  2
S
/ n1  S 22 / n2
1
2
2

( S1 / n1 ) 2 (S 22 / n2 ) 2

n1  1 n2  1
Remarks:
1. These formulas hold strictly for independent samples selected from Normal
populations. However, they provide good approximate (1-α)100% confidence
intervals when the distributions are not Normal provided both n1 and n2 are greater
than 30.

2. If  12 and  22 are unknown but n1 and n2 are greater than 30, use
 S12 S 22 
X  X  z  
 1 2  /2
n n 
 1 2 

3. Even if the population variances are considerably different, formula (b) will still
provide a good estimate provided that n1=n2 and both populations are normal.
Therefore, in a planned experiment, one should make every effort to equalize the
size of the samples.

Examples:
1. A statistics test was given to a random sample of 50 girls and another random
sample of 75 boys. The mean score of the girls is 80 with a standard deviation of
4 and the mean score of the boys is 86 with a standard deviation of 6. Find a 95%
confidence interval for the difference  B -  G.
2. A course in mathematics is taught to 12 students by the conventional classroom
procedure. A second group of 10 students was given the same course by means of
programmed materials. At the end of the semester the same examination was
given to each group. The 12 students meeting in the classroom made an average
grade of 85 with a standard deviation of 4, while the 10 students using
programmed materials made an average of 81 with a standard deviation of 5.
Find a 90% confidence interval for the difference between the population means,
assuming the populations are approximately normal with equal variances.
3. Records for the past 15 years have shown the average rainfall in a certain region
to be 4.93 cm., with a standard deviation of 1.14 cm. A second region has had an
average rainfall of 2.64 cm., with a standard deviation of .66 cm. during the past
10 years. Find a 95% confidence interval for the difference of the true average
rainfalls in these regions, assuming that the observations come from normal
populations with different variances.

(1-)100% Confidence Interval for x - y for Related/Paired Samples

When the population of differences D is normal or does not depart too markedly from
normality, a confidence interval for D = x - y is:

 S 
 d  t / 2,n 1 d 
 n
Where d i  xi  yi
n

d i
d i 1

 
2
n n
 n 
    di 
2
di  d n d i
2

S d  i 1  i 1  i 1 
n 1 n(n  1)
n is the number of pairs

Example: Twenty college freshmen were divided into 10 pairs, each member of the pair
having approximately the same IQ. One of each pair was selected at random and
assigned to a mathematics section using programmed materials only. The other
member of each pair was assigned to a section in which the professor lectured. At
the end of the semester each group was given the same examination and the
following results were recorded.

Pair 1 2 3 4 5 6 7 8 9 10
Programmed 76 60 85 58 91 75 82 64 79 88
Materials
Lectures 81 52 87 70 86 77 90 63 85 83

Find a 98% confidence interval for the true difference in the two learning
procedures. Assume normality.
Exercises: pp. 264-266 of Walpole nos. 14-23
14. A random sample of size n1 = 25 taken from a normal population with standard
deviation 1 = 5 has a mean x1 = 80. A second random sample of size n2 = 36
taken from a different normal population with a standard deviation of 2 = 3, has
a mean x2 = 75. Find a 94% confidence interval for µ1-µ2.
15. Two kinds of thread are being compared for strength. Fifty pieces of each type of
thread are tested under similar conditions. Brand A had an average tensile
strength of 78.3 kg. with a standard deviation of 5.6 kg., while Brand B had an
average tensile strength of 87.2 kg. with a standard deviation of 6.3 kg. Construct
a 95% confidence interval for the difference of the population means.
16. A study was made to estimate the difference in salaries of college professors in
the private and state colleges of Virginia. A random sample of 100 professors in
private colleges showed an average 9-month salary of $25,000 with a standard
deviation of $1200. A random sample of 200 professors in state colleges showed
an average salary of $26,000 with a standard deviation of $1400. Find a 98%
confidence interval for the difference between the average salaries of professors
teaching in state and private colleges of Virginia.
17. Given two random samples of size n1 = 9 and n2 = 16, from two independent
normal populations, with x1 = 64, x2 = 59, s1 = 6, and s2 = 5, find a 95%
confidence interval for µ1-µ2, assuming that 1 = 2.
18. Students may choose between a 3-unit course in Physics without lab and a 4-unit
course with lab. The final written examination is the same for each section. The
mean score of a random sample of 12 students in the section with lab is 84 with a
standard deviation of 4, and the mean score of another random sample of 18
students in the section without lab is 77 with a standard deviation of 6. Find a
99% confidence interval for the difference between the mean grades for the two
courses. Assume the populations to be approximately normally distributed with
equal variances.
19. A taxi company is trying to decide whether to purchase brand A or brand B tires
for its fleet of taxis. To estimate the difference in the two brands, an experiment
is conducted using 12 of each brand. The tires are run until they wear out. The
results are x A = 36,300 km., sA = 5000 km., x B = 38,100, and sB = 6100 km.
Construct a 95% confidence interval for µA-µB, assuming the populations to be
approximately normally distributed.
20. The following data represent the running time of a random sample of films
produced by two motion picture companies:

Time (minutes)
Company 1 103 94 110 87 98
Company 2 97 82 123 92 175 88 118

Compute a 90% confidence interval for the difference between the mean running
times of films produced by the two companies. Assume that the running times for
each of the companies are approximately normally distributed with unequal
variances.
21. The government awarded grants to the agricultural departments of nine
universities to test the yield capabilities of two new varieties of wheat. Each
variety was planted on plots of equal area at each university and the yields, in kg.
per plot, recorded as follows:

University
1 2 3 4 5 6 7 8 9
Variety 1 38 23 35 41 44 29 37 31 38
Variety 2 45 25 31 38 50 33 36 40 43

Find a 95% confidence interval for the mean difference between the yields of the
two varieties assuming the distributions of yields to be approximately normal.
22. Referring to Exercise 19, find a 99% confidence interval for µA-µB if a tire from
each company is assigned at random to the rear wheels of 8 taxis and the
following distances in km., recorded:

Taxi Brand A Brand B


1 34,400 36,700
2 45,500 46,800
3 36,700 37,700
4 32,000 31,100
5 48,400 47,800
6 32,800 36,400
7 38,100 38,900
8 30,100 31,500

23. It is claimed that a new diet will reduce a person’s weight by 4.5 kilograms on the
average in a period of 2 weeks. The weights of a random sample of 7 women who
followed this diet were recorded before and after a 2-week period:

Woman
1 2 3 4 5 6 7
Weight Before 58.5 60.3 61.7 69.0 64.0 62.6 56.7
Weight After 60.0 54.9 58.1 62.1 58.5 59.9 54.4

Compute a 95% confidence interval for the mean difference in the weight.
Assume the distribution of weights to be approximately normal.
8.4. ESTIMATING PROPORTIONS

 X
In a binomial experiment a point estimator of the proportion p is p  , where X
n
 
pq
represents the number of successes in n trials, with standard error of and margin of
n
 
pq
error of z / 2 .
n

If the unknown proportion is not expected to be too close to 0 or 1 and n is large, an


approximate (1-α)100% confidence interval for p is given by

   
 pq 
 p  z / 2 
 n 
 

Example In a random sample of 200 students who enrolled in Math 17, 138 passed
on their first take. Construct a 95% confidence interval for the population
proportion of students who passed Math 17 on their first take.
Sample Size for Estimating p

^
If p will be used to estimate p, then we can be (1-α)100% confident that the error will
z2 / 2 p(1  p)
not exceed a specified amount, e, when the sample size is n 
e2

When the value of p is unknown or cannot be approximated, then using p = 0.5 produces
the maximum value of p(1-p)=0.25. Hence a conservative formula for the sample size is
z2
n   /22
4e

Example Use the conservative formula to determine the sample size needed if we
want to be 95% confident that our estimate of p is within 0.05 of the true
value.

The SWS national survey for the fourth quarter of 2013, done on Dec. 11-16, expanded
its Visayas sample to 650 households, from the usual 300, thus reducing the Visayas error
margin to 4 percentage points, from the usual 6 points. This raised the national sample
size to 1,550 households, from the usual 1,200, enhancing the quality of Yolanda-related
items in particular, since the Visayas was the area that suffered the most.

 
pq 1
A conservative estimate for the margin of error is z / 2  .
n n

Sample size n Margin of error 1/n


100 .10
400 .05
625 .04
1000 .032
1600 .025
2500 .02
10000 .01
8.5 ESTIMATING THE DIFFERENCE OF TWO PROPORTIONS
Given 2 independent random samples of size n1 and n2 , a point estimator of the difference
  X X
between the two proportions p1 and p2 is given by p1  p 2  1  2 , where X1 is the
n1 n2
number of successes in n1 trials (first sample) and X2 is the number of successes in n2 trials
(second sample).

An approximate (1-α)100% confidence interval for p1 - p2 when n1 and n2 are large is

      
 p1 q1 p 2 q 2 
 p 1  p 2  z / 2  
 n1 n2 
 

Example In a random sample of 200 students, 78 of the 120 females and 60 of the
80 males passed Math 17 on their first take. Construct a 95% confidence
interval for p1- p2, where p1 and p2 are the true proportions of females and
males, respectively, who passed Math 17 on their first take.
Exercises: pp. 273-274 of Walpole nos. 1-13
1. A random sample of 200 voters is selected and 120 are found to support an
annexation suit. Find the 96% confidence interval for the fraction of the voting
population favoring the suit.
2. A random sample of 400 cigarette smokers is selected and 86 are found to have a
preference for brand X. Find the 90% confidence interval for the fraction of the
population of cigarette smokers who prefer brand X.
3. In a random sample of 1000 homes in a certain city, it is found that 628 are heated
by natural gas. Find the 98% confidence interval for the fraction of homes in this
city that are heated by natural gas.
4. A random sample of 75 college students is selected and 16 are found to have cars
on campus. Use a 95% confidence interval to estimate the fraction of students
who have cars on campus.
5. A new rocket-launching system is being considered for deployment of small
short-range launches. The existing system has p = .8 as the probability of a
successful launch. A sample of 40 experimental launches is made with the new
system and 34 are successful. Construct a 95% confidence interval for p.
6. How large a sample is needed in Exercise 1 if we wish to be 96% confident that
our sample proportion will be within .02 of the true fraction of the voting
population?
7. How large a sample is needed in Exercise 3 if we wish to be 98% confident that
our sample proportion will be within .05 of the true proportion of homes in this
city that are heated by natural gas?
8. A study is to be made to estimate the percentage of citizens in a town who favor
having their water fluoridated. How large a sample is needed if one wishes to be
at least 95% confident that our estimate is within 1% of the true percentage?
9. According to Dr. Memory Elvin-Lewis, head of the microbiology department at
Washington University School of Dental Medicine in St. Louis, a couple of cups
of either green or oolong tea each day will provide sufficient fluoride to protect
your teeth from decay. People who do not like tea and who live in unfluoridated
areas should ask their local governments to consider having their water
fluoridated. How large a sample is needed to estimate the percentage of citizens
in a certain town who favor having their water fluoridated if one wishes to be at
least 99% confident that the estimate is within 1% of the true percentage?
10. In a study to estimate the proportion of residents in a certain city and its suburbs
who favor the construction of a nuclear power plant, it is found that 52 of 100
urban residents favor the construction while only 34 of 125 suburban residents are
in favor. Find a 96% confidence interval for the difference between the
proportion of urban and suburban residents who favor construction of the nuclear
plant.
11. A cigarette-manufacturing firm claims that its brand A line of cigarettes outsells
its brand B line by 8%. If it is found that 42 of 200 smokers prefer brand A and
18 of 150 smokers prefer brand B, compute a 94% confidence interval for the
difference between the proportions of sales of the two brands.
12. A geneticist is interested in the proportion of males and females in the population
that have a certain minor blood disorder. In a random sample of 100 males, 24
are found to be afflicted, whereas 13 of the 100 females tested appear to have the
disorder. Compute a 99% confidence interval for the difference between the
proportion of males and females that have this blood disorder.
13. A study is made to determine if a cold climate results in more students being
absent from school during a semester than for a warmer climate. Two groups of
students are selected at random, one group from Vermont and the other group
from Georgia. Of the 300 students from Vermont, 64 were absent at least 1 day
during the semester, and of the 400 students from Georgia, 51 were absent 1 or
more days. Find a 95% confidence interval for the difference between the
fractions of the students who are absent in the two states.

During World War II, Allied military planners needed estimates of the number of tanks
Germany was manufacturing. The information provided by traditional spying methods
was not reliable, but statistical sampling methods proved to be valuable. For example,
espionage and reconnaissance led analysts to estimate 1550 tanks were produced during
June 1941. However, using the serial numbers of captured tanks and statistical analysis,
military planners estimated the number of tanks to be 244. This estimate turned out to be
27 less than the actual number manufactured by the Germans in June 1941. A similar
type of analysis was used to estimate the number of Iraqi tanks destroyed during Desert
Storm.
CHAPTER 9
Tests of Hypothesis

9.1 BASIC CONCEPTS OF STATISTICAL HYPOTHESIS TESTING


Principles of Statistical Hypothesis Testing

1. A statistical hypothesis is an assertion or conjecture concerning one or more


populations.
2. The null hypothesis (Ho or NH) is the hypothesis that is being tested.
3. The alternative hypothesis (Ha or AH) is the contradiction of the null
hypothesis.

Note: Ho and Ha must be non-overlapping statements about a population.

Example Consider a test by a light bulb manufacturer to examine the life of


a new long-life bulb it hopes to market. The leading brand of bulb
in the market has a mean burning time of 2000 hours. For
advertising purposes, the manufacturer wishes to prove that the
new bulb has a longer mean burning time. We let µ be the mean
burning time in hours of the new long-life bulb. Then Ho: µ ≤
2000 vs. Ha: µ > 2000.

4. A one-tailed test of hypothesis is a test where the alternative hypothesis specifies


a one-directional difference for the parameter of interest.
Example: Do male students study, on the average, more than female students do?
Ho: µM-µF ≤ 0 vs. Ha: µM-µF > 0 or
Ho: µF-µM ≥ 0 vs. Ha: µF-µM < 0

A two-tailed test of hypothesis is a test where the alternative hypothesis does not
specify a directional difference for the parameter of interest.
Examples:
a. Is there a general preference for Coke or Pepsi?
Ho: p .5 vs. Ha: p ≠ .5
b. Is the proportion favoring death penalty the same for teenagers as it is for
adults? Ho: pT - pA = 0 vs. Ha: pT - pA  0

5. A test statistic is a statistic computed from sample measurements that is


especially sensitive to the differences between Ho and Ha. It tends to take on
certain values when Ho is true and different values when Ho is false.
6. The critical region or rejection region is the set of values of the test statistic for
which the null hypothesis will be rejected. The acceptance region is the set of
values of the test statistic for which the null hypothesis will not be rejected. The
acceptance and rejection regions are separated by a critical value of the test
statistic. The location of the region of rejection depends on Ha.
7. The Type I error is the error made by rejecting the null hypothesis when it is
true. The probability of a Type I error is denoted by α.
The Type II error is the error made by accepting (not rejecting) the null
hypothesis when it is false. The probability of a Type II error is denoted by β.

Null Hypothesis
Decision True False
Reject Ho Type I error Correct decision
Accept Ho Correct decision Type II error

Example Consider four observations making up the population values 0, 1, 2


5
and 3 with   1.5 and  2  . Suppose we list all possible
4
samples of size 2, with replacement and test Ho:   1.5 vs. Ha:
  1.5 for each possible sample.

z-test Decision
Possible Samples of Size n=2 Ho:µ=1.5 α=.10
0 0 -1.897 Reject Ho
0 1 -1.265 Accept Ho
0 2 -0.632 Accept Ho
0 3 0.000 Accept Ho
1 0 -1.265 Accept Ho
1 1 -0.632 Accept Ho
1 2 0.000 Accept Ho
1 3 0.632 Accept Ho
2 0 -0.632 Accept Ho
2 1 0.000 Accept Ho
2 2 0.632 Accept Ho
2 3 1.265 Accept Ho
3 0 0.000 Accept Ho
3 1 0.632 Accept Ho
3 2 1.265 Accept Ho
3 3 1.897 Reject Ho

8. The level of significance, α, is the maximum probability of Type I error the


researcher is willing to commit. If the test leads to the rejection of Ho, then we
say that there is sufficient evidence supporting Ha at α level of significance.

The Type I error and Type II error are related. For a fixed sample size n, a
decrease in the probability of one will result in an increase in the probability of
the other. However, increasing the sample size will result in the reduction of both
probabilities.

Usually β is unknown because of the difficulty in computing it. The common


solution to this problem is to “withhold judgment” if the test leads to the
acceptance of Ho.
Common Consequences of
Choices
Type I Error Type II Error
Of α
.01 Very serious Not too serious
.10 Not too serious Very serious

Example If you are on a jury in the American judicial system, you must presume
that the defendant is innocent unless there is enough evidence to conclude that he
or she is guilty. Therefore the two hypotheses are
Ho: The defendant is innocent
Ha: The defendant is guilty

The prosecution collects evidence in the hope that the jurors will be convinced
that such evidence would be extremely unlikely if the assumption of innocence
were true. Consistent with our thinking in hypothesis testing, in many cases we
would not accept the hypothesis that the defendant is innocent. We would simply
conclude that the evidence was not strong enough to rule out the possibility of
innocence. In fact, in the United States the two conclusions juries are instructed
to choose from are “guilty” and “not guilty.” A jury would never conclude, “the
defendant is innocent.”

For trials in general, here are the possible errors and the consequences that
accompany those errors:
Type I error: A “guilty” verdict for a person who is really innocent.
Consequence: An innocent person is falsely convicted. The guilty party remains
free.

Type II error: A “not guilty” verdict for a person who committed a crime.
Consequence: A criminal is not punished.

In the American court system, a false conviction is generally viewed as the more
serious error. Not only is an innocent person punished but also a guilty one
remains free. Courtroom rules and rules affecting pretrial investigations tend to
reflect society’s concern about incorrectly punishing an innocent person.

Example Imagine that you are tested to determine if you have a disease. The lab
technician or physician who evaluates your results must make a choice between
two hypotheses:
Ho: You do not have the disease.
Ha: You have the disease.

Unfortunately, many laboratory tests for diseases are not 100% accurate. There is
a chance the result is wrong. Consider the two possible errors and their
consequences:
Type I error: You are told you have the disease, but you actually don’t. The test
result was a false positive. Consequence: You will be unnecessarily concerned
about your health and you may receive unnecessary treatment.
Type II error: You are told you do not have the disease, but you actually do. The
test result was a false negative. Consequence: You do not receive treatment for a
disease that you have. If this is contagious, you may infect others.

Which error is more serious? In most medical situations, the second possible
error is more serious but this could depend on the disease and the follow-up
actions that are taken. For instance, in a screening test for cancer, a false negative
could lead to a fatal delay in treatment. Initial test results that are “positive” for
cancer are usually followed up with a retest so a false positive may be discovered
quickly.

Steps in Hypothesis Testing


1. Determine the objectives of the study.
2. State Ho and Ha.
3. Choose the level of significance α
4. Select the appropriate test statistic and establish the critical region.
4. Collect the data and compute the value of the test statistic from the sample data.
5. Make the decision. Reject Ho if the value of the test statistic belongs in the critical
region. Otherwise, do not reject Ho.

9.2 Tests for 

Ho Test Statistic Ha Region of Rejection


Case 1:  known
 ≥ o X  o  < o z < -z
 ≤ o Z=  > o z > z
/ n
 = o   o |z| > z/2
Case 2:  unknown, n ≤ 30
 ≥ o X  o  < o t < -t, n-1
 ≤ o T=  > o t > t, n-1
s/ n
 = o   o |t| > t/2, n-1

 The above tests are exact -level tests for samples from a normal distribution.
However, the first test provides a good approximate -level test when the distribution
is not normal provided that the sample size is large enough, that is, n>30. See
Theorems 4 and 5 of Chapter 7.

 If 2 is unknown and n>30, use the z-test but replace  by s, that is,
X  o
Z=
s/ n
 The procedures are the same for testing the following:

Ho:  > o vs. Ha:  < o as Ho: =o vs. Ha:  < o
Ho:  < o vs. Ha:  > o as Ho: =o vs. Ha:  > o

Examples:
1. A manufacturer of sports equipment has developed a new synthetic fishing line
that he claims has a mean breaking strength of 8 kilograms with a standard
deviation of .5 kilogram. Test the hypothesis that µ = 8 kilograms if a random
sample of 50 lines is tested and found to have a mean breaking strength of 7.8
kilograms. Use a .01 level of significance.

2. A random sample of 100 recorded deaths during the past year showed an average
life span of 71.8 years, with a standard deviation of 8.9 years. Does this seem to
indicate that the average life span today is greater than 70 years? Use a .05 level
of significance.

3. The average length of time for students to register at a certain college has been 50
minutes with a standard deviation of 10 minutes. A new registration procedure
using modern computing machines is being tried. If a random sample of 12
students had an average registration time of 42 minutes with a standard deviation
of 11.9 minutes under the new system, test the hypothesis that the population
mean is now less than 50 minutes, using a level of significance of .10, .05 and .01.
Assume the population of times to be normal.

For the same data set, as α


Consequently, if Ho is rejected at α-level of significance then Ho will also be
rejected at a higher level of significance using the same data. For example, if Ho
is rejected at α α
Ho. However, Ho will not necessarily be rejected at α = 0.01.
3. An alternative way to report the results of the test is to compute the p-value. The
p-value is the smallest value of α
information. Reporting the p-value will allow the reader of the published research
to evaluate the extent to which the data disagree with Ho. In particular, it enables
each reader to choose their personal value of α

If the p-value ≤ α
Exercises: pp. 315-316 of Walpole nos.1-8
1. An electrical firm manufactures light bulbs that have a length of life that is
approximately normally distributed with a mean of 800 hours and a standard
deviation of 40 hours. Test the hypothesis that µ = 800 hours against the
alternative µ  800 hours if a random sample of 30 bulbs has an average life of
788 hours. Use a .04 level of significance.
2. In a research report by Richard H. Weindruch of the UCLA Medical School, it is
claimed that mice with an average lifespan of 32 months will live to be about 40
months old when 40% of the calories in their food are replaced by vitamins and
minerals. Is there any reason to believe that µ < 40 if 64 mice that are placed on
this diet have an average life of 38 months with a standard deviation of 5.8
months? Use a .025 level of significance.
3. The average height of females in the freshman class of a certain college has been
162.5 cm. with a standard deviation of 6.9 cm. Is there reason to believe that
there has been a change in the average height if a random sample of 50 females in
the present freshman class has an average height of 165.2 cm.? Use a .02 level of
significance.
4. It is claimed that an automobile is driven on the average less than 25,000
kilometers per year. To test this claim, a random sample of 100 automobile
owners is asked to keep a record of the kilometers they travel. Would you agree
with this claim if the random sample showed an average of 23,500 kilometers and
a standard deviation of 3,900 kilometers? Use a 0.01 level of significance.
5. Test the hypothesis that the average content of containers of a particular lubricant
is 10 liters if the contents of a random sample of 10 containers are 10.2, 9.7, 10.1,
10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Use a .01 level of significance and
assume that the distribution of contents is normal.
6. According to Dietary Goals for the United States (1977), high sodium intake may
be related to ulcers, stomach cancer, and migraine headaches. The human
requirement for salt is only 230 milligrams per day, which is surpassed in most
single servings of ready-to-eat cereals. A random sample of 20 similar servings of
Special K had mean sodium content of 244 milligrams of sodium and a standard
deviation of 24.5 milligrams. Is there sufficient evidence to believe that the
average sodium content for single servings of Special K exceeds the human
requirement for salt at α .05? Assume normality.
7. A random sample of 8 cigarettes of a certain brand has an average nicotine
content of 4.2 mg. and a standard deviation of 1.4 mg. Is this in line with the
manufacturer’s claim that the average nicotine content does not exceed 3.5 mg.?
Use a .01 level of significance and assume the distribution of nicotine contents to
be normal.
8. Last year the employees of a city sanitation department donated an average of
$8.00 to the volunteer rescue squad. Test the hypothesis at the .01 level of
significance that the average contribution this year is still $8.00 if a random
sample of 12 employees showed an average donation of $8.90 with a standard
deviation of $1.75. Assume the donations are approximately normally
distributed.
9.3. TESTING THE DIFFERENCE BETWEEN TWO POP’N MEANS
• Based on 2 independent samples
Ho Test Statistic Ha Critical region
a.  1 and  2 known
2 2

1 - 2 ≥ do ( X1  X 2 )  do 1 - 2 < d o z < - z
1 - 2 ≤ do Z 1 - 2 > do z > z
( 12 n1 )  ( 22 n2 )
1 - 2 = do 1 - 2  do | z | > z/2
b.  12 =  22 but unknown, n1, n2 ≤ 30
1 - 2 ≥ do ( X 1  X 2 )  do 1 - 2 < d o t < - t,n1+n2-2
t
S p (1 n1 )  (1 n2 )
 - ≤d
1 2 o 1 - 2 > do t > t,n1+n2-2
(n  1) S12  (n2  1) S 22
S p2  1
1 - 2 = do n1  n2  2 1 - 2  do | t | > t/2,n1+n2-2

c.  12   22 and unknown, n1, n2 ≤ 30


1 - 2 ≥ do ( X 1  X 2 )  do 1 - 2 < d o t < - t,v
t
( S12 n1 )  ( S 22 n2 )
1 - 2 ≤ do 1 - 2 > do t > t,v
( S12 n1  S 22 n2 ) 2
=
1 - 2 = do ( S12 n1 ) 2 ( S 22 n2 ) 2 1 - 2  do | t | > t/2,v

n1  1 n2  1

Remark The remarks made on confidence interval estimation for the difference
between means relative to the use of a given statistic apply to the tests
described here. See Theorem 5 of Chapter 7.
If  12 and  22 are unknown but n1, n2 > 30, use the z-test described in (a)
but replace the population standard deviations by the sample standard
deviation, that is,
( X 1  X 2 )  do
z
( S12 n1 )  ( S 22 n2 )

Example: A course in mathematics is taught to 12 students by the conventional


classroom procedure. A second group of 10 students was given the same course
by means of programmed materials. At the end of the semester the same
examination was given to each group. The 12 students meeting in the classroom
made an average grade of 85 with a standard deviation of 4, while the 10 students
using programmed materials made an average of 81 with a standard deviation of
5. Test the hypothesis that the two methods of learning are equal using a .10 level
of significance. Assume the populations to be approximately normal with equal
variances.
Testing the Difference Between two Population Means Based on two Related Samples

Ho Test Statistic Ha Critical region


X - Y ≥ do d  d0 X - Y < d0 t < -t,n-1
X - Y ≤ do t=  X - Y > d 0 t > t,n-1
sd / n
X - Y = do X - Y  d0 |t| > t/2,n-1
Note This test is an exact -level test if the differences between scores whose
differences come from a normal distribution.

Example: A taxi company is trying to decide whether the use of radial tires instead of
regular belted tires improves fuel economy. Twelve cars were equipped with
radial tires and driven over a prescribed test course. Without changing drivers,
the same cars were then equipped with regular belted tires and driven once again
over the test course. The gasoline consumption, in kilometers per liter, was
recorded as follows:

Cars Radial Tires Belted Tires


1 4.2 4.1
2 4.7 4.9
3 6.6 6.2
4 7.0 6.9
5 6.7 6.8
6 4.5 4.4
7 5.7 5.7
8 6.0 5.8
9 7.4 6.9
10 4.9 4.7
11 6.1 6.0
12 5.2 4.9
At the 0.025 level of significance, can we conclude that cars equipped with radial
tires give better fuel economy than those equipped with belted tires? Assume the
populations to be normally distributed.

t-Test: Paired Two Sample for Means


Radial Belted
Mean 5.75 5.608333333
Variance 1.108181818 0.988106061
Observations 12 12
Pearson Correlation 0.983002407
Hypothesized Mean
Difference 0
Df 11
t Stat 2.484515151
P(T<=t) one-tail 0.015164753
t Critical one-tail 2.200985159
P(T<=t) two-tail 0.030329506
t Critical two-tail 2.593092681
Exercises: pp. 317-318 of Walpole nos. 10-19
10. A random sample of size n1 = 25, taken from a normal population with standard
deviation 1 = 5.2, has a mean x1 = 81. A second random sample of size n2 = 36,
taken from a different normal population with a standard deviation 2 = 3.4, has a
mean x2 = 76. Test the hypothesis at the .06 level of significance that µ1 = µ2.
11. A manufacturer claims that the average tensile strength of thread A exceeds the
average tensile strength of thread B by at least 12 kg. To test this claim, 50 pieces
of each type of thread are tested under similar conditions. Type A thread had an
average tensile strength of 86.7 kg. with a standard deviation of 6.28 kg., while
type B thread had an average tensile strength of 77.8 kg. with a standard deviation
of 5.61 kg. Test the manufacturer’s claim using a .05 level of significance.
12. A study was made to estimate the difference in salaries of college professors in
the private and state colleges of Virginia. A random sample of 100 professors in
private colleges showed an average 9-month salary of $26,000 with a standard
deviation of $1300. A random sample of 200 professors in state colleges showed
an average salary of $26,900 with a standard deviation of $1400. Test the
hypothesis that the average salary for professors teaching in state colleges does
not exceed the average salary for professors teaching in private colleges by more
than $500. Use a .02 level of significance.
13. Given two random samples of size n1 = 11 and n2 = 14, from two independent
normal populations, with x1 = 75, x2 = 60, s1 = 6.1, and s2 = 5.3, test the
hypothesis at the .05 level of significance that µ 1 = µ2. Assume the population
variances are equal.
14. A study is made to see if increasing the substrate concentration has an appreciable
effect on the velocity of a chemical reaction. With the substrate concentration of
1.5 moles per liter, the reaction was run 15 times with an average velocity of 7.5
micromoles per 30 min. and a standard deviation of 1.5. With a substrate
concentration of 2.0 moles per liter, 12 runs were made yielding an average
velocity of 8.8 micromoles per 30 min. and a standard deviation of 1.2. Would
you say that the increase in substrate concentration increases the mean velocity by
more than .5 micromole per 30 min.? Use a .01 level of significance and assume
the populations to be approximately normally distributed with equal variances.
15. A study was made to determine if the subject matter in a physics course is better
understood when a lab constitutes part of the course. Students were allowed to
choose between a 3-unit course without lab and a 4-unit course with lab. In the
section with lab, a sample of 11 students had an average grade of 85 with a
standard deviation of 4.7, and in the section without lab, a sample of 17 students
had an average grade of 79 with a standard deviation of 6.1. Would you say that
the laboratory course increases the average grade by more than 5 points? Use a
0.01 level of significance and assume the populations to be approximately
normally distributed with equal variances.
16. A large automobile manufacturing company is trying to decide whether to
purchase brand A or brand B tires for its new models. To help arrive at a
decision, an experiment is conducted using 12 of each brand. The tires are run
until they wear out. The results are x A = 37,900 km., sA = 5100 km., x B = 39,800
km, and sB = 5900 km. Test the hypothesis at the .05 level of significance that
there is no difference in the two brands of tires. Assume the populations to be
approximately normally distributed.
17. The following data represent the running time of films produced by two motion
picture companies:
Time (minutes)
Company 1 103 94 110 87 98
Company 2 97 82 123 92 175 88 118
Test the hypothesis that the average running time of films produced by company 2
exceeds the average running time of films produced by company 1 by 10 minutes
against the one-sided alternative that the difference is more than 10 minutes. Use
a 0.1 level of significance and assume the distributions of times to be
approximately normal with unequal variances.

t-Test: Two-Sample Assuming Unequal Variances

Company 2 Company 1
Mean 110.7142857 98.4
Variance 1035.904762 76.3
Observations 7 5
Hypothesized Mean
Difference 10
Df 7
t Stat 0.181131997
P(T<=t) one-tail 0.430698476
t Critical one-tail 1.414923928
P(T<=t) two-tail 0.861396953
t Critical two-tail 1.894578604

18. In Exercise 21 on p. 265, test the hypothesis, at the .05 level of significance, that
the average yields of the two varieties of wheat are equal.
University
1 2 3 4 5 6 7 8 9
Variety 1 38 23 35 41 44 29 37 31 38
Variety 2 45 25 31 38 50 33 36 40 43

19. In Exercise 22 on p. 265, test the hypothesis, at the .01 level of significance, that
µ1 ≥ µ2.
Taxi Brand A Brand B
1 34,400 36,700
2 45,500 46,800
3 36,700 37,700
4 32,000 31,100
5 48,400 47,800
6 32,800 36,400
7 38,100 38,900
8 30,100 31,500
9.4 TESTING A HYPOTHESIS ON PROPORTIONS
Consider the problem of testing the hypothesis that the proportion of successes in a
binomial experiment equals some specified value.

If the unknown proportion is not expected to be too close to 0 or 1 and n is large, a large
sample approximation is given by:

Ho Test Statistic Ha Critical region


p ≥ po x  np0 p < po z < - zα
p ≤ po Z p > po z > zα
np0 q0
p = po p ≠ po | z | > zα/2

Example A commonly prescribed drug on the market for relieving nervous tension
is believed to be only 60% effective. Experimental results with a new drug
administered to a random sample of 100 adults who were suffering from
nervous tension showed that 70 received relief. Is this sufficient evidence
to conclude that the new drug is superior to the one commonly prescribed?
Use a 0.05 level of significance.

9.5 TESTING THE DIFFERENCE BETWEEN TWO PROPORTIONS


The testing procedure involves selection of independent samples of size n1 and n2 from
 
two binomial populations. The sample proportions p1 and p 2 are computed and the test is
as follows:

Ho Test Statistic Ha Critical region


p1- p2 ≥ d0 ^ ^ p1- p2 < d0 z < - zα
p1- p2 ≤ d0
p1  p2  d 0 P1- p2 > d0 z > zα
Z=
p1- p2 = d0 ^ ^
p1 q1 p 2 q 2
^ ^
P1- p2 ≠ d0 | z | > zα/2

n1 n2

Example In a survey of 200 students, 78 of the 120 females in the sample passed
Math 17 on their first take while this figure is 60 among the 80 males. Will
you agree that the proportion of males who passed Math 17 on their first
take is higher than the proportion of females who passed the same course
on their first take? Test at α = 0.05.
Exercises: p. 331 of Walpole nos. 1-12
1. A manufacturer of cigarettes claims that 20% of the cigarette smokers prefer
brand X. To test this claim a random sample of 20 cigarette smokers is selected
and asked what brand they prefer. If 6 of the 20 smokers prefer brand X, what
conclusion do we draw? Use a .05 level of significance.
2. Suppose that in the past 40% of all adults favored capital punishment. Do we
have reason to believe that the proportion of adults favoring capital punishment
today has increased if, in a random sample of 15 adults, 8 favor capital
punishment? Use a .05 level of significance.
3. A coin is tossed 20 times resulting in 5 heads. Is this sufficient evidence to reject
the hypothesis at the .03 level of significance that the coin is balanced in favor of
the alternative that head occur less than 50% of the time?
4. It is believed that at least 60% of the residents in a certain area favor an
annexation suit by a neighboring city. What conclusion would you draw if only
110 in a sample 200 voters favor the suit? Use a .04 level of significance.
5. The gas company claims that two-thirds of the houses in a certain city are heated
by natural gas. Do we have reason to doubt this claim if, in a random sample of
1000 houses in this city, it is found that 618 are heated by natural gas? Use a .02
level of significance.
6. At a certain college it is estimated that fewer than 25%of the students have cars
on campus. Does this seem to be a valid estimate if, in a random sample of 90
college students, 28 are found to have cars? Use a .05 level of significance.
7. In a study to estimate the proportion of residents in a certain city and its suburbs
who favor the construction of a nuclear power plant, it is found that 63 of 100
urban residents favor the construction while only 59 of 125 suburban residents are
in favor. Is there a significant difference between the proportion of urban and
suburban residents who favor construction of the nuclear plant? Use a .04 level of
significance.
8. A cigarette-manufacturing firm distributes two brands of cigarettes. If it is found
that 56 of 200 smokers prefer brand A and 29 of 150 smokers prefer brand B, can
we conclude at the .06 level of significance that brand A outsells brand B?
9. A geneticist is interested in the proportion of males and females in the population
that have a certain minor blood disorder. In a random sample of 100 males, 31
are found to be afflicted, whereas 24 of the 100 females tested appear to have the
disorder. Can we conclude at the .01 level of significance that the proportion of
men in the population afflicted with this blood disorder is significantly higher
than the proportion of women afflicted?
10. A study is made to determine if a cold climate results in more to absenteeism
from school during a semester than a warmer climate. Two groups of students are
selected at random, one group from Maine and the other from Alabama. Of the
300 students from Maine, 72 were absent at least 1 day during the semester, and
of the 400 students from Alabama, 70 were absent 1 or more days. Can we
conclude that a colder climate results in a greater number of students being absent
from school at least 1 day during the semester? Use a .05 level of significance.
11. A vote is to be taken among the residents of a town and the surrounding country
to determine whether a civic center will be constructed. The proposed
construction site is within the town limits and for this reason many voters in the
country feel that the proposal will pass because of the large proportion of town
voters who favor the construction. If 120 of 200 town voters favor the proposal
and 240 of 500 country residents favor it, test the hypothesis that the percentage
of town voters favoring the construction of a civic center will not exceed the
percentage of country voters by more that 3%. Use a .025 level of significance.
12. With reference to Exercise 8, test the hypothesis at the .06 level of significance
that brand A outsells brand B by at least 10%.

In 1788, James Madison, John Jay, and Alexander Hamilton anonymously published a
series of essays entitled The Federalist. These Federalist papers were an attempt to
convince the people of New York that they should ratify the Constitution. In the course
of history, the authorship of these papers became known, but 12 remained contested.
Through the use of statistical analysis, and particularly the use the frequency of the use of
various words, we can now conclude that Madison is the likely author of the 12 papers.
In fact, the statistical evidence that he is the author is overwhelming.
Before the election of 1936, a contest between Democratic incumbent Franklin Roosevelt
and Republican Alf Landon, the magazine Literary Digest had been extremely successful
in predicting the results in the US presidential elections. But 1936 turned out to be its
downfall, when it predicted a victory for Landon. To add insult to injury, young pollster
George Gallup, who had just founded the American Institute of Public Opinion in 1935,
correctly predicted Roosevelt as the winner of the election. He did this before they even
conducted their poll! And Gallup surveyed only 50,000 people, while the Literary Digest
sent questionnaires to 10 million people.

The Literary Digest made two classic mistakes. First, the lists of people to whom they
mailed the 10 million questionnaires were taken from magazine subscribers, car owners,
telephone directories, and lists of registered voters. In 1936, those who owned telephones
or cars, or subscribed to magazines, were more likely to be wealthy individuals who were
not happy with the Democratic incumbent.

Despite what accounts of this famous story conclude, the bias produced by the more
affluent list was not likely to have been as severe as the second problem. The main
problem was volunteer response. They received 2.3 million responses, a response rate of
only 23%. Those who felt strongly about the outcome of the election were more likely to
respond and that included a majority of those who wanted a change, the Landon
supporters. Those who were happy with the incumbent were less likely to bother to
respond.

Gallup, on the other hand, knew the value of random sampling. He was not only able to
predict the election but he also predicted what the results of the Literary Digest poll
would be to within 1%. How did he do this? He just chose 3,000 people at random from
the same lists the Digest was going to use, and mailed them all a postcard asking them
how they planned to vote.
9.6. TEST FOR INDEPENDENCE
The test for independence is used to determine whether two variables are related or not.
For example, we might test whether a person’s music preference is related to his
intelligence quotient. We then take a random sample and for each subject determine their
music preference and classify their IQ’s into different categories (high, medium, low).
The observed frequencies are presented in what is known as a contingency table shown
below:

Music IQ
Preference High Medium Low Total
Classical 40 26 17 83
Pop 47 59 25 131
Rock 83 104 79 266
Total 170 189 121 480

A contingency table containing r rows and c columns is referred to as an rxc table. The
row and column totals are called marginal frequencies. Note that in a test for
independence, these marginal frequencies are not fixed in advance but depends instead on
the way the sample distributed itself across the various cells in the table.

Procedure:
1. State the null and alternative hypothesis.
Ho: The two variables are independent
Ha: The two variables are not independent.
2. Choose the level of significance.
3. Compute the test statistic, given by
2
r
   ij
c O E 2 ij

i 1 j 1 Eij
where Oij= observed number of cases in the ith row of the jth column
Eij = expected number of cases under Ho
 =
  
i th row total  jth column total
grand total
4. Decision Rule: Reject Ho if   2,( r 1)(c1)
2

Remarks:
1. The test is valid if at least 80% of the cells have expected frequencies of at least 5
and no cell has an expected frequency ≤ 1.
2. If many expected frequencies are very small, researchers commonly combine
categories of variables to obtain a table having larger cell frequencies. Generally,
one should not pool categories unless there is a natural way to combine them.
3. For a 2x2 contingency table, a correction called Yates’ correction for continuity is

applied. The formula then becomes    2


r c Oij  Eij  .5 
2

i 1 j 1 Eij

Example: Ho: Music preference and IQ are independent


Ha: Music preference and IQ are not independent

Music IQ
Preference High Medium Low Total
Classical 40 (29.4) 26 (32.7) 17 (20.9) 83
Pop 47 (46.4) 59 (51.6) 25 (33.0) 131
Rock 83 (94.2) 104 (104.7) 79 (67.1) 266
Total 171 189 121 480

r c O  Eij 
2

The test statistic value is  2  


ij
= 12.38
i 1 j 1 Eij
At α = 0.05, the critical value is 2,( r 1)(c1)   42  9.488
Decision: Since 12.38 > 9.488, reject Ho. There is sufficient evidence at the 0.05 level of
significance that music preference and IQ are not independent.
Remember, association does not imply causation.

Music IQ
Preference High Medium Low
Classical 40/83 = .48 26/83 = .31 17/83 = .20
Pop 47/131 = .36 59/131 = .45 25/131 = .19
Rock 83/266 = .31 104/266 = .39 79/266 = .30

Music IQ
Preference High Medium Low
Classical 40/171 = .23 26/189 = .14 17/121 = .14
Pop 47/171 = .27 59/189 = .31 25/121 = .21
Rock 83/171 = .49 104/189 = .55 79/121 = .65
P v2   2   

v\α 0.10 0.05 0.025 0.01 0.005


1 2.706 3.841 5.024 6.635 7.879
2 4.605 5.991 7.378 9.210 10.597
3 6.251 7.815 9.348 11.345 12.838
4 7.779 9.488 11.143 13.277 14.860
5 9.236 11.070 12.833 15.086 16.750
6 10.645 12.592 14.449 16.812 18.548
7 12.017 14.067 16.013 18.475 20.278
8 13.362 15.507 17.535 20.090 21.955
9 14.684 16.919 19.023 21.666 23.589
10 15.987 18.307 20.483 23.209 25.188
11 17.275 19.675 21.920 24.725 26.757
12 18.549 21.026 23.337 26.217 28.300
13 19.812 22.362 24.736 27.688 29.819
14 21.064 23.685 26.119 29.141 31.319
15 22.307 24.996 27.488 30.578 32.801
16 23.542 26.296 28.845 32.000 34.267
17 24.769 27.587 30.191 33.409 35.718
18 25.989 28.869 31.526 34.805 37.156
19 27.204 30.144 32.852 36.191 38.582
20 28.412 31.410 34.170 37.566 39.997
21 29.615 32.671 35.479 38.932 41.401
22 30.813 33.924 36.781 40.289 42.796
23 32.007 35.172 38.076 41.638 44.181
24 33.196 36.415 39.364 42.980 45.559
25 34.382 37.652 40.646 44.314 46.928
26 35.563 38.885 41.923 45.642 48.290
27 36.741 40.113 43.195 46.963 49.645
28 37.916 41.337 44.461 48.278 50.993
29 39.087 42.557 45.722 49.588 52.336
30 40.256 43.773 46.979 50.892 53.672
X. Linear Regression and Correlation
10.1 A Simple Linear Probabilistic Model
Consider the problem of predicting a student’s final grade in a college freshman calculus
course based on his score on a mathematics achievement test administered prior to
college entrance. We wish to
 determine whether X = the achievement test score is actually related to a student’s
Y = grade in calculus and
 obtain an equation that will be useful for predicting Y as a function of X.
The evidence represents a sample of the achievement test scores and calculus grades for
ten college freshmen. We will assume that ten students constitute a random sample
drawn from the population of freshmen who have already entered the university or will
do so in the immediate future.

Student X Y 120
1 39 65
100
2 43 78
3 21 52 80

4 64 82 60
5 57 92
40
6 47 89
7 28 73 20
8 75 98
0
9 34 56 0 20 40 60 80
10 52 75

The mathematical equation of a straight line is


Y   0  1 X
where  0 is the y-intercept, the value of Y when X = 0, and
 1 is the slope of the line, the change in Y for a one-unit change in X.
The linear model Y   0  1 X is said to be a deterministic mathematical model
because, when a value of X is substituted into the equation, the value of Y is determined
and no allowance is made for error.

In contrast to the deterministic model, we might employ a probabilistic mathematical


model, which is a simple modification of the deterministic model. Rather than saying
that Y and X are related by the deterministic model Y   0  1 X we say that the
expected value of Y for a given value of X has a graph that is a straight line. That is, we
let E(Y | X )   0  1 X . So we write the probabilistic model for any particularly
observed value of Y as
Y  E(Y | X )     0  1 X  
where  is a random error, the difference between an observed value of Y and the mean
value of Y for a given X.
Thus, we assume that that for any given value of X the observed value of Y varies in a
random manner and possesses a probability distribution with mean E (Y | X ) .

Assumptions for the Probabilistic Model: For any given value of X, Y possesses a
normal distribution with a mean value E(Y | X )   0  1 X and with a variance of  2 .
Furthermore, any one value of Y is independent of every other value.

10.2 The Method of Least Squares



If we denote the predicted value of Y obtained by the best-fitting straight line as Y , the
prediction equation will be
  
Y   0  1 X
 
where  0 and  1 represent estimates of the parameters  0 and  1 .

Least squares criterion: Choose as the “best-fitting” line the line that minimizes the
2
 
 n
sum of squares for error SSE =   Yi  Yi  .
i 1  
The method for finding the numerical values of  0 and  1 that minimize SSE uses
differential calculus and is beyond the scope of this course.

  
 S xy n
1  where S xy   xi  x yi  y
S xx i 1
 
 0  y  1 x

For now, let us use the following EXCEL output to find the least squares prediction line
for the calculus grade-achievement test score data and predict a student’s calculus grade
if the student scored X = 50 on the achievement test.

Standard
Coefficients Error t Stat P-value
Intercept 40.78415521 8.506861379 4.794265875 0.00136551
X Variable 1 0.765561843 0.174984967 4.375014926 0.002364532

 The best-fitting straight line relating the calculus grade to the achievement test
   
score is Y   0  1 X or Y  40.78415521 + .765561843X.
 .765561843 is the estimated change in Y for a 1-unit change in X
 The Y intercept will not be interpreted since X = 0 is not part of the range of X
 If a student scores X = 50 on the achievement test, his or her predicted calculus
  
grade would be Y   0  1 X = 40.78415521 + .765561843(50) = 79.06225
10.3 Inferences

The third parameter in our linear probabilistic model is  2 and its estimator is

 2  MSE  SSE
n2
where MSE stands for mean squared error.
In the following EXCEL output, MSE = 75.75323363 can be found in the second row,
fourth column while SSE = 606.025869 can be found in the same row, third column.

ANOVA
Significance
df SS MS F F
Regression 1 1449.974131 1449.974131 19.1407556 0.002364532
Residual 8 606.025869 75.75323363
Total 9 2056

Does X contribute information for the prediction of Y; i.e., do the data provide sufficient
evidence to indicate that Y increases (or decreases) linearly as x increases over the region
of observation? We would wish to test Ho: 1  0 vs. Ha: 1  0 . The test statistic is

1
 
n
where S xx   xi  x .
2
t
MSE / S xx i 1

We reject Ho if t  t / 2,n2 . Alternatively, we can construct a (1-α)100% confidence


interval for  1 of the form

1  t / 2,n2 MSE
S xx
.

In the following EXCEL output (see last row), 1  .765561843, MSE
S xx  .174984967,

1
t = 4.375014926 with p-value of .002364532 for testing Ho: 1  0 vs.
MSE / S xx
Ha: 1  0 , and 95% CI of (.362045786, 1.169077901).

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 40.78415521 8.506861379 4.794265875 0.00136551 21.16729771 60.40101272
X Variable
1 0.765561843 0.174984967 4.375014926 0.002364532 0.362045786 1.169077901
10.4 Multiple Regression Models

We restricted our attention to the problem of predicting Y as a linear function of a single


variable X. For example, Y = company’s regional sales of a product can be predicted by
X1 = amount of the company’s television advertising expenditures, or X2 = the amount of
newspaper advertising expenditures, or X3 = number of sales representatives assigned to
the region. The more useful regression models involve several Xs. So instead of using
Y   0  1 X   where X is any of the Xs, we use Y   0  1 X 1   2 X 2   3 X 3   .

A political scientist may wish to relate Y = success in a political campaign to X1 =


characteristics of the candidate, X2 = nature of the opposition, X3 = various campaign
issues, X4 = campaign expenditures, and X5 = promotional techniques. His model would
be Y   0  1 X 1   2 X 2  3 X 3   4 X 4  5 X 5   .

10.5 A Coefficient of Correlation

Sometimes we wish to obtain an indicator of the strength of the linear relationship


existing between two variables Y and X that is independent of their respective scales of
measurement. We call this a measure of the linear correlation between Y and X. The
commonly used measure of the linear correlation is called the Pearson product moment
coefficient of correlation between Y and X

S xy
 
n
where S yy   yi  y
2
r
S xx S yy i 1

S xy 1894
In the calculus grade-achievement test score data, r   
S xx S yy 2474  2056
.839786. The EXCEL output follows.

Column 1 Column 2
Column 1 1
Column 2 0.839786 1

S xy  S xy
The denominators used in calculating r  and 1  will always be
S xx S yy S xx

positive. Since the numerators are identical, r and  1 will assume the same sign.
r=1 -1 < r < 0

r0 0 < r < 1 but relationship is nonlinear

The coefficient of determination r 2 =.705240336 is the proportion of the variability in the


observed values of Y that can be explained by X and is nothing but the square of the
correlation coefficient between X and Y.

The sample correlation coefficient r is an estimator of a population correlation coefficient


, which would be obtained if the correlation coefficient were calculated by using all the
points in the population. A test of Ho:  = 0 that no correlation exists between Y and X
r n2
has the following test statistic t  with v = n-2 and is identical to
1 r2

1
t the test statistic for testing Ho: 1  0 .
MSE / S xx
Computations

x y
xx yy x  x y  y  xy x  x
2
y  y 
2
x2 y2
39 65 -7 -11 77 2535 49 121 1521 4225
43 78 -3 2 -6 3354 9 4 1849 6084
21 52 -25 -24 600 1092 625 576 441 2704
64 82 18 6 108 5248 324 36 4096 6724
57 92 11 16 176 5244 121 256 3249 8464
47 89 1 13 13 4183 1 169 2209 7921
28 73 -18 -3 54 2044 324 9 784 5329
75 98 29 22 638 7350 841 484 5625 9604
34 56 -12 -20 240 1904 144 400 1156 3136
52 75 6 -1 -6 3900 36 1 2704 5625
460 760 1894 36854 2474 2056 23634 59816

S xy   xi yi 
 x  y   36854  460  760
i i

n 10
  
n
S xy   xi  x yi  y  1894
i 1

S xx  x 
 x 2 i
2

 23634 
460 2
i
n 10
 
n
S xx   xi  x  2474
2

i 1

S yy y 
 y  2 i
2

 59816 
760 2
i
n 10
 
n
S yy   yi  y
2
 2056
i 1
 S xy 1894
1    0.765561843
S xx 2474
  760 460
 0  y  1 x   .765561843   40.78415521
10 10
S xy2 1894 2
SSE  S yy   2056   606.025869
S xx 2474
^ 2 SSE 606.025869
  s 2  MSE    75.75323363
n2 10  2
 s2 75.75323363
Standard error of 1    0.174984967
S xx 2474
XI. The Analysis of Variance
11.1 Introduction

Suppose that you want to compare the mean size of health insurance claims submitted by
five groups of policy holders. Ten claims were randomly selected from among the claims
for each group. Do the data contained in the five samples provide sufficient evidence to
indicate a difference in the mean levels of claims among the five health groups? We look
for a single test of Ho: 1  2    5 vs. Ha: at least one pair of means differ. We
assume that the observations within each sample population are normally distributed with
a common variance  2 .

Group 1 Group 2 Group 3 Group 4 Group 5


$763 $1335 $596 $3742 $1632
4365 1262 1448 1833 5078
2144 217 1183 375 3010
1998 4100 3200 2010 671
5412 2948 630 743 2145
957 3210 942 867 4063
1286 867 1285 1233 1232
311 3744 128 1072 1456
863 1635 844 3105 2735
1499 643 1683 1767 767

11.2 The Completely Randomized Design: A One-Way Classification

The analysis of experimental data depends on the design of the experiment, which refers
to the way the data were collected. A very useful and relatively simple design called the
completely randomized design is one in which random samples are independently
selected from each of k populations. This design results in observations that are
classified only according to the population from which they came. For example, in
assessing voter preference concerning the next city/municipality election, we may wish to
select random samples of registered voters in each of k barangays within the
city/municipality.

We want to compare k population means 1 ,  2 ,,  k based on independent random


samples of n1 , n2 ,, nk observations selected from populations 1, 2,…, k, respectively.
Let x ij be the jth measurement in the ith health group. The sum of squares of deviations
of all n  n1  n2    nk = 50 values about their overall mean x

 
k ni
Total SS = S xx   xij  x
2

i 1 j 1

can be partitioned into sum of squares for treatments/groups (SST, a measure of variation
among sample means) and sum of squares for error (SSE, a measure of variation within
samples). Thus, Total SS = SST + SSE.
For now we will be guided by EXCEL outputs. Total SS = 84153818.88 can be found in
the second column, last row, SSE = 77411264.4 in the same column, second row, SST =
6742554.48 in the same column, first row.

ANOVA
Source of
Variation SS Df MS F P-value F crit
Between
Groups 6742554.48 4 1685638.62 0.979879847 0.428070522 2.578739184
Within Groups 77411264.4 45 1720250.32
Total 84153818.88 49

The third column refers to the degrees of freedom of each sum of square. For Total SS,
49 = n – 1, for SSE, 45 = n – k, and for SST, 4 = k – 1.

The fourth column is the mean squares column calculated by dividing the sum of squares
by its degrees of freedom. So mean square for treatments MST =
k 1   1685638 .62 and mean square error MSE = nSSE
k   1720250 .32 .
SST 6742554.48 77411264.4
51 505

Both mean squares are independent estimators of  2 .

To test Ho: 1  2    5 we use the test statistic F = MST


MSE
= 0.979879847 (see first

row) with p-value of 0.428070522. We reject Ho if F > Fv1 k 1,v2 nk = 2.578739184 at α
= .05.

MSE
A (1-α)100% CI for a single treatment mean µi of the form x i  t / 2,nk
ni
1720250 .32
For example, a 95% CI for µ4 is 1674.7  1.96 =
10
1674.7  812.9276493 = (861.7723507, 2487.627649).

SUMMARY
Groups Count Sum Average Variance
Column 1 10 19598 1959.8 2751539.289
Column 2 10 19961 1996.1 1915389.878
Column 3 10 11939 1193.9 703792.7667
Column 4 10 16747 1674.7 1137055.789
Column 5 10 22789 2278.9 2093473.878

A (1-α)100% CI for a difference between two treatment means µi - µj of the form:


1 1
x 
 x j  t / 2,nk MSE  
i
 ni n j 
 
For example, a 95% CI for µ1 - µ3 is (1959.8-1193.9)  1.96 1720250 .32101  101  =
765.9  1149.653307 = (-383.753307, 1915.553307).
Computing Formulas
Population
Statistic 1 2  k
Sample size n1 n2  nk
Total=  xij T1   x1 j T2   x2 j  Tk   xkj
j j j j
Sample mean x1  T1 / n1 x 2  T2 / n2  x k  Tk / nk

T2
CM 
n
where T = total of all observations =  x
i j
ij  Ti
i

    x
k ni
Total SS = S xx   xij  x
2 2
ij  CM
i 1 j 1 i j

Ti 2
SST =   CM SSE = Total SS - SST
i ni

11.3 Randomized Block Design

Consider the problem of assessing the effects of three different package designs on the
number or amount of sales. We might decide to use a completely randomized design and
select 12 supermarkets and display each of the designs in four different markets. Unless
the markets all had similar characteristics, differences in sales for the three package
designs might also reflect differences in the characteristics of the stores. One way to
avoid this problem is to use, say, four stores and display each of the three designs in all
four stores. This way store-to-store variability has been eliminated.

As another example, suppose the CEO of a large construction company employs three
experienced construction engineers to perform the time-consuming cost analyses,
estimates, and bids for the work on large construction projects. It is important to know
whether these three estimators tend to produce estimates at the same mean level or
whether one or another tends to always submit a high (or low) bid on projects. Each of
the three estimators would be required to produce an analysis, and estimate, and a bid
price for the same set of projects. In this way, differences in bids for the same projects
can be compared, thereby eliminating project-to-project variability.

Project
Estimator 1 2 3 4 5
1 3.52 4.71 3.89 5.21 4.14
2 3.39 4.79 3.82 4.93 3.96
3 3.64 4.92 4.19 5.10 4.20

An analysis of variance for a randomized block design partitions the total sum of squares
into three parts: SST (measures the variation among treatment means), SSB (measures
the variation among block means), and SSE (measures the variation of the differences
among the treatment observations within blocks. That is, Total SS = SST + SSB + SSE.
Using the following EXCEL output (see second column), Total SS = 5.09096, SST =
0.13456, SSB = 4.88896, and SSE = 0.06744.

Source of
Variation SS df MS F P-value F crit
Rows 0.13456 2 0.06728 7.981020166 0.012424095 4.458970108
Columns 4.88896 4 1.22224 144.9869514 1.6952E-07 3.837853355
Error 0.06744 8 0.00843
Total 5.09096 14

For the degrees of freedom (third column), SST has 2 = k-1 = 3-1 where k is the number
of treatments (estimators), SSB has 4 = b-1 = 5-1 where b is the number of blocks
(projects), SSE has 8 = n-b-k+1 = 15-5-3+1, and Total SS has 14 = n-1 = 15-1. The
fourth column is the mean square column. As in CRD, MS is SS/df so that MST = SST k 1
,
MSB = SSB
b 1
, and MSE = SSE
nbk 1
. All three MS are independent estimates of  2 .

MST
To test Ho: No differences among the k treatment means, we use F = 
MSE
7.981020166 (see first row) with p-value of 0.012424095. Ho is rejected if F >
Fk1,nbk 1  4.458970108 at α = .05.

MSB
To test Ho: No differences among the b block means, we use F =  144.9869514
MSE
(second row) with p-value of 1.6952E-07. Ho is rejected if F > Fb1,nbk 1 
3.837853355 at α = .05.

A (1-α)100% CI for a difference between two treatment means µi - µj of the form:

x i   2
 x j  t / 2,nbk 1 MSE 
b
.00843  2
For example, a 95% CI for µ1 - µ3 is (4.294-4.41)  2.306 = -.116 
5
0.13390694 = (-0.24990694, 0.01790694) . Can verify CI for µ1 - µ2 is (-.01791,
.249907) and CI for µ2 - µ3 is (-.36591, -.09809).

SUMMARY Count Sum Average Variance


Row 1 5 21.47 4.294 0.44953
Row 2 5 20.89 4.178 0.43417
Row 3 5 22.05 4.41 0.3554

Column 1 3 10.55 3.516666667 0.015633333


Column 2 3 14.42 4.806666667 0.011233333
Column 3 3 11.9 3.966666667 0.038633333
Column 4 3 15.24 5.08 0.0199
Column 5 3 12.3 4.1 0.0156
763 1335 596 3742 1632
4365 1262 1448 1833 5078
2144 217 1183 375 3010
1998 4100 3200 2010 671
5412 2948 630 743 2145
957 3210 942 867 4063
1286 867 1285 1233 1232
311 3744 128 1072 1456
863 1635 844 3105 2735
1499 643 1683 1767 767
Ti  19598 19961 11939 16747 22789

91034  T  Ti
T2
= CM 
165743783 n
249897602 =  x 2
ij

84153819 =Total SS
  xij2  CM
i j

Ti 
2
384081604 398441521 142539721 280462009 519338521
2
Ti

ni 38408160 39844152 14253972 28046201 51933852
Ti 2

172486337.6 ni
6742554 =SST
Ti 2
=  CM
i ni
77411264 =SSE
=Total SS - SST

ANOVA
Source of Variation SS
Between Groups 6742554.48
Within Groups 77411264.4
Total 84153818.88
The chi-square statistic for testing independence is also applicable when testing Ho: p1 =
p2 = …= pk.

Example: In a shop study, a set of data was collected to determine whether or not the
proportion of defectives produced by workers was the same for the day,
evening, or night shift work. The following data were collected:

Shift
Day Evening Night
Defectives 45 55 70
Nondefectives 905 890 870

Use a .025 level of significance to determine if the proportion of defectives is the same
for all three shifts.

Shift
Day Evening Night Total
Defectives 45 (57.0) 55 (56.7) 70 (56.3) 170
Nondefectives 905 (893.0) 890 (888.3) 870 (883.7) 2665
Total 950 945 940 2835

  
2
r c O
ij  Eij 
2

= 6.288
i 1 j 1 Eij
v = (r-1)(c-1) = (2-1)(3-1) = 2
 .2025, 2 = 7.378
Decision: Accept Ho and conclude that the proportion of defectives produced is about the
same for all shifts.

Can verify: 95% CI for p1-p2: (-.03096, .009299)


CI for p1-p3: (-.04864, -.00556)
CI for p2-p3: (-.03873, .006194)
Goodness-of-Fit Test

We consider a test to determine if a population has a specified theoretical distribution.


The test is based upon how good a fit we have between the frequency of occurrence of
observations in an observed sample and the expected frequencies obtained from the
hypothesized distribution.

To illustrate, consider the following frequency distribution table constructed from the
lives of 40 similar car batteries. The batteries are guaranteed to last 3 years. Let us test
the hypothesis that the frequency distribution may be approximated by a normal
distribution with mean x = 3.41 and standard deviation s = .703.

Class Boundaries Oi
1.45-1.95 2
1.95-2.45 1
2.45-2.95 4
2.95-3.45 15
3.45-3.95 10
3.95-4.45 5
4.45-4.95 3

If the observed frequencies are close to the corresponding expected frequencies,

the  2   i
k
O  Ei 2 value will be small, indicating a good fit. If the observed
i 1 Ei
frequencies differ considerably from the expected frequencies, the 2 will be large and
the fit is poor.

The number of degrees of freedom in a chi-square goodness-of-fit test is equal to the


number of cells minus the number of quantities obtained from the observed data, which
are used in the calculations of the expected frequencies.

We reject Ho: good fit if  2  2,v . The expected frequencies should be at least 5. This
restriction may require the combining of adjacent cells resulting in a reduction of the
number of degrees of freedom.
Going back to the example, the expected frequencies for each class/cell is obtained from
the normal curve having the same mean and standard deviation as our sample. These
values will be used for µ and  in computing z values corresponding to the class
boundaries. For the first interval, we solve P(X < 1.95). For the last interval, we solve
P(X > 4.45). For the 4th interval, we solve P(2.95 < X < 3.45) = P(-.65 < Z < .06) = .2661
so that E4 = .2661(40) = 10.6.

Class Boundaries Oi Ei
1.45-1.95 2 0.8
1.95-2.45 1 2.7
2.45-2.95 4 6.9
2.95-3.45 15 10.6
3.45-3.95 10 10.2
3.95-4.45 5 6.0
4.45-4.95 3 2.8

Combining adjacent classes for expected frequencies less than 5,

Oi Ei
7 10.4
15 10.6
10 10.2
8 8.8

Thus,   
2
k
Oi  Ei 2
= 3.015
i 1 Ei
The number of degrees of freedom for this test is 4-3 = 1, since three quantities – the total
frequency, mean and standard deviation – of the observed data were required to find the
expected frequencies. Since  .205,1 = 3.841, we have no reason to reject Ho and conclude
that the normal distribution provides a good fit for the distribution of battery lives.
Time Series Analysis

Definition: A time series is a sequence of n observations Y1, Y2, …, Yn on a process


at equally spaced points in time.

Some Applications:
 To forecast future values of Y. It is assumed that some of the patterns observed in
the past will continue into the future. Thus, if quantifiable information about the
past can be measured then this can be used to forecast what will happen in the
future. Forecasting is an important aid in effective and efficient planning.
 To facilitate comparisons with data for past years. For example, time series data
can be used to answer the question whether or not the recent increase in
unemployment is normal for this time of the year.
 To identify indicators that coincide or precede with a change in direction of a time
series (called a cyclical turning point) and help in anticipating such.

8.1 Components of a Time Series

TREND describes the long-term sweep of the series and usually modeled by a
smooth curve. There are many types of trends such as linear (a constant
amount of increase/decrease in the trend value from one period to the
next) and exponential (the trend value changes at a constant rate from one
period to the next).

SEASONAL describes the short-term recurring pattern of change in the series and
consists of relatively repetitious cycles of fixed amplitude and duration.

CYCLICAL movements in a time series that, like seasonal variations, are recurrent but
that, unlike seasonal variations, occur in cycles longer than one year. This
pattern exists when the series is influenced by longer-time economic
fluctuations.

IRREGULAR describes the miscellaneous, erratic movements in the series and tends to
have an irregular, saw-toothed pattern

8.2 Smoothing Techniques

Purpose: to eliminate randomness so the underlying pattern that exists in a data


series can be projected into the future and used as the forecast

General Methods:
 Averaging methods wherein past observations are given equal weights in
evaluating the forecast
 Exponential smoothing methods, wherein past observations are given unequal
weights that decay exponentially
Single Moving Average

Let Yt be the variable at time t and Ft be the forecast at time t.

Step 1 Choose the number of periods T to be used in the computation of the forecast.
The larger the value of T, the greater the smoothing effect. The smaller the value
of T, the more the moving averages follow the pattern of the data.

Step 2 Compute the moving averages using the following formula:

Y
t 1
t
FT+1 =
T
T 1

Y
t 2
t
FT+2 =
T

T  k 1

Y
t k
t
FT+k =
T

Note that the oldest observation is dropped as each new observation becomes
available.

Yt Ft based on 3-month MA Ft based on 5-month MA


200
135
195
197.5 176.67
310 175.83
175 234.17 207.50
155 227.50 202.50
130 213.33 206.50
220 153.33 193.50
277 168.33 198.00
235 209.00 191.40
DEC 244.00 203.40
Single Exponential Smoothing

Step 1 Choose the weight α (between 0 and 1) that will give the smallest forecast error.
A large value of α gives very little smoothing in the forecast, whereas a small
value of α gives considerable smoothing.

Some measures of forecast error: Let et = Yt – Ft

e
t 1
t
MAE =
T
T

e
t 1
2
t
MSE =
T
T

 e Y 100%
t 1
t
t
MAPE =
T
2
 et  100%
T

  Y
t 1  t 
MSPE =
T

Step 2 Compute the forecasts using the following formula:


Ft+1= αYt + (1-α)Ft
= αYt + (1-α)αYt-1 + (1-α)2αYt-2 + (1-α)3αYt-3 + …

SES with F1 = Y1
Yt Ft ,α = .1 Ft ,α = .5 Ft ,α = .9
200 200 200 200
135 200 200 200
195 193.5 167.5 141.5
197.5 193.65 181.25 189.65
310 194.035 189.375 196.715
175 205.6315 249.6875 298.6715
155 202.5684 212.3438 187.3672
130 197.8115 183.6719 158.2367
220 191.0304 156.8359 132.8237
277 193.9273 188.418 211.2824
235 202.2346 232.709 270.4282
DEC 205.5111 233.8545 238.5428

SES with F1 = Y
Yt Ft, α = .1 Ft, α = .5 Ft, α = .9
200 202.6818 202.6818 202.6818
135 202.4136 201.3409 200.2682
195 195.6723 168.1705 141.5268
197.5 195.605 181.5852 189.6527
310 195.7945 189.5426 196.7153
175 207.2151 249.7713 298.6715
155 203.9936 212.3857 187.3672
130 199.0942 183.6928 158.2367
220 192.1848 156.8464 132.8237
277 194.9663 188.4232 211.2824
235 203.1697 232.7116 270.4282
DEC 206.3527 233.8558 238.5428
3-month MA

Yt Ft |et| et2 |et/Yt| (et/Yt)2


200
135
195
197.5 176.67 20.83 434.03 10.55 1.11
310 175.83 134.17 18000.69 43.28 18.73
175 234.17 59.17 3500.69 33.81 11.43
155 227.50 72.50 5256.25 46.77 21.88
130 213.33 83.33 6944.44 64.10 41.09
220 153.33 66.67 4444.44 30.30 9.18
277 168.33 108.67 11808.44 39.23 15.39
235 209.00 26.00 676.00 11.06 1.22
DEC 244.00 71.42 6383.13 34.89 15.01
MAE MSE MAPE MSPE

5-month MA

Yt Ft |et| et2 |et/Yt| (et/Yt)2


200
135
195
197.5
310
175 207.50 32.50 1056.25 18.57 3.45
155 202.50 47.50 2256.25 30.65 9.39
130 206.50 76.50 5852.25 58.85 34.63
220 193.50 26.50 702.25 12.05 1.45
277 198.00 79.00 6241.00 28.52 8.13
235 191.40 43.60 1900.96 18.55 3.44
DEC 203.40 50.93 3001.49 27.86 10.08
MAE MSE MAPE MSPE

Vous aimerez peut-être aussi