Académique Documents
Professionnel Documents
Culture Documents
Introduction
Almost daily we apply statistical concepts in our lives. For example, to start the day you
turn on the shower and let it run for a few moments. Then you put your hand in the
shower to sample the temperature and decide to add more hot water or more cold water,
or conclude that the temperature is just right and enter the shower. As a second example,
you are at the grocery store looking to buy a frozen pizza. One of the pizza makers has a
stand, and they offer a small wedge of their pizza. After sampling the pizza, you decide
whether to purchase the pizza or not. In both the shower and pizza examples, you make a
decision and select a course of action based on a sample.
Definition of Statistics
in its plural sense, statistics is a set of numerical data (e.g., annual GNP/GDP,
quarterly/monthly sales of a company, weekly/daily peso-dollar exchange rate)
in its singular sense, Statistics is that branch of science which deals with the
collection, presentation, organization, analysis, and interpretation of data
Example In order to estimate the true proportion of students at a certain college who
smoke cigarettes, the administration polled a sample of 200 students and
determined that the proportion of students from the sample who smoke
cigarettes is 0.12. Identify the parameter and the statistic.
In 1662 John Graunt published an article “Natural and Political Observations Made upon
Bills of Mortality.” His “observations” were the result of his study and analysis of a
weekly church publication called “Bill of Mortality,” which listed births, christenings,
and deaths and their causes. This analysis and interpretation of social and political data
are thought to mark the start of statistics.
Fields of Statistics
a. Statistical Theory of Mathematical Statistics - deals with the development and
exposition of theories that serve as bases of statistical methods.
Classification of Variables
1. Qualitative (or Categorical) vs. Quantitative
Qualitative variable a variable that yields categorical responses (e.g.,
political affiliation, occupation, marital status)
Continuous variable a variable which can assume the infinitely many values
corresponding to a line interval
Levels of Measurement
1. Nominal Level (or Classificatory Scale)
Examples:
Sex M-Male F-Female
Marital status 1-Single 2-Married 3-Widowed 4-Separated
The ordinal level of measurement contains the properties of the nominal level,
and in addition, the numbers assigned to categories of any variable may be ranked
or ordered in some low-to-high-manner.
Examples:
Teaching ratings 1-poor 2- fair 3-good 4-excellent
Year level 1-1st yr 2 – 2nd yr 3 – 3rd yr 4 – 4th yr
3. Interval Level
The interval level is that which has the properties of the nominal and ordinal
levels, and in addition, the distances between any two numbers on the scale are of
known sizes. An interval scale must have a common and constant unit of
measurement. Furthermore, the unit of measurement is arbitrary and there is no
“true zero” point.
Examples:
IQ
Temperature (in Celsius)
4. Ratio Level
The ratio level of measurement contains all the properties of the interval level,
and in addition, it has a “true zero” point.
.
Examples:
Age (in years)
No. of correct answers in an exam
Exercise: Identify the population under study and variable/s of interest.
a) The Office of Admissions is studying the relationship between the score in the
entrance examination during application and the general weighted average upon
graduation among graduates of the university from 2000 to 2005.
b) The research division of a certain pharmaceutical company is investigating the
effectiveness of a new diet pill in reducing weight on female adults.
c) The Department of Health is interested in determining the percentage of children
below 12 years old infected by the Hepatitis B virus in Metro Manila in 2006.
Heart disease is the most common cause of death in industrialized nations. In the US and
Canada, nearly 30% of deaths yearly are due to heart disease, mainly heart attacks. Does
regular aspirin intake reduce deaths from heart attacks? Harvard Medical School
conducted a landmark study to investigate. The people participating in the study
regularly took either an aspirin or a placebo (a pill with no active ingredient). Of those
who took aspirin, 0.9% had heart attacks during the study. Of those who took the
placebo, 1.7% had heart attacks, nearly twice as many.
Can you conclude that it’s beneficial for people to take aspirin regularly? Or could the
observed be explained by how it was decided which people would receive aspirin and
which would receive the placebo? For instance, might those who took aspirin have had
better results merely because they were healthier (or have better diet or exercise more
regularly), on the average, than those who took the placebo?
A TV exit poll used to project the election outcome reported that 53.1% of a sample of
3889 voters said they have voted for candidate A. Was this sufficient evidence to project
A as the winner, even though such information was available from such a small portion
of the more than 9.5 million voters?
If candidate A were actually going to lose the election, what’s the chance that he/she
would be supported by 53.1% of the exit poll voters? If the chance were extremely small,
we’d feel comfortable making the inference that A’s election was supported by majority
of all 9.5 million voters.
Chapter 2
Collection and Presentation of Data
2.1 PRELIMINARIES
Classification of Data
1. Primary vs. Secondary
a. Primary source - data measured by the researcher/agency that published it
b. Secondary source - any republication of data by another agency
We now enumerate some agencies where a researcher can avail of primary data.
a) Central Bank is a primary source of data on banking and finance.
b) National Statistics Office is a primary source of data on population, housing,
and establishments.
c) Pulse Asia is a primary source of data on opinions or sentiments of the people
on current issues.
d) Bureau of Agricultural Statistics is a primary source of data on agriculture and
livestock.
2. External vs Internal
a. Internal data - information that relates to the operations and functions of the
organization collecting the data
b. External data - information that relates to some activity outside the organization
collecting the data
Example The sales data of SM is internal data for SM but external data for
any other organization such as Robinson’s.
2.2 DATA COLLECTION METHODS
1. Survey method - questions are asked to obtain information, either through self-
administered questionnaire or personal (or phone) interview
a) Pulse Asia conducted a sample survey on voter response to political ads in the
May 2013 election. Its respondents were selected registered voters who intend to
vote in the 2013 election.
b) The Department of Energy regularly conducts the Household Energy
Consumption Survey to measure the level and pattern of energy consumption at
the national and regional levels.
c) The Food and Nutrition Research Institute regularly conducts the National
Nutrition Survey that generates data on malnutrition, prevalence of anemia,
Vitamin A and iodine deficiencies, the nutrient intake/adequacies of the members
in the households.
2. Experimental method - a scientific investigation conducted under controlled
situations where treatments are applied and their effects measured on the response
of interest to the experimenter. This is an excellent method of collecting data for
causation studies. If properly designed and executed, experiments will reveal with
a good deal of accuracy, the effect of a change in one variable on another
variable.
3. Observation method - makes possible the recording of behavior but only at the
time of occurrence (e.g., observing reactions to a particular stimulus, traffic count,
behavior of animals in wildlife or newborn babies in nursery).
Advantages:
Observation is superior over survey method in collecting data for
nonverbal behavior. In a survey, the researcher may encounter all sorts of
difficulties such as deliberate denial or memory failure. On the other
hand, an observer can make filed notes that record the salient features of
the behavior, or may even record behavior in its totality via videotape.
Observation is superior over experiment in the sense that behavior takes
place in its natural environment. However the presence of an observer
may possibly alter the true behavior of the subjects.
The observer is able to conduct his study in the subject’s natural
environment, and is thus usually able to study over a much longer time
period than with either survey or experiment.
Possible sources:
a) The National Statistics Office is a major collector of data for both private and
government needs. It provides the public with basic data on various subject
matters such as household income and expenditure, housing, education, health,
employment, and others.
b) The National Statistical Coordination Board compiles data necessary for the
computation of the gross national product, gross domestic product, consumer
price index, and other indices.
c) The Department of Health is responsible for health statistics like prevalence of
diseases among infants and pregnant women, morbidity rates, family planning
methods, etc.
d) The Social Weather Station keeps a record of poll results, social issues, and
others.
e) Theses of graduate students contain data used in their statistical inquiry.
In a case-control study, “cases” who have a particular attribute or condition are compared
to “controls” who do not. The idea is to compare the cases and controls to see how they
differ on an explanatory variable of interest. In medical settings, the cases usually are
individuals who have been diagnosed with a particular disease. Researchers then identify
a group of controls who are as similar as possible to cases, except that they don’t have the
disease. For example, samples of male heart attack patients (cases) and other male
hospital patients (controls) were compared to the extent of baldness.
Clinical trials are experiments that study the effectiveness of medical treatments on actual
patients.
Convenience sampling
Thus, in conducting the survey, the researchers sought the assistance of doctors with
private clinics. When a patient consults one of these doctors and has AIDS, the social
scientists would interview this patient in return for a free-of-charge consultation. With
this method, the sample will include persons who consulted one of the appointed
physicians and volunteered to participate in the study to avail of the free consultation.
a) A researcher may use a particular district, province, or city to be the sample cluster in
representing their population of interest. For instance, the researcher can identify a
specific district of Quezon City whose households have the same profile in terms of the
socio-economic characteristics as the households in the whole Quezon City.
b) For a study that aims to predict the senatorial winners in the national election, a
researcher may include in the sample the provinces that have voted for the actual winners
in a series of past senatorial elections.
We give an example of a government study using purposive sampling.
The Producer’s Price Survey of NSO is a nationwide undertaking intended to provide the
price data needed in the computation of the Producer’s Price Index for manufacturing. To
select the items included in the sample, NSO used purposive sampling by using a set of
criteria to identify the commodities for the market basket. Some of the criteria are: (i) the
commodity has relatively high market share; (ii) the commodity was available in the
market in the base year; and, (iii) the current production of the commodity; and the
market share of the commodity has been stable during the last three years based on the
NSO Annual Survey of Establishment reports.
A researcher wishes to study the people’s views on birth control. The researcher believes
that a person’s views on birth control and his religion are related. Census results showed
that 70% of the people in the population are Catholics, 20% are Protestants, and 10% are
Muslims. The researcher then selects a sample reflecting the same proportions to
represent the three groupings. If there should be 200 respondents in the sample then this
means that the quota set for each group are as follows: (i) Catholics - 70% of 200=140,
(ii) Protestants – 20% of 200 = 40, and, (iii) Muslim – 10% of 200 = 20. This is quota
sampling and not stratified sampling if the researcher leaves the selection of the 140
Catholics, 40 Protestants, and 20 Muslims to the discretion of the interviewers.
Shortly after Bill Clinton became President of the United States, a television station in
Sacramento, California asked viewers to respond to the question, “Do you support the
President’s economic plan?” The next day the result of a properly conducted study that
asked the same question were published in the newspaper.
To do this, we first list down all the 30 members of the organization and assign a
unique serial number, from 01 to 30, to each one of them.
Advantages
The theory involved is much easier to understand than the theory behind other
sampling designs.
Inferential methods are simple and easy.
Disadvantages
The sample chosen may be widely spread, thus entailing high transportation costs.
A population frame, or list, is needed.
Less precise estimates result if the population is heterogeneous with respect to the
characteristic under study.
Below are some examples of simple random sampling.
Advantages
It is easier to draw the sample and often easier to execute without mistakes than
simple random sampling.
It is possible to select a sample in the field without a sampling frame.
The systematic sample is spread more evenly over the population.
Disadvantages
If periodic regularities are found in the list, a systematic sample may consist only
of similar types. (Example: Store sales over seven days of the week – estimating
total sales based on a systematic sample every Tuesday would be unwise.)
Knowledge of the structure of the population is necessary for its most effective
use.
Example Suppose we wish to select a sample of farms to estimate the total farm
production. If we have a list of farms with their corresponding sizes in
square meters, we can arrange the farms first according to size before we
select our systematic sample.
Stratified Sampling
Let us select a sample from the same population used in the previous Example but
this time we will use stratified sampling. The population has N=30 members of an
organization and the sample size is n=10 members. If the stratification variable is sex
then we would partition the population into two strata: (i) Stratum 1 – Males, and (ii)
Stratum 2 – Females. One way of allocating the 10 units in the sample is to distribute
them equally into the two strata. Thus, we will select n1=5 males and n2=5 females.
MALES FEMALES
01 Almeda, Joel 01 Abad,Melissa 12 Querido, Rose
02 Baluyot, Temy 02 Conlin, Juliet 13 Quiambao, Gina
03 Cruz, Raks 03 Corpuz, Joan 14 Quidayan, Candy
04 Fuentes,Mar 04 Dayrit, Erlyn 15 Santos, Emily
05 Lanuza, Jon 05 Diaz, Aurora 16 Tablante, Rita
06 Macasaet, Erwin 06 Foz, Vivian 17 Tolentino,Magda
07 Peña, Lito 07 Gomez,May 18 Tuason, Joy
08 Quebral, Joseph 08 Joson, Sonia 19 Zamora, Bea
09 Surla,Michael 09 La Pierre, Amy
10 Valdez, Ernie 10 Le, Diana
11 Venegas, Anthony 11 Macaibay,Macky
Advantages
Stratification may produce a gain in precision in the estimates of characteristics of
the population
It allows for more comprehensive data analysis since information is provided for
each stratum.
It is administratively convenient.
Disadvantages
A listing of the population for each stratum is needed.
The stratification of the population may require additional prior information about
the population and its strata.
Suppose we want to get the opinion of business administration college students regarding
premarital sex. A good stratification variable is sex because the views of the males may
be very different from the views of the females. The population consists of N =500
business administration students and the sample size is n=50. Out of the 500, there are
300 female and 200 male students. The list of business administration students, together
with their respective sex, is available at the records section of the college, or at the Office
of the Registrar.
The Business Expectations Survey is a nationwide survey, which the Bangko Sentral ng
Pilipinas conducts every semester. The survey provides information useful to policy
makers and monetary managers for their economic and financial policy planning. It
presents data on the general perceptions of the business sector on the current state of
business and the economic prospects for the succeeding semester, and it computes
indicators of economic activity.
In the 2000 BES, the sampling frame was the Securities and Exchange Commission list
of the Philippines’ Top 2000 Corporations. BSP stratified the firms in the list according
to the nine industry groups of the Philippine Standard Industry Classification. This allows
the representation of each industry group in the sample. BES selected the sample of firms
from each industry group using systematic sampling.
Cluster Sampling
Description of the Design
Cluster sampling is a method of sampling where a sample of distinct groups, or clusters,
of elements is selected and then a census of every element in the selected clusters is
taken. Similar to strata in stratified sampling, clusters are non-overlapping sub-
populations which together comprise the entire population. For example, a household is
a cluster of individuals living together or a city block might also be considered as a
cluster. Unlike strata, however, clusters are preferably formed with heterogeneous, rather
than homogeneous elements so that each cluster will be typical of the population.
Clusters may be of equal or unequal size. When all of the clusters are of the same size,
the number of elements in a cluster will be denoted by M while the number of clusters in
the population will be denoted by N.
Sample-Selection Procedure
Step 1 Number the clusters from 1 to N.
Step 2 Select n numbers from 1 to N at random. The clusters corresponding to the
selected numbers form the sample of clusters.
Step 3 Observe all the elements in the sample of clusters.
Step 1: Decide on how to divide the population into non-overlapping clusters. In this
example, we will use the barangays as the clusters so that the elementary units are the
households but the sampling units are the barangays.
Step 2: Get a list of all barangays in Mandaluyong City. Number the barangays in the
list, consecutively from 1 to 27.
Advantages
A population list of elements is not needed; only a population list of clusters is
required. Listing cost is reduced.
Transportation cost is reduced.
Disadvantages
The costs and problems of statistical analysis are greater.
Estimation procedures are more difficult.
The National Statistics Office conducts the Census of Agriculture and Fisheries to collect
data from all agricultural and fishing operators, and all households engaged in
agricultural and fishing activities.
However, due to budgetary constraints, NSO was only able to collect sample data for the
1991 CAF. NSO used cluster sampling, where the barangays served as the clusters. For
each city/municipality, NSO prepared a list of barangays arranged in descending order,
according to the total farm area in the whole barangay. From this list, NSO selected a
sample of barangays using systematic sampling. All agricultural and fishing operators
and all households engaged in agricultural and fishing activities in the selected barangays
were included in the study. In the end, NSO included a total of 5,997,427 operators and
households for this study.
Multistage Sampling
Description of the Design
In multistage sampling, the population is divided into a hierarchy of sampling units
corresponding to the different sampling stages. In the first stage of sampling, the
population is divided into primary stage units (PSU) then a sample of PSUs is drawn. In
the second stage of sampling, each selected PSU is subdivided into second-stage units
(SSU) then a sample of SSUs is drawn. The process of subsampling can be carried to a
third stage, fourth stage and so on, by sampling the subunits instead of enumerating them
completely at each stage.
Advantages
Listing cost is reduced.
Transportation cost is reduced.
Disadvantages
Estimation procedure is difficult, especially when the primary stage units are not
of the same size.
Estimation procedure gets more complicated as the number of sampling stages
increases.
The sampling procedure entails much planning before selection is done.
We now present an actual survey that used two-stage sampling in selecting the sample of
elements.
The Food and Nutrition Research Institute of the Department of Science and Technology
conducts the National Nutrition Survey every 5 years. This survey aims to determine the
prevalence of malnutrition and specific health problems in the country and to provide
data on food consumption and nutrient intake.
For this study, FNRI used 2-stage sampling to select a sample of individuals from each
province. The primary stage units are the barangays. The second stage units are the
individuals.
We now present an actual survey that used three-stage sampling in selecting the sample
of elements.
The Department of Tourism conducts the Visitor Sample Survey every month. This
survey aims to collect data on the demographic profile, travel characteristics and
preferences of foreign and overseas Filipinos who visited the country for tourism
development planning and policy-making purposes.
DoT selects the sample of visitors using three-stage sampling. The primary stage units are
the weeks of the month. The second-stage units are the weekly flights. The third-stage
units are the visitors.
For this monthly survey, DoT selects the week of the month using simple random
sampling. From the selected week, they select a sample of weekly flights. They perform
this using stratified random sampling. DoT stratified all the regular weekly international
flights leaving the different international airports in the Philippines according to country
market. It then selects a sample of flights from each country market using simple random
sampling. From the selected flights, DoT selects a sample of visitors using simple
random sampling.
Read on The Questionnaire (Optional)
• Strategies in Writing the Questions (Closed- vs. Open-ended questions)
• Pitfalls to Avoid in Wording Questions
• Ways to Avoid Irrelevant Questions
• Question Order
• Cover Letter/ Introduction
• Pretest
Textual Presentation
• data incorporated to a paragraph of text
Example
The 2013 Young Adult Fertility Study Findings from the 2013 Young Adult
(YAFS 4) conducted by the Demographic Fertility and Sexuality Study (YAFS 4)
Research & Development Foundation and …show that the levels of current drug use,
the University of the Philippines drinking alcohol and smoking among
Population Institute shows that 32 percent young people aged 15-24 have dropped
of young Filipinos between the ages 15 to considerably. The declining pattern is
24 have had sex before marriage. Of these, found in the practices of both young men
78 percent reported that their first sexual and women, as well as in younger and
encounter was unprotected: 84 percent older youth.
among young women and 73 percent
among young men. The percentage of young people who are
“current smokers” declined from 20.9
The same study also found that 7.3 percent percent in 2002 to 19.7 percent in 2013.
have engaged in casual sex while 3.5 Eleven years ago, 41 percent of young
percent have had regular sex without Filipinos reported to be “current alcohol
emotional attachment (FUBU). Five drinkers”. Now, 37 percent of young adults
percent of young men disclosed having are engaged in this behavior. But the most
experienced sex with another man (MSM). substantial decline is found in drug use.
Among individuals who are either formally Only 4 percent admitted to have ever used
married or in a live-in arrangement, 3 drugs in 2013, compared to almost 11
percent said they ever had an extra-marital percent in 2002.
affair.
The National Capital Region has the
Regional difference in premarital sex highest level of youth smokers (27 percent)
prevalence shows the National Capital while ARMM registered the lowest. Only
Region (NCR) having the highest 12 percent of young people in ARMM are
prevalence at 41 percent and ARMM, the smokers.
lowest (7.7 percent).
Advantages
• gives emphasis to significant figures and comparisons
• simplest and most appropriate approach when there are only a few numbers to be
presented
Disadvantages
• when a large mass of quantitative data are included in a text or paragraph, the
presentation becomes almost incomprehensible
• written paragraphs can be tiresome to read especially if the same words are repeated so
many times
Tabular Presentation
• the systematic organization of data in rows and columns
Advantages
• more concise than textual presentation
• easy to understand
• facilitates comparisons & analysis of relationship among different categories
• presents data in greater detail than a graph
2. Box Head - the portion of the table that contains the column heads which describe the
data in each column, together with the needed classifying and qualifying spanner heads.
3. Stub - the portion of the table usually comprising the first column on the left, in which
the stubhead and row captions, together with the needed classifying and qualifying center
head and subheads are located. The stubhead describes the stub listing as a whole in
terms of the classification presented. The row caption is a descriptive title of the data on
the given line.
4. Field - main part of the table; contains the substance or the figures of one’s data
5. Source note - an exact citation of the source of data presented in the table (should
always be placed when the figures are not original)
Guidelines
• The title should be concise, written in telegraphic style, not in complete sentence.
• Column labels should be precise. Stress differences rather than similarities between
adjacent columns. As much as possible, two or more adjacent columns should not begin
nor end with the same phrase. This is frequently a signal that a spanner head is needed.
• The arrangement of lines in the stub depends on the nature of classification, purpose of
presentation or limitations of space.
• Categories should not overlap.
• The units of measure must be clearly stated.
• Show any relevant total, subtotals, percentages, etc.
• Indicate if the data were taken from another publication by including a source note.
• Tables should be self-explanatory, although they may be accompanied by a paragraph
that will provide an interpretation or direct attention to important figures.
Graphical Presentation
• a graph or chart is a device for showing numerical values or relationships in pictorial
form
Advantages
• main features and implications of a body of data can be grasped at a glance
• can attract attention and hold the reader’s interest
• simplifies concepts that would otherwise have been expressed in so many words
• can readily clarify data, frequently bring out hidden facts and relationships
2. Pie Chart - a circular graph that is useful in showing how a total quantity is distributed
among a group of categories. The “pieces of the pie” represent the proportions of the total
that fall into each category.
3. Bar Chart - consists of a series of rectangular bars where the length of the bar
represents the quantity or frequency for each category if the bars are arranged
horizontally. If the bars are arranged vertically, the height of the bar represents the
quantity.
4. Pictorial unit chart – a pictorial chart in which each symbol represents a definite and
uniform value
2.5 THE FREQUENCY DISTRIBUTION TABLE
Definition. The raw data is the set of data in its original form.
50 57 63 69 72 74 77 80 82 84 87
50 59 65 69 72 75 77 80 82 84 87
50 59 66 69 72 75 77 80 82 85 88
50 60 66 69 72 75 77 81 83 85 89
50 60 68 70 73 75 78 81 83 86 89
50 60 68 71 73 75 79 81 84 86 91
51 62 68 71 73 76 79 81 84 87 92
52 62 68 71 73 76 79 82 84 87 94
53 62 68 71 74 76 79 82 84 87 94
53 62 69 72 74 76 79 82 84 87 96
Suppose we have data on number of children of 50 married women using any modern
contraceptive method.
0 0 1 2 2 2 3 3 4 4
0 0 1 2 2 3 3 3 4 4
0 1 1 2 2 3 3 3 4 4
0 1 1 2 2 3 3 3 4 5
0 1 1 2 2 3 3 3 4 5
Since there are only 6 unique values in the data set then we use
single-value grouping.
Definition of terms
1. Class interval - the numbers defining the class
2. Class limits - the end numbers of the class
3. Open-end class - a class that has no lower limit or upper limit
4. Class frequency - the number of observations falling in the class
5. Class size - the difference between the upper class limits of the class and the preceding
class; can also be computed as the difference between the lower class limits of the next
class and the class
2. Determine the approximate class size. Whenever possible, all classes should be of the
same size. The following steps can be used to determine the class size.
• Solve for the range, R = max – min.
• Compute C’ = R ÷ K.
• Round-off C’ to the same number of decimal places as the original dataset, say
C, and use C as the class size.
3. Determine the lowest class limit. The first class must include the smallest value in the
data set and must agree with the number of decimal places in the dataset.
4. Determine all class limits by adding the class size, C, to the limit of the previous class.
5. Tally the frequencies for each class. Sum the frequencies and check against the total
number of observations.
1. Histogram - a bar graph that displays the classes on the horizontal axis and the
(relative) frequencies (percentage) of the classes on the vertical axis; the vertical lines of
the bars are erected at the class boundaries and the height of the bars correspond to the
class (relative) frequency (percentage)
CB f
49.5-54.5 10 25
54.5-59.5 3 20
59.5-64.5 8 15
64.5-69.5 13 10
69.5-74.5 17
5
74.5-79.5 19
0
79.5-84.5 22
84.5-89.5 13
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
49
54
59
64
69
74
79
84
89
94
99
89.5-94.5 4
94.5-99.5 1
120
UCB <CF
54.5 10
100 59.5 13
80 64.5 21
69.5 34
60
74.5 51
40 79.5 70
20
84.5 92
89.5 105
0 94.5 109
49.5 54.5 59.5 64.5 69.5 74.5 79.5 84.5 89.5 94.5 99.5
99.5 110
59.5 97 80
64.5 89
60
69.5 76
74.5 59 40
79.5 40
20
84.5 18
89.5 5 0
49.5 54.5 59.5 64.5 69.5 74.5 79.5 84.5 89.5 94.5 99.5
94.5 1
The stem-and-leaf display is an alternative method for describing a set of data. It presents
a histogram-like picture of the data, while allowing the experimenter to retain the actual
observed values of each data point. Hence, the stem-and-leaf display is partly tabular and
partly graphical in nature.
In creating a stem-and-leaf display, we divide each observation into two parts, the stem
and the leaf. For example, we could divide the data value 234 as follows:
Stem Leaf
2 | 34
Alternatively, we could choose the point of division between the units and tens, whereby
Stem Leaf
23 | 4
The choice of the stem and leaf coding depends on the nature of the data set.
Example: Typing speeds (net words per minute) for 20 secretarial applicants
68 72 91 47
52 75 63 55
65 35 84 45
58 61 69 22
46 55 66 71
1. Americans are becoming increasingly concerned with the incidence of crime, and
voluminous data is being collected to document the magnitude of the problem.
The following table displays data on number of rapes per 100,000 residents for
the 50 states and the District of Columbia.
Florence Nightingale is known as the founder of the nursing profession. However, she
also saved many lives by using statistical analysis. When she encountered an unsanitary
condition or an undersupplied hospital, she improved the conditions and then used
statistical data to document the improvement. Thus, she was able to convince others of
the need for medical reform, particularly in the area of sanitation. She developed original
graphs to demonstrate that, during the Crimean War, more soldiers died from unsanitary
conditions than were killed in combat.
2. The paper “The Acid Rain Controversy: The Limits of Confidence” (Amer.
Statistician (1983): 385-394) presented data on average sulfur dioxide emission
rates for industrial and utility boilers in 47 states (data from Alaska, Hawaii, and
Idaho were not given).
.3 .9 1.5 2.5
.6 .6 1.4 2.7
.4 1.5 1.9 2.9
.5 1.5 1.0 2.1
.2 1.3 1.7 2.9
.7 1.2 1.8 3.8
.2 1.2 1.7 3.6
.7 1.0 1.8 3.4
.7 1.4 1.4 3.7
.5 1.0 2.3 4.2
.1 1.7 2.7 4.5
.6 1.5 2.2
3. The NSCB presented the following figures on the total deposits (in millions of
pesos) in the government banks in the different provinces in the Philippines in
2001.
Let the Greek letter (sigma) indicate the “summation of,” thus, we can write the sum of
n
the observations as X
i 1
i X1 X 2 X n
The numbers 1 and n are called the lower and the upper limits of summation,
respectively.
Example: i 1 2 3 4
Xi 2 4 6 8
Yi 1 2 1 2
Calculate:
3
1. X
i 2
i
3
2. X
i 2
i Yi
4
3. X Y
i 1
i i
4 4
4. X i Yi
i 1 i 1
4
Xi
5.
i 1 Yi
4
X
i 1
i
6. 4
Y
i 1
i
Some Results on Summation
1. The summation of the sum (or difference) of variables is the sum (or difference) of
n n n
their summations. That is, X i Yi X i Yi
i 1 i 1 i 1
n n
2. If c is a constant, then cX i c X i
i 1 i 1
n
3. If c is a constant then c nc
i 1
The population mean for a population with N elements, denoted by the Greek letter μ
N
X i
(mū), is computed as i 1
N
n
X i
The sample mean X (X bar) of n observations is computed as X i 1
n
The sample mean (a statistic) is an estimate of the unknown population mean (a
parameter).
Examples:
1. The number of employees at 5 different drug stores are 10, 12, 6, 8, and 4. Find the
mean number of employees for the 5 stores.
2. Scores in the Statistics 101 first exam for a sample of 10 students are as follows: 60,
55, 30, 90, 88, 79, 45, 66, 93, and 80. Find the mean.
3. Refer to the example on the final grades of 110 Statistics 101 students. The sample
mean is 74.10909091
Definition. The weighted mean is a modification of the usual mean that assigns weights
(or measures of relative importance) to the observations to be averaged. If each
observation Xi is assigned a weight Wi the weighted mean is given by
n
W X i i
X i 1
n
W i 1
i
Examples:
1. Suppose a teacher assigns the following weights to the various course requirements:
Assignment 15%
Project 25%
Midterm Exam 20%
Final Exam 40%
The maximum score a student may obtain for each component is 100. Jeffry obtains
marks of 83 for assignments, 72 for the project, 41 for the midterm exam, and 47 for the
final exam. Find his mean mark (or final grade) for the course.
X i c2 X i2 n X
2 2
from the mean. Hint: show that n X c
i 1 i 1
6. a. If a constant c is added (subtracted) to all observations, the mean of the new
observations will increase (decrease) by the same amount c.
b. If all observations are multiplied or divided by a constant, the new observations
will have a mean that is the same constant multiple of the original mean.
Example: Given 5 temperature readings measured in Fahrenheit: 98, 100, 107, 90, 92. If
the mean temperature is X F = 97.4, find the mean temperature in centigrade if
C 95 F 32
Approximating the Mean from a Frequency Distribution
- possible only when the class mark can be assumed to be representative of all the values
in that class. If the assumption holds, the following equation may be used to approximate
the mean from a frequency distribution.
k k
f CM i i f CM i i
X i 1
k
i 1
f
n
i
i 1
where fi = the frequency of the ith class
CMi = the class mark of the ith class
k = total number of classes
n = total number of observations
f CM i i
8145
x i 1
k
74.045455
f
110
i
i 1
Remarks:
1. The formula for approximating the mean cannot be used if a frequency distribution has
open-ended intervals, unless there are reasonably accurate estimates of the class marks
for the open intervals.
2. The mean of a frequency distribution is simply a weighted mean of the class marks,
where the fi’s are the weights.
3.3 THE MEDIAN
- the positional middle of the arrayed data
- in an array, one-half of the values precede the median and one-half follow it
The first step in calculating the median, denoted as Md, is to arrange the data in an array.
Let X(i) be the ith observation in the array, i = 1, 2, . . . , n.
If n is odd, the median position equals (n+1)/2, and the value of the (n+1)/2 th
observation in the array is taken as the median, i.e.,
Md = X n 1
2
If n is even, the mean of the two middle values in the array is the median, i.e.,
Md = 12 X n X n 1
2 2
Examples:
1. Given the following heights (in inches): 71, 72, 75, 75, and 67. Find the median
height.
2. Given the following scores: 1, 7, 3, 3, 6, 5, 4, 3, find the median of the scores.
3. Refer to the example on the grades of 110 Statistics 101 students. Find the median
.
Characteristics of the Median:
1. The median is a positional measure.
2. The median is affected by the position of each item in the series but not by the value of
each item. This means that extreme values affect the median less than the arithmetic
mean.
n cf Md 1
Md LCBMd c 2
f Md
where LCBMd = the lower class boundary of the median class
c = class size of the median class
n = the total number of observations in the distribution
<CF Md - 1 = less than cumulative freq. of the class preceding the median class
fMd = frequency of the median class
Example:
Refer to the example on the final grades of 110 Statistics 101 students.
n cf Md 1
Md LCBMd c 2
f Md
51
110
74.5 5 2 75.552632
19
The mode is determined by counting the frequency of each value and finding the value
with the highest frequency of occurrence.
Examples:
1. 2, 5, 2, 3, 5, 2, 1, 4, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 2
2. 2, 5, 5, 2, 2, 5, 1, 3, 5, 4, 2, 5, 5, 2, 2, 5, 5, 2, 2, 1
3. 1, 2, 3, 3, 2, 1, 2, 3, 1, 4, 4, 5, 5, 1, 2, 3, 4, 5, 4, 5
4. Refer to the example on the final grades of 110 Statistics 101 students. Find the mode.
f Mo f Mo1
Mo LCBMo c
2 f Mo f Mo1 f Mo1
where LCBMo = lower class boundary of the modal class
c = class size of the modal class
fMo = frequency of the modal class
fMo-1 = frequency of the class preceding the modal class
fMo+1 = frequency of the class following the modal class
Example:
Class Freq
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1
f Mo f Mo1
Mo LCBMo c
2 f Mo f Mo1 f Mo1
22 19
79.5 5 80.75
222 19 13
3.5 MEASURES OF LOCATION
Definition. Measures of location or fractiles or quantiles are values below which a
specified fraction or percentage of the observations in a given set must fall.
Definition. Quartiles are values that divide the array into 4 equal parts. Thus,
Q1, read as first quartile, is the value below which 25% of the values fall.
Q2, read as second quartile, is the value below which 50% of the values fall.
Q3, read as third quartile, is the value below which 75% of the values fall.
Lowest 25% Second lowest 25% Third lowest 25% Highest 25%
Q1 Q2 Q3
Definition. Deciles are values that divide the array into 10 equal parts. Thus,
D1, read as first decile, is the value below which 10% of the values fall.
D2, read as second decile, is the value below which 20% of the values fall.
•
•
•
D9, read as ninth decile, is the value below which 90% of the values fall.
10% 10% 10% 10% 10% 10% 10% 10% 10% 10%
D1 D2 D3 D4 D5 D6 D7 D8 D9
Definition. Percentiles are values that divide the array into 100 equal parts. Thus,
P1, read as first percentile, is the value below which 1% of the values fall.
P2, read as second percentile, is the value below which 2% of the values fall.
•
•
•
P99, read as ninety-ninth percentile, is the value below which 99% of the values fall.
1% 1% 1% 1% 1% 1% 1% 1% 1% 1%
P1 P2 P3 P4 P5 P6 P7 P8 P9
Examples: Determine P69, D3, Q1 and Q3 from the data on stat 101 final grades
50 57 63 69 72 74 77 80 82 84 87
50 59 65 69 72 75 77 80 82 84 87
50 59 66 69 72 75 77 80 82 85 88
50 60 66 69 72 75 77 81 83 85 89
50 60 68 70 73 75 78 81 83 86 89
50 60 68 71 73 75 79 81 84 86 91
51 62 68 71 73 76 79 81 84 87 92
52 62 68 71 73 76 79 82 84 87 94
53 62 68 71 74 76 79 82 84 87 94
53 62 69 72 74 76 79 82 84 87 96
CHAPTER 4
Measures of Dispersion
and
Measures of Skewness
Definition. Measures of dispersion indicate the extent to which individual items in a
series are scattered about an average.
• The Range
Definition. The range of a set of measurements is the difference between the largest and
the smallest values.
Examples:
1. The IQ’s of 5 members of a certain family are 108, 112, 127, 116, and 113. Find the
range.
2. Refer to the example on the final grade of 110 Statistics 101 students. The range is 96
– 50 = 46.
Approximating the range from the frequency distribution table, we get 99 – 50 = 49.
X i
2
N
N
X i
2
X
n
2
i X
Definition. For a sample of size n, the sample variance is S 2 i 1
n 1
X
n
2
i X
and the sample standard deviation is S i 1
n 1
Remarks:
1. The standard deviation is the most frequently used measure of dispersion and is
interpreted as the average distance of the data values from their mean.
2. The variance is not a measure of absolute dispersion. It is not expressed in the same
units as the original observations.
Examples:
1. The following scores were given by 6 judges for a gymnast’s performance in the vault:
7, 5, 9, 7, 8, and 6. Find the standard deviation.
3. Refer to the example on the final grades of 110 Statistics 101 students. The sample
standard deviation is given by s 11.2537745
A recent survey showed that customers of the US Postal Service were interested in more
consistency in the time it takes to make a delivery. Under the old conditions, a local
letter might take only one day to deliver, or it might take several. “Just tell me how many
days ahead I need to mail the birthday card to Mom so it gets there on her birthday, not
early, not late,” was a common complaint. The level of consistency is measured by the
standard deviation of the delivery times. A smaller standard deviation indicates more
consistency.
2
n n
n X X i i
2
Computational formula: S
i 1 i 1
nn 1
110 110
Example: the final grade of 110 Statistics 101 students, X i2 = 617936,
i 1
X
i 1
i 8152
f CM
2
k k
k
n f i CM f i CM i
2
X
2
i i i
S i 1
i 1 i 1
n 1 nn 1
Example:
Class Freq CMi fiCMi fiCMi2
50 – 54 10 52 520 27040
55 – 59 3 57 171 9747
60 – 64 8 62 496 30752
65 – 69 13 67 871 58357
70 – 74 17 72 1224 88128
75 – 79 19 77 1463 112651
80 – 84 22 82 1804 147928
85 – 89 13 87 1131 98397
90 – 94 4 92 368 33856
95 – 99 1 97 97 9409
Total 110 8145 616265
2
k
k
n f i CM f i CM i
2
s i 1 i 1 = 10.989892239821
nn 1 110 109
Measures of relative dispersion are unitless and are used when one wishes to compare
the scatter of one distribution with another distribution.
Mean s.d.
1989-1991 22.4 1.84
1992-1994 26.4 1.15
2. Two of the quality criteria in processing butter cookies are the weight and color
development in the final stages of oven browning. Individual pieces of cookies are
scanned by a spectrophotometer calibrated to reflect yellow-brown light. The readout is
expressed in per cent of a standard yellow-brown reference plate and a value of 41 is
considered optimal (golden-yellow). The cookies were also weighed in grams at this
stage. The means and standard deviations of 30 sample cookies are presented below.
Which of the two quality criteria is more varied?
Mean s.d.
Color 41.1 10
Weight 17.7 3.2
Definition. The standard score measures how many standard deviations an observation
X
is above or below the mean. It is computed as Z and the sample counterpart is
X X
Z
S
Remarks:
1. The standard score is not a measure of relative dispersion per se but is somewhat
related.
2. It is useful for comparing two values from different series specially when these two
series differ with respect to the mean or standard deviation or both are expressed in
different units.
Examples:
1. Robert got a grade of 75% in Stat 101 and a grade of 90% in Econ 11. The mean grade
in Stat 101 is 70% and the standard deviation is 10%, whereas in Econ 11, the mean
grade is 80% and the standard deviation is 20%. Relative to the other students, where did
he perform better?
2. In problem (1), if the mean grade in Stat 101 is 65%, in which subject did Robert
perform better?
3. Different typing skills are required for secretaries depending on whether one is
working in a law office, an accounting firm, or for a mathematical research group at a
major university. In order to evaluate candidates for these positions, an agency
administers 3 distinct standardized typing samples. A time penalty has been incorporated
into the scoring of each sample based on the number of typing errors. The mean and
standard deviation for each test, together with the scores achieved by Nancy, an
applicant, are given in the following table. Where do you think should Nancy be placed?
Stockbrokers have a problem when they are considering two investments where the mean
rate of return is the same. They usually calculate the standard deviation of the rates of
return to assess the risk associated with the two investments. The investment with the
larger standard deviation is considered to have the greater risk.
In the 1941 Major League Baseball, Ted Williams batted .406 and nobody has hit over
.400 since. The highest batting average in recent times was by Tony Gwynn, .394 in
1994. It is interesting to note that the mean batting average for all players at about .260
for 100 years. The standard deviation of that average, however, has declined from .049
to .031. This indicates that there is less dispersion in the batting averages today and helps
explain why there have not been any .400 hitters in recent times.
4.3 MEASURES OF SKEWNESS
Definition. A measure of skewness shows the degree of asymmetry, or departure from
symmetry of a distribution. It indicates not only the amount of skewness but also the
direction.
1. Sk
X Mo
2. Sk
3 X Md
S S
Remarks:
1. Since the mode is frequently only an approximation, formula 2 is preferred.
2. Interpretation of the measure of skewness:
Sk > 0: positively skewed since x > Md > Mo
Sk < 0: negatively skewed since x < Md < Mo
Sk = 0: symmetric since x = Md = Mo
Remarks:
1. The height of the rectangle is usually arbitrary and has no specific meaning. If several
boxplots appear together, however, the height is sometimes made proportional to the
different sample sizes.
2. If the outlying observation is less than Q1 - 3 IQR or greater than Q3 + 3 IQR it is
identified with a circle at their actual location. Such an observation is called a far outlier.
Examples:
Set A: 1 15 21 22 24
10 18 22 23 25
14 20 22 24 28
Q1 = 15 IQR = 9
Q3 = 24 FL = 1.5
Md = 22 FU = 37.5
0 5 10 15 20 25 30
Set B: 3 10 11 12 19
8 10 11 16 19
9 10 12 16 30
Q1 = 10 IQR = 6
Q3 = 16 FL = 1
Md = 12 FU = 25
p
50 55 60 65 70 75 80 85 90 95 100
CHAPTER 5
Probability
Definition of Terms
1. Random experiment any process of generating a set of data or observations that
can be repeated under basically the same conditions, which lead to well-defined
outcomes
2. Sample space set of all possible outcomes of an experiment, usually denoted by S
3. Sample point an element of the sample space, an outcome
4. Event any subset of the sample space, usually denoted by capital letters
5. Null space/Empty space a subset of the sample space that contains no elements
and denoted by the symbol .
6. Simple event an event which contains only one element of the sample space
7. Compound event an event that can be expressed as the union of simple events,
thus containing more than one sample point
8. Mutually exclusive events Two events A and B are mutually exclusive if AB =
; that is, A and B have no elements in common
Remarks:
An event is said to have occurred if the outcome of the experiment is one of the
sample points in the event.
The empty space can be viewed as an event that will never happen. It is called the
impossible event.
The sample space S, as an event, always occurs, and is referred to as the certain or
sure event.
The French naturalist Count Buffon (1707-1788) tossed a coin 4040 times.
Result: 2048 heads, or proportion 2048/4040 = .5069 for heads.
Around 1900, the English statistician Karl Pearson heroically tossed a coin 24,000
times. Result: 12,012 heads, a proportion of .5005.
While imprisoned by the Germans during World War II, the South African
statistician John Kerrich tossed a coin 10,000 times. Result: 5067 heads,
proportion of heads .5067.
The late astronomer Carl Sagan believed that the probability of a major asteroid
hitting the Earth soon is high enough to be of concern. “The probability that the
Earth will be hit by a civilization-threatening small world in the next century is a
little less than one in a thousand.” To arrive at that probability, Sagan obviously
could not use the long-run frequency definition of probability. He would have to
use his own knowledge of astronomy, combined with past asteroid behavior.
Examples:
Example How many sample points are there in the sample space when a pair of
balanced dice is thrown once?
Without considering strategy in a game of chess, there are 400 ways of playing the first
round of moves.
Examples:
1. How many even three-digit numbers can be formed from the digits 1, 2, 5, 6, and
9 if each digit can be used only once?
2. How many ways can a 10-question true-false examination be answered?
P(AB)c = P(AcBc)
Examples:
1. The probability that a student passes Statistics is 2/3, and the probability that he
passes English is 4/9. If the probability of passing at least one of the two courses
is 4/5, what is the probability that he will pass both courses? fail both courses?
2. What is the probability of getting a total of 7 or 11 when a pair of dice is tossed?
3. In the toss of a fair coin 4 times, what is the probability of no head in the toss? At
least one head?
Exercises: pp. 95-97 of Walpole nos. 1-20
1. Find the errors in each of the following statements:
a. The probability that it will rain tomorrow is 0.40 and the probability that it
will not rain tomorrow is 0.52.
b. The probabilities that a printer will make 0, 1, 2, 3, or 4 or more mistakes
in printing a document are, respectively, 0.19, 0.34, -0.25, 0.43, and 0.29.
c. The probabilities that an automobile salesperson will sell 0, 1, 2, or 3 cars
on any given day in February are, respectively, 0.19, 0.38, 0.29, and 0.15.
d. On a single draw from a deck of playing cards the probability of selecting
a heart is 1/4, the probability of selecting a black card is 1/2, and the
probability of selecting both a heart and a black card is 1/8.
2. An experiment involves tossing a pair of dice. Find the probability of event
a. A = sum is greater than 8
b. C = a number greater than 4 comes up on one die.
c. AC
3. Three men are seeking public office. Candidates A and B are given about the
same chance of winning, but candidate C is given twice the chance of either A or
B. What is the probability that C wins? A does not win?
4. A box contains 500 envelopes of which 75 contain $100 in cash, 150 contain $25,
and 275 contain $10. An envelope may be purchased for $25. Find the
probability that the first envelope purchased contains less than $100.
5. A 5-sided die with sides numbered 1, 2, 3, 4, and 5 is constructed so that the 1 and
5 occur twice as often as the 2 and 4, which occur three times as often as the 3.
What is the probability that a perfect square occurs when this die is tossed once?
6. If A and B are mutually exclusive events and P(A) = .3 and P(B) = .5, find
a. P(A B)
b. P(A’)
c. P(A’ B)
7. If A, B, and C are mutually exclusive events and P(A) = .2, P(B) = .3 and P(C) =
.2, find
a. P(A B C)
b. P[A’ (B C)]
c. P(B C’)’
8. If a letter is chosen at random from the English alphabet, find the probability that
the letter
(a) is a vowel
(b) precedes the letter j
(c) follows the letter g.
9. If a permutation (rearrangement of the letters) of the word “white” is selected at
random, find the probability that the permutation
(a) begins with a consonant
(b) ends with a vowel
(c) has the consonants and vowels alternating.
10. If each coded item in a catalog begins with 3 distinct letters followed by 4 distinct
nonzero digits, find the probability of randomly selecting one of these coded
items with the first letter a vowel and the last digit even.
11. A pair of dice is thrown. Find the probability of getting (a) a total of 8; and (b) at
most a total of 5.
12. Two cards are drawn in succession from a deck without replacement. What is the
probability that both cards are greater than 2 and less than 8?
13. If 3 books are picked at random from a shelf containing 5 novels, 3 books of
poems, and a dictionary, what is the probability that (a) the dictionary is selected;
and (b) 2 novels and 1 book of poems are selected?
14. In a poker hand consisting of 5 cards, find the probability of holding (a) 3 aces;
and (b) 4 hearts and 1 club
15. In a game of Yahtzee, where 5 dice are tossed simultaneously, find the probability
of getting four of a kind.
16. In a college graduating class of 100 students, 54 studied mathematics, 69 studied
history, and 35 studied both mathematics and history. If one of these students is
selected at random, find the probability that the student
(a) takes mathematics or history
(b) does not take either of these subjects
(c) takes history but not mathematics.
17. Suppose that in a senior college class of 500 students it is found that 210 smoke,
258 drink alcoholic beverages, 216 eat between meals, 122 smoke and drink
alcoholic beverages, 83 eat between meals and drink alcoholic beverages, 97
smoke and eat between meals, and 52 engage in all three of these bad health
practices. If a member of this senior class is selected at random, find the
probability that the student
(a) smokes but does not drink alcoholic beverages
(b) eats between meals and drinks alcoholic beverages but does not smoke
(c) neither smokes nor eats between meals.
18. The probability that an American industry will locate in Munich is .7, the
probability that it will locate in Brussels is .4, and the probability that it will
locate in either Munich or Brussels or both is .8. What is the probability that the
industry will locate in
(a) both cities
(b) neither city?
19. From past experiences a stockbroker believes that under present economic
conditions a customer will invest in tax-free bonds with a probability of .6, will
invest in mutual funds with a probability of .3, and will invest in both tax-free
bonds and mutual funds with a probability of .15. At this time, find the
probability that a customer will invest in
(a) either tax-free funds or mutual bonds
(b) neither tax-free bonds nor mutual funds.
20. In a certain federal prison it is known that 2/3 of the inmates are under 25 years of
age. It is also known that 3/5 of the inmates are male and the 5/8 of the inmates
are female or over 25 years of age or older. What is the probability that a prisoner
selected at random from this prison is female and at least 25 years old?
Defn The probability of an event B occurring when it is known that some event A has
occurred is called a conditional probability. It is defined as
P( A B)
P( B | A) , if P(A)>0
P( A)
Examples:
1. A random sample of 200 adults is classified below according to sex and the level
of education attained. If a person is picked at random from this group, find the
probability that the person
a. is a male, given that the person has a secondary education.
b. does not have a college degree, given that the person is a female.
Male Female
Elementary 38 45
Secondary 28 50
College 22 17
2. The probability that a regularly scheduled flight departs on time is .83, the
probability that it arrives on time is .92, and the probability that it departs and
arrives on time is .78. Find the probability that a plane (a) arrives on time given
that it departed on time, and (b) departed on time given that it has arrived on time.
3. Suppose there has been a crime and it is known that the criminal is a person
within a population of 6,000,000. Further, suppose it is known that that in this
population only about one person in a million has a DNA type that matches the
DNA found at the crime scene, so let’s assume that there are six people in the
population with this DNA type. Someone in custody has this DNA type. We
know the person’s DNA matches, but what is the probability that he is actually
innocent?
Define A = DNA of randomly chosen person matches DNA at the crime scene
B = person selected is innocent of the crime
AB = event that the selected person is innocent and the DNA matches
P( A B) 5 / 6,000,000 5
So that P( B | A)
P( A) 6 / 6,000,000 6
P( A B) 5 / 6,000,000 5
And P( A | B) .
P( B) 5,999,999 / 6,000,000 5,999,999
If you were the jury, it would be important to realize that without additional
evidence, the probability that this person is innocent is 5/6, even though the DNA
matches. The prosecutor surely would emphasize the other conditional
probability.
Defn Two events A and B are said to be independent if any one of the following
conditions is satisfied:
(a) P(A|B) = P(A) if P(B)>0
(b) P(B|A) = P(B) if P(A)>0
(c) P(AB) = P(A) P(B)
Otherwise, the events are said to be dependent.
Examples:
1. Consider an experiment in which 2 cards are drawn in succession from an
ordinary deck, with replacement. Define
A: the first card is an ace
B: the second card is a spade
Are A and B independent events?
Spade
Ace
SpadeC
Spade
C
Ace
SpadeC
2. Consider the following events in the toss of a single die where even numbers are
twice as likely to occur as the odd numbers:
A: Get a number greater than 3
B: Get a perfect square
Are A and B independent events?
3. Suppose that we have a fuse box containing 20 fuses, of which 5 are defective. If
2 fuses are selected at random and removed from the box in succession without
replacing the first, what is the probability that both are defective?
4. A small town has one fire engine and one ambulance available for emergencies.
The probability that the fire engine is available when needed is .98, and the
probability that the ambulance is available when called is .92. In the event of an
injury resulting from a burning building, find the probability that both the
ambulance and the fire engine will be available.
5. Three cards are drawn in succession, without replacement, from an ordinary deck
of playing cards. Find the probability that the first card is a red ace, the second
card is a ten or jack, and the third card is greater than 3 but less than 7.
6. A coin is biased so that a head is twice as likely to occur as a tail. If the coin is
tossed 3 times, what is the probability of getting 2 tails and 1 head?
7. Assuming birth months (days) are equally likely, what is the probability that the
next two unrelated strangers you meet both share your birth month (day)?
8. Sudden infant death syndrome (SIDS) causes babies to die suddenly (often in
their cribs) with no explanation. Deaths from SIDS have been greatly reduced by
placing babies on their backs, but as yet no cause is known.
When more than one SIDS death occurs in a family, the parents are sometimes
accused. One “expert witness” popular with prosecutors in England told juries
that there is only a 1 in 73 million chance that two children in the same family
could have died naturally. Here’s his calculation: the rate of SIDS in a
nonsmoking middle-class family is 1 in 8500. So the probability of two deaths is
8500 8500 72, 250, 000 .
1 1 1
Several women were convicted of murder on this basis,
without any direct evidence that they harmed their children.
As the Royal Statistical Science said, this reasoning is nonsense. It assumes that
SIDS deaths in the same family are independent events. The cause of SIDS is
unknown: “There may well be unknown genetic or environmental factors that
predispose families to SIDS, so that a second case in the family becomes much
more likely.” The British government decided to review the cases of 258 parents
convicted of murdering their babies.
9. Many people who come to clinics to be tested for HIV, the virus that causes
AIDS, don’t come back to learn the test results. Clinics now use “rapid HIV
tests” that give a result in a few minutes. The false positive rate for a diagnostic
test is the probability that a person with no disease will have a positive test result.
For the rapid HIV tests, the Food and Drug Administration has established 2% as
the maximum false positive rate. If a clinic uses a test that meets the FDA
standard and tests 50 people who are free of HIV antibodies, what is the
probability that at least one false-positive will occur?
P(at least one positive) = 1 – P(no positives)
= 1 – P(50 negatives)
= 1 – (1-.02)50 = .6358
There is approximately 64% chance that at least one of the 50 people will test
positive for HIV, even though no one has the virus.
Concern about excessive numbers of false positives led the New York City
Department of Health and Mental Hygiene to suspend the use of one particular
rapid HIV test.
10. Only 5% of male high school basketball, baseball, and football players go on to
play at the college level. Of these, only 1.7% enter major league professional
sports. About 40% of the athletes who compete in college and then reach the pros
have a career of more than three years. Define these events: A = {competes in
college}, B = {competes professionally}, C = {pro career longer than 3 years}.
What is the probability that a high school athlete competes in college and then
goes on to have a pro career of more than three years?
We know that P(A) = .05, P(B|A) = .017, P(C|AB) = .4. The probability we
want is therefore P(ABC) = P(A)P(B|A)P(C|AB)
= .05 .017 .4 = .00034
Only about 3 of every 10,000 high school athletes can expect to compete in
college and have a professional career of more than three years. High school
students would be wise to concentrate on studies rather than on unrealistic hopes
of fortune from pro sports.
Exercises: pp. 105-108 of Walpole nos. 1-18
1. If R is the event that a convict committed armed robbery and D is the event that
the convict pushed dope, state in words what probabilities are expressed by
a. P(R|D)
b. P(D’|R)
c. P(R’|D’)
2. A class in advanced physics is comprised of 10 juniors, 30 seniors, and 10
graduate students. The final grades showed that 3 of the juniors, 10 seniors, and 5
graduate students received an A for the course. If a student is chosen at random
from this class and is found to have earned an A, what is the probability that he or
she is a senior?
3. Consider the event B of getting a perfect square when a die is tossed. The die is
constructed so that the even numbers are twice as likely to occur as the odd
numbers. Suppose it is known that the toss of the die resulted in A = a number
greater than 3. Find P(B|A).
4. In the senior year of a high school graduating class of 100 students, 42 studied
mathematics, 68 studied psychology, 54 studied history, 22 studied both
mathematics and history, 25 studied both mathematics and psychology, 7 studied
history but neither mathematics nor psychology, 10 studied all three subjects, and
8 did not take any of the three. If a student is selected at random, find the
probability that a person
(a) enrolled in psychology takes all three subjects
(b) not taking psychology is taking both history and mathematics.
5. A pair of dice is thrown. If it is known that one die shows a 4, what is the
probability that
(a) the other die shows a 5
(b) the total of both dice is greater than 7.
6. A card is drawn from an ordinary deck and we are told that it is red. What is the
probability that the card is greater than 2 but less than 9?
7. The probability that an automobile being filled with gasoline will also need an oil
change is .25, the probability that it needs a new oil filter is .4, and the probability
that both the oil and filter need changing is .14.
(a) If the oil had to be changed, what is the probability that a new oil filter is
needed?
(b) If a new oil filter is needed, what is the probability that the oil has to be
changed?
8. The probability that a married man watches a certain television show is .4 and the
probability that a married woman watches the show is .5. The probability that a
man watches the show, given that his wife does, is .7. Find the probability that
(a) a married couple watches the show
(b) a wife watches the show given that her husband does
(c) at least one person of a married couple will watch the show.
9. The probability that a vehicle entering the Luray Caverns has Canadian license
plates is .12, the probability that it is a camper is .28, and the probability that it is
a camper with Canadian license plates is .09. What is the probability that
(a) a camper entering the Luray Caverns has Canadian license plates?
(b) a vehicle with Canadian license plates entering the Luray Caverns is a
camper?
(c) a vehicle entering the Luray Caverns does not have a Canadian license plates
or is not a camper?
10. The probability that the lady of the house is home when the Avon representative
calls is .6. Given that the lady of the house is home, the probability that she
makes a purchase is .4. Find the probability that the lady of the house is home
and makes a purchase when the Avon representative calls.
11. The probability that a doctor correctly diagnoses a particular illness is .7. Given
that the doctor makes an incorrect diagnosis, the probability that the patient enters
a law suit is .9. What is the probability that the doctor makes an incorrect
diagnosis and the patient sues?
12. One bag contains 4 white balls and 3 black balls, and a second bag contains 3
white balls and 5 black balls. One ball is drawn at random from the first bag and
placed unseen in the second bag. What is the probability that a ball now drawn
from the second bag is black? (Hint: Let B1, B2, and W1 represent, respectively,
the drawing of a black ball from bag 1, a black ball from bag 2, and a white ball
from bag 1. We are interested in B1 B2 and W1 B2.)
13. A real estate agent has 8 master keys to open several new homes. Only 1 master
key will open any given house. If 40% of these homes are usually left unlocked,
what is the probability that the real estate agent can get into a specific home if the
agent selects 3 master keys at random before leaving the office? (hint: Let A =
the house is open and B = the correct key is one of the 3 selected before leaving
the office. One event is A’ B.)
14. A town has 2 fire engines operating independently. The probability that a specific
fire engine is available when needed is .96. What is the probability that
(a) neither is available when needed
(b) that a fire engine is available when needed?
15. If the probability that Tom will be alive in 20 years is .7 and the probability that
Nancy will be alive in 20 years is .9, what is the probability that neither will be
alive in 20 years?
16. The probability that a person visiting his dentist will have an x-ray is .6; the
probability that a person who has an x-ray will also have a cavity filled is .3; and
the probability that the person who has had an x-ray and a cavity filled will also
have a tooth extracted is .1. What is the probability that a person visiting his
dentist will have an x-ray, a cavity filled, and a tooth extracted?
17. Find the probability of randomly selecting 4 good quarts of milk in succession
from a cooler containing 20 quarts of which 5 are spoiled.
18. From a box containing 6 black balls and 4 green balls, 3 balls are drawn in
succession, each ball being replaced in the box before the next draw is made.
What is the probability that all 3 are the same color? Each color is represented?
CHAPTER 6
Probability Distributions
Remark We shall use an uppercase letter, say X, to denote a random variable and
its corresponding lowercase letter, x in this case, for one of its values.
Examples:
1. (Experiment No. 1) An experiment consists of tossing a coin 3 times and
observing the result. The possible outcomes and the values of the random
variables X and Y, where X is the number of heads and Y is the number of heads
minus the number of tails are
Sample Points x y
HHH 3 3
HHT 2 1
HTH 2 1
HTT 1 -1
THH 2 1
THT 1 -1
TTH 1 -1
TTT 0 -3
Defn A random variable defined over a discrete sample space is called a discrete
random variable.
Defn A table or formula listing all possible values that a discrete random variable can
take on, along with the associated probabilities, is called a discrete probability
distribution.
Remark The probabilities associated with all possible values of a discrete random
variable must sum to 1.
Examples:
1. For Experiment No. 1, the discrete probability distributions of the random
variables X and Y are
X 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8
Y -3 -1 1 3
P(Y=y) 1/8 3/8 3/8 1/8
2. Construct the discrete probability distribution for the random variable M defined
in Experiment No. 2.
6.3 EXPECTED VALUES
Defn Let X be a discrete random variable with probability distribution
x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)
n
X E ( X ) x i f ( xi )
i 1
Examples:
1. Find the mean of the random variables X and Y of Experiment No. 1.
X 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8
Y -3 -1 1 3
P(Y=y) 1/8 3/8 3/8 1/8
x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)
n
E[ g ( X )] g ( xi ) f ( xi )
i 1
X2 V ( X ) E( X ) 2
x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)
The variance of X is
n
X2 V ( X ) E ( X ) 2 ( xi ) 2 f ( xi )
i 1
= 0.75
Examples:
1. Find the probability of obtaining exactly three 2’s if an ordinary die is tossed 5
times.
2. In a certain city district the need for money to buy drugs is given as the reason for
75% of all thefts. What is the probability that exactly 2 of the next 4 theft cases
reported in this district resulted from the need for money to buy drugs?
3. The probability that a patient recovers from a rare blood disease is .4. If 15
people are known to have contracted this disease, what is the probability that (a) 5
survive; (b) 3 to 8 survive?; and (c) at least 10 survive?
Exercises: pp. 165-166 of Walpole nos. 4, 6-10, 12, 13
4. A baseball player’s batting average is .250. What is the probability that he gets
exactly 1 hit in his next 5 times at bat?
6. A multiple-choice quiz has 15 questions, each with 4 possible answers of which
only 1 is the correct answer. What is the probability that sheer guesswork yields
a. exactly 10 correct answers
b. at least 1 correct answer
c. 5 to 10 correct answers .
7. The probability that a patient recovers from a delicate heart operation is .9. What
is the probability that exactly 5 of the next 7 patients having this operation
survive?
8. A study conducted at George Washington University and the National Institute of
Health examined national attitudes about tranquilizers. The study revealed that
approximately 70% believe “tranquilizers don’t really cure anything, they just
cover up the real trouble.” According to this study, what is the probability that at
least 3 of the next 5 people selected at random will be of the opinion that
tranquilizers do cure the problem rather than just cover it up?
9. A survey of the residents in a United States city showed that 20% preferred a
white telephone over any other color available. What is the probability that more
than one-half of the next 20 telephone installed in this city will be white?
10. One-fourth of the female freshmen entering a Virginia college are out-of-state
students. If the students are assigned at random to the dormitories, 3 to a room,
what is the probability that in one room at most 2 of the 3 roommates are out-of-
state students?
11.
12. Suppose that airplane engines operate independently in flight and fail with
probability q = .2. Assuming that a plane makes a safe flight if at least one-half of
its engines run, determine whether a 4-engine plane or a 2-engine plane has the
higher probability for a successful flight.
13. Repeat Exercise 12 for q =.5 and q = 1/3.
Near the end of World War II, the Germans developed rocket bombs, which were fired at
the city of London. The Allied military command did not know whether these bombs
were fired at random or whether they had some type of aiming device. To investigate,
the city of London was divided into 576 square regions and the number of hits per region
was counted and compared with the expected number of hits under a special discrete
probability distribution. Because the actual number of hits was close to the expected
number of hits, the military command concluded that the bombs were falling at random.
The Germans had not developed a bomb with an aiming device.
Defn A random variable defined over a continuous sample space is called a continuous
random variable.
Defn The function with values f(x) is called a probability density function for the
continuous random variable X, if
the total area under its curve and above the horizontal axis is equal to
1; and
the area under the curve between any two ordinates x = a and x = b
gives the probability that X lies between a and b.
Remarks:
1. A continuous random variable has a probability of zero of assuming exactly any
of its values, that is, if X is a continuous random variable, then P(X=x) = 0 for all
real numbers x.
2. The probability density function can not be represented in tabular form.
Example A continuous random variable X that can assume values between 0 and 2
has a density function given by
.5, for 0 x 2
f ( x)
0, otherwise
Example A used car dealer finds that in any day, the probability of selling no car is
0.4, one car is 0.2, two cars is 0.15, 3 cars is 0.10, 4 cars is 0.08, five cars
is 0.06 and six cars is 0.01. Let g(X) = 500 + 1500X represent the
salesman’s daily earnings, where X is the number of cars sold. Find the
salesman’s expected daily earnings.
THE NORMAL DISTRIBUTION
Defn A continuous random variable X is said to be normally distributed if its density
function is given by:
1
x 2
1
f ( x) e 2
2
The graph of the normal (or Gaussian) distribution is called the normal curve.
Properties:
1. The curve is bell-shaped and symmetric about a vertical axis through the mean µ.
2. The normal curve approaches the horizontal axis asymptotically as we proceed in
either direction away from the mean.
3. The total area under the curve and above the horizontal axis is equal to 1.
Defn The distribution of a normal random variable with mean zero and standard
deviation equal to 1 is called a standard normal distribution.
Hence, whenever X is between the values x1 and x2, the random variable Z will fall
between the corresponding values
x x
z1 1 and z 2 2
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.9 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
-3.8 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
-3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
-3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
-3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
-3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
-3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
-2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
-2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
-2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
-2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
-2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
-2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
-2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
-2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
-2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
-2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
-1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
-1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
-1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
-1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
-1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
-1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
-1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
-1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
-1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
-0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
-0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
-0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
-0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
-0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
-0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
-0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
-0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
-0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
-0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
3.5 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002
3.6 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.7 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.8 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.9 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Exercises: pp. 197-199 of Walpole nos. 1-16
1. = 40 and σ = 6, find
a. the area below 32
b. the area above 27
c. the area between 42 and 51
d. the x value that has 45% of the area below it
e. the x value that has 13% of the area above it
2. Given a normal distributio 200 and σ2 = 100, find
a. the area below 214
b. the area above 179
c. the area between 188 and 206
d. the x value that has 80% of the area below it
e. two x values containing the middle 75% of the area
3. Given the normally distributed random variable X with mean 18 and standard
deviation 2.5, find
(a) P(X < 15)
(b) P(17 < X < 21)
(c) the value of k such that P( X < k) = .2578
(d) the value of k such that P(X > k) = .1539.
4. A soft-drink machine is regulated so that it discharges an average of 200 ml. per
cup. If the amount of drink is normally distributed with standard deviation equal
to 15 ml.
(a) what fraction of cups will contain more than 224 ml.?
(b) what is the probability that a cup contains between 191 and 209 ml.?
(c) how many cups will likely overflow if 230-ml. cups are used in the next 1000
drinks?
(d) below what value do we get the smallest 25% of the drinks?
5. The finished inside diameter of a piston ring is normally distributed with a mean
of 10 cm. and a standard deviation of .03 cm.
(a) What proportion of rings will have inside diameters exceeding 10.075 cm.?
(b) What is the probability that a piston ring will have an inside diameter between
9.97 and 10.03 cm.?
(c) Below what value of inside diameter will 15% of piston rings fall?
6. A lawyer commutes daily from his suburban home to his midtown office. On the
average the trip one way takes 24 minutes, with a standard deviation of 3.8
minutes. Assume the distribution of trip times to be normally distributed.
(a) What is the probability that the trip will take at least ½ hour?
(b) If the office opens at 9 AM and he leaves his house at 8:45 AM, what
percentage of the time is he late for work?
(c) If he leaves the house at 8:35 AM and coffee is served at the office from 8:50
AM until 9 AM, what is the probability that he misses coffee?
(d) Find the length of time above which we find the slowest 15% of the trips
7. If a set of grades on a statistics examination are approximately normally
distributed with a mean of 74 and a standard deviation of 7.9, find
(a) the lowest passing grade if the lowest 10% of the students are given F’s
(b) the highest B if the top 5% of the students are given A’s
(c) the lowest B if the top 10% of the students are given A’s and the next 25% are
given B’s.
8. In a mathematics examination the average grade was 82 and the standard
deviation was 5. All students with grades from 88 to 94 received a grade of B. If
the grades are approximately normally distributed and 8 students received a B
grade, how many students took the examination?
9. The heights of 1000 students are normally distributed with a mean of 174.5 cm.
and a standard deviation of 6.9 cm. How many of these students would you
expect to have heights (a) less than 160.0 cm; (b) between 171.5 and 182.0 cm;
(c) equal to 175.0 cm; and (d) greater than or equal to 188.0 cm?
10. A company pays its employees an average wage of $7.25 an hour with a standard
deviation of 60 cents. If the wages are approximately normally distributed
(a) what percentage of the workers receive wages between $6.75 and $7.69 an
hour?
(b) the highest 5% of the employee hourly wages are greater than what amount?
11. The weights of a large number of miniature poodles are approximately normally
distributed with a mean of 8 kg. and a standard deviation of .9 kg. Find the
fraction of these poodles with weights
(a) over 9.5 kg.
(b) at most 8.6 kg.
(c) between 7.3 and 9.1 kg.
12. The tensile strength of a certain metal component is normally distributed with a
mean of 10,000 kg/cm2 and a standard deviation of 100 kg/cm2.
(a) What proportion of these components exceeds 10,150 kg/cm2 in tensile
strength?
(b) If specifications require that all components have tensile strength between
9800 and 10,200 kg/cm2, what proportion of pieces would you expect to scrap?
13. If a set of observations is normally distributed, what percentage of the
observations differs from the mean by
(a) more than 1.3?
(b) less than .52?
14. The IQs of 600 applicants to a certain college are approximately normally
distributed with a mean of 115 and a standard deviation of 12. If the college
requires an IQ of at least 95, how many of these students will be rejected on this
basis regardless of their other qualifications?
15. The average rainfall in Roanoke, Virginia for the month of March is 9.22 cm.
Assuming a normal distribution with a standard deviation of 2.83 cm, find the
probability that next March Roanoke receives (a) less than 1.84 cm of rain; (b)
more than 5 cm but not over 7 cm of rain; and (c) more than 13.8 cm of rain.
16. The average life of a certain type of small motor is 10 years with a standard
deviation of 2 years. The manufacturer replaces free all motors that fail while
under guarantee. If he is willing to replace only 3% of the motors that fail, how
long a guarantee should he offer? Assume that the lives of the motors follow
normal distribution.
CHAPTER 7
Sampling Distributions
Consider three observations making up the population values 0, 1, and 2 with parameters
N N
X i
2 (X i )2
i 1
1 and 2 . i 1
N N 3
Suppose we list all possible samples of size 2, with replacement, and for each possible
n
X i
sample compute the value of the sample mean, X i 1
:
n
3.5
3 X average of the X ’s = 1 = µ
2.5
2
1.5 And
1
0.5
1 2/3 2
0
-0.5 0 0.5 1 1.5 2 2.5 X2 variance of the X ’s =
3 2 n
Thm1 If all possible random samples of size n are drawn with replacement from a finite
population of size and standard deviation σ, then the sample
mean will have mean µ and variance 2 / n.
Suppose we list all possible samples of size 2, without replacement, and for each possible
sample compute the value of the sample mean, X :
2
X average of the X ’s = 1 = µ
1.5
1
And
0.5
1 2 N n
0 X2 variance of the X ’s =
0 0.5 1 1.5 2 6 n N 1
Thm2 If all possible random samples of size n are drawn without replacement from a
finite population of size N with mean µ σ, then the
N n
2
sample mean will have mean µ and variance
n N 1
N n
The factor is called the finite population correction factor. For large N relative
N 1
to the sample size n, this factor will be close to 1 and the variance is approximately equal
to σ2 /n.
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
-0.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2
Thm3 (Central Limit Theorem) If X is the mean of a random sample of size n taken
and variance σ2, then the
sampling distribution of X is approximately normally distributed with mean
E( X ) = µ X ) = σ2/n when n is sufficiently large. Hence, the
limiting form of the distribution of
X
Z
/ n
as n approaches infinity is the standard normal distribution.
The normal approximation in the theorem will be good if n ≥
of the shape of the population.
If n < 30, the approximation is good only if the population is not too different
from the normal.
If the distribution of the population is normal then the sampling distribution
will also be exactly normal, no matter how small the size of the sample.
Example An electrical firm manufactures electric light bulbs that have a length of
life which is normally distributed with mean and standard deviation equal
to 500 and 50 hours, respectively. Find the probability that a random
sample of 15 bulbs will have an average life of less than 475 hours.
Thm4 (The t-distribution) If X and S2 are the mean and variance, respectively, of a
random sample of size n taken from a population which is normally distributed
and variance σ2, then
X
T
S/ n
is a random variable having the t - distribution with v = n-1 degrees of freedom.
Notation: T~ tv=n-1
Just like any continuous probability distribution, the probability that a random
sample produces a t-value falling between any two specified values is equal to the
area under the curve of the t-distribution between any two ordinates
corresponding to the specified values
Examples:
1. Find the following values on the t -table:
(a) t0.025 when v = 14.
(b) t0.99 when v=10.
2. Find k such that P(k < T < 2.807) = 0.945 when T ~ t(23)
3. A manufacturing firm claims that the batteries used in their electronic games will
last an average of 30 hours. To maintain this average, 16 batteries are tested each
month. If the computed t-value falls between -t0.025 and t0.025 , the firm is satisfied
with its claim. What conclusion should the firm draw from a sample that has
mean X = 27.5 hours and standard deviation S = 5 hours? Assume the
distribution of battery lives to be approximately normal.
Thm5 If independent random samples of size n1 and n2 are drawn from two large or
infinite populations with means µ1 and µ2 and standard deviations σ1 and σ2,
respectively, then the sampling distribution of the difference of means X 1 X 2
is approximately normally distributed with mean and standard deviation given by
12 22
X 1X 2
1 2 and X 1 X 2
n1 n2
Hence z
x 1
x 2 1 2
is a value of the standard normal variable Z.
12 / n1 22 / n2
P(tv > tα,v) = α
Example A commonly prescribed drug on the market for relieving nervous tension
is believed to be only 60% effective. Experimental results with a new drug
administered to a random sample of 100 adults who were suffering from
nervous tension showed that 70 received relief.
a. Estimate the population proportion of nervous tension patients who
will receive relief with the experimental drug.
b. Is this sufficient evidence to conclude that the new drug is superior
to the one commonly prescribed?
Point Estimation
Defn An estimator is any statistic whose value is used to estimate an unknown
parameter. A realized value of an estimator is called an estimate.
Remarks:
1. An estimator is said to be unbiased if the average of the estimates it produces
under repeated sampling is equal to the true value of the parameter being
estimated.
Examples:
Under random sampling, the sample mean is an unbiased estimator of the
population mean, that is, E( X ) = μ.
Under random sampling with replacement, S2 is an unbiased estimator of σ2,
but S on the other hand is a biased estimator of σ with the bias becoming
insignificant for large samples.
Samples S2 S Samples S2 S
0,0 0 0
0,1 .5 .707107 0,1 .5 .707107
0,2 2 1.414214 0,2 2 1.414214
1,0 .5 .707107 1,0 .5 .707107
1,1 0 0
1,2 .5 .707107 1,2 .5 .707107
2,0 2 1.414214 2,0 2 1.414214
2,1 .5 .707107 2,1 .5 .707107
2,2 0 0
Mean .666667 .628539 Mean 1 .942809
2. A parameter can have more than one unbiased estimator. We would naturally
choose the unbiased estimator with the smallest variance.
Possible Samples X Md
0,1,2 1 1
0,1,3 1.33 1
0,2,1 1 1
0,2,3 1.67 2
0,3,1 1.33 1
0,3,2 1.67 2
1,0,2 1 1
1,0,3 1.33 1
1,2,0 1 1
1,2,3 2 2
1,3,0 1.33 1
1,3,2 2 2
2,0,1 1 1
2,0,3 1.67 2
2,1,0 1 1
2,1,3 2 2
2,3,0 1.67 2
2,3,1 2 2
3,0,1 1.33 1
3,0,2 1.67 2
3,1,0 1.33 1
3,1,2 2 2
3,2,0 1.67 2
3,2,1 2 2
Mean 1.5 1.5
Variance .138889 .25
Interval Estimation
Example The running time (in minutes) of a sample of films produced by Star-
Regal Theater are as follows: 103 94 110 87 98.
A 95% confidence interval for the mean running time of films produced
by Star-Regal Theater is (87.6, 109.2).
Remarks:
1. In general, we construct a (1-α)100% confidence interval. The fraction (1-α) is
called the confidence coefficient, and the endpoints a and b are called the lower
and upper confidence limits, respectively.
2. The confidence coefficient is not “the probability that the true value of the
parameter falls in the interval estimate” since once a sample is drawn and a
confidence interval constructed, the resulting interval estimate either encloses the
true value of the parameter or it doesn’t. Rather, the confidence coefficient is “the
probability that the interval estimator encloses the true value of the parameter.”
3. A good confidence interval is one that is as narrow as possible and has a large
confidence coefficient, near 1. The narrower the interval, the more exactly we
have located the parameter; whereas, the larger the confidence coefficient, the
more confidence we have that a particular interval encloses the true value of the
parameter. However, for a fixed sample size, as the confidence coefficient
increases, the length of the interval also increases.
X i (X i )2
5
and 3 with i 1
1.5 and 2 i 1
. Suppose
N N 4
we list all possible samples of size 2, with replacement and
compute the 90% confidence interval for each possible sample.
8.2 Estimation of µ
a. when σ
X z / 2
n
where z α/2 is the z-value leaving an area of α/2 to the right.
b. when σ , n ≤ 30
X t / 2 ,v s
n
where tα/2 is the t-value with v = n - 1 degrees of freedom
leaving an area of α/2 to the right.
Remarks:
1. The above formulas hold strictly for random samples from a normal distribution.
However, they provide good approximate (1-α)100% confidence intervals when
the distribution is not normal provided the sample size is large, i.e. n > 30.
2.
If is unknown and n > 30, use X z / 2 Sn where zα/2 is the z-value leaving an
area of α/2 to the right.
t-table
α/2
v tα/2,v
zα/2
z-table
zα/2
α/2
Examples:
1. The mean and standard deviation for the quality grade-point averages of a random
sample of 36 college seniors are calculated to be 2.6 and .3, respectively. Find the
95% and 99% confidence intervals for the mean of the entire senior class.
2. The contents of 7 similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8, 10.0,
10.2, and 9.6 liters. Find a 95% confidence interval for the mean content of all
such containers, assuming an approximate normal distribution.
z
2
n /2
e
Example How large a sample is needed in Example 1 if we want to be 95%
confident that our estimate of µ is not off by more than .05?
Exercises: pp. 262-264 of Walpole nos. 3-13
3. An electrical firm manufactures light bulbs that have a length of life that is
normally distributed, with a standard deviation of 40 hours. If a random sample of
30 bulbs has a mean life of 780 hours, find a 96% confidence interval for the
population mean of all bulbs produced by this firm.
4. A soft-drink machine is regulated so that the amount of drink dispensed is
approximately normally distributed with a standard deviation of 1.5 dl. Find a
95% confidence interval for the mean of all drinks dispensed by this machine if a
random sample of 36 drinks had an average content of 22.5 dl.
5. The heights of a random sample of 50 college students showed a mean of 174.5
cm. and a standard deviation of 6.9 cm. Construct a 98% confidence interval for
the mean height of all college students.
6. A random sample of 100 automobile owners shows that an automobile is driven
on the average 23,500 kilometers per year, in the state of Virginia, with a standard
deviation of 3900 kilometers. Construct a 99% confidence interval for the average
distance an automobile is driven annually in Virginia.
7. How large a sample is needed in Exercise 3 if we wish to be 96% confident that
our sample mean will be within 10 hours of the true mean?
8. How large a sample is needed in Exercise 4 if we wish to be 95% confident that
our sample mean will be within .3 ounce of the true mean?
9. An efficiency expert wishes to determine the average time that it takes to drill 3
holes in a certain metal clamp. How large a sample will he need to be 95%
confident that his sample mean will be within 15 sec. of the true mean? Assume
that it is known from previous studies that = 40 sec.
10. Regular consumption of presweetened cereals contribute to tooth decay, heart
disease, and other degenerative diseases, according to a study by Dr. M. Albreight
of the National Institute of Health and Dr. D. Solomon, Professor of Nutrition and
Dietetics at the University of London. In a random sample of 20 similar servings
of Alpha-Bits, the mean sugar content was 11.3 grams with a standard deviation
of 2.45 grams. Assuming that the sugar content is normally distributed, construct
a 95% confidence interval for the mean sugar content for single servings of
Alpha-Bits.
11. The contents of 10 similar containers of a commercial soap are 10.2, 9.7, 10.1,
10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Find a 99% confidence interval for
the mean soap content of all such containers, assuming an approximate normal
distribution.
12. A random sample of 8 cigarettes of a certain brand has an average nicotine
content of 3.6 mg. and a standard deviation of .9 mg. Construct a 99% confidence
interval for the true average nicotine content of this particular brand of cigarettes,
assuming an approximate normal distribution.
13. A random sample of 12 female students in a certain dormitory showed an average
weekly expenditure of $8.00 for snack foods, with a standard deviation of $1.75.
Construct a 90% confidence interval for the average amount spent each week on
snack foods by female students living in this dormitory, assuming the
expenditures to be approximately normally distributed.
8.3 Inferences About 1 - 2
In comparing two populations with means 1 and 2 and standard deviations 1 and 2,
respectively, the analysis of the sample data depends on the sampling design used.
Defn Two samples are called independent samples when the measurements in one
sample are not related to the measurements in the other sample.
Random samples are taken separately from two populations and the same
response variable is recorded for each individual
One random sample is taken and a variable recorded for each individual, but then
units are categorized as belonging to one population or another
Participants are randomly assigned to one of two treatment conditions and the
same response variable is recorded for each individual unit
Defn The term paired (or matched/related/dependent) data means that data have been
observed in natural pairs.
Each person is measured twice. The two measurements of the same characteristic
or trait are made under different situations.
Similar individuals are paired prior to the experiment. During the experiment,
each member of a pair receives a different treatment. The same response variable
is measured for all individuals.
Example An independent samples design and a matched samples design are under
consideration for a study to obtain an estimate of the weight loss in a
shipment of bananas during transit.
Independent Samples. A random sample of banana bunches is selected from the lot and
weighted before loading. After shipment, an independent random sample of bunches is
selected and weighed during the unloading. The difference in the two sample mean
weights per bunch is used as the estimate of weight loss per bunch.
Matched Samples. A random sample of banana bunches is selected and weighted before
loading. After shipment, the same bunches selected before loading are weighed again,
and the difference in weight for each bunch is noted. The mean of these differences is
used as an estimate of the weight loss per bunch.
Exercise: For each item, identify whether the samples are independent or not.
1. A police department performs an experiment to assess the effects of an obvious
radar trap on the speeds of cars. Ten cars are randomly selected on a highway,
and their speeds are measured just before a radar trap comes into view and just
after they pass the trap.
2. A tire manufacturer is testing 2 new tread designs in terms of stopping distance.
To do this, the company uses two test cars driving side by side at the same speed.
Both cars have automatic braking systems so that both sets of brakes engage
simultaneously on a signal. Then the stopping distances for both cars are
measured after repeating the experiment 10 times.
3. A company which does a large volume of business by mail decides to test whether
there is a difference in mail delivery between those items brought to a post office
as compared to those put in a corner mailbox. A random sample of 100 customers
from the same city was selected and their letters were mailed from the post office.
Another random sample of 100 customers was then selected but their letters were
sent from the corner mailbox.
4. Two formulations of a new skin-softening lotion are to be compared as to their
softening action. A random sample of 40 potential users of the lotion is selected.
Each person in the sample is independently assigned at random one of the two
formulations to be applied to the left arm and the other formulation to be applied
to the right arm. After a lapse of eight hours, each person is asked to rate the
skin-softening effect of each formulation on a 10-point scale.
If we have two populations with means 1 and 2 and standard deviations σ1 and σ2,
respectively, a point estimator of the difference 1 - 2 is the statistic X 1 X 2 .
X X t n1 1S12 n2 1S 22 1 1
1 2 / 2, n1 n 2 2
n1 n2 2 n1 n2
c. 12 ≠ 22 but unknown, n1,n2 ≤ 30
S12 S 22
X X t
1 2 / 2 ,v
n n
1 2
Where v 2
S
/ n1 S 22 / n2
1
2
2
( S1 / n1 ) 2 (S 22 / n2 ) 2
n1 1 n2 1
Remarks:
1. These formulas hold strictly for independent samples selected from Normal
populations. However, they provide good approximate (1-α)100% confidence
intervals when the distributions are not Normal provided both n1 and n2 are greater
than 30.
2. If 12 and 22 are unknown but n1 and n2 are greater than 30, use
S12 S 22
X X z
1 2 /2
n n
1 2
3. Even if the population variances are considerably different, formula (b) will still
provide a good estimate provided that n1=n2 and both populations are normal.
Therefore, in a planned experiment, one should make every effort to equalize the
size of the samples.
Examples:
1. A statistics test was given to a random sample of 50 girls and another random
sample of 75 boys. The mean score of the girls is 80 with a standard deviation of
4 and the mean score of the boys is 86 with a standard deviation of 6. Find a 95%
confidence interval for the difference B - G.
2. A course in mathematics is taught to 12 students by the conventional classroom
procedure. A second group of 10 students was given the same course by means of
programmed materials. At the end of the semester the same examination was
given to each group. The 12 students meeting in the classroom made an average
grade of 85 with a standard deviation of 4, while the 10 students using
programmed materials made an average of 81 with a standard deviation of 5.
Find a 90% confidence interval for the difference between the population means,
assuming the populations are approximately normal with equal variances.
3. Records for the past 15 years have shown the average rainfall in a certain region
to be 4.93 cm., with a standard deviation of 1.14 cm. A second region has had an
average rainfall of 2.64 cm., with a standard deviation of .66 cm. during the past
10 years. Find a 95% confidence interval for the difference of the true average
rainfalls in these regions, assuming that the observations come from normal
populations with different variances.
When the population of differences D is normal or does not depart too markedly from
normality, a confidence interval for D = x - y is:
S
d t / 2,n 1 d
n
Where d i xi yi
n
d i
d i 1
2
n n
n
di
2
di d n d i
2
S d i 1 i 1 i 1
n 1 n(n 1)
n is the number of pairs
Example: Twenty college freshmen were divided into 10 pairs, each member of the pair
having approximately the same IQ. One of each pair was selected at random and
assigned to a mathematics section using programmed materials only. The other
member of each pair was assigned to a section in which the professor lectured. At
the end of the semester each group was given the same examination and the
following results were recorded.
Pair 1 2 3 4 5 6 7 8 9 10
Programmed 76 60 85 58 91 75 82 64 79 88
Materials
Lectures 81 52 87 70 86 77 90 63 85 83
Find a 98% confidence interval for the true difference in the two learning
procedures. Assume normality.
Exercises: pp. 264-266 of Walpole nos. 14-23
14. A random sample of size n1 = 25 taken from a normal population with standard
deviation 1 = 5 has a mean x1 = 80. A second random sample of size n2 = 36
taken from a different normal population with a standard deviation of 2 = 3, has
a mean x2 = 75. Find a 94% confidence interval for µ1-µ2.
15. Two kinds of thread are being compared for strength. Fifty pieces of each type of
thread are tested under similar conditions. Brand A had an average tensile
strength of 78.3 kg. with a standard deviation of 5.6 kg., while Brand B had an
average tensile strength of 87.2 kg. with a standard deviation of 6.3 kg. Construct
a 95% confidence interval for the difference of the population means.
16. A study was made to estimate the difference in salaries of college professors in
the private and state colleges of Virginia. A random sample of 100 professors in
private colleges showed an average 9-month salary of $25,000 with a standard
deviation of $1200. A random sample of 200 professors in state colleges showed
an average salary of $26,000 with a standard deviation of $1400. Find a 98%
confidence interval for the difference between the average salaries of professors
teaching in state and private colleges of Virginia.
17. Given two random samples of size n1 = 9 and n2 = 16, from two independent
normal populations, with x1 = 64, x2 = 59, s1 = 6, and s2 = 5, find a 95%
confidence interval for µ1-µ2, assuming that 1 = 2.
18. Students may choose between a 3-unit course in Physics without lab and a 4-unit
course with lab. The final written examination is the same for each section. The
mean score of a random sample of 12 students in the section with lab is 84 with a
standard deviation of 4, and the mean score of another random sample of 18
students in the section without lab is 77 with a standard deviation of 6. Find a
99% confidence interval for the difference between the mean grades for the two
courses. Assume the populations to be approximately normally distributed with
equal variances.
19. A taxi company is trying to decide whether to purchase brand A or brand B tires
for its fleet of taxis. To estimate the difference in the two brands, an experiment
is conducted using 12 of each brand. The tires are run until they wear out. The
results are x A = 36,300 km., sA = 5000 km., x B = 38,100, and sB = 6100 km.
Construct a 95% confidence interval for µA-µB, assuming the populations to be
approximately normally distributed.
20. The following data represent the running time of a random sample of films
produced by two motion picture companies:
Time (minutes)
Company 1 103 94 110 87 98
Company 2 97 82 123 92 175 88 118
Compute a 90% confidence interval for the difference between the mean running
times of films produced by the two companies. Assume that the running times for
each of the companies are approximately normally distributed with unequal
variances.
21. The government awarded grants to the agricultural departments of nine
universities to test the yield capabilities of two new varieties of wheat. Each
variety was planted on plots of equal area at each university and the yields, in kg.
per plot, recorded as follows:
University
1 2 3 4 5 6 7 8 9
Variety 1 38 23 35 41 44 29 37 31 38
Variety 2 45 25 31 38 50 33 36 40 43
Find a 95% confidence interval for the mean difference between the yields of the
two varieties assuming the distributions of yields to be approximately normal.
22. Referring to Exercise 19, find a 99% confidence interval for µA-µB if a tire from
each company is assigned at random to the rear wheels of 8 taxis and the
following distances in km., recorded:
23. It is claimed that a new diet will reduce a person’s weight by 4.5 kilograms on the
average in a period of 2 weeks. The weights of a random sample of 7 women who
followed this diet were recorded before and after a 2-week period:
Woman
1 2 3 4 5 6 7
Weight Before 58.5 60.3 61.7 69.0 64.0 62.6 56.7
Weight After 60.0 54.9 58.1 62.1 58.5 59.9 54.4
Compute a 95% confidence interval for the mean difference in the weight.
Assume the distribution of weights to be approximately normal.
8.4. ESTIMATING PROPORTIONS
X
In a binomial experiment a point estimator of the proportion p is p , where X
n
pq
represents the number of successes in n trials, with standard error of and margin of
n
pq
error of z / 2 .
n
pq
p z / 2
n
Example In a random sample of 200 students who enrolled in Math 17, 138 passed
on their first take. Construct a 95% confidence interval for the population
proportion of students who passed Math 17 on their first take.
Sample Size for Estimating p
^
If p will be used to estimate p, then we can be (1-α)100% confident that the error will
z2 / 2 p(1 p)
not exceed a specified amount, e, when the sample size is n
e2
When the value of p is unknown or cannot be approximated, then using p = 0.5 produces
the maximum value of p(1-p)=0.25. Hence a conservative formula for the sample size is
z2
n /22
4e
Example Use the conservative formula to determine the sample size needed if we
want to be 95% confident that our estimate of p is within 0.05 of the true
value.
The SWS national survey for the fourth quarter of 2013, done on Dec. 11-16, expanded
its Visayas sample to 650 households, from the usual 300, thus reducing the Visayas error
margin to 4 percentage points, from the usual 6 points. This raised the national sample
size to 1,550 households, from the usual 1,200, enhancing the quality of Yolanda-related
items in particular, since the Visayas was the area that suffered the most.
pq 1
A conservative estimate for the margin of error is z / 2 .
n n
p1 q1 p 2 q 2
p 1 p 2 z / 2
n1 n2
Example In a random sample of 200 students, 78 of the 120 females and 60 of the
80 males passed Math 17 on their first take. Construct a 95% confidence
interval for p1- p2, where p1 and p2 are the true proportions of females and
males, respectively, who passed Math 17 on their first take.
Exercises: pp. 273-274 of Walpole nos. 1-13
1. A random sample of 200 voters is selected and 120 are found to support an
annexation suit. Find the 96% confidence interval for the fraction of the voting
population favoring the suit.
2. A random sample of 400 cigarette smokers is selected and 86 are found to have a
preference for brand X. Find the 90% confidence interval for the fraction of the
population of cigarette smokers who prefer brand X.
3. In a random sample of 1000 homes in a certain city, it is found that 628 are heated
by natural gas. Find the 98% confidence interval for the fraction of homes in this
city that are heated by natural gas.
4. A random sample of 75 college students is selected and 16 are found to have cars
on campus. Use a 95% confidence interval to estimate the fraction of students
who have cars on campus.
5. A new rocket-launching system is being considered for deployment of small
short-range launches. The existing system has p = .8 as the probability of a
successful launch. A sample of 40 experimental launches is made with the new
system and 34 are successful. Construct a 95% confidence interval for p.
6. How large a sample is needed in Exercise 1 if we wish to be 96% confident that
our sample proportion will be within .02 of the true fraction of the voting
population?
7. How large a sample is needed in Exercise 3 if we wish to be 98% confident that
our sample proportion will be within .05 of the true proportion of homes in this
city that are heated by natural gas?
8. A study is to be made to estimate the percentage of citizens in a town who favor
having their water fluoridated. How large a sample is needed if one wishes to be
at least 95% confident that our estimate is within 1% of the true percentage?
9. According to Dr. Memory Elvin-Lewis, head of the microbiology department at
Washington University School of Dental Medicine in St. Louis, a couple of cups
of either green or oolong tea each day will provide sufficient fluoride to protect
your teeth from decay. People who do not like tea and who live in unfluoridated
areas should ask their local governments to consider having their water
fluoridated. How large a sample is needed to estimate the percentage of citizens
in a certain town who favor having their water fluoridated if one wishes to be at
least 99% confident that the estimate is within 1% of the true percentage?
10. In a study to estimate the proportion of residents in a certain city and its suburbs
who favor the construction of a nuclear power plant, it is found that 52 of 100
urban residents favor the construction while only 34 of 125 suburban residents are
in favor. Find a 96% confidence interval for the difference between the
proportion of urban and suburban residents who favor construction of the nuclear
plant.
11. A cigarette-manufacturing firm claims that its brand A line of cigarettes outsells
its brand B line by 8%. If it is found that 42 of 200 smokers prefer brand A and
18 of 150 smokers prefer brand B, compute a 94% confidence interval for the
difference between the proportions of sales of the two brands.
12. A geneticist is interested in the proportion of males and females in the population
that have a certain minor blood disorder. In a random sample of 100 males, 24
are found to be afflicted, whereas 13 of the 100 females tested appear to have the
disorder. Compute a 99% confidence interval for the difference between the
proportion of males and females that have this blood disorder.
13. A study is made to determine if a cold climate results in more students being
absent from school during a semester than for a warmer climate. Two groups of
students are selected at random, one group from Vermont and the other group
from Georgia. Of the 300 students from Vermont, 64 were absent at least 1 day
during the semester, and of the 400 students from Georgia, 51 were absent 1 or
more days. Find a 95% confidence interval for the difference between the
fractions of the students who are absent in the two states.
During World War II, Allied military planners needed estimates of the number of tanks
Germany was manufacturing. The information provided by traditional spying methods
was not reliable, but statistical sampling methods proved to be valuable. For example,
espionage and reconnaissance led analysts to estimate 1550 tanks were produced during
June 1941. However, using the serial numbers of captured tanks and statistical analysis,
military planners estimated the number of tanks to be 244. This estimate turned out to be
27 less than the actual number manufactured by the Germans in June 1941. A similar
type of analysis was used to estimate the number of Iraqi tanks destroyed during Desert
Storm.
CHAPTER 9
Tests of Hypothesis
A two-tailed test of hypothesis is a test where the alternative hypothesis does not
specify a directional difference for the parameter of interest.
Examples:
a. Is there a general preference for Coke or Pepsi?
Ho: p .5 vs. Ha: p ≠ .5
b. Is the proportion favoring death penalty the same for teenagers as it is for
adults? Ho: pT - pA = 0 vs. Ha: pT - pA 0
Null Hypothesis
Decision True False
Reject Ho Type I error Correct decision
Accept Ho Correct decision Type II error
z-test Decision
Possible Samples of Size n=2 Ho:µ=1.5 α=.10
0 0 -1.897 Reject Ho
0 1 -1.265 Accept Ho
0 2 -0.632 Accept Ho
0 3 0.000 Accept Ho
1 0 -1.265 Accept Ho
1 1 -0.632 Accept Ho
1 2 0.000 Accept Ho
1 3 0.632 Accept Ho
2 0 -0.632 Accept Ho
2 1 0.000 Accept Ho
2 2 0.632 Accept Ho
2 3 1.265 Accept Ho
3 0 0.000 Accept Ho
3 1 0.632 Accept Ho
3 2 1.265 Accept Ho
3 3 1.897 Reject Ho
The Type I error and Type II error are related. For a fixed sample size n, a
decrease in the probability of one will result in an increase in the probability of
the other. However, increasing the sample size will result in the reduction of both
probabilities.
Example If you are on a jury in the American judicial system, you must presume
that the defendant is innocent unless there is enough evidence to conclude that he
or she is guilty. Therefore the two hypotheses are
Ho: The defendant is innocent
Ha: The defendant is guilty
The prosecution collects evidence in the hope that the jurors will be convinced
that such evidence would be extremely unlikely if the assumption of innocence
were true. Consistent with our thinking in hypothesis testing, in many cases we
would not accept the hypothesis that the defendant is innocent. We would simply
conclude that the evidence was not strong enough to rule out the possibility of
innocence. In fact, in the United States the two conclusions juries are instructed
to choose from are “guilty” and “not guilty.” A jury would never conclude, “the
defendant is innocent.”
For trials in general, here are the possible errors and the consequences that
accompany those errors:
Type I error: A “guilty” verdict for a person who is really innocent.
Consequence: An innocent person is falsely convicted. The guilty party remains
free.
Type II error: A “not guilty” verdict for a person who committed a crime.
Consequence: A criminal is not punished.
In the American court system, a false conviction is generally viewed as the more
serious error. Not only is an innocent person punished but also a guilty one
remains free. Courtroom rules and rules affecting pretrial investigations tend to
reflect society’s concern about incorrectly punishing an innocent person.
Example Imagine that you are tested to determine if you have a disease. The lab
technician or physician who evaluates your results must make a choice between
two hypotheses:
Ho: You do not have the disease.
Ha: You have the disease.
Unfortunately, many laboratory tests for diseases are not 100% accurate. There is
a chance the result is wrong. Consider the two possible errors and their
consequences:
Type I error: You are told you have the disease, but you actually don’t. The test
result was a false positive. Consequence: You will be unnecessarily concerned
about your health and you may receive unnecessary treatment.
Type II error: You are told you do not have the disease, but you actually do. The
test result was a false negative. Consequence: You do not receive treatment for a
disease that you have. If this is contagious, you may infect others.
Which error is more serious? In most medical situations, the second possible
error is more serious but this could depend on the disease and the follow-up
actions that are taken. For instance, in a screening test for cancer, a false negative
could lead to a fatal delay in treatment. Initial test results that are “positive” for
cancer are usually followed up with a retest so a false positive may be discovered
quickly.
The above tests are exact -level tests for samples from a normal distribution.
However, the first test provides a good approximate -level test when the distribution
is not normal provided that the sample size is large enough, that is, n>30. See
Theorems 4 and 5 of Chapter 7.
If 2 is unknown and n>30, use the z-test but replace by s, that is,
X o
Z=
s/ n
The procedures are the same for testing the following:
Ho: > o vs. Ha: < o as Ho: =o vs. Ha: < o
Ho: < o vs. Ha: > o as Ho: =o vs. Ha: > o
Examples:
1. A manufacturer of sports equipment has developed a new synthetic fishing line
that he claims has a mean breaking strength of 8 kilograms with a standard
deviation of .5 kilogram. Test the hypothesis that µ = 8 kilograms if a random
sample of 50 lines is tested and found to have a mean breaking strength of 7.8
kilograms. Use a .01 level of significance.
2. A random sample of 100 recorded deaths during the past year showed an average
life span of 71.8 years, with a standard deviation of 8.9 years. Does this seem to
indicate that the average life span today is greater than 70 years? Use a .05 level
of significance.
3. The average length of time for students to register at a certain college has been 50
minutes with a standard deviation of 10 minutes. A new registration procedure
using modern computing machines is being tried. If a random sample of 12
students had an average registration time of 42 minutes with a standard deviation
of 11.9 minutes under the new system, test the hypothesis that the population
mean is now less than 50 minutes, using a level of significance of .10, .05 and .01.
Assume the population of times to be normal.
If the p-value ≤ α
Exercises: pp. 315-316 of Walpole nos.1-8
1. An electrical firm manufactures light bulbs that have a length of life that is
approximately normally distributed with a mean of 800 hours and a standard
deviation of 40 hours. Test the hypothesis that µ = 800 hours against the
alternative µ 800 hours if a random sample of 30 bulbs has an average life of
788 hours. Use a .04 level of significance.
2. In a research report by Richard H. Weindruch of the UCLA Medical School, it is
claimed that mice with an average lifespan of 32 months will live to be about 40
months old when 40% of the calories in their food are replaced by vitamins and
minerals. Is there any reason to believe that µ < 40 if 64 mice that are placed on
this diet have an average life of 38 months with a standard deviation of 5.8
months? Use a .025 level of significance.
3. The average height of females in the freshman class of a certain college has been
162.5 cm. with a standard deviation of 6.9 cm. Is there reason to believe that
there has been a change in the average height if a random sample of 50 females in
the present freshman class has an average height of 165.2 cm.? Use a .02 level of
significance.
4. It is claimed that an automobile is driven on the average less than 25,000
kilometers per year. To test this claim, a random sample of 100 automobile
owners is asked to keep a record of the kilometers they travel. Would you agree
with this claim if the random sample showed an average of 23,500 kilometers and
a standard deviation of 3,900 kilometers? Use a 0.01 level of significance.
5. Test the hypothesis that the average content of containers of a particular lubricant
is 10 liters if the contents of a random sample of 10 containers are 10.2, 9.7, 10.1,
10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Use a .01 level of significance and
assume that the distribution of contents is normal.
6. According to Dietary Goals for the United States (1977), high sodium intake may
be related to ulcers, stomach cancer, and migraine headaches. The human
requirement for salt is only 230 milligrams per day, which is surpassed in most
single servings of ready-to-eat cereals. A random sample of 20 similar servings of
Special K had mean sodium content of 244 milligrams of sodium and a standard
deviation of 24.5 milligrams. Is there sufficient evidence to believe that the
average sodium content for single servings of Special K exceeds the human
requirement for salt at α .05? Assume normality.
7. A random sample of 8 cigarettes of a certain brand has an average nicotine
content of 4.2 mg. and a standard deviation of 1.4 mg. Is this in line with the
manufacturer’s claim that the average nicotine content does not exceed 3.5 mg.?
Use a .01 level of significance and assume the distribution of nicotine contents to
be normal.
8. Last year the employees of a city sanitation department donated an average of
$8.00 to the volunteer rescue squad. Test the hypothesis at the .01 level of
significance that the average contribution this year is still $8.00 if a random
sample of 12 employees showed an average donation of $8.90 with a standard
deviation of $1.75. Assume the donations are approximately normally
distributed.
9.3. TESTING THE DIFFERENCE BETWEEN TWO POP’N MEANS
• Based on 2 independent samples
Ho Test Statistic Ha Critical region
a. 1 and 2 known
2 2
1 - 2 ≥ do ( X1 X 2 ) do 1 - 2 < d o z < - z
1 - 2 ≤ do Z 1 - 2 > do z > z
( 12 n1 ) ( 22 n2 )
1 - 2 = do 1 - 2 do | z | > z/2
b. 12 = 22 but unknown, n1, n2 ≤ 30
1 - 2 ≥ do ( X 1 X 2 ) do 1 - 2 < d o t < - t,n1+n2-2
t
S p (1 n1 ) (1 n2 )
- ≤d
1 2 o 1 - 2 > do t > t,n1+n2-2
(n 1) S12 (n2 1) S 22
S p2 1
1 - 2 = do n1 n2 2 1 - 2 do | t | > t/2,n1+n2-2
Remark The remarks made on confidence interval estimation for the difference
between means relative to the use of a given statistic apply to the tests
described here. See Theorem 5 of Chapter 7.
If 12 and 22 are unknown but n1, n2 > 30, use the z-test described in (a)
but replace the population standard deviations by the sample standard
deviation, that is,
( X 1 X 2 ) do
z
( S12 n1 ) ( S 22 n2 )
Example: A taxi company is trying to decide whether the use of radial tires instead of
regular belted tires improves fuel economy. Twelve cars were equipped with
radial tires and driven over a prescribed test course. Without changing drivers,
the same cars were then equipped with regular belted tires and driven once again
over the test course. The gasoline consumption, in kilometers per liter, was
recorded as follows:
Company 2 Company 1
Mean 110.7142857 98.4
Variance 1035.904762 76.3
Observations 7 5
Hypothesized Mean
Difference 10
Df 7
t Stat 0.181131997
P(T<=t) one-tail 0.430698476
t Critical one-tail 1.414923928
P(T<=t) two-tail 0.861396953
t Critical two-tail 1.894578604
18. In Exercise 21 on p. 265, test the hypothesis, at the .05 level of significance, that
the average yields of the two varieties of wheat are equal.
University
1 2 3 4 5 6 7 8 9
Variety 1 38 23 35 41 44 29 37 31 38
Variety 2 45 25 31 38 50 33 36 40 43
19. In Exercise 22 on p. 265, test the hypothesis, at the .01 level of significance, that
µ1 ≥ µ2.
Taxi Brand A Brand B
1 34,400 36,700
2 45,500 46,800
3 36,700 37,700
4 32,000 31,100
5 48,400 47,800
6 32,800 36,400
7 38,100 38,900
8 30,100 31,500
9.4 TESTING A HYPOTHESIS ON PROPORTIONS
Consider the problem of testing the hypothesis that the proportion of successes in a
binomial experiment equals some specified value.
If the unknown proportion is not expected to be too close to 0 or 1 and n is large, a large
sample approximation is given by:
Example A commonly prescribed drug on the market for relieving nervous tension
is believed to be only 60% effective. Experimental results with a new drug
administered to a random sample of 100 adults who were suffering from
nervous tension showed that 70 received relief. Is this sufficient evidence
to conclude that the new drug is superior to the one commonly prescribed?
Use a 0.05 level of significance.
Example In a survey of 200 students, 78 of the 120 females in the sample passed
Math 17 on their first take while this figure is 60 among the 80 males. Will
you agree that the proportion of males who passed Math 17 on their first
take is higher than the proportion of females who passed the same course
on their first take? Test at α = 0.05.
Exercises: p. 331 of Walpole nos. 1-12
1. A manufacturer of cigarettes claims that 20% of the cigarette smokers prefer
brand X. To test this claim a random sample of 20 cigarette smokers is selected
and asked what brand they prefer. If 6 of the 20 smokers prefer brand X, what
conclusion do we draw? Use a .05 level of significance.
2. Suppose that in the past 40% of all adults favored capital punishment. Do we
have reason to believe that the proportion of adults favoring capital punishment
today has increased if, in a random sample of 15 adults, 8 favor capital
punishment? Use a .05 level of significance.
3. A coin is tossed 20 times resulting in 5 heads. Is this sufficient evidence to reject
the hypothesis at the .03 level of significance that the coin is balanced in favor of
the alternative that head occur less than 50% of the time?
4. It is believed that at least 60% of the residents in a certain area favor an
annexation suit by a neighboring city. What conclusion would you draw if only
110 in a sample 200 voters favor the suit? Use a .04 level of significance.
5. The gas company claims that two-thirds of the houses in a certain city are heated
by natural gas. Do we have reason to doubt this claim if, in a random sample of
1000 houses in this city, it is found that 618 are heated by natural gas? Use a .02
level of significance.
6. At a certain college it is estimated that fewer than 25%of the students have cars
on campus. Does this seem to be a valid estimate if, in a random sample of 90
college students, 28 are found to have cars? Use a .05 level of significance.
7. In a study to estimate the proportion of residents in a certain city and its suburbs
who favor the construction of a nuclear power plant, it is found that 63 of 100
urban residents favor the construction while only 59 of 125 suburban residents are
in favor. Is there a significant difference between the proportion of urban and
suburban residents who favor construction of the nuclear plant? Use a .04 level of
significance.
8. A cigarette-manufacturing firm distributes two brands of cigarettes. If it is found
that 56 of 200 smokers prefer brand A and 29 of 150 smokers prefer brand B, can
we conclude at the .06 level of significance that brand A outsells brand B?
9. A geneticist is interested in the proportion of males and females in the population
that have a certain minor blood disorder. In a random sample of 100 males, 31
are found to be afflicted, whereas 24 of the 100 females tested appear to have the
disorder. Can we conclude at the .01 level of significance that the proportion of
men in the population afflicted with this blood disorder is significantly higher
than the proportion of women afflicted?
10. A study is made to determine if a cold climate results in more to absenteeism
from school during a semester than a warmer climate. Two groups of students are
selected at random, one group from Maine and the other from Alabama. Of the
300 students from Maine, 72 were absent at least 1 day during the semester, and
of the 400 students from Alabama, 70 were absent 1 or more days. Can we
conclude that a colder climate results in a greater number of students being absent
from school at least 1 day during the semester? Use a .05 level of significance.
11. A vote is to be taken among the residents of a town and the surrounding country
to determine whether a civic center will be constructed. The proposed
construction site is within the town limits and for this reason many voters in the
country feel that the proposal will pass because of the large proportion of town
voters who favor the construction. If 120 of 200 town voters favor the proposal
and 240 of 500 country residents favor it, test the hypothesis that the percentage
of town voters favoring the construction of a civic center will not exceed the
percentage of country voters by more that 3%. Use a .025 level of significance.
12. With reference to Exercise 8, test the hypothesis at the .06 level of significance
that brand A outsells brand B by at least 10%.
In 1788, James Madison, John Jay, and Alexander Hamilton anonymously published a
series of essays entitled The Federalist. These Federalist papers were an attempt to
convince the people of New York that they should ratify the Constitution. In the course
of history, the authorship of these papers became known, but 12 remained contested.
Through the use of statistical analysis, and particularly the use the frequency of the use of
various words, we can now conclude that Madison is the likely author of the 12 papers.
In fact, the statistical evidence that he is the author is overwhelming.
Before the election of 1936, a contest between Democratic incumbent Franklin Roosevelt
and Republican Alf Landon, the magazine Literary Digest had been extremely successful
in predicting the results in the US presidential elections. But 1936 turned out to be its
downfall, when it predicted a victory for Landon. To add insult to injury, young pollster
George Gallup, who had just founded the American Institute of Public Opinion in 1935,
correctly predicted Roosevelt as the winner of the election. He did this before they even
conducted their poll! And Gallup surveyed only 50,000 people, while the Literary Digest
sent questionnaires to 10 million people.
The Literary Digest made two classic mistakes. First, the lists of people to whom they
mailed the 10 million questionnaires were taken from magazine subscribers, car owners,
telephone directories, and lists of registered voters. In 1936, those who owned telephones
or cars, or subscribed to magazines, were more likely to be wealthy individuals who were
not happy with the Democratic incumbent.
Despite what accounts of this famous story conclude, the bias produced by the more
affluent list was not likely to have been as severe as the second problem. The main
problem was volunteer response. They received 2.3 million responses, a response rate of
only 23%. Those who felt strongly about the outcome of the election were more likely to
respond and that included a majority of those who wanted a change, the Landon
supporters. Those who were happy with the incumbent were less likely to bother to
respond.
Gallup, on the other hand, knew the value of random sampling. He was not only able to
predict the election but he also predicted what the results of the Literary Digest poll
would be to within 1%. How did he do this? He just chose 3,000 people at random from
the same lists the Digest was going to use, and mailed them all a postcard asking them
how they planned to vote.
9.6. TEST FOR INDEPENDENCE
The test for independence is used to determine whether two variables are related or not.
For example, we might test whether a person’s music preference is related to his
intelligence quotient. We then take a random sample and for each subject determine their
music preference and classify their IQ’s into different categories (high, medium, low).
The observed frequencies are presented in what is known as a contingency table shown
below:
Music IQ
Preference High Medium Low Total
Classical 40 26 17 83
Pop 47 59 25 131
Rock 83 104 79 266
Total 170 189 121 480
A contingency table containing r rows and c columns is referred to as an rxc table. The
row and column totals are called marginal frequencies. Note that in a test for
independence, these marginal frequencies are not fixed in advance but depends instead on
the way the sample distributed itself across the various cells in the table.
Procedure:
1. State the null and alternative hypothesis.
Ho: The two variables are independent
Ha: The two variables are not independent.
2. Choose the level of significance.
3. Compute the test statistic, given by
2
r
ij
c O E 2 ij
i 1 j 1 Eij
where Oij= observed number of cases in the ith row of the jth column
Eij = expected number of cases under Ho
=
i th row total jth column total
grand total
4. Decision Rule: Reject Ho if 2,( r 1)(c1)
2
Remarks:
1. The test is valid if at least 80% of the cells have expected frequencies of at least 5
and no cell has an expected frequency ≤ 1.
2. If many expected frequencies are very small, researchers commonly combine
categories of variables to obtain a table having larger cell frequencies. Generally,
one should not pool categories unless there is a natural way to combine them.
3. For a 2x2 contingency table, a correction called Yates’ correction for continuity is
i 1 j 1 Eij
Music IQ
Preference High Medium Low Total
Classical 40 (29.4) 26 (32.7) 17 (20.9) 83
Pop 47 (46.4) 59 (51.6) 25 (33.0) 131
Rock 83 (94.2) 104 (104.7) 79 (67.1) 266
Total 171 189 121 480
r c O Eij
2
Music IQ
Preference High Medium Low
Classical 40/83 = .48 26/83 = .31 17/83 = .20
Pop 47/131 = .36 59/131 = .45 25/131 = .19
Rock 83/266 = .31 104/266 = .39 79/266 = .30
Music IQ
Preference High Medium Low
Classical 40/171 = .23 26/189 = .14 17/121 = .14
Pop 47/171 = .27 59/189 = .31 25/121 = .21
Rock 83/171 = .49 104/189 = .55 79/121 = .65
P v2 2
Student X Y 120
1 39 65
100
2 43 78
3 21 52 80
4 64 82 60
5 57 92
40
6 47 89
7 28 73 20
8 75 98
0
9 34 56 0 20 40 60 80
10 52 75
Assumptions for the Probabilistic Model: For any given value of X, Y possesses a
normal distribution with a mean value E(Y | X ) 0 1 X and with a variance of 2 .
Furthermore, any one value of Y is independent of every other value.
Least squares criterion: Choose as the “best-fitting” line the line that minimizes the
2
n
sum of squares for error SSE = Yi Yi .
i 1
The method for finding the numerical values of 0 and 1 that minimize SSE uses
differential calculus and is beyond the scope of this course.
S xy n
1 where S xy xi x yi y
S xx i 1
0 y 1 x
For now, let us use the following EXCEL output to find the least squares prediction line
for the calculus grade-achievement test score data and predict a student’s calculus grade
if the student scored X = 50 on the achievement test.
Standard
Coefficients Error t Stat P-value
Intercept 40.78415521 8.506861379 4.794265875 0.00136551
X Variable 1 0.765561843 0.174984967 4.375014926 0.002364532
The best-fitting straight line relating the calculus grade to the achievement test
score is Y 0 1 X or Y 40.78415521 + .765561843X.
.765561843 is the estimated change in Y for a 1-unit change in X
The Y intercept will not be interpreted since X = 0 is not part of the range of X
If a student scores X = 50 on the achievement test, his or her predicted calculus
grade would be Y 0 1 X = 40.78415521 + .765561843(50) = 79.06225
10.3 Inferences
The third parameter in our linear probabilistic model is 2 and its estimator is
2 MSE SSE
n2
where MSE stands for mean squared error.
In the following EXCEL output, MSE = 75.75323363 can be found in the second row,
fourth column while SSE = 606.025869 can be found in the same row, third column.
ANOVA
Significance
df SS MS F F
Regression 1 1449.974131 1449.974131 19.1407556 0.002364532
Residual 8 606.025869 75.75323363
Total 9 2056
Does X contribute information for the prediction of Y; i.e., do the data provide sufficient
evidence to indicate that Y increases (or decreases) linearly as x increases over the region
of observation? We would wish to test Ho: 1 0 vs. Ha: 1 0 . The test statistic is
1
n
where S xx xi x .
2
t
MSE / S xx i 1
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 40.78415521 8.506861379 4.794265875 0.00136551 21.16729771 60.40101272
X Variable
1 0.765561843 0.174984967 4.375014926 0.002364532 0.362045786 1.169077901
10.4 Multiple Regression Models
S xy
n
where S yy yi y
2
r
S xx S yy i 1
S xy 1894
In the calculus grade-achievement test score data, r
S xx S yy 2474 2056
.839786. The EXCEL output follows.
Column 1 Column 2
Column 1 1
Column 2 0.839786 1
S xy S xy
The denominators used in calculating r and 1 will always be
S xx S yy S xx
positive. Since the numerators are identical, r and 1 will assume the same sign.
r=1 -1 < r < 0
x y
xx yy x x y y xy x x
2
y y
2
x2 y2
39 65 -7 -11 77 2535 49 121 1521 4225
43 78 -3 2 -6 3354 9 4 1849 6084
21 52 -25 -24 600 1092 625 576 441 2704
64 82 18 6 108 5248 324 36 4096 6724
57 92 11 16 176 5244 121 256 3249 8464
47 89 1 13 13 4183 1 169 2209 7921
28 73 -18 -3 54 2044 324 9 784 5329
75 98 29 22 638 7350 841 484 5625 9604
34 56 -12 -20 240 1904 144 400 1156 3136
52 75 6 -1 -6 3900 36 1 2704 5625
460 760 1894 36854 2474 2056 23634 59816
S xy xi yi
x y 36854 460 760
i i
n 10
n
S xy xi x yi y 1894
i 1
S xx x
x 2 i
2
23634
460 2
i
n 10
n
S xx xi x 2474
2
i 1
S yy y
y 2 i
2
59816
760 2
i
n 10
n
S yy yi y
2
2056
i 1
S xy 1894
1 0.765561843
S xx 2474
760 460
0 y 1 x .765561843 40.78415521
10 10
S xy2 1894 2
SSE S yy 2056 606.025869
S xx 2474
^ 2 SSE 606.025869
s 2 MSE 75.75323363
n2 10 2
s2 75.75323363
Standard error of 1 0.174984967
S xx 2474
XI. The Analysis of Variance
11.1 Introduction
Suppose that you want to compare the mean size of health insurance claims submitted by
five groups of policy holders. Ten claims were randomly selected from among the claims
for each group. Do the data contained in the five samples provide sufficient evidence to
indicate a difference in the mean levels of claims among the five health groups? We look
for a single test of Ho: 1 2 5 vs. Ha: at least one pair of means differ. We
assume that the observations within each sample population are normally distributed with
a common variance 2 .
The analysis of experimental data depends on the design of the experiment, which refers
to the way the data were collected. A very useful and relatively simple design called the
completely randomized design is one in which random samples are independently
selected from each of k populations. This design results in observations that are
classified only according to the population from which they came. For example, in
assessing voter preference concerning the next city/municipality election, we may wish to
select random samples of registered voters in each of k barangays within the
city/municipality.
k ni
Total SS = S xx xij x
2
i 1 j 1
can be partitioned into sum of squares for treatments/groups (SST, a measure of variation
among sample means) and sum of squares for error (SSE, a measure of variation within
samples). Thus, Total SS = SST + SSE.
For now we will be guided by EXCEL outputs. Total SS = 84153818.88 can be found in
the second column, last row, SSE = 77411264.4 in the same column, second row, SST =
6742554.48 in the same column, first row.
ANOVA
Source of
Variation SS Df MS F P-value F crit
Between
Groups 6742554.48 4 1685638.62 0.979879847 0.428070522 2.578739184
Within Groups 77411264.4 45 1720250.32
Total 84153818.88 49
The third column refers to the degrees of freedom of each sum of square. For Total SS,
49 = n – 1, for SSE, 45 = n – k, and for SST, 4 = k – 1.
The fourth column is the mean squares column calculated by dividing the sum of squares
by its degrees of freedom. So mean square for treatments MST =
k 1 1685638 .62 and mean square error MSE = nSSE
k 1720250 .32 .
SST 6742554.48 77411264.4
51 505
MSE
A (1-α)100% CI for a single treatment mean µi of the form x i t / 2,nk
ni
1720250 .32
For example, a 95% CI for µ4 is 1674.7 1.96 =
10
1674.7 812.9276493 = (861.7723507, 2487.627649).
SUMMARY
Groups Count Sum Average Variance
Column 1 10 19598 1959.8 2751539.289
Column 2 10 19961 1996.1 1915389.878
Column 3 10 11939 1193.9 703792.7667
Column 4 10 16747 1674.7 1137055.789
Column 5 10 22789 2278.9 2093473.878
T2
CM
n
where T = total of all observations = x
i j
ij Ti
i
x
k ni
Total SS = S xx xij x
2 2
ij CM
i 1 j 1 i j
Ti 2
SST = CM SSE = Total SS - SST
i ni
Consider the problem of assessing the effects of three different package designs on the
number or amount of sales. We might decide to use a completely randomized design and
select 12 supermarkets and display each of the designs in four different markets. Unless
the markets all had similar characteristics, differences in sales for the three package
designs might also reflect differences in the characteristics of the stores. One way to
avoid this problem is to use, say, four stores and display each of the three designs in all
four stores. This way store-to-store variability has been eliminated.
As another example, suppose the CEO of a large construction company employs three
experienced construction engineers to perform the time-consuming cost analyses,
estimates, and bids for the work on large construction projects. It is important to know
whether these three estimators tend to produce estimates at the same mean level or
whether one or another tends to always submit a high (or low) bid on projects. Each of
the three estimators would be required to produce an analysis, and estimate, and a bid
price for the same set of projects. In this way, differences in bids for the same projects
can be compared, thereby eliminating project-to-project variability.
Project
Estimator 1 2 3 4 5
1 3.52 4.71 3.89 5.21 4.14
2 3.39 4.79 3.82 4.93 3.96
3 3.64 4.92 4.19 5.10 4.20
An analysis of variance for a randomized block design partitions the total sum of squares
into three parts: SST (measures the variation among treatment means), SSB (measures
the variation among block means), and SSE (measures the variation of the differences
among the treatment observations within blocks. That is, Total SS = SST + SSB + SSE.
Using the following EXCEL output (see second column), Total SS = 5.09096, SST =
0.13456, SSB = 4.88896, and SSE = 0.06744.
Source of
Variation SS df MS F P-value F crit
Rows 0.13456 2 0.06728 7.981020166 0.012424095 4.458970108
Columns 4.88896 4 1.22224 144.9869514 1.6952E-07 3.837853355
Error 0.06744 8 0.00843
Total 5.09096 14
For the degrees of freedom (third column), SST has 2 = k-1 = 3-1 where k is the number
of treatments (estimators), SSB has 4 = b-1 = 5-1 where b is the number of blocks
(projects), SSE has 8 = n-b-k+1 = 15-5-3+1, and Total SS has 14 = n-1 = 15-1. The
fourth column is the mean square column. As in CRD, MS is SS/df so that MST = SST k 1
,
MSB = SSB
b 1
, and MSE = SSE
nbk 1
. All three MS are independent estimates of 2 .
MST
To test Ho: No differences among the k treatment means, we use F =
MSE
7.981020166 (see first row) with p-value of 0.012424095. Ho is rejected if F >
Fk1,nbk 1 4.458970108 at α = .05.
MSB
To test Ho: No differences among the b block means, we use F = 144.9869514
MSE
(second row) with p-value of 1.6952E-07. Ho is rejected if F > Fb1,nbk 1
3.837853355 at α = .05.
x i 2
x j t / 2,nbk 1 MSE
b
.00843 2
For example, a 95% CI for µ1 - µ3 is (4.294-4.41) 2.306 = -.116
5
0.13390694 = (-0.24990694, 0.01790694) . Can verify CI for µ1 - µ2 is (-.01791,
.249907) and CI for µ2 - µ3 is (-.36591, -.09809).
91034 T Ti
T2
= CM
165743783 n
249897602 = x 2
ij
84153819 =Total SS
xij2 CM
i j
Ti
2
384081604 398441521 142539721 280462009 519338521
2
Ti
ni 38408160 39844152 14253972 28046201 51933852
Ti 2
172486337.6 ni
6742554 =SST
Ti 2
= CM
i ni
77411264 =SSE
=Total SS - SST
ANOVA
Source of Variation SS
Between Groups 6742554.48
Within Groups 77411264.4
Total 84153818.88
The chi-square statistic for testing independence is also applicable when testing Ho: p1 =
p2 = …= pk.
Example: In a shop study, a set of data was collected to determine whether or not the
proportion of defectives produced by workers was the same for the day,
evening, or night shift work. The following data were collected:
Shift
Day Evening Night
Defectives 45 55 70
Nondefectives 905 890 870
Use a .025 level of significance to determine if the proportion of defectives is the same
for all three shifts.
Shift
Day Evening Night Total
Defectives 45 (57.0) 55 (56.7) 70 (56.3) 170
Nondefectives 905 (893.0) 890 (888.3) 870 (883.7) 2665
Total 950 945 940 2835
2
r c O
ij Eij
2
= 6.288
i 1 j 1 Eij
v = (r-1)(c-1) = (2-1)(3-1) = 2
.2025, 2 = 7.378
Decision: Accept Ho and conclude that the proportion of defectives produced is about the
same for all shifts.
To illustrate, consider the following frequency distribution table constructed from the
lives of 40 similar car batteries. The batteries are guaranteed to last 3 years. Let us test
the hypothesis that the frequency distribution may be approximated by a normal
distribution with mean x = 3.41 and standard deviation s = .703.
Class Boundaries Oi
1.45-1.95 2
1.95-2.45 1
2.45-2.95 4
2.95-3.45 15
3.45-3.95 10
3.95-4.45 5
4.45-4.95 3
the 2 i
k
O Ei 2 value will be small, indicating a good fit. If the observed
i 1 Ei
frequencies differ considerably from the expected frequencies, the 2 will be large and
the fit is poor.
We reject Ho: good fit if 2 2,v . The expected frequencies should be at least 5. This
restriction may require the combining of adjacent cells resulting in a reduction of the
number of degrees of freedom.
Going back to the example, the expected frequencies for each class/cell is obtained from
the normal curve having the same mean and standard deviation as our sample. These
values will be used for µ and in computing z values corresponding to the class
boundaries. For the first interval, we solve P(X < 1.95). For the last interval, we solve
P(X > 4.45). For the 4th interval, we solve P(2.95 < X < 3.45) = P(-.65 < Z < .06) = .2661
so that E4 = .2661(40) = 10.6.
Class Boundaries Oi Ei
1.45-1.95 2 0.8
1.95-2.45 1 2.7
2.45-2.95 4 6.9
2.95-3.45 15 10.6
3.45-3.95 10 10.2
3.95-4.45 5 6.0
4.45-4.95 3 2.8
Oi Ei
7 10.4
15 10.6
10 10.2
8 8.8
Thus,
2
k
Oi Ei 2
= 3.015
i 1 Ei
The number of degrees of freedom for this test is 4-3 = 1, since three quantities – the total
frequency, mean and standard deviation – of the observed data were required to find the
expected frequencies. Since .205,1 = 3.841, we have no reason to reject Ho and conclude
that the normal distribution provides a good fit for the distribution of battery lives.
Time Series Analysis
Some Applications:
To forecast future values of Y. It is assumed that some of the patterns observed in
the past will continue into the future. Thus, if quantifiable information about the
past can be measured then this can be used to forecast what will happen in the
future. Forecasting is an important aid in effective and efficient planning.
To facilitate comparisons with data for past years. For example, time series data
can be used to answer the question whether or not the recent increase in
unemployment is normal for this time of the year.
To identify indicators that coincide or precede with a change in direction of a time
series (called a cyclical turning point) and help in anticipating such.
TREND describes the long-term sweep of the series and usually modeled by a
smooth curve. There are many types of trends such as linear (a constant
amount of increase/decrease in the trend value from one period to the
next) and exponential (the trend value changes at a constant rate from one
period to the next).
SEASONAL describes the short-term recurring pattern of change in the series and
consists of relatively repetitious cycles of fixed amplitude and duration.
CYCLICAL movements in a time series that, like seasonal variations, are recurrent but
that, unlike seasonal variations, occur in cycles longer than one year. This
pattern exists when the series is influenced by longer-time economic
fluctuations.
IRREGULAR describes the miscellaneous, erratic movements in the series and tends to
have an irregular, saw-toothed pattern
General Methods:
Averaging methods wherein past observations are given equal weights in
evaluating the forecast
Exponential smoothing methods, wherein past observations are given unequal
weights that decay exponentially
Single Moving Average
Step 1 Choose the number of periods T to be used in the computation of the forecast.
The larger the value of T, the greater the smoothing effect. The smaller the value
of T, the more the moving averages follow the pattern of the data.
Y
t 1
t
FT+1 =
T
T 1
Y
t 2
t
FT+2 =
T
…
T k 1
Y
t k
t
FT+k =
T
Note that the oldest observation is dropped as each new observation becomes
available.
Step 1 Choose the weight α (between 0 and 1) that will give the smallest forecast error.
A large value of α gives very little smoothing in the forecast, whereas a small
value of α gives considerable smoothing.
e
t 1
t
MAE =
T
T
e
t 1
2
t
MSE =
T
T
e Y 100%
t 1
t
t
MAPE =
T
2
et 100%
T
Y
t 1 t
MSPE =
T
SES with F1 = Y1
Yt Ft ,α = .1 Ft ,α = .5 Ft ,α = .9
200 200 200 200
135 200 200 200
195 193.5 167.5 141.5
197.5 193.65 181.25 189.65
310 194.035 189.375 196.715
175 205.6315 249.6875 298.6715
155 202.5684 212.3438 187.3672
130 197.8115 183.6719 158.2367
220 191.0304 156.8359 132.8237
277 193.9273 188.418 211.2824
235 202.2346 232.709 270.4282
DEC 205.5111 233.8545 238.5428
SES with F1 = Y
Yt Ft, α = .1 Ft, α = .5 Ft, α = .9
200 202.6818 202.6818 202.6818
135 202.4136 201.3409 200.2682
195 195.6723 168.1705 141.5268
197.5 195.605 181.5852 189.6527
310 195.7945 189.5426 196.7153
175 207.2151 249.7713 298.6715
155 203.9936 212.3857 187.3672
130 199.0942 183.6928 158.2367
220 192.1848 156.8464 132.8237
277 194.9663 188.4232 211.2824
235 203.1697 232.7116 270.4282
DEC 206.3527 233.8558 238.5428
3-month MA
5-month MA