Vous êtes sur la page 1sur 419


Descriptive Statistics

Describing data with tables and

(quantitative or categorical

Numerical descriptions of center,

variability, position (quantitative
1. Tables and Graphs

Frequency distribution: Lists possible values

of variable and number of times each

Example: Student survey (n = 60)


political ideology measured as ordinal

variable with 1 = very liberal, , 4 =
moderate, , 7 = very conservative
Histogram: Bar graph of frequencies
or percentages
Shapes of histograms
(for quantitative variables)

Bell-shaped (IQ, SAT, political ideology in all

U.S. )
Skewed right (annual income, no. times
Skewed left (score on easy exam)
Bimodal (polarized opinions)

Ex. GSS data on sex before marriage in Exercise

3.73: always wrong, almost always wrong,
wrong only sometimes, not wrong at all
Stem-and-leaf plot (John Tukey,

Example: Exam scores (n = 40 students)

Stem Leaf
3 6
5 37
6 235899
7 011346778999
8 00111233568889
9 02238
2.Numerical descriptions
Let y denote a quantitative variable, with
observations y1 , y2 , y3 , , yn

a. Describing the center

Median: Middle measurement of ordered


Mean: y1 y2 ... yn yi
n n
Example: Annual per capita carbon dioxide
emissions (metric tons) for n = 8 largest nations
in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2,

Indonesia 1.4, Pakistan 0.7, Russia 9.9, U.S.

Ordered sample:

Median =

Mean =
Example: Annual per capita carbon dioxide
emissions (metric tons) for n = 8 largest nations
in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2,

Indonesia 1.4, Pakistan 0.7, Russia 9.9, U.S.

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9,


Median =
Mean =
Example: Annual per capita carbon dioxide
emissions (metric tons) for n = 8 largest nations
in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2,

Indonesia 1.4, Pakistan 0.7, Russia 9.9, U.S.

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9,


Median = (1.4 + 1.8)/2 = 1.6

Mean = (0.3 + 0.7 + 1.2 + + 20.1)/8 = 4.7
Properties of mean and
For symmetric distributions, mean =
For skewed distributions, mean is drawn in
direction of longer tail, relative to median
Mean valid for interval scales, median for
interval or ordinal scales
Mean sensitive to outliers (median often
preferred for highly skewed distributions)
When distribution symmetric or mildly
skewed or discrete with few values, mean
preferred because uses numerical values
of observations

New York Yankees baseball team, 2006

mean salary = $7.0 million
median salary = $2.9 million

How possible? Direction of skew?

Give an example for which you would


mean < median

b. Describing variability

Range: Difference between largest and smallest

(but highly sensitive to outliers, insensitive to

Standard deviation: A typical distance from the


The deviation of observation i from the mean is

yi y
The variance of the n observations is

( yi y ) ( y1 y ) ... ( yn y )
2 2 2

n 1 n 1
The standard deviation s is the square root of the

s s 2
Example: Political ideology
For those in the student sample who attend
religious services at least once a week (n = 9
of the 60),
y = 2, 3, 7, 5, 6, 7, 5, 6, 4
y 5.0,
(2 5) 2
(3 5) 2
... (4 5) 2
9 1 8
s 3.0 1.7

For entire sample (n = 60), mean = 3.0, standard

deviation = 1.6, tends to have similar variability but be
more liberal
Properties of the standard deviation:
s 0, and only equals 0 if all observations are equal
s increases with the amount of variation around the mean
Division by n - 1 (not n) is due to technical reasons (later)
s depends on the units of the data (e.g. measure euro vs $)
Like mean, affected by outliers

Empirical rule: If distribution is approx. bell-shaped,

about 68% of data within 1 standard dev. of mean
about 95% of data within 2 standard dev. of mean
all or nearly all data within 3 standard dev. of mean
Example: SAT with mean = 500, s = 100
(sketch picture summarizing data)

Example: y = number of close friends you have

GSS: The variable frinum has mean 7.4, s =

Probably highly skewed: right or left?

Empirical rule fails; in fact, median = 5,


Example: y = selling price of home in Syracuse,

If mean = $130,000, which is realistic?
c. Measures of position
pth percentile: p percent of
observations below it, (100 - p)%
above it.

p = 50: median
p = 25: lower quartile (LQ)
p = 75: upper quartile (UQ)

Interquartile range IQR = UQ - LQ

Quartiles portrayed graphically by box
(John Tukey)
Example: weekly TV watching for n=60
from student survey data file, 3 outliers
Box plots have box from LQ to UQ, with
median marked. They portray a five-
number summary of the data:
Minimum, LQ, Median, UQ,
except for outliers identified separately

Outlier = observation falling

below LQ 1.5(IQR)
or above UQ + 1.5(IQR)

Ex. If LQ = 2, UQ = 10, then IQR = 8

3. Bivariate description
Usually we want to study associations between
two or more variables (e.g., how does number
of close friends depend on gender, income,
education, age, working status, rural/urban,
Response variable: the outcome variable
Explanatory variable(s): defines groups to

Ex.: number of close friends is a response

variable, while gender, income, are
explanatory variables
Summarizing associations:
Categorical vars: show data using contingency
Quantitative vars: show data using scatterplots
Mixture of categorical var. and quantitative var.
(e.g., number of close friends and gender) can
give numerical summaries (mean, standard
deviation) or side-by-side box plots for the

Ex. General Social Survey (GSS) data

Men: mean = 7.0, s = 8.4
Example: Income by highest degree
Contingency Tables

Cross classifications of categorical variables

in which rows (typically) represent categories
of explanatory variable and columns
represent categories of response variable.

Counts in cells of the table give the

numbers of individuals at the corresponding
combination of levels of the two variables
Happiness and Family Income
(GSS 2008 data: happy,
Income Very Pretty Not too Total
Above Aver. 164 233 26 423

Average 293 473 117 883

Below Aver. 132 383 172 687

Total 589 1089 315
Can summarize by percentages on response
variable (happiness)

Example: Percentage very happy is

39% for above aver. income (164/423 =

33% for average income (293/883 = 0.33)
19% for below average income (??)
Income Very Pretty Not too
Above 164 (39%) 233 (55%) 26 (6%)
Average 293 (33%) 473 (54%) 117 (13%)
Below 132 (19%) 383 (56%) 172 (25%)

Inference questions for later chapters? (i.e.,

Scatterplots (for quantitative variables)
plot response variable on vertical axis,
explanatory variable on horizontal axis

Example: Table 9.13 (p. 294) shows UN data for

several nations on many variables, including
fertility (births per woman), contraceptive use,
literacy, female economic activity, per capita
gross domestic product (GDP), cell-phone use,
CO2 emissions

Data available at
Example: Survey in Alachua County,
Florida, on predictors of mental health
(data for n = 40 on p. 327 of text and at

y = measure of mental impairment (incorporates

various dimensions of psychiatric symptoms,
including aspects of depression and anxiety)
(min = 17, max = 41, mean = 27, s = 5)

x = life events score (events range from severe

personal disruptions such as death in family,
extramarital affair, to less severe events such as
new job, birth of child, moving)
(min = 3, max = 97, mean = 44, s = 23)
Bivariate data from 2000 Presidential election
Butterfly ballot, Palm Beach County, FL, text
Example: The Massachusetts Lottery
(data for 37 communities)

% income
spent on

Per capita income

Correlation describes strength of
Falls between -1 and +1, with sign indicating
direction of association (formula later in
Chapter 9)

The larger the correlation in absolute value, the

stronger the association (in terms of a straight
line trend)

Examples: (positive or negative, how strong?)

Mental impairment and life events, correlation =
GDP and fertility, correlation =
Correlation describes strength of

Falls between -1 and +1, with sign indicating

direction of association

Examples: (positive or negative, how strong?)

Mental impairment and life events, correlation =

GDP and fertility, correlation = - 0.56
GDP and percent using Internet, correlation =
Regression analysis gives line
predicting y using x
y = mental impairment, x = life events

Predicted y = 23.3 + 0.09x

e.g., at x = 0, predicted y =
at x = 100, predicted y =
Regression analysis gives line
predicting y using x
y = mental impairment, x = life events

Predicted y = 23.3 + 0.09x

e.g., at x = 0, predicted y = 23.3

at x = 100, predicted y = 23.3 + 0.09(100) =

Inference questions for later chapters?

Example: student survey
y = college GPA, x = high school GPA

(data at www.stat.ufl.edu/~aa/social/data.html)

What is the correlation?

What is the estimated regression equation?

Well see later in course the formulas for

finding the correlation and the best
fitting regression equation (with possibly
several explanatory variables), but for
now, try using software such as SPSS to
find the answers.
Sample statistics /
Population parameters

We distinguish between summaries of

samples (statistics) and summaries of
populations (parameters).

Common to denote statistics by Roman

letters, parameters by Greek letters:

Population mean = standard deviation =

proportion are parameters.

In practice, parameter values unknown, we

make inferences about their values using
The sample meany estimates
the population mean (quantitative

The sample standard deviation s estimates

the population standard deviation
(quantitative variable)

A sample proportion p estimates

a population proportion (categorical
Chapter 1: Statistics
Chapter Goals
Create an initial image of the field of
Learn how to obtain sample data.
Example: A recent study examined the math and
verbal SAT scores of high school seniors across
the country. Which of the following statements
are descriptive in nature and which are
The mean math SAT score was 492.
The mean verbal SAT score was 475.
Students in the Northeast scored higher in
math but lower in verbal.
80% of all students taking the exam were
headed for college.
32% of the students scored above 610 on the
verbal SAT.
The math SAT scores are higher than they were
10 years ago.
1.2 Introduction to Basic
Population: A collection, or set, of
individuals or objects or events whose
properties are to be analyzed.
Two kinds of populations: finite or

Sample: A subset of the population.

Variable: A characteristic about each individual
element of a population or sample.
Data (singular): The value of the variable
associated with one element of a population or
sample. This value may be a number, a word, or
a symbol.
Data (plural): The set of values collected for the
variable from each of the elements belonging to
the sample.
Experiment: A planned activity whose results
yield a set of data.
Parameter: A numerical value summarizing all
the data of an entire population.
Statistic: A numerical value summarizing the
sample data.
Example: A college dean is interested in learning about the
average age of faculty. Identify the basic terms in this

The population is the age of all faculty members at the

A sample is any subset of that population. For example,
we might select 10 faculty members and determine their
The variable is the age of each faculty member.
One data would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the
ages forming the sample and determining the actual age
of each faculty member in the sample.
The parameter of interest is the average age of all
faculty at the college.
The statistic is the average age for all faculty in the
Two kinds of variables:
Qualitative, or Attribute, or
Categorical, Variable: A variable that
categorizes or describes an element of a
Note: Arithmetic operations, such as
addition and averaging, are not
meaningful for data resulting from a
qualitative variable.
Quantitative, or Numerical, Variable:
A variable that quantifies an element of a
Note: Arithmetic operations such as
addition and averaging, are meaningful for
Example: Identify each of the following examples as
attribute (qualitative) or numerical (quantitative)

1. The residence hall for each student in a statistics

class. (Attribute)
2. The amount of gasoline pumped by the next 10
customers at the local Unimart. (Numerical)
3. The amount of radon in the basement of each of 25
homes in a new development. (Numerical)
4. The color of the baseball cap worn by each of 20
students. (Attribute)
5. The length of time to complete a mathematics
homework assignment. (Numerical)
6. The state in which each truck is registered when
stopped and inspected at a weigh station. (Attribute)
Qualitative and quantitative variables may be
further subdivided:

Nominal Variable: A qualitative variable that categorizes
(or describes, or names) an element of a population.

Ordinal Variable: A qualitative variable that incorporates

an ordered position, or ranking.

Discrete Variable: A quantitative variable that can

assume a countable number of values. Intuitively, a
discrete variable can assume values corresponding to
isolated points along a line interval. That is, there is a gap
between any two values.

Continuous Variable: A quantitative variable that can

assume an uncountable number of values. Intuitively, a
continuous variable can assume any value along a line
interval, including every possible value between any two
1. In many cases, a discrete and continuous
variable may be distinguished by
determining whether the variables are
related to a count or a measurement.
2. Discrete variables are usually associated
with counting. If the variable cannot be
further subdivided, it is a clue that you are
probably dealing with a discrete variable.
3. Continuous variables are usually associated
with measurements. The values of discrete
variables are only limited by your ability to
measure them.
Example: Identify each of the following as
examples of qualitative or numerical variables:
1. The temperature in Barrow, Alaska at 12:00
pm on any
given day.
2. The make of automobile driven by each
faculty member.
3. Whether or not a 6 volt lantern battery is
4. The weight of a lead pencil.
5. The length of time billed for a long distance
telephone call.
6. The brand of cereal children eat for breakfast.
7. The type of book taken out of the library by
an adult.
Example: Identify each of the following as
examples of (1) nominal, (2) ordinal, (3) discrete,
or (4) continuous variables:
1. The length of time until a pain reliever begins
to work.
2. The number of chocolate chips in a cookie.
3. The number of colors used in a statistics
4. The brand of refrigerator in a home.
5. The overall satisfaction rating of a new car.
6. The number of files on a computers hard
7. The pH level of the water in a swimming pool.
8. The number of staples in a stapler.
1.3: Measure and Variability
No matter what the response
variable: there will always be
variability in the data.
One of the primary objectives of
statistics: measuring and
characterizing variability.
Controlling (or reducing) variability in
a manufacturing process: statistical
process control.
Example: A supplier fills cans of soda marked 12
ounces. How much soda does each can really

It is very unlikely any one can contains exactly

12 ounces of soda.
There is variability in any process.
Some cans contain a little more than 12
ounces, and some cans contain a little less.
On the average, there are 12 ounces in each
The supplier hopes there is little variability in
the process, that most cans contain close to 12
ounces of soda.
1.4: Data Collection
First problem a statistician faces:
how to obtain the data.
It is important to obtain good, or
representative, data.
Inferences are made based on
statistics obtained from the data.
Inferences can only be as good as
the data.
Biased Sampling Method: A sampling method
that produces data which systematically differs
from the sampled population. An unbiased
sampling method is one that is not biased.

Sampling methods that often result in biased

1. Convenience sample: sample selected from
elements of a
population that are easily accessible.
2. Volunteer sample: sample collected from
those elements
of the population which chose to contribute
the needed
information on their own initiative.
Process of data collection:

1. Define the objectives of the survey or

Example: Estimate the average life of an
electronic component.
2. Define the variable and population of
Example: Length of time for anesthesia to
wear off after surgery.
3. Defining the data-collection and data-
measuring schemes. This includes sampling
procedures, sample size, and the data-
measuring device (questionnaire, scale, ruler,
4. Determine the appropriate descriptive or
inferential data-analysis techniques.
Methods used to collect data:

Experiment: The investigator controls or

modifies the environment and observes the
effect on the variable under study.

Survey: Data are obtained by sampling some of

the population of interest. The investigator does
not modify the environment.

Census: A 100% survey. Every element of the

population is listed. Seldom used: difficult and
time-consuming to compile, and expensive.
Sampling Frame: A list of the elements
belonging to the population from which the
sample will be drawn.

Note: It is important that the sampling frame be

representative of the population.

Sample Design: The process of selecting

sample elements from the sampling frame.

Note: There are many different types of sample

designs. Usually they all fit into two categories:
judgment samples and probability samples.
Judgment Samples: Samples that are selected
on the basis of being typical.

Items are selected that are representative of the

population. The validity of the results from a
judgment sample reflects the soundness of the
collectors judgment.

Probability Samples: Samples in which the

elements to be selected are drawn on the basis
of probability. Each element in a population has
a certain probability of being selected as part of
the sample.
Random Samples: A sample selected in such a
way that every element in the population has a
equal probability of being chosen. Equivalently,
all samples of size n have an equal chance of
being selected. Random samples are obtained
either by sampling with replacement from a finite
population or by sampling without replacement
from an infinite population.
1. Inherent in the concept of randomness: the next result (or
occurrence) is not predictable.
2. Proper procedure for selecting a random sample: use a random
number generator or a table of random numbers.
Example: An employer is interested in the time it
takes each employee to commute to work each
morning. A random sample of 35 employees will
be selected and their commuting time will be

There are 2712 employees.

Each employee is numbered: 0001, 0002, 0003,
etc. up to 2712.
Using four-digit random numbers, a sample is
identified: 1315, 0987, 1125, etc.
Systematic Sample: A sample in which every
kth item of the sampling frame is selected,
starting from the first element which is randomly
selected from the first k elements.

Note: The systematic technique is easy to

execute. However, it has some inherent dangers
when the sampling frame is repetitive or cyclical
in nature. In these situations the results may not
approximate a simple random sample.

Stratified Random Sample: A sample obtained

by stratifying the sampling frame and then
selecting a fixed number of items from each of
the strata by means of a simple random
sampling technique.
Proportional Sample (or Quota Sample): A
sample obtained by stratifying the sampling
frame and then selecting a number of items in
proportion to the size of the strata (or by quota)
from each strata by means of a simple random
sampling technique.

Cluster Sample: A sample obtained by

stratifying the sampling frame and then selecting
some or all of the items from some of, but not
all, the strata.
1.5: Comparison of Probability and
Probability: Properties of the
population are assumed known.
Answer questions about the sample
based on these properties.

Statistics: Use information in the

sample to draw a conclusion about the
Example: A jar of M&Ms contains 100 candy
pieces, 15 are red. A handful of 10 is selected.

Probability question: What is the probability that

3 of the 10 selected are red?

Example: A handful of 10 M&Ms is selected from

a jar containing 1000 candy pieces. Three
M&Ms in the handful are red.

Statistics question: What is the proportion of red

M&Ms in the entire jar?
1.6: Statistics and the
The electronic technology has had a
tremendous effect on the field of
Many statistical techniques are
repetitive in nature: computers and
calculators are good at this.
Lots of statistical software packages:
Statgraphics, SPSS, and calculators.
Remember: Responsible use of statistical
methodology is very important. The
burden is on the user to ensure that the
appropriate methods are correctly applied
and that accurate conclusions are drawn
and communicated to others.

Note: The textbook illustrates statistical

procedures using MINITAB, EXCEL 97, and
the TI-83.
Chapter 1: Introduction to

A variable is a characteristic or
condition that can change or take on
different values.
Most research begins with a general
question about the relationship
between two variables for a specific
group of individuals.

The entire group of individuals is
called the population.
For example, a researcher may be
interested in the relation between
class size (variable 1) and academic
performance (variable 2) for the
population of third-grade children.

Usually populations are so large that
a researcher cannot examine the
entire group. Therefore, a sample is
selected to represent the population
in a research study. The goal is to
use the results obtained from the
sample to help answer questions
about the population.

Types of Variables
Variables can be classified as
discrete or continuous.
Discrete variables (such as class
size) consist of indivisible categories,
and continuous variables (such as
time or weight) are infinitely divisible
into whatever units a researcher may
choose. For example, time can be
measured to the nearest minute,
second, half-second, etc.
Real Limits
To define the units for a continuous
variable, a researcher must use real
limits which are boundaries located
exactly half-way between adjacent

Measuring Variables
To establish relationships between
variables, researchers must observe
the variables and record their
observations. This requires that the
variables be measured.
The process of measuring a variable
requires a set of categories called a
scale of measurement and a
process that classifies each individual
into one category.
4 Types of Measurement
1. A nominal scale is an unordered
set of categories identified only by
name. Nominal measurements only
permit you to determine whether
two individuals are the same or
2. An ordinal scale is an ordered set
of categories. Ordinal
measurements tell you the direction
of difference between two
individuals. 79
4 Types of Measurement
3. An interval scale is an ordered series of
equal-sized categories. Interval
measurements identify the direction and
magnitude of a difference. The zero point
is located arbitrarily on an interval scale.
4. A ratio scale is an interval scale where a
value of zero indicates none of the
variable. Ratio measurements identify
the direction and magnitude of
differences and allow ratio comparisons
of measurements.
Correlational Studies
The goal of a correlational study is
to determine whether there is a
relationship between two variables
and to describe the relationship.
A correlational study simply
observes the two variables as they
exist naturally.

The goal of an experiment is to
demonstrate a cause-and-effect
relationship between two variables;
that is, to show that changing the
value of one variable causes changes
to occur in a second variable.

Experiments (cont.)
In an experiment, one variable is
manipulated to create treatment
conditions. A second variable is observed
and measured to obtain scores for a group
of individuals in each of the treatment
conditions. The measurements are then
compared to see if there are differences
between treatment conditions. All other
variables are controlled to prevent them
from influencing the results.
In an experiment, the manipulated
variable is called the independent
variable and the observed variable is the
dependent variable. 84
Other Types of Studies
Other types of research studies,
know as non-experimental or
quasi-experimental, are similar to
experiments because they also
compare groups of scores.
These studies do not use a
manipulated variable to differentiate
the groups. Instead, the variable
that differentiates the groups is
usually a pre-existing participant
variable (such as male/female) or a
time variable (such as before/after). 86
Other Types of Studies
Because these studies do not use the
manipulation and control of true
experiments, they cannot
demonstrate cause and effect
relationships. As a result, they are
similar to correlational research
because they simply demonstrate
and describe relationships.

The measurements obtained in a
research study are called the data.
The goal of statistics is to help
researchers organize and interpret
the data.

Descriptive Statistics
Descriptive statistics are methods
for organizing and summarizing data.

For example, tables or graphs are

used to organize data, and
descriptive values such as the
average score are used to
summarize data.
A descriptive value for a population
is called a parameter and a
descriptive value for a sample is 90
Inferential Statistics
Inferential statistics are methods for
using sample data to make general
conclusions (inferences) about
Because a sample is typically only a part
of the whole population, sample data
provide only limited information about the
population. As a result, sample statistics
are generally imperfect representatives of
the corresponding population parameters.
Sampling Error
The discrepancy between a sample
statistic and its population parameter
is called sampling error.
Defining and measuring sampling
error is a large part of inferential

The individual measurements or scores
obtained for a research participant will be
identified by the letter X (or X and Y if
there are multiple scores for each
The number of scores in a data set will be
identified by N for a population or n for a
Summing a set of values is a common
operation in statistics and has its own
notation. The Greek letter sigma, , will
be used to stand for "the sum of." For
example, X identifies the sum of the 94
Order of Operations
1. All calculations within parentheses are
done first.
2. Squaring or raising to other exponents is
done second.
3. Multiplying, and dividing are done third,
and should be completed in order from
left to right.
4. Summation with the notation is done
5. Any additional adding and subtracting is
done last and should be completed in
order from left to right. 95
Basics of Statistics

Definition: Science of collection, presentation, analysis, and reasonable

interpretation of data.

Statistics presents a rigorous scientific method for gaining insight into data. For
example, suppose we measure the weight of 100 patients in a study. With so
many measurements, simply looking at the data fails to provide an informative
account. However statistics can give an instant overall picture of data based
on graphical presentation or numerical summarization irrespective to the
number of data points. Besides data summarization, another important task of
statistics is to make inference and predict relations of variables.
A Taxonomy of Statistics
Statistical Description of
Statistics describes a numeric set of
data by its
Statistics describes a categorical set
of data by
Frequency, percentage or proportion of each
Some Definitions
Variable - any characteristic of an individual or entity. A variable can
take different values for different individuals. Variables can be
categorical or quantitative. Per S. S. Stevens
Nominal - Categorical variables with no inherent order or ranking sequence such
as names or classes (e.g., gender). Value may be a numerical, but without
numerical value (e.g., I, II, III). The only operation that can be applied to Nominal
variables is enumeration.
Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe.
Can be compared for equality, or greater or less, but not how much greater or less.
Interval - Values of the variable are ordered as in Ordinal, and additionally,
differences between values are meaningful, however, the scale is not absolutely
anchored. Calendar dates and temperatures on the Fahrenheit scale are examples.
Addition and subtraction, but not multiplication and division are meaningful
Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero
point, e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication,
and division are all meaningful operations.
Some Definitions
Distribution - (of a variable) tells us what values the variable takes
and how often it takes these values.
Unimodal - having a single peak
Bimodal - having two distinct peaks
Symmetric - left and right half are mirror images.
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the
frequency distribution of variable age can be tabulated as
Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Grouped Frequency Distribution of Age:
Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6

Frequency 8 12 6

Cumulative Frequency 8 20 26
Data Presentation
Two types of statistical presentation of data - graphical and numerical.

Graphical Presentation: We look for the overall pattern and for striking
deviations from that pattern. Over all pattern usually described by
shape, center, and spread of the data. An individual value that falls
outside the overall pattern is called an outlier.

Bar diagram and Pie charts are used for categorical variables.

Histogram, stem and leaf and Box-plot are used for numerical variable.
Data Presentation Categorical
Bar Diagram: Lists the categories and presents the percent or count of
individuals who fall in each category.

Treatment Frequency Proportion Percent

Group (%)

1 15 (15/60)=0.25 25.0

2 25 (25/60)=0.333 41.7

3 20 (20/60)=0.417 33.3
Total 60 1.00 100
Data Presentation Categorical
Pie Chart: Lists the categories and presents the percent or count of
individuals who fall in each category.

Treatment Frequency Proportion Percent

Group (%)

1 15 (15/60)=0.25 25.0
2 25 (25/60)=0.333 41.7

3 20 (20/60)=0.417 33.3

Total 60 1.00 100

Graphical Presentation Numerical
Histogram: Overall pattern can be described by its shape, center,
and spread. The following age distribution is right skewed. The
center lies between 80 to 100. No outliers.

Mean 90.41666667
Standard Error 3.902649518
Median 84
Mode 84
Standard Deviation 30.22979318
Sample Variance 913.8403955
Kurtosis -1.183899591
Skewness 0.389872725
Range 95
Minimum 48
Maximum 143
Sum 5425
Count 60
Graphical Presentation Numerical
Box-Plot: Describes the five-number summary

Figure 3: Distribution of Age

Box Plot
Numerical Presentation
A fundamental concept in summary statistics is that of a central value for a set of
observations and the extent to which the central value characterizes the whole
set of data. Measures of central value such as the mean or median must be
coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.

To understand how well a central value characterizes a set of observations, let

us consider the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from
the mean in data set A is larger than in the data set B. Thus, the mean of data
set B is a better representation of the data set than is the case for set A.
Methods of Center Measurement

Center measurement is a summary measure of the overall level of a


Commonly used methods are mean, median, mode, geometric mean

Mean: Summing up all the observation and dividing by number of
observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.
Notation : Let x1 , x2, ...xn are n observations of a variable
x. Then the mean of this variable,

x1 x2 ... xn x i
x i 1

n n
Methods of Center Measurement

Median: The middle value in an ordered sequence of observations.

That is, to find the median we need to order the data set and then
find the middle value. In case of an even number of observations
the average of the two middle most values is the median. For
example, to find the median of {9, 3, 6, 7, 5}, we first sort the data
giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the
number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the
median is the average of the two middle values from the sorted
sequence, in this case, (5 + 6) / 2 = 5.5.

Mode: The value that is observed most frequently. The mode is

undefined for sequences in which no observation is repeated.
Mean or Median
The median is less sensitive to outliers (extreme scores) than the
mean and thus a better measure than the mean for highly skewed
distributions, e.g. family income. For example mean of 20, 30, 40,
and 990 is (20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations out of 4 lie
between 20-40. So, the mean 270 really fails to give a realistic
picture of the major part of the data. It is influenced by extreme value
Methods of Variability Measurement

Variability (or dispersion) measures the amount of scatter in a dataset.

Commonly used methods: range, variance, standard deviation,

interquartile range, coefficient of variation etc.

Range: The difference between the largest and the smallest

observations. The range of 10, 5, 2, 100 is (100-2)=98. Its a crude
measure of variability.
Methods of Variability Measurement

Variance: The variance of a set of observations is the average of the

squares of the deviations of the observations from their mean. In
symbols, the variance of the n observations x1, x2,xn is

( x1 x ) 2 .... ( xn x ) 2

n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

(5 5) 2 (3 5) 2 (7 5) 2
3 1
Standard Deviation: Square root of the variance. The standard
deviation of the above example is 2.
Methods of Variability Measurement

Quartiles: Data can be divided into four regions that cover the total
range of observed values. Cut points for these regions are known as
In notations, quartiles of a data is the ((n+1)/4)q th observation of the
data, where q is the desired quartile and n is the number of
observations of data.
The first quartile (Q1) is the first 25% of the data. The second quartile
(Q2) is between the 25th and 50th percentage points in the data. The
upper bound of Q2 is the median. The third quartile (Q3) is the 25% of
the data lying between the median and the 75% cut point in the data.

Q1 is the median of the first half of the ordered observations and Q3 is

the median of the second half of the ordered observations.
Methods of Variability Measurement

In the following example Q1= ((15+1)/4)1 =4th observation of the data. The
4th observation is 11. So Q1 is of this data is 11.

An example with 15 numbers

3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40 (This is
also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range

of the previous example is 61- 40=21. The middle half of the ordered
data lie between 40 and 61.
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points
are called Deciles
Percentiles: If data is ordered and divided into 100 parts, then cut
points are called Percentiles. 25th percentile is the Q1, 50th percentile
is the Median (Q2) and the 75th percentile of the data is Q3.

In notations, percentiles of a data is the ((n+1)/100)p th observation of

the data, where p is the desired percentile and n is the number of
observations of data.

Coefficient of Variation: The standard deviation of data divided by its

mean. It is usually expressed in percent.

Coefficient of Variation = 100
Five Number Summary

Five Number Summary: The five number summary of a distribution

consists of the smallest (Minimum) observation, the first quartile (Q1),
The median(Q2), the third quartile, and the largest (Maximum)
observation written in order from smallest to largest.

Box Plot: A box plot is a graph of the five number summary. The
central box spans the quartiles. A line within the box marks the
median. Lines extending above and below the box mark the
smallest and the largest observations (i.e., the range). Outlying
samples may be additionally plotted outside the range.
Distribution of Age in Month
Choosing a Summary
The five number summary is usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with
extreme outliers. The mean and standard deviation are reasonable for
symmetric distributions that are free of outliers.

In real life we cant always expect symmetry of the data. Its a common
practice to include number of observations (n), mean, median, standard
deviation, and range as common for data summarization purpose. We
can include other summary statistics like Q1, Q3, Coefficient of variation
if it is considered to be important for describing data.
Shape of Data
Shape of data is measured by
Measures asymmetry of data
Positive or right skewed: Longer right tail
Negative or left skewed: Longer left tail

Let x1 , x2 ,...xn be n observations. Then,

n ( xi x ) 3
Skewness i 1
3/ 2

(x x)
i 1

Measures peakedness of the distribution of
data. The kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn be n observations. Then,

n ( xi x ) 4
Kurtosis i 1

(x x)
i 1

Summary of the Variable Age in
the given data set
Mean 90.41666667 Histogram of Age

Standard Error 3.902649518

Median 84
Mode 84

Standard Deviation 30.22979318

Number of Subjects

Sample Variance 913.8403955
Kurtosis -1.183899591

Skewness 0.389872725
Range 95
Minimum 48

Maximum 143
40 60 80 100 120 140 160
Sum 5425
Age in Month
Count 60
Summary of the Variable Age in the
given data set

Boxplot of Age in Month


Class Summary (First Part)
So far we have learned-

Statistics and data presentation/data summarization

Graphical Presentation: Bar Chart, Pie Chart, Histogram, and Box Plot
Numerical Presentation: Measuring Central value of data (mean,
median, mode etc.), measuring dispersion (standard deviation,
variance, co-efficient of variation, range, inter-quartile range etc),
quartiles, percentiles, and five number summary

Any questions ?
Brief concept of Statistical Softwares

There are many softwares to perform statistical analysis and visualization

of data. Some of them are SAS (System for Statistical Analysis), S-plus, R,
Matlab, Minitab, BMDP, Stata, SPSS, StatXact, Statistica, LISREL, JMP,
GLIM, HIL, MS Excel etc. We will discuss MS Excel and SPSS in brief.

Some useful websites for more information of statistical softwares-

Microsoft Excel
A Spreadsheet Application. It features calculation, graphing tools, pivot
tables and a macro programming language called VBA (Visual Basic for

There are many versions of MS-Excel. Excel XP, Excel 2003, Excel 2007
are capable of performing a number of statistical analyses.

Starting MS Excel: Double click on the Microsoft Excel icon on the

desktop or Click on Start --> Programs --> Microsoft Excel.

Worksheet: Consists of a multiple grid of cells with numbered rows down the page
and alphabetically-tilted columns across the page. Each cell is referenced by its
coordinates. For example, A3 is used to refer to the cell in column A and row 3.
B10:B20 is used to refer to the range of cells in column B and rows 10 through 20.
Microsoft Excel
Opening a document: File Open (From a existing workbook). Change the
directory area or drive to look for file in other locations.
Creating a new workbook: FileNewBlank Document
Saving a File: FileSave

Selecting more than one cell: Click on a cell e.g. A1), then hold the Shift key and
click on another (e.g. D4) to select cells between and A1 and D4 or Click on a cell
and drag the mouse across the desired range.

Creating Formulas: 1. Click the cell that you want to enter the formula,
2. Type = (an equal sign), 3. Click the Function Button, 4. Select the
formula you want and step through the on-screen instructions.

Microsoft Excel
Entering Date and Time: Dates are stored as MM/DD/YYYY. No need to enter
in that format. For example, Excel will recognize jan 9 or jan-9 as 1/9/2007 and
jan 9, 1999 as 1/9/1999. To enter todays date, press Ctrl and ; together. Use a
or p to indicate am or pm. For example, 8:30 p is interpreted as 8:30 pm. To
enter current time, press Ctrl and : together.

Copy and Paste all cells in a Sheet: Ctrl+A for selecting, Ctrl +C for copying
and Ctrl+V for Pasting.

Sorting: Data Sort Sort By

Descriptive Statistics and other Statistical methods: ToolsData Analysis Statistical

method. If Data Analysis is not available then click on Tools Add-Ins and then select
Analysis ToolPack and Analysis toolPack-Vba
Microsoft Excel
Statistical and Mathematical Function: Start with = sign and then select
function from function wizard f x .

Inserting a Chart: Click on Chart Wizard (or InsertChart), select chart,

give, Input data range, Update the Chart options, and Select output
range/ Worksheet.

Importing Data in Excel: File open FileType Click on File Choose

Option ( Delimited/Fixed Width) Choose Options (Tab/ Semicolon/
Comma/ Space/ Other) Finish.

Limitations: Excel uses algorithms that are vulnerable to rounding and

truncation errors and may produce inaccurate results in extreme
Statistics Package
for the Social Science (SPSS)
A general purpose statistical package SPSS is widely used in the social
sciences, particularly in sociology and psychology.
SPSS can import data from almost any type of file to generate tabulated
reports, plots of distributions and trends, descriptive statistics, and
complex statistical analyzes.
Starting SPSS: Double Click on SPSS on desktop or ProgramSPSS.

Opening a SPSS file: FileOpen


Data Editor
Various pull-down menus appear at the top of the Data Editor window. These
pull-down menus are at the heart of using SPSSWIN. The Data Editor menu
items (with some of the uses of the menu) are:
Statistics Package
for the Social Science (SPSS)

FILE used to open and save data files

EDIT used to copy and paste data values; used to find data in a
file; insert variables and cases; OPTIONS allows the user to set
general preferences as well as the setup for the Navigator, Charts,

VIEW user can change toolbars; value labels can be seen in cells
instead of data values

DATA select, sort or weight cases; merge files

TRANSFORM Compute new variables, recode variables, etc.

Statistics Package
for the Social Science (SPSS)

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts, etc

UTILITIES add comments to accompany data file (and other,

advanced features)

ADD-ons these are features not currently installed (advanced

statistical procedures)

WINDOW switch between data, syntax and navigator windows

HELP to access SPSSWIN Help information

Statistics Package
for the Social Science (SPSS)

Navigator (Output) Menus

When statistical procedures are run or charts are created, the output will appear
in the Navigator window. The Navigator window contains many of the pull-down
menus found in the Data Editor window. Some of the important menus in the
Navigator window include:

INSERT used to insert page breaks, titles, charts, etc.

FORMAT for changing the alignment of a particular portion of the output

Statistics Package
for the Social Science (SPSS)
Formatting Toolbar
When a table has been created by a statistical procedure, the user can edit the
table to create a desired look or add/delete information. Beginning with version
14.0, the user has a choice of editing the table in the Output or opening it in a
separate Pivot Table (DEFINE!) window. Various pulldown menus are activated
when the user double clicks on the table. These include:

EDIT undo and redo a pivot, select a table or table body (e.g., to
change the font)

INSERT used to insert titles, captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells

Statistics Package
for the Social Science (SPSS)
Additional menus
CHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window

Show or hide a toolbar

Click on VIEW TOOLBARS to show it/ to hide it

Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to
its new location

Customize a toolbar


Statistics Package
for the Social Science (SPSS)
Importing data from an EXCEL spreadsheet:
Data from an Excel spreadsheet can be imported into SPSSWIN as follows:
Box will appear.
2. Locate the file of interest: Use the "Look In" pull-down list to identify the folder
containing the Excel file of interest
3. From the FILE TYPE pull down menu select EXCEL (*.xls).
4. Click on the file name of interest and click on OPEN or simply double-click on
the file name.
5. Keep the box checked that reads "Read variable names from the first row of
data". This presumes that the first row of the Excel data file contains variable
names in the first row. [If the data resided in a different worksheet in the Excel
file, this would need to be entered.]
6. Click on OK. The Excel data file will now appear in the SPSSWIN Data
Statistics Package
for the Social Science (SPSS)
Importing data from an EXCEL spreadsheet:
7. The former EXCEL spreadsheet can now be saved as an SPSS file (FILE
SAVE AS) and is ready to be used in analyses. Typically, you would label variable
and values, and define missing values.
Importing an Access table
SPSSWIN does not offer a direct import for Access tables. Therefore, we must follow
these steps:
1. Open the Access file
2. Open the data table
3. Save the data as an Excel file
4. Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN.
Importing Text Files into SPSSWIN
Text data points typically are separated (or delimited) by tabs or commas.
Sometimes they can be of fixed format.
Statistics Package
for the Social Science (SPSS)
Importing tab-delimited data
In SPSSWIN click on FILE OPEN DATA. Look in the appropriate location for
the text file. Then select Text from Files of type: Click on the file name and then
click on Open. You will see the Text Import Wizard step 1 of 6 dialog box.

You will now have an SPSS data file containing the former tab-delimited data. You
simply need to add variable and value labels and define missing values.

Exporting Data to Excel

click on FILE SAVE AS. Click on the File Name for the file to be exported. For
the Save as Type select from the pull-down menu Excel (*.xls). You will notice the
checkbox for write variable names to spreadsheet. Leave this checked as you will
want the variable names to be in the first row of each column in the Excel
spreadsheet. Finally, click on Save.
Statistics Package
for the Social Science (SPSS)
Running the FREQUENCIES procedure

1. Open the data file (from the menus, click on FILE OPEN DATA) of

2. From the menus, click on ANALYZE DESCRIPTIVE STATISTICS

3. The FREQUENCIES Dialog Box will appear. In the left-hand box will be a listing
("source variable list") of all the variables that have been defined in the data file. The
first step is identifying the variable(s) for which you want to run a frequency analysis.
Click on a variable name(s). Then click the [ > ] pushbutton. The variable name(s)
will now appear in the VARIABLE[S]: box ("selected variable list"). Repeat these
steps for each variable of interest.

4. If all that is being requested is a frequency table showing count, percentages

(raw, adjusted and cumulative), then click on OK.
Statistics Package
for the Social Science (SPSS)
Descriptive and summary STATISTICS can be requested for numeric variables. To
request Statistics:
1. From the FREQUENCIES Dialog Box, click on the STATISTICS... pushbutton.
2. This will bring up the FREQUENCIES: STATISTICS Dialog Box.
3. The STATISTICS Dialog Box offers the user a variety of choices:


The DESCRIPTIVES procedure can be used to generate descriptive statistics

procedure offers many of the same statistics as the FREQUENCIES procedure,
but without generating frequency analysis tables.
Statistics Package
for the Social Science (SPSS)
Requesting CHARTS
One can request a chart (graph) to be created for a variable or variables included in
a FREQUENCIES procedure.

1. In the FREQUENCIES Dialog box click on CHARTS.

2. The FREQUENCIES: CHARTS Dialog box will appear. Choose the intended chart
(e.g. Bar diagram, Pie chart, histogram.

Pasting charts into Word

1. Click on the chart.
2. Click on the pulldown menu EDIT COPY OBJECTS
3. Go to the Word document in which the chart is to be embedded. Click on EDIT

4. Select Formatted Text (RTF) and then click on OK
5. Enlarge the graph to a desired size by dragging one or more of the black squares

along the perimeter (if the black squares are not visible, click once on the graph).
Statistics Package
for the Social Science (SPSS)
1. From the ANALYZE pull-down menu, click on DESCRIPTIVE STATISTICS
2. The CROSSTABS Dialog Box will then open.

3. From the variable selection box on the left click on a variable you wish to
designate as the Row variable. The values (codes) for the Row variable make up
the rows of the crosstabs table. Click on the arrow (>) button for Row(s). Next,
click on a different variable you wish to designate as the Column variable. The
values (codes) for the Column variable make up the columns of the crosstabs
table. Click on the arrow (>) button for Column(s).

4. You can specify more than one variable in the Row(s) and/or Column(s). A cross
table will be generated for each combination of Row and Column variables
Statistics Package
for the Social Science (SPSS)
Limitations: SPSS users have less control over data manipulation and
statistical output than other statistical packages such as SAS, Stata etc.

SPSS is a good first statistical package to perform quantitative research

in social science because it is easy to use and because it can be a good
starting point to learn more advanced statistical packages.
Introduction to
Colm ODushlaine

Neuropsychiatric Genetics, TCD



Descriptive Statistics & Graphical

Presentation of Data
Statistical Inference
Hypothesis Tests & Confidence Intervals
T-tests (Paired/Two-sample)
Regression (SLR & Multiple Regression)
Intended as an interview. Will provide slides
after lectures
Whats in the lectures?...
Lecture 1 Lecture 2 Lecture 3 Lecture 4
Descriptive Statistics and Graphical Presentation of

1. Terminology
2. Frequency Distributions/Histograms
3. Measures of data location
4. Measures of data spread
5. Box-plots
6. Scatter-plots
7. Clustering (Multivariate Data)

Lecture 1 Lecture 2 Lecture 3
Lecture 4 Statistical Inference

1. Distributions & Densities

2. Normal Distribution
3. Sampling Distribution & Central Limit
4. Hypothesis Tests
5. P-values
6. Confidence Intervals
7. Two-Sample Inferences
8. Paired Data

Lecture 1 Lecture 2 Lecture 3
Lecture 4 Sample Inferences

1. Two-Sample Inferences
Paired t-test
Two-sample t-test
2. Inferences for more than two samples
One-way ANOVA
Two-way ANOVA
Interactions in Two-way ANOVA
3. DataDesk demo

Lecture 1 Lecture 2 Lecture 3
Lecture 4

1. Regression
2. Correlation
3. Multiple Regression
5. Normality Checks
6. Non-parametrics
7. Sample Size Calculations
8. Useful tools and websites
Explanations of outputs
Videos with commentary
Help with deciding what test
to use with what data

1. Terminology
Populations & Samples
Population: the complete set of
individuals, objects or scores of interest.
Often too large to sample in its entirety
It may be real or hypothetical (e.g. the results
from an experiment repeated ad infinitum)

Sample: A subset of the population.

A sample may be classified as random (each
member has equal chance of being selected
from a population) or convenience (whats
Random selection attempts to ensure the
sample is representative of the population.
Variables are the quantities
measured in a sample.They may
be classified as:
Quantitative i.e. numerical
Continuous (e.g. pH of a sample, patient
cholesterol levels)
Discrete (e.g. number of bacteria
colonies in a culture)
Nominal (e.g. gender, blood group)
Ordinal (ranked e.g. mild, moderate or
severe illness). Often ordinal variables
are re-coded to be quantitative. 153
Variables can be further classified as:
Dependent/Response. Variable of primary
interest (e.g. blood pressure in an
antihypertensive drug trial). Not controlled by
the experimenter.
called a Factor when controlled by
experimenter. It is often nominal (e.g.
Covariate when not controlled.
If the value of a variable cannot be
predicted in advance then the variable is
referred to as a random variable
Parameters & Statistics
Parameters: Quantities that
describe a population
characteristic. They are usually
unknown and we wish to make
statistical inferences about
parameters. Different to

Descriptive Statistics:
Quantities and techniques used to
describe a sample characteristic or155
2. Frequency Distributions
An (Empirical) Frequency Distribution
or Histogram for a continuous variable
presents the counts of observations
grouped within pre-specified classes or

A Relative Frequency Distribution

presents the corresponding proportions of
observations within the classes

A Barchart presents the frequencies for a

categorical variable
Example Serum CK
Blood samples taken from 36 male
volunteers as part of a study to
determine the natural variation in CK

The serum CK concentrations were

measured in (U/I) are as follows:

Serum CK Data for 36 male

121 82 100 151 68 58

95 145 64 201 101 163
84 57 139 60 78 94
119 104 110 113 118 203
62 83 67 93 92 110
25 123 70 48 95 42
Relative Frequency Table
Serum CK Frequency Relative Cumulative Rel.
(U/I) Frequency Frequency
20-39 1 0.028 0.028
40-59 4 0.111 0.139
60-79 7 0.194 0.333
80-99 8 0.222 0.555
100-119 8 0.222 0.777
120-139 3 0.083 0.860
140-159 2 0.056 0.916
160-179 1 0.028 0.944
180-199 0 0.000 0.944
200-219 2 0.056 1.000
Total 36 1.000
Frequency Distribution
8 100.0% maximu
6 75.0% quart
50.0% media
25.0% quart

4 2.5%
0.0% minimu

20 40 60 80 100 120 140 160 180 200 220

Relative Frequency
Shaded area is 100.0% maxim
percentage of 99.5%
males with CK 0.20 97.5%
values between 90.0%
60 and 100 U/l, 75.0% quar

Relative Frequency
i.e. 42%. 0.15 50.0% med
Right tail 25.0% quar
(skewed) 2.5%
0.10 0.5%
0.0% minim
Left tail

20 40 60 80 100 120 140 160 180 200 220

3. Measures of Central
Tendency (Location)
Measures of location indicate where on the
number line the data are to be found. Common
measures of location are:

(i) the Arithmetic Mean,

(ii) the Median, and
(iii) the Mode

The Mean

Let x1,x2,x3,,xn be the realised

values of a random variable X, from
a sample of size n. The sample
arithmetic mean is defined as:
x 1
n xi
i 1

Example 2: The systolic blood pressure
of seven middle aged men were as
151, 124, 132, 170, 146, 124 and 113.
151 124 132 170 146 124 113
The mean is 137.14

The Median and Mode
If the sample data are arranged in
increasing order, the median is
(i) the middle value if n is an odd
number, or
(ii) midway between the two middle
values if n is an even number
The mode is the most commonly
occurring value.

Example 1 n is odd
The reordered systolic blood pressure data seen
earlier are:

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered

data, i.e. 132.

Two individuals have systolic blood pressure =

124 mm Hg, so the Mode is 124.

Example 2 n is even

Six men with high cholesterol participated in a study to

investigate the effects of diet on cholesterol level. At the
beginning of the study, their cholesterol levels (mg/dL)
were as follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:

230, 274, 274, 292, 327 and 366.

The Median is half way between the middle two readings,

i.e. (274+292) 2 = 283.

Two men have the same cholesterol level- the Mode is 274.
Mean versus Median

Large sample values tend to inflate the mean. This will

happen if the histogram of the data is right-skewed.

The median is not influenced by large sample values and is

a better measure of centrality if the distribution is skewed.

Note if mean=median=mode then the data are said to be


e.g. In the CK measurement study, the sample mean =

98.28. The median = 94.5, i.e. mean is larger than median
indicating that mean is inflated by two large data values
201 and 203.

4. Measures of Dispersion

Measures of dispersion characterise how

spread out the distribution is, i.e., how
variable the data are.
Commonly used measures of dispersion
1. Range
2. Variance & Standard deviation
3. Coefficient of Variation (or relative standard
4. Inter-quartile range

the sample Range is the difference
between the largest and smallest
observations in the sample
easy to calculate;
Blood pressure example: min=113
and max=170, so the range=57
useful for best or worst case
sensitive to extreme values
Sample Variance
The sample variance, s2, is the
arithmetic mean of the squared
deviations from the sample mean:
xi x

s i 1
n 1


Standard Deviation
The sample standard deviation, s, is
the square-root of the variance

xi x

i 1
n 1

s has the advantage of being in the same units

as the original variable x
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73
113 -24.14 582.88
Sum = 960.0 Sum = 0.00 Sum = 2304.86
x 137.14 173
Example (contd.)

x x
i 2304.86

i 1

i s 2304.86
x x
i 1 7 1
Coefficient of Variation
The coefficient of variation (CV) or
relative standard deviation (RSD) is the
sample standard deviation expressed as a
percentage of thes mean, i.e.

CV 100%
The CV is not affected by multiplicative
changes in scale
Consequently, a useful way of comparing the
dispersion of variables measured on different
The CV of the blood pressure data is:

CV 100 %

i.e., the standard deviation is 14.3% as

large as the mean.

Inter-quartile range
The Median divides a distribution into two

The first and third quartiles (denoted Q1 and

Q3) are defined as follows:
25% of the data lie below Q1 (and 75% is above Q1),
25% of the data lie above Q3 (and 75% is below Q3)

The inter-quartile range (IQR) is the

difference between the first and third quartiles,
i.e. 177
The ordered blood pressure data is:
113124 124 132 146 151 170

Q1 Q3

Inter Quartile Range (IQR) is 151-124

= 27

60% of slides complete!

5. Box-plots
A box-plot is a visual description of
the distribution based on
Useful for comparing large sets of
Example 1
The pulse rates of 12 individuals
arranged in increasing order are:
62, 64, 68, 70, 70, 74, 74, 76, 76, 78,
78, 80

Q1=(68+70)2 = 69, Q3=(76+78)2 =


IQR = (77 69) = 8

Example 1: Box-plot

Example 2: Box-plots of intensities
from 11 gene expression arrays


AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel

An outlier is an observation which
does not appear to belong with the
other data
Outliers can arise because of a
measurement or recording error or
because of equipment failure during
an experiment, etc.
An outlier might be indicative of a sub-
population, e.g. an abnormally low or
high value in a medical test could 184
Outlier Boxplot

Re-define the upper and lower limits

of the boxplots (the whisker lines) as:
Lower limit = Q1-1.5IQR, and
Upper limit = Q3+1.5IQR

Note that the lines may not go as far

as these limits
If a data point is < lower limit or >
upper limit, the data point is
considered to be an outlier. 185
Example CK data


6. Scatter-plot

Displays the relationship between

two continuous variables

Useful in the early stage of

analysis when exploring data and
determining is a linear regression
analysis is appropriate

May show outliers in your data 187

Example 1: Age versus Systolic
Blood Pressure in a Clinical Trial

Example 2: Up-regulation/Down-regulation
of gene expression across an array
(Control Cy5 versus Disease Cy3)

Example of a Scatter-plot matrix
(multiple pair-wise plots)

Other graphical representations
Dot-Plots, Stem-and-leaf plots
Not visually appealing
Visually appealing, but hard to compare two
datasets. Best for 3 to 7 categories. A total must be
=boxplot+smooth density
Nice visual of data shape

Multivariate Data
Clustering is useful for visualising
multivariate data and uncovering patterns,
often reducing its complexity

Clustering is especially useful for high-

dimensional data (p>>n): hundreds or
perhaps thousands of variables

An obvious areas of application are gel

electrophoresis and microarray
experiments where the variables are
protein abundances or gene expression
7. Clustering

Aim: Find groups of samples or variables

sharing similiarity

Clustering requires a definition of distance

between objects, quantifying a notion of
Points are grouped on the basis on minimum
distance apart (distance measures)

Once a pair are grouped, they are combined

into a single point (using a linkage method)
e.g. take their average. The process is then
repeated. 193
Clustering can be applied to rows or columns of a data
set (matrix) i.e. to the samples or variables

A tree can be constructed with branch length

proportional to distances between linked clusters, called
a Dendrogram

Clustering is an example of unsupervised learning: No

use is made of sample annotations i.e. treatment groups,
diagnosis groups

Unweighted Pair-Group Method Average
Most commonly used clustering method
1. Each observation forms its own cluster
2. The two with minimum distance are grouped into
a single cluster representing a new observation-
take their average
3. Repeat 2. until all data points form a single

Contrived Example
5 genes of interest on 3 replicates arrays/gels
Array1 Array2 Array3

p53 9 3 7
mdm2 10 2 9
bcl2 1 9 4
d xy ( x1 y1 ) ( x2 y2 ) ( x3 y3 )
2 2 2

cyclinE 6 5 5

caspase 8 1 10 3

Calculate distance between each pair of genes

e.g. d ( p53, mdm2) (9 10) 2 (3 2) 2 (7 9) 2 2.5

Construct a distance matrix of all pair-wise
p53 mdm2 bcl2 cyclinE caspase 8

p53 0 2.5 10.44 4.12 11.75

mdm2 - 0 12.5 6.4 13.93
bcl2 - - 0 6.48 1.41
cyclinE - - - 0 7.35
caspase 8 - - - - 0

Cluster the 2 genes with smallest distance

Take their average & re-calculate distances to other genes

{caspase-8 &
p53 mdm2 cyclin E
p53 0 2.5 4.12 10.9
mdm2 0 6.4 9.1

cyclin E 0 6.9
{caspase-8 &

{p53 & {caspase-8 &

cyclin E
mdm2} bcl-2}
{p53 & mdm2} 0 3.7 9.2
cyclin E 0 6.9

{caspase-8 & bcl-2} 0

Example (contd)

..and the final


Example of a gene expression

Variety of approaches to clustering

Clustering techniques
agglomerative -start with every element in its own
cluster, and iteratively join clusters together
divisive - start with one cluster and iteratively divide it
into smaller clusters
Distance Metrics
Euclidean (as-the-crow-flies)
Minkowski (a whole class of metrics)
Correlation (similarity in profiles: called similarity
Linkage Rules
average: Use the mean distance between cluster
single: Use the minimum distance (gives loose clusters)
complete: Use the maximum distance (gives tight
median: Use the median distance
centroid: Use the distance between the average 201
Clustering Summary
The clusters & tree topology often depend
highly on the distance measure and linkage
method used

Recommended to use two distance metrics,

such as Euclidean and a correlation metric

A clustering algorithm will always yield

clusters, whether the data are organised in
clusters or not!

What is Statistics?
Statistics is a way to get information
from data

Data Information

Data: Facts, Information:

especially numerical Knowledge
facts, collected communicated
together for concerning some
reference or particular fact.

Definitions: Oxford English Dictionar

Interval Data
Interval data
Real numbers, i.e. heights, weights,
prices, etc.
Also referred to as quantitative or

Arithmetic operations can be performed on

Interval Data, thus its meaningful to talk
about 2*Height, or Price + $1, and so on.

Nominal Data
Nominal Data
The values of nominal data are categories.
E.g. responses to questions about marital status,
coded as:
Single = 1, Married = 2, Divorced = 3, Widowed = 4

Because the numbers are arbitrary arithmetic

operations dont make any sense (e.g. does Widowed
2 = Married?!)

Nominal data are also called qualitative or


Ordinal Data
Ordinal Data appear to be categorical in nature, but their
values have an order; a ranking to them:

E.g. College course rating system:

poor = 1, fair = 2, good = 3, very good = 4, excellent = 5

While its still not meaningful to do arithmetic on this data

(e.g. does 2*fair = very good?!), we can say things like:
excellent > poor or fair < very good
That is, order is maintained no matter what numeric
values are assigned to each category.

Graphical & Tabular Techniques for Nominal

The only allowable calculation on nominal data is to

count the frequency of each value of the variable.

We can summarize the data in a table that presents

the categories and their counts called a frequency

A relative frequency distribution lists the

categories and the proportion with which each occurs.

Refer to Example 2.1

Nominal Data (Tabular

Nominal Data (Frequency)

Bar Charts are often used to display frequencies

Nominal Data
It all the same information,
(based on the same data).
Just different presentation.

Graphical Techniques for Interval
There are several graphical methods that are
used when the data are interval (i.e. numeric,

The most important of these graphical methods

is the histogram.

The histogram is not only a powerful graphical

technique used to summarize interval data,
but it is also used to help explain probabilities.

Building a Histogram
1) Collect the Data
2) Create a frequency distribution for
the data.
3) Draw the Histogram.

Histogram and Stem &


Is a graph of a cumulative frequency


We create an ogive in three steps

1) Calculate relative frequencies.
2) Calculate cumulative relative
frequencies by adding the current class
relative frequency to the previous class
cumulative relative frequency.
(For the first class, its cumulative relative frequency is just its
relative frequency)

Cumulative Relative
first class
next class: .


last class: .

The ogive can be used
to answer questions

What telephone bill

value is at the 50th

around $35
(Refer also to Fig. 2.13 in your textbook
Scatter Diagram
Example 2.9 A real estate agent wanted
to know to what extent the selling price
of a home is related to its size

1) Collect the data

2) Determine the independent variable (X
house size) and the dependent variable
(Y selling price)
3) Use Excel to create a scatter diagram

Scatter Diagram
It appears that in fact there is a
relationship, that is, the greater the
house size the greater the selling

Patterns of Scatter
Linearity and Direction are two
concepts we are interested in

Positive Linear Relationship Negative Linear Relationship

Weak or Non-Linear Relationship 1.219

Time Series Data
Observations measured at the same point in
time are called cross-sectional data.

Observations measured at successive points

in time are called time-series data.

Time-series data graphed on a line chart,

which plots the value of the variable on the
vertical axis against the time periods on the
horizontal axis.

Numerical Descriptive
Measures of Central Location
Mean, Median, Mode

Measures of Variability
Range, Standard Deviation, Variance, Coefficient of

Measures of Relative Standing

Percentiles, Quartiles

Measures of Linear Relationship

Covariance, Correlation, Least Squares Line

Measures of Central
The arithmetic mean, a.k.a.
average, shortened to mean, is the
most popular & useful measure of
central location.
Sum of the observations
It is =
computed byofsimply adding up
Number observations
all the observations and dividing by
the total number of observations:

Arithmetic Mean

Sample Mean
Population Mean

Statistics is a pattern
Population Sample

Size N n


The Arithmetic Mean
is appropriate for describing
measurement data, e.g. heights of
people, marks of student papers, etc.

is seriously affected by extreme values

called outliers. E.g. as soon as a
billionaire moves into a neighborhood,
the average household income increases
beyond what it was previously!

Measures of Variability
Measures of central location fail to
tell the whole story about the
distribution; that is, how much are
For example, two sets of class grades
the The
are shown. observations
mean (=50) is the spread out around
same in each case
the mean value?
But, the red class has greater
variability than the blue class.

The range is the simplest measure of variability,
calculated as:

Range = Largest observation Smallest observation

Data: {4, 4, 4, 4, 50} Range = 46
Data: {4, 8, 15, 24, 39, 50} Range = 46
The range is the same in both cases,
but the data sets have very different distributions

Statistics is a pattern
Population Sample

Size N n



Variance population mean

population size
The variance of a populationsample
is: mean

The variance of a sample is:

Note! the denominator is sample size (n) minus one !

Example 4.7. The following sample consists of the
number of jobs six randomly selected students
applied for: 17, 15, 23, 7, 9, 13.
Finds its mean and variance.

What are we looking to calculate?

The following sample consists of the number of

jobs six randomly selected students applied for:
17, 15, 23, 7, 9, 13.
Finds its mean and variance.
as opposed to or 2
Sample Mean
Mean & Variance

Sample Variance

Sample Variance (shortcut method)

Standard Deviation
The standard deviation is simply the
square root of the variance, thus:

Population standard deviation:

Sample standard deviation:

Standard Deviation
Consider Example 4.8 where a golf
club manufacturer has designed a
new club and wants to determine if it
is hit more consistently (i.e. with less
variability) than with an old club.
Using Tools > Data Analysis >
[may need to add in

Descriptive Statistics in Excel, we produce

You get more
the following tables for distance with
the new club.
The Empirical Rule If the histogram
Approximately 68% of all observations
is bell shaped fall
within one standard deviation of the mean.

Approximately 95% of all observations fall

within two standard deviations of the mean.

Approximately 99.7% of all observations fall

within three standard deviations of the mean.

Chebysheffs TheoremNot often used because interval
is very wide.

A more general interpretation of the

standard deviation is derived from
Chebysheffs Theorem, which applies to
all shapes of histograms (not just bell
For k=2 (say), the theorem
states that atin
The proportion of observations least
any 3/4 of all
observations lie within 2
sample that lie standard deviations of the
mean. This
within k standard deviations ofisthe
a lower
compared to Empirical Rules
at least: approximation (95%).
Box Plots
These box plots are
based on data in

Wendys service
time is shortest and
least variable.

Hardees has the

greatest variability,
while Jack-in-the-Box
has the longest
service times.

Methods of Collecting
There are many methods used to
collect or obtain data for statistical
analysis. Three of the most popular
methods are:
Direct Observation
Experiments, and

Recall that statistical inference permits us to draw
conclusions about a population based on a sample.

Sampling (i.e. selecting a sub-set of a whole

population) is often done for reasons of cost (its less
expensive to sample 1,000 television viewers than
100 million TV viewers) and practicality (e.g.
performing a crash test on every automobile
produced is impractical).

In any case, the sampled population and the

target population should be similar to one another.

Sampling Plans
A sampling plan is just a method or
procedure for specifying how a sample will be
taken from a population.

We will focus our attention on these three


Simple Random Sampling,

Stratified Random Sampling, and
Cluster Sampling.
Simple Random Sampling
A simple random sample is a sample
selected in such a way that every possible
sample of the same size is equally likely to
be chosen.

Drawing three names from a hat containing

all the names of the students in the class is
an example of a simple random sample: any
group of three names is as equally likely as
picking any other group of three names.

Stratified Random
After the population has been
stratified, we can use simple
random sampling to generate the
complete sample:

f we only have sufficient resources to sample 400 people total,

we would draw 100 of them from the low income group

if we are sampling 1000 people, wed draw

50 of them from the high income group.
Cluster Sampling
A cluster sample is a simple random sample of
groups or clusters of elements (vs. a simple
random sample of individual objects).

This method is useful when it is difficult or costly

to develop a complete list of the population
members or when the population elements are
widely dispersed geographically.

Cluster sampling may increase sampling error

due to similarities among cluster members.

Sampling Error
Sampling error refers to differences between the
sample and the population that exist only because of the
observations that happened to be selected for the

Another way to look at this is: the differences in results

for different samples (of the same size) is due to
sampling error:

E.g. Two samples of size 10 of 1,000 households. If we

happened to get the highest income level data points in
our first sample and all the lowest income levels in the
second, this delta is due to sampling error.

Nonsampling Error
Nonsampling errors are more serious and are
due to mistakes made in the acquisition of data or
due to the sample observations being selected
improperly. Three types of nonsampling errors:

Errors in data acquisition,

Nonresponse errors, and
Selection bias.

Note: increasing the sample size will not reduce

this type of error.

Approaches to Assigning
There are three ways to assign a probability, P(O i),
to an outcome, Oi, namely:

Classical approach: make certain assumptions

(such as equally likely, independence) about

Relative frequency: assigning probabilities

based on experimentation or historical data.

Subjective approach: Assigning probabilities

based on the assignors judgment.

Interpreting Probability
One way to interpret probability is this:

If a random experiment is repeated an infinite

number of times, the relative frequency for any
given outcome is the probability of this outcome.

For example, the probability of heads in flip of a

balanced coin is .5, determined using the classical
approach. The probability is interpreted as being
the long-term relative frequency of heads if the
coin is flipped an infinite number of times.

Conditional Probability
Conditional probability is used to
determine how two events are
related; that is, we can determine
the probability of one event given
the occurrence of another related

Conditional probabilities are written

as P(A | B) and read as the
probability of A given B and is 1.247
One of the objectives of calculating conditional
probability is to determine whether two events are

In particular, we would like to know whether they are

independent, that is, if the probability of one event
is not affected by the occurrence of the other event.

Two events A and B are said to be independent if

P(A|B) = P(A)
P(B|A) = P(B)

Complement Rule
The complement of an event A is the event that occurs
when A does not occur.

The complement rule gives us the probability of an

event NOT occurring. That is:

P(AC) = 1 P(A)

For example, in the simple roll of a die, the probability

of the number 1 being rolled is 1/6. The probability
that some number other than 1 will be rolled is 1
1/6 = 5/6.

Multiplication Rule
The multiplication rule is used to
calculate the joint probability of
two events. It is based on the
formula for conditional probability
defined earlier:
If we multiply both sides of the equation by P(B) we have:

P(A and B) = P(A | B)P(B)

Likewise, P(A and B) = P(B | A) P(A)

If A and B are independent events, then P(A and B) = P(A)P(B)

Addition Rule
Recall: the addition rule was introduced
earlier to provide a way to compute the
probability of event A or B or both A and B
occurring; i.e. the union of A and B.

P(A or B) = P(A) + P(B) P(A and B)

Why do we subtract the joint probability P(A

and B) from the sum of the probabilities of A
P(A orB?
and B) = P(A) + P(B) P(A and B)

Addition Rule for Mutually Excusive
If and A and B are mutually exclusive the occurrence of
one event makes the other one impossible. This means

P(A and B) = 0

The addition rule for mutually exclusive events is

P(A or B) = P(A) + P(B)

We often use this form when we add some joint

probabilities calculated from a probability tree

Two Types of Random
Discrete Random Variable
one that takes on a countable number of values
E.g. values on the roll of dice: 2, 3, 4, , 12

Continuous Random Variable

one whose values are not discrete, not countable
E.g. time (30.1 minutes? 30.10000001 minutes?)

Integers are Discrete, while Real Numbers are

Laws of Expected Value
1. E(c) = c
The expected value of a constant (c) is just
the value of the constant.

2. E(X + c) = E(X) + c
3. E(cX) = cE(X)
We can pull a constant out of the
expected value expression (either as part of
a sum with a random variable X or as a
coefficient of random variable X).
Laws of Variance
1. V(c) = 0
The variance of a constant (c) is zero.

2. V(X + c) = V(X)
The variance of a random variable and a constant is
just the variance of the random variable (per 1 above).

3. V(cX) = c2V(X)
The variance of a random variable and a constant
coefficient is the coefficient squared times the variance
of the random variable.

Binomial Distribution
The binomial distribution is the probability
distribution that results from doing a binomial
experiment. Binomial experiments have the
following properties:

1. Fixed number of trials, represented as n.

2. Each trial has two possible outcomes, a success
and a failure.
3. P(success)=p (and thus: P(failure)=1p), for all trials.
4. The trials are independent, which means that the
outcome of one trial does not affect the outcomes of
any other trials.

Binomial Random Variable
The binomial random variable
counts the number of successes in n
trials of the binomial experiment. It
can take on values from 0, 1, 2, , n.
Thus, its a discrete random variable.

To calculate the probability

for x=0, 1, 2, , n
associated with each value we use
Binomial Table
What is the probability that Pat fails
the quiz?
i.e. what is P(X 4), given
P(success) = .20 and n=10 ?

P(X 4) = .967
Binomial Table
What is the probability that Pat gets
two answers correct?
i.e. what is P(X = 2), given
P(success) = .20 and n=10 ?

P(X = 2) = P(X2) P(X1) = .678 .376 = .302

remember, the table shows cumulative probabilities 1.259
There is a binomial distribution
function in Excel that can also be
used to calculate these probabilities.
# successes

For example: # trials

What is the probability that Pat gets

two answers correct?
(i.e. P(Xx)?)

There is a binomial distribution
function in Excel that can also be
used to calculate these probabilities.
# successes

For example: # trials

What is the probability that Pat fails

the quiz?
(i.e. P(Xx)?)

Binomial Distribution
As you might expect, statisticians
have developed general formulas for
the mean, variance, and standard
deviation of a binomial random
variable. They are:

Poisson Distribution
Named for Simeon Poisson, the Poisson
distribution is a discrete probability distribution
and refers to the number of events (a.k.a.
successes) within a specific time period or region
of space. For example:
The number of cars arriving at a service station in 1
hour. (The interval of time is 1 hour.)
The number of flaws in a bolt of cloth. (The specific
region is a bolt of cloth.)
The number of accidents in 1 day on a particular
stretch of highway. (The interval is defined by both time,
1 day, and space, the particular stretch of highway.)

The Poisson Experiment
Like a binomial experiment, a Poisson experiment
has four defining characteristic properties:
1. The number of successes that occur in any interval is
independent of the number of successes that occur
in any other interval.
2. The probability of a success in an interval is the
same for all equal-size intervals
3. The probability of a success is proportional to the
size of the interval.
4. The probability of more than one success in an
interval approaches 0 as the interval becomes

Poisson Distribution
The Poisson random variable is the number
of successes that occur in a period of time or
an interval of space in a Poisson experiment.

E.g. On average, 96 trucks arrive at a border

time period
every hour.

E.g. The number of typographic errors in a new

textbook edition averages 1.5 per 100 pages.
successes (?!) interval

Poisson Probability
The probability that a Poisson random
variable assumes a value of x is given by:

and e is the natural logarithm base.

Example 7.12
The number of typographical errors in new
editions of textbooks varies considerably
from book to book. After some analysis he
concludes that the number of errors is
Poisson distributed with a mean of 1.5 per
100 pages. The instructor randomly selects
100 pages of a new book. What is the
probability that there are no typos?

That is, what is P(X=0) given that = 1.5?

There is about a 22% chance of finding zero errors
Poisson Distribution
As mentioned on the Poisson experiment slide:

The probability of a success is

proportional to the size of the interval

Thus, knowing an error rate of 1.5 typos per

100 pages, we can determine a mean value for
a 400 page book as:

=1.5(4) = 6 typos / 400 pages.

Example 7.13
For a 400 page book, what is the
probability that there are
no typos?

P(X=0) =
there is a very small chance there are no typos

Example 7.13
Excel is an even better alternative:

Probability Density
Unlike a discrete random variable which
we studied in Chapter 7, a continuous
random variable is one that can assume
an uncountable number of values.
We cannot list the possible values
because there is an infinite number of
Because there is an infinite number of
values, the probability of each individual
value is virtually 0.
Point Probabilities are Zero
Because there is an infinite number of values, the
probability of each individual value is virtually 0.

Thus, we can determine the probability of a range

of values only.

E.g. with a discrete random variable like tossing a die, it is

meaningful to talk about P(X=5), say.
In a continuous setting (e.g. with time as a random variable), the
probability the random variable of interest, say task length, takes
exactly 5 minutes is infinitesimally small, hence P(X=5) = 0.
It is meaningful to talk about P(X 5).

Probability Density
A function f(x) is called a probability density
function (over the range a x b if it meets
the following requirements:

1) f(x) 0 for all x between a and b, and



a b x
2) The total area under the curve between a and b is

The Normal Distribution
The normal distribution is the most important of
all probability distributions. The probability density
function of a normal random variable is given by:

It looks like this:

Bell shaped,
Symmetrical around the mean

The Normal Distribution
Important things to note:
The normal distribution is fully defined by two parameters:
its standard deviation and mean

The normal distribution is bell shaped and

symmetrical about the mean

Unlike the range of the uniform distribution (a x b)

Normal distributions range from minus infinity to plus infinity
Standard Normal
A normal distribution whose mean is zero and standard
deviation is one is called
0 the standard normal
distribution. 1

As we shall see shortly, any normal distribution can be

converted to a standard normal distribution with
simple algebra. This makes calculations much easier.

Calculating Normal
We can use the following function to
convert any normal random variable
to a standard normal random

Some advice:
always draw a
Calculating Normal
Example: The time required to build a computer is
normally distributed with a mean of 50 minutes
and a standard deviation of 10 minutes:

What is the probability that a computer is

assembled in a time between 45 and 60 minutes?

Algebraically speaking, what is P(45 < X < 60) ?

Calculating Normal
mean of 50 minutes and a
standard deviation of 10 minutes
P(45 < X < 60) ?

Calculating Normal
We can use Table 3 in
Appendix B to look-up
probabilities P(0 < Z < z)

We can break up P(.5 < Z < 1) into:

P(.5 < Z < 0) + P(0 < Z < 1)

The distribution is symmetric around zero, so we have:

P(.5 < Z < 0) = P(0 < Z < .5)
Hence: P(.5 < Z < 1) = P(0 < Z < .5) + P(0 < Z <

Calculating Normal
How to use Table 3
This table gives probabilities P(0 < Z < z)
First column = integer + first decimal
Top row = second decimal place

P(0 < Z < 0.5)

P(0 < Z < 1)

P(.5 < Z < 1) = .1915 + .3414 = .5328

Using the Normal Table (
Table 3)
What is P(Z > 1.6)P(0?< Z < 1.6) = .4452

0 1.6

P(Z > 1.6) = .5 P(0 < Z < 1.6)

= .5 .4452
= .0548
Using the Normal Table (
Table 3)
What is P(Z < -2.23) ? P(0 < Z < 2.23)

P(Z < -2.23) P(Z > 2.23)

-2.23 0 2.23

P(Z < -2.23) = P(Z > 2.23)

= .5 P(0 < Z < 2.23)
= .0129
Using the Normal Table (
Table 3)
What is P(Z < 1.52)P(0
P(Z < 0) = .5 ? < Z < 1.52)

0 1.52

P(Z < 1.52) = .5 + P(0 < Z < 1.52)

= .5 + .4357
= .9357
Using the Normal Table (
Table 3)
What is P(0.9 < ZP(0
Z < 0.9) ?

P(0.9 < Z < 1.9)

0 0.9 1.9

P(0.9 < Z < 1.9) = P(0 < Z < 1.9) P(0 < Z < 0.9)
=.4713 .3159
= .1554
Finding Values of Z

Other Z values are

Z.05 = 1.645
Z.01 = 2.33

Using the values of Z

Because z.025 = 1.96 and - z.025=

-1.96, it follows that we can state

P(-1.96 < Z < 1.96) = .95

P(-1.645 < Z < 1.645) = .90

Other Continuous
Three other important continuous
distributions which will be used
extensively in later sections are
introduced here:

Student t Distribution,
Chi-Squared Distribution, and
F Distribution.

Student t Distribution
Here the letter t is used to represent the random
variable, hence the name. The density function
for the Student t distribution is as follows

(nu) is called the degrees of freedom, and

(Gamma function) is (k)=(k-1)(k-2)(2)(1)

Student t Distribution
In much the same way that and define the normal
distribution, , the degrees of freedom, defines the
t Distribution:

Figure 8.24
As the number of degrees of freedom increases, the t
distribution approaches the standard normal distribution.

Determining Student t
The student t distribution is used extensively in
statistical inference. Table 4 in Appendix B lists values of

That is, values of a Student t random variable with

degrees of freedom such that:

The values for A are pre-determined

critical values, typically in the
10%, 5%, 2.5%, 1% and 1/2% range.

Using the t table (Table 4) for
For example, if we
Area under thewant the(tvalue
curve valueA
t with 10 degrees of freedom such
tthat the area under the Student t
curve is .05:

Degrees of Freedom : ROW

F Distribution
The F density function is given by:

F > 0. Two parameters define this distribution, and

like weve already seen these are again degrees
of freedom.
is the numerator degrees of freedom and
is the denominator degrees of freedom.

Determining Values of F
For example, what is the value of F
for 5% of the area under the right
hand tail of the curve, with a
There are different tables
for different values of A. degree of freedom of 3
Make sure you start with
the correct a denominator degree of
freedom of 7? F =4.35


.05,3,7 use the F look-up (Table 6)
Denominator Degrees of Freedom : ROW
Numerator Degrees of Freedom : COLUMN
Determining Values of F
For areas under the curve on the left
hand side of the curve, we can
leverage the following relationship:

Pay close attention to the order of the terms!

Chapter 9

Sampling Distributions

Sampling Distribution of the
A fair die is thrown infinitely many times,
with the random variable X = # of spots on
any throw.

x 1 2 3 4 5 6
The probability distribution of X is:
P(x) 1/6 1/6 1/6 1/6 1/6 1/6

and the mean and variance are calculated

as well:
Sampling Distribution of Two
A sampling distribution is created by looking at
all samples of size n=2 (i.e. two dice) and their means

While there are 36 possible samples of size 2, there are

only 11 values for , and some (e.g. =3.5) occur more
frequently than others (e.g. =1).

Sampling Distribution of Two Dice

P( )sampling
distribution of is
shown below:
2.0 3/36
2.5 4/36
3.0 5/36

3.5 6/36
4.0 5/36
4.5 4/36 2/36
5.0 3/36
5.5 2/36
6.0 1/36 1/36

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

Compare the distribution of X

1 2 3 4 5 6 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

with the sampling distribution of .

As well, note that:

Central Limit Theorem
The sampling distribution of the
mean of a random sample drawn
from any population is
approximately normal for a
sufficiently large sample size.

The larger the sample size, the more

closely the sampling distribution of X
will resemble a normal distribution.
Central Limit Theorem
If the population is normal, then X is normally
distributed for all values of n.

If the population is non-normal, then X is

approximately normal only for larger values of n.

In many practical situations, a sample size of 30

may be sufficiently large to allow us to use the
normal distribution as an approximation for the
sampling distribution of X.

Sampling Distribution of the Sample


3. If X is normal, X is normal. If X is
nonnormal, X is approximately normal for
sufficiently large sample sizes.
Note: the definition of sufficiently large
depends on the extent of nonnormality of x
(e.g. heavily skewed; multimodal)
Example 9.1(a)
The foreman of a bottling plant has
observed that the amount of soda in each
32-ounce bottle is actually a normally
distributed random variable, with a mean
of 32.2 ounces and a standard deviation of
.3 ounce.

If a customer buys one bottle, what is the

probability that the bottle will contain
more than 32 ounces?
Example 9.1(a)
We want to find P(X > 32), where X is
normally distributed and =32.2
and =.3

there is about a 75% chance

that a single bottle of soda
contains more than 32oz.
Example 9.1(b)
The foreman of a bottling plant has observed
that the amount of soda in each 32-ounce
bottle is actually a normally distributed
random variable, with a mean of 32.2 ounces
and a standard deviation of .3 ounce.

If a customer buys a carton of four bottles,

what is the probability that the mean
amount of the four bottles will be greater
than 32 ounces?

Example 9.1(b)
We want to find P(X > 32), where X is normally
with =32.2 and =.3

Things we know:
1)X is normally distributed, therefore so will X.

2) = 32.2 oz.

Example 9.1(b)
If a customer buys a carton of four bottles,
what is the probability that the mean
amount of the four bottles will be greater
than 32 ounces?

There is about a 91% chance the mean

of the four bottles will exceed 32oz.
Graphically Speaking

what is the probability that one what is the probability that the
bottle will contain more than 32 mean of four bottles will exceed 32
ounces? oz?

Sampling Distribution: Difference
of two means
The final sampling distribution introduced is that of the
difference between two sample means. This

independent random samples be drawn from each

of two normal populations

If this condition is met, then the sampling distribution of

the difference between the two sample means, i.e.
will be normally distributed.
(note: if the two populations are not both normally
distributed, but the sample sizes are large (>30), the
distribution of is approximately normal)

Sampling Distribution: Difference
of two means
The expected value and variance of the
sampling distribution of are given by:


standard deviation:

(also called the standard error if the difference

between two means)

There are two types of inference: estimation
and hypothesis testing; estimation is
introduced first.

The objective of estimation is to determine

the approximate value of a population
parameter on the basis of a sample statistic.

E.g., the sample mean ( ) is employed to

estimate the population mean ( ).
The objective of estimation is to determine
the approximate value of a population
parameter on the basis of a sample statistic.

There are two types of estimators:

Point Estimator

Interval Estimator

Point & Interval Estimation
For example, suppose we want to estimate the mean
summer income of a class of business students. For
n=25 students,
is calculated to be 400 $/week.

point estimate interval estimate

An alternative statement is:

The mean income is between 380 and 420 $/week.

Estimating when is
the confidence

We established in Chapter 9:

the sample mean

is in the center of
the interval
Thus, the probability that the interval:

contains the population mean is 1 . This is

a confidence interval estimator for .
Four commonly used confidence
Confidence Level
& keep handy!

Table 10.1
Example 10.1
A computer company samples demand during
lead time over
235 25374
time 309
499 253
421 361 514 462 369
394 439 348 344 330
261 374 302 466 535
386 316 296 332 334

Its is known that the standard deviation of

demand over lead time is 75 computers. We
want to estimate the mean demand over lead
time with 95% confidence in order to set
inventory levels

Example 10.1
In order to use our confidence interval estimator, we need
the following pieces of
Calculated data:
from the data


n 25


The lower and upper confidence limits are 340.76 and



Example 10.1
The estimation for the mean demand during lead
time lies between 340.76 and 399.56 we can use
this as input in developing an inventory policy.

That is, we estimated that the mean demand during

lead time falls between 340.76 and 399.56, and this
type of estimator is correct 95% of the time. That
also means that 5% of the time the estimator will be

Incidentally, the media often refer to the 95% figure

as 19 times out of 20, which emphasizes the
long-run aspect of the confidence level.

Interval Width
A wide interval provides little information.
For example, suppose we estimate with 95%
confidence that an accountants average starting
salary is between $15,000 and $100,000.

Contrast this with: a 95% confidence interval

estimate of starting salaries between $42,000 and

The second estimate is much narrower, providing

accounting students more precise information about
starting salaries.

Interval Width
The width of the confidence interval
estimate is a function of the
confidence level, the population
standard deviation, and the sample

Selecting the Sample Size
We can control the width of the interval by
determining the sample size necessary to produce
narrow intervals.

Suppose we want to estimate the mean demand

to within 5 units; i.e. we want to the interval
estimate to be:


It follows that
Solve for n to get requisite sample size!
Selecting the Sample Size
Solving the equation

that is, to produce a 95% confidence

interval estimate of the mean (5
units), we need to sample 865 lead
time periods (vs. the 25 data points
we have currently).
Sample Size to Estimate a
The general formula for the sample
size needed to estimate a population
mean with an interval estimate of:

Requires a sample size of at least

this large:

Example 10.2
A lumber company must estimate the
mean diameter of trees to determine
whether or not there is sufficient lumber to
harvest an area of forest. They need to
estimate this to within 1 inch at a
confidence level of 99%. The tree
diameters are normally distributed with a
standard deviation of 6 inches.

How many trees need to be sampled?

Example 10.2
Things we know:

Confidence level = 99%, therefore =.01

We want 1 , hence W=1.

We are given that = 6.
Example 10.2
We compute

That is, we will need to sample at
least 239 trees to have a
99% confidence interval of

Nonstatistical Hypothesis Testing

A criminal trial is an example of hypothesis testing

without the statistics.
In a trial a jury must decide between two hypotheses.
The null hypothesis is
H0: The defendant is innocent

The alternative hypothesis or research hypothesis is

H1: The defendant is guilty

The jury does not know which hypothesis is true. They

must make a decision on the basis of evidence

Nonstatistical Hypothesis Testing

There are two possible errors.

A Type I error occurs when we reject
a true null hypothesis. That is, a Type
I error occurs when the jury convicts
an innocent person.

A Type II error occurs when we dont

reject a false null hypothesis. That
occurs when a guilty defendant is
acquitted. 1.329
Nonstatistical Hypothesis Testing

The probability of a Type I error is

denoted as (Greek letter alpha).
The probability of a type II error is
(Greek letter beta).

The two probabilities are inversely

related. Decreasing one increases
the other.

Nonstatistical Hypothesis Testing

The critical concepts are theses:

1. There are two hypotheses, the null and the alternative
2. The procedure begins with the assumption that the
null hypothesis is true.
3. The goal is to determine whether there is enough
evidence to infer that the alternative hypothesis is true.
4. There are two possible decisions:
Conclude that there is enough evidence to support the
alternative hypothesis.
Conclude that there is not enough evidence to support
the alternative hypothesis.

Nonstatistical Hypothesis Testing

5. Two possible errors can be made.

Type I error: Reject a true null
Type II error: Do not reject a false
null hypothesis.

P(Type I error) =
P(Type II error) =
Concepts of Hypothesis Testing (1)

There are two hypotheses. One is called the null

hypothesis and the other the alternative or research
hypothesis. The usual notation is:
H nought

H0: the null hypothesis

H1: the alternative or research hypothesis

The null hypothesis (H0) will always state that the

parameter equals the value specified in the
alternative hypothesis (H1)

Concepts of Hypothesis
Consider Example 10.1 (mean demand for
computers during assembly lead time) again.
Rather than estimate the mean demand, our
operations manager wants to know whether the
mean is different from 350 units. We can
rephrase this request into a test of the hypothesis:

H0: = 350

Thus, our research hypothesis

This becomes:
is what we are
interested in
H1: 350 determining

Concepts of Hypothesis Testing (4)

There are two possible decisions that can be made:

Conclude that there is enough evidence to support

the alternative hypothesis
(also stated as: rejecting the null hypothesis in favor of
the alternative)

Conclude that there is not enough evidence to

support the alternative hypothesis
(also stated as: not rejecting the null hypothesis in favor
of the alternative)
NOTE: we do not say that we accept the null

Concepts of Hypothesis
Once the null and alternative hypotheses are stated, the
next step is to randomly sample the population and
calculate a test statistic (in this example, the sample

If the test statistics value is inconsistent with the null

hypothesis we reject the null hypothesis and infer
that the alternative hypothesis is true.
For example, if were trying to decide whether the mean is
not equal to 350, a large value of (say, 600) would
provide enough evidence. If is close to 350 (say, 355) we
could not say that this provides a great deal of evidence to
infer that the population mean is different than 350.

Types of Errors
A Type I error occurs when we reject a true null
hypothesis (i.e. Reject H0 when it is TRUE)
H0 T F

Reject I

Reject II

A Type II error occurs when we dont reject a false

null hypothesis (i.e. Do NOT reject H0 when it is FALSE)

Recap I
1) Two hypotheses: H0 & H1
3) GOAL: determine if there is enough
evidence to infer that H1 is TRUE
4) Two possible decisions:
Reject H0 in favor of H1
NOT Reject H0 in favor of H1
5) Two possible types of errors:
Type I: reject a true H0 [P(Type I)= ]
Type II: not reject a false H 0 [P(Type II)= ]
Example 11.1
A department store manager determines that a
new billing system will be cost-effective only if
the mean monthly account is more than $170.

A random sample of 400 monthly accounts is

drawn, for which the sample mean is $178. The
accounts are approximately normally distributed
with a standard deviation of $65.

Can we conclude that the new system will

be cost-effective?
Example 11.1
The system will be cost effective if the mean account
balance for all customers is greater than $170.

We express this belief as a our research hypothesis, that


H 1: > 170 (this is what we want to determine)

Thus, our null hypothesis becomes:

H0: = 170 (this specifies a single value for the

parameter of interest)

Example 11.1
What we want to show:
H1: > 170
H0: = 170 (well assume this is true)

We know:
n = 400,
= 178, and
= 65

Hmm. What to do next?!

Example 11.1
To test our hypotheses, we can use two
different approaches:

The rejection region approach (typically used

when computing statistics manually), and

The p-value approach (which is generally used

with a computer and statistical software).

We will explore both in turn

Example 11.1 Rejection
The rejection region is a range of
values such that if the test statistic
falls into that range, we decide to
reject the null hypothesis in favor of
the alternative hypothesis.

is the critical value of to reject H0.

Example 11.1
All thats left to do is calculate
and compare it to 170.

we can calculate this based on any level of

significance ( ) we want

Example 11.1
At a 5% significance level (i.e. =0.05), we get

Solving we compute =175.34

Since our sample mean (178) is greater than the
critical value we calculated (175.34), we reject the null
hypothesis in favor of H1, i.e. that: > 170 and that
it is cost effective to install the new billing system

Example 11.1 The Big

H1: > 170 =175.34

H0: = 170 =178

Reject H0 in favor of
Standardized Test Statistic
An easier method is to use the standardized test

and compare its result to : (rejection region: z > )

Since z = 2.46 > 1.645 (z.05), we reject H0 in favor of



The p-value of a test is the probability of
observing a test statistic at least as extreme
as the one computed given that the null
hypothesis is true.

In the case of our department store example,

what is the probability of observing a
sample mean at least as extreme as the
one already observed (i.e. = 178), given
that the null hypothesis (H0: = 170) is true?

Interpreting the p-value
The smaller the p-value, the more statistical evidence
exists to support the alternative hypothesis.
If the p-value is less than 1%, there is overwhelming
evidence that supports the alternative hypothesis.
If the p-value is between 1% and 5%, there is a
strong evidence that supports the alternative
If the p-value is between 5% and 10% there is a weak
evidence that supports the alternative hypothesis.
If the p-value exceeds 10%, there is no evidence that
supports the alternative hypothesis.
We observe a p-value of .0069, hence there is
overwhelming evidence to support H1: > 170.

Interpreting the p-value
Compare the p-value with the selected value of the
significance level:

If the p-value is less than , we judge the p-value

to be small enough to reject the null hypothesis.

If the p-value is greater than , we do not reject

the null hypothesis.

Since p-value = .0069 < = .05, we reject H0

in favor of H1

Chapter-Opening Example

The objective of the study is to draw a conclusion

about the mean payment period. Thus, the parameter
to be tested is the population mean. We want to know
whether there is enough statistical evidence to show
that the population mean is less than 22 days. Thus,
the alternative hypothesis is

H1: < 22

The null hypothesis is

H0: = 22

Chapter-Opening Example
The x
z test statistic is
/ n

We wish to reject the null hypothesis in favor of

the alternative only if the sample mean and
hence the value of the test statistic is small
enough. As a result we locate the rejection
region in the left tail of the sampling distribution.
We set the significance level at 10%.

Chapter-Opening Example
z z z.10 1.28
Rejection region:

From the data in SSA we compute


220 220

x 21.63 22
z .91
/ n 6 / 220
p-value = P(Z < -.91) = .5 - .3186 = .1814

Chapter-Opening Example

Conclusion: There is not enough evidence

to infer that the mean is less than 22.

There is not enough evidence to infer

that the plan will be profitable.

Since Z(- .91) > -Z.10(-1.28)

We fail to reject Ho: > 22
at a 10% level of significance.

Right-Tail Testing
Calculate the critical value of the
mean ( ) and compare against the
observed value of the sample mean (

Left-Tail Testing
Calculate the critical value of the
mean ( ) and compare against the
observed value of the sample mean (

TwoTail Testing
Two tail testing is used when we want
to test a research hypothesis that a
parameter is not equal () to some

Example 11.2
AT&Ts argues that its rates are such that customers wont
see a difference in their phone bills between them and
their competitors. They calculate the mean and standard
deviation for all their customers at $17.09 and $3.87

They then sample 100 customers at random and

recalculate a monthly phone bill based on competitors

What we want to show is whether or not:

H1: 17.09. We do this by assuming that:
H0: = 17.09

Example 11.2
The rejection region is set up so we can reject the null
hypothesis when the test statistic is large or when it is

stat is small stat is large

That is, we set up a two-tail rejection region. The total

area in the rejection region must sum to , so we
divide this probability by 2.

Example 11.2
At a 5% significance level (i.e. =.
05), we have
/2 = .025. Thus, z.025 = 1.96 and
our rejection region is:

z < 1.96 -or- z > 1.96

-z.025 0 +z.025

Example 11.2
From the data, we calculate = 17.55

Using our standardized test statistic:

We find that:

Since z = 1.19 is not greater than 1.96, nor less than

1.96 we cannot reject the null hypothesis in favor of H 1.
That is there is insufficient evidence to infer that
there is a difference between the bills of AT&T
and the competitor.


Summary of One- and Two-Tail

One-Tail Test Two-Tail Test One-Tail Test

(left tail) (right tail)

Inference About A




We will develop techniques to estimate and

test three population parameters:
Population Mean
Population Variance
Population Proportion p
Inference With Variance Unknown

Previously, we looked at estimating and testing

the population mean when the population
standard deviation ( ) was known or given:

But how often do we know the actual

population variance?

Instead, we use the Student t-statistic,

given by:

Testing when is
When the population standard
deviation is unknown and the
population is normal, the test
statistic for testing hypotheses about

which is Student t distributed with

= n1 degrees of freedom. The
confidence interval estimator of is
Example 12.1
Will new workers achieve 90% of the level of
experienced workers within one week of
being hired and trained?

Experienced workers can process 500

packages/hour, thus if our conjecture is
correct, we expect new workers to be able to
process .90(500) = 450 packages per hour.

Given the data, is this the case?


Example 12.1
Our objective is to describe the population of the
numbers of packages processed in 1 hour by new
workers, that is we want to know whether the new
workers productivity is more than 90% of that of
experienced workers. Thus we have:

H1: > 450

Therefore we set our usual null hypothesis to:

H0: = 450


Example 12.1
Our test statistic is:

With n=50 data points, we have n1=49 degrees of

freedom. Our hypothesis under question is:
H 1: > 450
Our rejection region becomes:

Thus we will reject the null hypothesis in favor of the

alternative if our calculated test static falls in this


Example 12.1
From the data, we calculate = 460.38, s
=38.83 and thus:


we reject H0 in favor of H1, that is, there is

sufficient evidence to conclude that the new
workers are producing at more than 90% of
the average of experienced workers.

Example 12.2
Can we estimate the return on
investment for companies that won
quality awards?

We are given a random sample of n

= 83 such companies. We want to
construct a 95% confidence interval
for the mean return, i.e. what is:

Example 12.2
From the data, we calculate:

For this term

and so:

Check Requisite
The Student t distribution is robust, which means
that if the population is nonnormal, the results of
the t-test and confidence interval estimate are still
valid provided that the population is not
extremely nonnormal.

To check this requirement, draw a histogram of

the data and see how bell shaped the resulting
figure is. If a histogram is extremely skewed (say in
the case of an exponential distribution), that could
be considered extremely nonnormal and hence t-
statistics would be not be valid in this case.

Inference About Population
If we are interested in drawing inferences about a
populations variability, the parameter we need to
investigate is the population variance:

The sample variance (s2) is an unbiased, consistent and

efficient point estimator for . Moreover,

the statistic, , has a chi-squared


with n1 degrees of freedom.

Testing & Estimating Population
Combining this statistic:

With the probability statement:

Yields the confidence interval

lower confidence for : upper confidence
limit limit


Example 12.3
Consider a container filling machine.
Management wants a machine to fill 1 liter
(1,000 ccs) so that that variance of the fills is
less than 1 cc2. A random sample of n=25 1 liter
fills were taken. Does the machine perform as it
should at the 5% significance level?
Variance is less than 1 cc2

We want to show that:

H1 : <1
(so our null hypothesis becomes: H 0: = 1). We
will use this test statistic:


Example 12.3
Since our alternative hypothesis is phrased as:
H1: <1

We will reject H0 in favor of H1 if our test statistic

falls into this rejection region:

We computer the sample variance to be:


And thus our test statistic takes on this value

Example 12.4
As we saw, we cannot reject the null hypothesis
in favor of the alternative. That is, there is not
enough evidence to infer that the claim is true.
Note: the result does not say that the variance
is greater than 1, rather it merely states that
we are unable to show that the variance is
less than 1.

We could estimate (at 99% confidence say) the

variance of the fills


Example 12.4
In order to create a confidence interval
estimate of the variance, we need these
lower confidence upper confidence
limit limit

we know (n1)s2 = 19.41 from our previous

calculation, and we have from Table 5 in
Appendix B:
Comparing Two
Previously we looked at techniques to
estimate and test parameters for one
Population Mean , Population Variance
We will still consider these parameters when
we are looking at two populations,
however our interest will now be:
The difference between two means.
The ratio of two variances.

Difference of Two Means
In order to test and estimate the difference
between two population means, we
draw random samples from each of two
populations. Initially, we will consider
independent samples, that is, samples that
are completely unrelated to one another.

Because we are compare two population

means, we use the statistic:

Sampling Distribution of
1. is normally distributed if the original
populations are normal or approximately normal
if the populations are nonnormal and the sample
sizes are large (n1, n2 > 30)

2. The expected value of is

3. The variance of is

and the standard error is:

Making Inferences About
Since is normally distributed if the
original populations are normal or approximately
normal if the populations are nonnormal and the
sample sizes are large (n1, n2 > 30), then:

is a standard normal (or approximately normal)

random variable. We could use this to build test
statistics or confidence interval estimators for

Making Inferences About
except that, in practice, the z statistic is rarely
used since the population variances are unknown.


Instead we use a t-statistic. We consider two cases

for the unknown population variances: when we
believe they are equal and conversely when they
are not equal.

When are variances equal?
How do we know when the population
variances are equal?

Since the population variances are

unknown, we cant know for certain
whether theyre equal, but we can
examine the sample variances and
informally judge their relative values to
determine whether we can assume that
the population variances are equal or not.
Test Statistic for (equal
1) Calculate the pooled variance
estimator as

2) and use it here:

degrees of freedom
CI Estimator for (equal
The confidence interval estimator
for when the population
variances are equal is given by:

pooled variance estimator degrees of freedom

Test Statistic for (unequal
The test statistic for when
the population variances are
unequal is given by:
degrees of freedom

Likewise, the confidence interval

estimator is:

Example 13.2
Two methods are being tested for assembling
office chairs. Assembly times are recorded (25
times for each method). At a 5% significance
level, do the assembly times for the two
methods differ?

That is, H1:

Hence, our null hypothesis becomes: H0:

Reminder: This is a two-tailed test.


Example 13.2
The assembly times for each of the
two methods are recorded and
preliminary data is prepared

The sample variances are similar, hence we will assume that

the population variances are equal

Example 13.2
Recall, we are doing a two-tailed test,
hence the rejection region will be:

The number of degrees of freedom


Hence our critical values of t (and

our rejection region) becomes:

Example 13.2
In order to calculate our t-statistic,
we need to first calculate the pooled
variance estimator, followed by
the t-statistic


Example 13.2

Since our calculated t-statistic does not fall into

the rejection region, we cannot reject H 0 in favor
of H1, that is, there is not sufficient evidence to
infer that the mean assembly times differ.


Example 13.2
Excel, of course, also provides us
with the information


or look at p-value

Confidence Interval
We can compute a 95% confidence interval estimate
for the difference in mean assembly times as:

That is, we estimate the mean difference between the

two assembly methods between .36 and .96 minutes.
Note: zero is included in this confidence interval

Matched Pairs Experiment
Previously when comparing two populations,
we examined independent samples.

If, however, an observation in one sample is

matched with an observation in a second
sample, this is called a matched pairs

To help understand this concept, lets

consider example 13.4
Identifying Factors
Factors that identify the t-test and
estimator of :

Inference about the ratio of two
So far weve looked at comparing measures of central
location, namely the mean of two populations.

When looking at two population variances, we consider

the ratio of the variances, i.e. the parameter of interest to
us is:

The sampling statistic: is F distributed with

degrees of freedom.

Inference about the ratio of two
Our null hypothesis is always:

H 0:

(i.e. the variances of the two populations will be

equal, hence their ratio will be one)

Therefore, our statistic simplifies to:

df1 = n1 - 1
df2 = n2 - 1


Example 13.6
In example 13.1, we looked at the variances of
the samples of people who consumed high fiber
cereal and those who did not and assumed
they were not equal. We can use the ideas just
developed to test if this is in fact the case.

We want to show: H1:

(the variances are not equal to each other)

Hence we have our null hypothesis: H0:


Example 13.6
Since our research hypothesis is: H1:
We are doing a two-tailed test, and
our rejection region is:


Example 13.6
Our test statistic is:

.58 1.61 F
Hence there is sufficient evidence to reject the null
hypothesis in favor of the alternative; that is, there is a
difference in the variance between the two populations.


Example 13.6
We may need to work with the Excel
output before drawing conclusions
Our research hypothesis
requires two-tail testing,
but Excel only gives us values
for one-tail testing

If we double the one-tail p-value Excel gives us, we have the p-

value of
the test were conducting (i.e. 2 x 0.0004 = 0.0008). Refer to
the text and CD Appendices for more detail.
Show of Hands
Who is doing a study
that involves
statistical analysis of
What type of
(quantitative) data are
you collecting?
Will there be enough
data to achieve
(adequate power vs.
pilot) If pilot:
Descriptive statistics
9/14/2010 406
Types of data

In order but not
equal (Likert)

9/14/2010 407
Continuous Data
If comparing 2 groups
If comparing > 2 groups
ANOVA (F-test)
If measuring association between 2
Pearson r correlation
If trying to predict an outcome
(crystal ball)
Regression or multiple regression
9/14/2010 408
Ordinal Data
Beyond the capability of Excel just FYI
If comparing 2 groups
Mann Whitney U (treatment vs. control)
Wilcoxon (matched pre vs. post)
If comparing > 2 groups
Kruskal-Wallis (median test)
If measuring association between 2
Spearman rho ()
Likert-type scales are ordinal data
9/14/2010 409
Categorical Data
Called a test of frequency how
often something is observed (AKA:
Goodness of Fit Test, Test of
Chi-Square (2)
Examples of burning research
Do negative ads change how people
Is there a relationship between marital
9/14/2010status and health insurance coverage? 410
Words we use to describe
Mean ()
The arithmetic
average (add all of
the scores
together, then
divide by the
number of scores)
= x / n

9/14/2010 412
The middle number
(just like the
median strip that
divides a highway
down the middle;
Used when data is
not normally
Often hear about
the median price of
9/14/2010 413
The most
occurring number
value, cost)
On a frequency
distribution, its the
highest point (like
the la mode on
9/14/2010 414
Standard Deviation ()


9/14/2010 415
We Make Mistakes!
Alpha level p value
Set BEFORE we collect Calculated AFTER we
data, run statistics gather the data
Defines how much of The calculated
an error we are willing probability of a mistake
by saying it works
to make to say we
AKA: level of significance
made a difference
Describes the percent of
If were wrong, its an
the population/area
alpha error or Type 1 under the curve (in the
error tail) that is beyond our

9/14/2010 416
2-tailed Test
The critical value is
the number that
separates the blue
zone from the middle
( 1.96 this example)
In a t-test, in order to
be statistically
significant the t score
needs to be in the
blue zone
If = .05, then 2.5%
of the area is in each
9/14/2010 417
1-tailed Test

The critical value is

either + or -, but
not both.
In this case, you
would have
significance (p < .
05) if t 1.645.

9/14/2010 418
Chi-Square ( ) 2

Any number squared

is a positive number
Therefore, area under
the curve starts at 0
and goes to infinity
To be statistically
significant, needs to
be in the upper 5% (
= .05)
Compares observed
frequency to what we
9/14/2010 419