Vous êtes sur la page 1sur 15

The term inference refers to a key concept in statistics in which we draw a

conclusion from available evidence.

The purpose of descriptive statistics is to summarize or display data so we can
quickly obtain an overview. Inferential statistics allows us to make claims or
conclusions about a population based on a sample of data from that population. A
population represents all possible outcomes or measurements of interest. A sample is
a subset of a population.
We use the term population in statistics to represent all possible measurements or
outcomes that are of interest to us in a particular study. The term sample refers to a
portion of the population that is representative of the population from which it was
Data is simply defned as the value assined to a specifc observation or
!ata that is used to describe somethin of interest about a population is called a
"or instance# let$s say that the population of interest is my wife$s three%year%old
preschool class and my measurement of interest is how many times the little urchins
use the bathroom in a day.
&f we averae the number of trips per child# this fure would be considered a
parameter because the entire population was measured. 'owever# if we want to
make a statement about the averae number of bathroom trips per day per three%
year%old in the country# then !ebbie$s class could be our sample. We can consider the
averae that we observe from her class a statistic if we assume it could be used to
estimate all three year%olds in the country.
!ata that describes a characteristic about a population is known as a parameter.
!ata that describes a characteristic about a sample is known as a statistic.
Information is data that is transformed into useful facts that can be used for a
specifc purpose# such as makin a decision.
We classify the sources of data into two broad cateories( primary and secondary.
)ou can obtain primary data in many ways# such as direct observation# surveys# and
Direct observation( "ocus roups are a direct observational technique where the
sub+ects are aware that
data is bein collected. ,usinesses use focus roups to ather information in a roup
settin controlled by a moderator. The sub+ects are usually paid for their time and are
asked to comment on specifc topics.
Experiments: This method is more direct than observation because the sub+ects will
participate in an e*periment desined to determine the e-ectiveness of a treatment.
An e*ample of a treatment could be the use of a new medical dru. Two roups would
be established. The frst is the e*perimental roup who receive the new dru# and the
second is the control roup who think they are ettin the new dru but are in fact
ettin no medication. The reactions from each roup are measured and compared to
determine whether the new dru was e-ective.
The beneft of e*periments is that they allow the statistician to control factors that
could in.uence the results# such as ender# ae# and education of the participants.
The concern about collectin data throuh e*periments is that the response of the
sub+ects miht be in.uenced by the fact that they are participatin in a study. The
desin of e*periments for a statistical study is a very comple* topic and oes beyond
the scope of this book.
Surveys( This technique of data collection involves directly askin the sub+ect a
series of questions.
The questionnaire needs to be carefully desined to avoid any bias or confusion for
those participatin. /oncerns also e*ist about the in.uence the survey will have on
the participant$s responses. 0esearch has shown that the manner in which the
questions are asked can a-ect the responses a person provides on a questionnaire. A
question posed in a positive tone will tend to invoke a more positive response and
vice versa. A ood stratey is to test your questionnaire with a small roup of people
before releasin it to the eneral public.
Another way to classify data is by one of two types( quantitative or qualitative.
Types of measurement scales:
A nominal level of measurement deals strictly with qualitative data. 1bservations are
simply assined to predetermined cateories. 1ne e*ample is ender of the
respondent# with the cateories bein male and female. This data type does not allow
us to perform any mathematical operations# such as addin or multiplyin. We also
cannot rankorder this list in any way from hihest to lowest. This type is considered
the lowest level of data and# as a result# is the most restrictive when choosin a
statistical technique to use for the analysis.
)ou can use numbers at the nominal level of measurement. 2ven in this case# the
rules of the nominal scale still remain. An e*ample would be zip codes or telephone
numbers# which can$t be added or placed in a meaninful order of reater than or
less than. 2ven thouh the data appears to be numbers# it$s handled +ust like
1n the food chain of data# ordinal is the ne*t level up. &t has all the properties of
nominal data with the added feature that we can rank%order the values from hihest
to lowest. An e*ample is if you were to have a lawnmower race. 3et$s say the fnishin
order was 4cott# Tom# and ,ob. We still can$t perform mathematical operations on
this data# but we can say that 4cott$s lawnmower was faster than ,ob$s. 'owever# we
cannot say how much faster. 1rdinal data does not allow us to make measurements
between the cateories and to say# for instance# that 4cott$s lawnmower is twice as
ood as ,ob$s 5it$s not6.
1rdinal data can be either qualitative or quantitative. An e*ample of quantitative data
is ratin movies with 7# 8# 9# or : stars. 'owever# we still may not claim that a :%star
movie is : times as ood as a 7%star movie.
;ovin up the scale of data# we fnd ourselves at the interval level# which is strictly
quantitative data. <ow we can et to work with the mathematical operations of
addition and subtraction when comparin values. "or this data# we can measure the
di-erence between the di-erent cateories with actual numbers and also provide
meaninful information. Temperature measurement in derees "ahrenheit is a
common e*ample here. "or instance# => derees is ? derees warmer than @?
'owever# multiplication and division can$t be performed on this data. Why notA
4imply because we cannot arue that 7>> derees is twice as warm as ?> derees.
The kin of data types is the ratio level. <ow we can perform all four mathematical
operations to compare values with absolutely no feelins of uilt. 2*amples of this
type of data are ae# weiht# heiht# and salary. 0atio data has all the features of
interval data with the added beneft of a true > point. The term true zero point
means that a > data value indicates the absence of the ob+ect bein measured. "or
instance# > salary indicates the absence of any salary.
The distinction between interval and ratio data is a fne line.
To help identify the proper scale# use the twice as much rule. &f the phrase twice as
much accurately describes the relationship between two values that di-er by a
multiple of 8# then the data can be considered ratio level.
&nterval data does not have a true > point. "or e*ample# > derees "ahrenheit does
not represent the absence
of temperature# even thouh it may feel like it.
Frequency distributions is simply a table that oranizes the number of data values
into intervals.
The intervals in a frequency distribution are o-icially known as classes# and the
number of observations in each class is known as class frequencies.
/onstructin a frequency distribution(
% from classes of equal size.
% make classes mutually e*clusive# or in other words# prevent classes from
% try to have no fewer than ? classes and no more than 7? classes
% avoid open%ended classes# if possible 5for instance# a hihest class of 7?Bover6.
% include all data values from the oriinal table in a class. &n other words# the classes
should be e*haustive.
Relative Frequency Distribution
0ather than display the number of observations in each class# this method calculates
the percentae of observations in each class by dividin the frequency of each class
by the total number of observations.
Cumulative Frequency Distribution
/umulative frequency distributions indicate the percentae of observations that are
less than or equal to the current class. &t totals the percentaes of each class as you
move down the column. Cohn used his phone D times or less on D: percent of the days
in the month.
rap!in" a Frequency Distribution# t!e $isto"ram
A historam is simply a bar raph showin the number of observations in each class
as the heiht of each bar.
% the frst thin we need to do is open 2*cel to a blank sheet and enter our data in
/olumn A startin in /ell A7.
% ne*t enter the upper limits to each class in /olumn , startin in /ell ,7.
% o to the Tools menu at the top of the 2*cel window and select !ata Analysis.
% The /hart Wizard allows me more control over the fnal appearance.
Statistical Flo%er &o%er# t!e Stem and 'eaf Display
The ma+or beneft of this approach is that all the oriinal data points are visible on
the display.

The stem in the display is the frst column of numbers# which represents the frst
diit of the olf scores. The leaf in the display is the second diit of the olf scores#
with 7 diit for each score. ,ecause there were ? scores in the =>s# there are ? diits
to the riht of =.
'ere# the stem labeled = 5?6 stores all the scores between =? and =E. The stem D 5>6
stores all the scores between D> and D:.
C!artin" a Frequency Distribution
(ar C!arts
,ar charts are a useful raphical tool when you are plottin individual data values
ne*t to each other.
The historam that we visited earlier in the chapter is actually a special type of bar
chart that plots frequencies rather than actual data values.
'ow do & choose between a pie chart and a bar chartA &f your ob+ective is to
compare the relative size
of each class to one another# use a pie chart. ,ar charts are more useful when you
want to hihliht the actual data values.
'ine C!arts is used to help identify patterns between two sets of data.
3ine charts prove very useful when you are interested in e*plorin patterns between
two di-erent types of data. They are also helpful when you have many data points and
want to show all of them on one raph.
,ecause the line connectin
the data points seems to have an
overall upward trend# my suspicions
hold true. &t
seems the more showers our
waterloed darlins take# the hiher
the utility bill.
)easures of Central Tendency
There e*ist two broad cateories of descriptive statistics that are commonly used.
The frst# measures of central tendency# describes the center point of our data set
with a sinle value. &t$s a valuable tool to help us summarize many pieces of data with
one number. The second cateory# measures of dispersion describe how far
individual data values have strayed from the mean.
The mean or avera"e is the most common measure of central tendency and is
calculated by addin all the values in our data set and then dividin this result
by the number of observations.
A %ei"!ted mean allows you to assin more weiht to certain values and less
weiht to others.
)ean of rouped Data from a Frequency Distribution * e*ample(

The mean of a frequency distribution where data is rouped into classes is only an
appro*imation to the mean of the oriinal data set from which it was derived.
This is true because we make the assumption that the oriinal data values are at the
midpoint of each class# which is not necessarily the case. The true mean of the 9>
oriinal data values in the cell phone e*ample is only :.? calls per day rather than
The median is the value in the data set for which half the observations are
hiher and half the observations are lower. We fnd the median by arranin the
data values in ascendin order and identifyin the halfway point.
When there is an even number of data points# the median will be the averae of the
two center points.
Fsin our e*ample with the video ames# we rearrane our data set in ascendin
order( 9 : : : ? @ = = E 7=
Accordin to the mean of this
frequency distribution# Cohn
averaes :.@ calls per day on his
cell phone.
,ecause we have an even number of data points 57>6# the median is the averae of
the two center points. &n this case# that will be the values ? and @# resultin in a
median of ?.? hours of video ames per week. <otice
that there are four data values to the left 59# :# :# and :6 of these center points and
four data values to the riht 5=# =# E# and 7=6.
The mode is simply the observation in the data set that occurs the most
&f you think all the data in your data set is relevant# then the mean is your best
choice. This measurement
is a-ected by both the number and manitude of your values. 'owever# very small or
very lare values can have a sinifcant impact on the mean# especially if the size of
the sample is small. &f this is a concern# perhaps you should consider usin the
median. The median is not as sensitive to a very lare or small value.
/onsider the followin data set from the oriinal video ame e*ample(9 : : : ? @ = =
E 7=
The number 7= is rather lare when compared to the rest of the data. The mean of
this sample was @.@# whereas the median was ?.?. &f you think 7= is not a typical
value that you would e*pect in this data set# the median would be your best choice for
central tendency.
The poor lonely mode has limited applications. &t is primarily used to describe data at
the nominal scaleGthat is# data that is rouped in descriptive cateories such as
ender. &f @> percent of our survey respondents were male# then the mode of our data
would be male.
"rom !ata Analysis% !escriptive 4tatistics( mean# median# mode.
)easures of Dispersion
Ran"e is the simplest measure of dispersion and is calculated by fndin the
di-erence between the hihest value and the lowest value in the data set. = E D
77 : % rane H 77 B : H =
'owever# the limitation is that it only relies on two data points to describe the
variation in the sample. <o other values between the hihest and lowest points are
part of the rane calculation.
+ariance summarizes the squared deviation of each data value from the mean.
The variance is a measure of dispersion that describes the relative distance between
the data points in the set and the mean of the data set. This measure is widely used in
inferential statistics.

The frst step in calculatin the variance is to determine the mean of the data set. The
rest of the calculations can
be facilitated by the followin table. The fnal sample variance calculation becomes
this( s8H 8@#DI ?%7.
,sin" t!e Ra% Score )et!od is a more e-icient way to calculate the variance of a
data set.
s8H 5the sum of each data value after it has been squared% the square of the sum of
all the data values6I n%7
T!e +ariance of a &opulation
Standard deviation is simply the square root of the variance. Cust as with the
variance# there is a
standard deviation for both the sample and population. To calculate the standard
deviation# you must frst calculate the variance and then take the square root of the
The standard deviation is actually a more useful measure than the variance because
the standard deviation is in the units of the oriinal data set.
Calculatin" t!e Standard Deviation of rouped Data
T!e Empirical Rule: %or-in" %it! Standard Deviation
The values of many lare data sets tend to cluster around the mean or median so that
the data distribution in the historam resembles a bell%shape# symmetrical curve.
When this is the case# the empirical rule tells us that appro*imately @D percent of the
data values will be within one standard deviation from the mean.
"or e*ample# suppose that the averae e*am score for my lare statistics class is DD
points and the standard deviation is :.> points and that the distribution of rades is
bell%shape around the mean. ,ecause one standard deviation above the mean would
be E8 5DD J :6 and one standard deviation below the mean would be D:
5DD B :6# the empirical rule tells me that appro*imately @D percent of the e*am scores
will fall between D: and E8 points.
Accordin to the empirical rule# if a distribution follows a bellshapeGa symmetrical
curve centered around the meanGwe would e*pect appro*imately @D# E?# and EE.=
percent of the values to fall within one# two# and three standard deviations around the
mean respectively.
&n eneral# we can use the followin equation to e*press the rane of values within k
standard deviations around the mean( KJI% k L.
C!ebys!ev.s T!eorem
/hebyshev$s theorem is a mathematical rule similar to the empirical rule e*cept that
it applies to any distribution rather than +ust bell%shape# symmetrical distributions.
/hebyshev$s theorem states that for any number k reater than 7# at least 57 B 7Ik
7>> percent of the values will fall within k standard deviations from the mean. Fsin
this equation# we can state the followin(
% at least =? percent of the data values will fall within two standard deviations from
the mean by settin k H 8 into /hebyshev$s equation.
% at least DD.E percent of the data values will fall within three standard deviations
from the mean by settin k H 9
into the equation.
% at least E9.= percent of the data values will fall within four standard deviations from
the mean by settin k H : into the equation.
This table supports /hebyshev$s theorem# which predicts that at least =? percent of
the values will fall within two standard deviations from the mean. "rom the data set#
we can observe that E? percent actually fall between 8>.9 and :E.7 home runs 59D out
of :>6. The same e*planation holds true for three and four standard deviations around
the mean.
)easures of Relative &osibtion describe the percentae of the data below a
certain point.
/uartiles divide the data set into four equal sements after it has been
arraned in ascendin order.
Appro*imately 8? percent of the data points will fall below the frst quartile# M7.
Appro*imately ?> percent of the data points will fall below the second quartile# M8.
And# you uessed it# =? percent should fall below the third quartile# M9.
76 4tep 7( Arrane your data in ascendin order.
86 4tep 8( "ind the median of the data set. This is M8.
96 4tep 9( "ind the median of the lower half of the data set 5in parenthesis6. This is
:6 4tep :( "ind the median of the upper half of the data set 5in parenthesis6. This
is M9.
Interquartile ran"e % the &M0 measures the spread of the center half of our
data set. &t is simply
the di-erence between the third and frst quartiles# as follows( &M0 H M9 B M7. The
interquartile rane is used to identify outliers# which are the black sheep of our
data set. These are e*treme values whose accuracy is questioned and can cause
unwanted distortions in statistical results. Any values that are more than( M9 J
7.?&M0 or less than( M7 B 7.?&M0 should be discarded.
2*ample( 7> :8 :? :@ ?7 ?8 ?D =9
4ince there are eiht data values# M7 will be the median of the frst four values 5the
midpoint between the second and third values6. M7H 5:8J:?6I8H :9.?
3ikewise# M9 will be the median of the last four values 5the midpoint between the
si*th and seventh values6.
M8H 5?8J?D6I8H ?@. &0M H M9% M7H ?@% :9.?H 78.?
Any values reater than M9 J 7.? &0MH =:.=? or less than M7% 7.? &0MH 8:.=? should
be considered an outliner# therefore the value 7> would be an outliner in this data set.
The values for variance and standard deviation reported by 2*cel are for a sample. &f
your data set represents a population# you need to recalculate the results usin N in
the denominator rather than n B 7.
&robability topics
Experiment. The process of measurin or observin an activity for the purpose of
collectin data. An e*ample is rollin a pair of dice.
0utcome. A particular result of an e*periment. An e*ample is rollin a pair of threes
with the dice.
Sample space. All the possible outcomes of the e*periment. The sample space for
our e*periment is the numbers N8# 9# :# ?# @# =# D# E# 7># 77# and 78O. 4tatistics people
like to put NO around the sample space values Event. 1ne or more outcomes that are
of interest for the e*periment and which isIare a subset of the sample space. An
e*ample is rollin a total of 8# 9# :# or ? with two dice.
Classical &robability refers to a situation when we know the number of possible
outcomes of the event of interest and can calculate the probability of that event with
the followin equation(
PQARH <umber of possible outcomes in which 2vent A occursI Total number of
possible outcomes in the sample space.
Empirical &robability % when we don$t know enouh about the underlyin process
to determine the number
of outcomes associated with an event. This type of probability observes the number of
occurrences of an event throuh an e*periment and calculates the probability from a
relative frequency distribution.
PQARH "requency in which 2vent A occursI Total number of observations.
1ne e*ample of empirical probability is to answer the ae%old question What is the
probability that Cohn will et out of bed in the mornin for school after his frst wake%
up callA
,ased on these observations# if 2vent A H Cohn ettin out of bed on the frst wake%up
call# then PQAR H >.7?
Fsin the previous table# we can also e*amine the probability of other events. 3et$s
say 2vent , H Cohn requirin more than 8 wake%up calls to et out of bedS then PQ,R
H>.:> J >.8? H >.@?.
&f & choose to run another 8>%day e*periment of Cohn$s wakin behavior# & would most
likely see di-erent results than those in the previous table. 'owever# if & were to
observe 7>> days of this data# the relative frequencies would approach the true or
classical probabilities of the underlyin process. This pattern is known as the law of
lare numbers.
The law of lare numbers states that when an e*periment is conducted a lare
number of times# the empirical probabilities of the process will convere to the
classical probabilities.
Sub1ective probability
We use sub+ective probability when classical and empirical probabilities are not
Fnder these circumstances# we rely on e*perience and intuition to estimate the
(asic &roperties of &robability * one event
&f PQAR H 7# then 2vent A must occur with certainty.
&f PQAR H ># then 2vent A will not occur with certainty.
The probability of 2vent A must be between > and 7.
The sum of all the probabilities for the events in the sample space must be equal to 7.
The complement to 2vent A is defned as all the outcomes in the sample space that
are not part of 2vent A and is denoted as A$. Fsin this defnition# we can state the
followin( PQAR J PQA$R H 7 or PQAR H 7 B PQA$R.
T!e Intersection of Events
<ow that my children are older and livin away from home# & cherish those moments
when the phone rins and & see one of their numbers appear on my caller &!.
2*perience has tauht me that & can cateorize these calls as either crisis# involvin
such thins as a computer# a car# an AT; card# or a cell phoneS or noncrisis# when
they call +ust to see if &$m alive and well enouh to help with their ne*t crisis.
The followin table# called a continency table# cateorizes the last ?> phone calls by
child and type of call.
/ontinency tables show the actual or relative frequency of two types of data at the
same time. &n this case# the data types are child and type of call.
2vent A H the ne*t phone call will come from /hristin.
2vent , H the ne*t phone call will involve a crisis.
PQARH 8>I?>H >.:
What about the probability that the ne*t phone call will come from /hristin and will
involve a crisisA
This event is known as the intersection of 2vents A and , and is described by AT,.
The number of phone calls from our continency table that meet both criteria is 7:#
so( PQA and ,R H PQAR T PQ,RH 7:I?>H >.8D
A continency table indicates the number of observations that are classifed
accordin to two variables. The intersection of 2vents A and , represents the number
of instances where 2vents A and , occur at the same time 5that is# the same phone
call is both from /hristin and a crisis6. The probability of the intersection of two
events is known as a 1oint probability.
T!e union of Events A and , represents the number of instances where either
2vent A or , occur 5that is# the number of calls that were either from /hristin or were
a crisis6.
PQA and ,R H PQAR F PQ,RH 9:I?>H >.@D
/lassical probability requires knowlede of the underlyin process in order to count
the number of possible outcomes of the event of interest.
2mpirical probability relies on historical data from a frequency distribution to
calculate the likelihood that an event will occur.
The law of lare numbers states that when an e*periment is conducted a lare
number of times# the empirical probabilities of the process will convere to the
classical probabilities.
The intersection of 2vents A and , represents the number of instances where 2vents
A and , occur at the same time.
The union of 2vents A and , represents the number of instances where either 2vent A
or , occur.
Conditional &robability
We defne conditional probability as the probability of 2vent A knowin that 2vent ,
has already occurred.
2*ample( the followin table shows the outcomes of our last 8> matches# alon with
the type of warm%up before we started keepin score.
Without any additional information# the simple probability of each of these events is
as follows( PQARHEI8>H>.:?
PQ,RH79I8>H>.@?# PQA$RH77I8>H>.??# PQ,$RH=I8>H>.9?
4imple or prior probabilities are always based on the total number of observations. &n
the previous e*ample# it is 8> matches.
Unowin this piece of info# what is the probability that !ebbie will win the matchA
This is the conditional probability of 2vent A iven that 2vent , has occurred.
3ookin at the previous table# we can see that 2vent , has occurred 79 times.
,ecause !ebbie has won : of those matches 5A6# the probability of A iven , is
calculated as follows( PQAI,RH:I79H>.97
We can also calculate the probability that !ebbie will win( PQAI,$RH?I=H>.=7
/onditional probabilities are also known as posterior probabilities. /onditional
probabilities are very useful for determinin the probabilities of compound events as
you will see in the followin sections.
Independent versus Dependent Events
2vents A and , are said to be independent of each other if the occurrence of 2vent ,
has no e-ect on the probability of 2vent A. Fsin conditional probability# 2vents A
and , are independent of one another if( PQAI,R H PQAR
&f 2vents A and , are not independent of one another# then they are said to be
dependent events.