Vous êtes sur la page 1sur 92

lecture 3

Data Description

“Presenting and summarizing data”


Objectives

• Summarize data using the measures of central


tendency, such as the mean, median, mode, and
midrange.

• Describe data using the measures of variation,


such as the range, variance, and standard
deviation.
Summarizing data
• Tables
– Simplest way to summarize data
– Data are presented as absolute numbers or
percentages
• Charts and graphs
– Visual representation of data
– Data are presented as absolute numbers or
percentages
Tables: Frequency distribution

Set of categories with numerical counts

Year Number of births


1900 61
1901 58
1902 75
Tables: Relative frequency
number of values within an interval x 100
total number of values in the table

Year # births (n) Relative frequency (%)


1900–1909 35 27
1910–1919 46 34
1920–1929 51 39
Total 132 100.0
Tables
Percentage of births by decade between 1900 and 1929
Year Number of births Relative frequency
(n) (%)
1900–1909 35 27
1910–1919 46 34
1920–1929 51 39
Total 132 100.0

Source: U.S. Census data, 1900–1929.


Charts and graphs
• Charts and graphs are used to portray:
– Trends, relationships, and comparisons
• The most informative are simple and self-
explanatory
Use the right type of graphic
• Charts and graphs
– Bar chart: comparisons, categories of data
– Line graph: display trends over time
– Pie chart: show percentages or proportional
share
Percentage of new enrollees tested for HIV at each site,
by quarter
6
% o f new enrollees tested for

5
4
3
HIV

2
Site 1
1 Site 2
0 Site 3
Quarter 1 Quarter 2 Quarter 3 Quarter 4
Months
Q1 Jan–Mar Q2 Apr–June Q3 July–Sept Q4 Oct–Dec
Stacked bar chart
Represent components of whole & compare wholes
Number of Months Female and Male Patients Have Been
Enrolled in HIV Care, by Age Group

Females 4 10

0-14 years
15+ years
Males 3 6

0 5 10 15
Number of months patients have been enrolled in HIV care
Line graph
Displays trends over time
Number of Clinicians Working in Each Clinic During Years 1–4*

5
Number of clinicians

4
Clinic 1
3
Clinic 2
2 Clinic 3
1

0
Year 1 Year 2 Year 3 Year 4
Pie chart
Contribution to the total = 100%
Percentage of All Patients Enrolled by Quarter
8%

10%

1st Qtr
2nd Qtr
3rd Qtr
23% 59% 4th Qtr

N=150
Summary Measures in Descriptive
Statistics
Summary Measures

Central Tendency Variation

Mean
Range
Median
Variance
Mode

Standard Deviation
Measures of Central Tendency

Do you remember:
• A statistic is a characteristic or measure
obtained by using the data values from a
sample.
• A parameter is a characteristic or
measure obtained by using the data
values from a specific population.
Central Tendency

Mean Median
Mode

Central tendency is most commonly referred to


as the numerical center of the data set:
it is a single number that is used to
represent a group of numbers.
There are three common representations of
central tendency: the mean, median, and mode.
The Mean (arithmetic average)

• The mean is defined to be the sum of the


data values divided by the total number of
values.
• We will compute two means: one for the
sample and one for a finite population of
values.
• One disadvantage of the mean is that it
can be influenced by extreme scores.
The Sample
Mean X

The symbol X represents the sample mean.


X is read as " X - bar ". The Greek symbol
 is read as " sigma" and it means " to sum".

X  X  ... + X
X = 1 2 n

n
X
 .
n
The Sample Mean - Example

The ages in weeks of a random sample


of six kittens at an animal shelter are
3, 8, 5, 12, 14, and 12. Find the
average age of this sample.
The sample mean is

X =
X =
3 + 8 + 5 + 12 + 14 + 12
n 6
54
=  9 weeks.
6
The Population Mean μ

The Greek symbol  represents the population


mean. The symbol  is read as " mu".
N is the size of the finite population.

X  X  ... + X
= 1 2 N

N
X
 .
N
The Population Mean - Example

A small company consists of the owner , the manager ,


the salesperson, and two technicians. The salaries are
listed as $50,000, 20,000, 12,000, 9,000 and 9,000
respectively. ( Assume this is the population.)
Then the population mean will be
X
=
N
50,000 + 20,000 + 12,000 + 9,000 + 9,000
=
5
= $20,000.
The Median
• When a data set is ordered, it is called a data array.
• The median is defined to be the midpoint of the
data array.
• The symbol used to denote the median is MD.
• The median is relatively unaffected by extreme
scores at either end of the distribution
The Median
How to find the median:
We find the position of the middle score by
counting the number of scores we have
collected (n), adding 1 to this value, and then
dividing by 2.
With 11 scores, this gives us (n + 1)/2 = (11 +
1)/2 = 12/2 = 6. Then, we find the score that is
positioned at the location we have just
calculated.
The Median – Example 1
• The weights (in pounds) of seven army
recruits are 180, 201, 220, 191, 219, 209,
and 186. Find the median.
• Arrange the data in order and select the
middle point.
• Data array: 180, 186, 191, 201, 209, 219,
220.
• The median, MD = 201.
The Median
• In the previous example, there was an odd
number of values in the data set. In this case
it is easy to select the middle number in the
data array.

• When there is an even number of values in


the data set, the median is obtained by taking
the average of the two middle numbers.
The Median – Example 2
• Six customers purchased the following number of
magazines: 1, 7, 3, 2, 3, 4. Find the median.
• Arrange the data in order and compute the middle
point.
• Data array: 1, 2, 3, 3, 4, 7.
• The median, MD = (3 + 3)/2 = 3.
The Median – Example 3
• The ages of 10 college students are: 18, 24,
20, 35, 19, 23, 26, 23, 19, 20. Find the
median.
• Arrange the data in order and compute the
middle point.
• Data array: 18, 19, 19, 20, 20, 23, 23, 24,
26, 35.
• The median, MD = (20 + 23)/2 = 21.5.
The Mode
• The mode is defined to be
the value that occurs most
often in a data set.
• This is easy to spot in a
frequency distribution
because it will be the tallest
bar!
• A data set can have more
than one mode.
• A data set is said to have no
mode if all values occur with
equal frequency.
The Mode - Examples

• The following data represent the duration (in days)


of U.S. space shuttle voyages for the years 1992-
94. Find the mode.
• Data set: 8, 9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8, 10, 14,
11, 8, 14, 11.
• Ordered set: 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 10,
11, 11, 14, 14, 14. Mode = 8.
The Mode - Examples

Find the mode.


• Data set: 2, 3, 5, 7, 8, 10.
• There is no mode since each data value occurs
equally with a frequency of one.
The Mode - Examples

• Eleven different automobiles were tested at a


speed of 15 mph for stopping distances. The
distance, in feet, is given below. Find the mode.
• Data set: 15, 18, 18, 18, 20, 22, 24, 24, 24, 26, 26.
• There are two modes (bimodal). The values are 18
and 24. Why?
Advantages/Disadvantages
Advantages Disadvantages
Mean • very common sensitive to outliers
• nice mathematical
properties
• always exists
Mode appropriate for qualitative data • does not always
exist
• sometimes more
than one

Median • always exists does not take all


• easily calculated data into account
Relations Among the Measures
of Central Tendency
• The mode, median,
and mean are all
approximately the
same value for
interval or ratio
scale data that are
normally (or even
just symmetrically)
distributed.

When the data distribution is skewed, these


values are not the same.
Positively Skewed
For positively skewed data, the mean has a
higher value than the median, and the median
has a higher value than the mode.
Negatively Skewed
For negatively skewed data, the mean has a
lower value than the median, and the median
has a lower value than the mode.
How to choice the appropriate central
tendency statistics

nominal ordinal
Level of measurement

scale

Skewed badly

NO Yes

mode mean median median


Measures of Dispersion (Variation)

Variation

Variance Standard Deviation


Range Population
Population
Variance
Standard
Sample Deviation
Variance
Sample
Standard
Deviation
Understanding Variation
• The more Spread out or dispersed data
the larger the measures of variation
• The more concentrated or homogenous the data the
smaller the measures of variation
• If all observations are equal
measures of variation = Zero
• All measures of variation are Nonnegative
Variability Measures

• Data in a set is spread over a range of


values.
• The standard deviation and variance are
related and involve how much the individual
data differs from the data set's mean.
• There are three variability measures of a
data set: range, standard deviation, and
variance.
Measures of Variation - Range

• The range is defined to be the highest value


minus the lowest value. The symbol R is used
for the range.
• R = highest value – lowest value.
• Extremely large or extremely small data values
can drastically affect the range.
Measures of Variation - Population Variance

The variance is the average of the squares of the


distance each value is from the mean.
The symbol for the population variance is
 (  is the Greek lowercase letter sigma)
2

 ( X  )
2

 
2
, where
N
X = individual value
 = population mean
N = population size
Measures of Variation - Population Standard
Deviation

The standard deviation is the square


root of the variance.

 ( X  )
2

 =   2
.
N
Measures of Variation - Example

• Consider the following data to constitute the


population: 10, 60, 50, 30, 40, 20. Find the
mean and variance.
• The mean  = (10 + 60 + 50 + 30 + 40 + 20)/6
= 210/6 = 35.
• The variance  2 = 1750/6 = 291.67. See next
slide for computations.
Measures of Variation - Example

XX XX --  ((XX -- ))22
1100 --2255 662255
6600 ++2255 662255
5500 ++1155 222255
3300 --55 2255
4400 ++55 2255
2200 --1155 222255
221100 11775500
Measures of Variation - Sample Variance

The unbiased estimator of the population


variance or the sample variance is a
statistic whose value approximates the
expected value of a population variance.
2
It is denoted by s , where

(X  X) 2

s
2
 , and

n 1
X = sample mean
n = sample size
Measures of Variation - Sample Standard
Deviation

The sample standard deviation is the square


root of the sample variance.


( X X ) 2

s = s 
2
.
n 1
Shortcut Formula for the Sample Variance and the
Standard Deviation

2  X  ( X ) / n
2 2

s=
n 1

 X  ( X ) / n
2 2

s=
n 1
Sample Variance - Example

• Find the variance and standard


deviation for the following sample: 16,
19, 15, 15, 14.
• X = 16 + 19 + 15 + 15 + 14 = 79.
 X2 = 162 + 192 + 152 + 152 + 142
= 1263.
Sample Variance - Example

2  X 2 (  X ) / n
2

s =
n 1
1263  (79) / 5
2

 = 3.7
4

s= 3.7  19
.
properties that can help in
interpreting a standard deviation
• The standard deviation can never be a negative
number.
• The smallest possible value for the standard
deviation is 0 (no deviation).
• The standard deviation is affected by outliers
(extremely low or extremely high numbers in the
data set). That’s because the standard deviation
is based on the distance from the mean.
How to choice the appropriate measures
of Variation statistics

ordinal
Level of measurement

scale

Skewed badly

NO Yes

Standard Deviation rang rang


Writing up your results

• To do this, you need to identify your data


analysis technique, report your test statistic,
and provide some interpretation of the results.
Each analysis you run should be related to your
hypotheses.
EXAMPLES
• Mean and Standard Deviation are most
clearly presented in parentheses: The
sample as a whole was relatively young
(M = 19.22, SD = 3.45).The average age of
students was 19.22 years (SD = 3.45).
• Percentages are also most clearly
displayed in parentheses with no decimal
places:
Nearly half (49%) of the sample was married.
assignment 1

• Open survey.sav file


• Make descriptive analysis for 3 variables (one
nominal, one ordinal and one scale)
• Interpret the results
The end
Greek Alphabet
Basic guidance when
summarizing data
• Ensure graphic has a title
• Label the components of your graphic
• Indicate source of data with date
• Provide number of observations (n=xx) as
a reference point
• Add footnote if more information is needed
How to Interpret Standard
Deviation
• a histogram that represents a set of 100
measurements. The mean of the data is
approximately 23, and the standard deviation is
approximately 7.
Cont.
In this picture, we estimate that
• about 70% of the data are within 1 standard
deviation of the mean (between 16 and 30),
• about 95% are within 2 standard deviations of the
mean (between 9 and 37), and
• all, or almost all, of the data are within 3 standard
deviations of the mean (between 2 and 44).
These estimates are consistent with the empirical
rule.
The Sample Mean for an Ungrouped Frequency
Distribution

The mean for a ungrouped frequency


distribution is given by

( f  X )
X =
n
The Sample Mean for an Ungrouped Frequency
Distribution - Example

The scores for 25 students on a 4  point


quiz are given in the table. Find the mean score.
Score,
Score,XX Frequency,
Frequency,ff
00 22
11 44
22 12
12
33 44
44 33
5

5
The Sample Mean for an Ungrouped Frequency
Distribution - Example

Score,
Score,XX Frequency,
Frequency,ff ffXX
00 22 00
11 44 44
22 12
12 24
24
33 44 12
12
44 33 12
12
5

X =
 f X
=
52
 2.08
n 25
The Sample Mean for a Grouped Frequency
Distribution

The mean for a grouped frequency


distribution is given by

 ( f  X )
m
X = .
n
Here X is the corresponding class
m

midpoint.
The Sample Mean for a Grouped Frequency
Distribution - Example

Given the table below, find the mean.

Class
Class Frequency,
Frequency,ff
15.5
15.5--20.5
20.5 33
20.5
20.5--25.5
25.5 55
25.5
25.5--30.5
30.5 44
30.5
30.5--35.5
35.5 33
35.5
35.5--40.5
40.5 22
5

5
The Sample Mean for a Grouped Frequency
Distribution - Example

Table with class midpoints, X . m

Class
Class Frequency,
Frequency,ff XXmm ff XXmm
15.5
15.5--20.5
20.5 33 18
18 5454
20.5
20.5--25.5
25.5 55 23
23 115
115
25.5
25.5--30.5
30.5 44 28
28 112
112
30.5
30.5--35.5
35.5 33 33
33 99
99
35.5
35.5--40.5
40.5 22 38
38 76
76
5

5
The Sample Mean for a Grouped
Frequency Distribution - Example

 f  X  54  115  112  99  76
m

= 456
and n = 17. So
f X
X= m

n
456
=  26.82.
17
The Median-Ungrouped Frequency Distribution

• For an ungrouped frequency


distribution, find the median by
examining the cumulative frequencies
to locate the middle value.
The Median-Ungrouped Frequency Distribution

• If n is the sample size, compute n/2.


Locate the data point where n/2
values fall below and n/2 values fall
above.
The Median-Ungrouped Frequency Distribution -
Example

• LRJ Appliance recorded the number of VCRs


sold per week over a one-year period. The
data is given below.

No.
No.Sets
SetsSold
Sold Frequency
Frequency
11 44
22 99
33 66
44 22
55 33
The Median-Ungrouped Frequency Distribution -
Example

• To locate the middle point, divide n by 2; 24/2 =


12.
• Locate the point where 12 values would fall below
and 12 values will fall above.
• Consider the cumulative distribution.
• The 12th and 13th values fall in class 2. Hence MD
= 2.
The Median-Ungrouped Frequency Distribution -
Example

No.
No.Sets
SetsSold
Sold Frequency
Frequency Cumulative
Cumulative
Frequency
Frequency
11 44 44
22 99 13
13
33 66 19
19
44 22 21
21
55 33 24
24

This class contains the 5th through the 13th values.


The Median for a Grouped Frequency Distribution

The median can be computed from:


( n 2)  cf
MD  ( w )  Lm
f
Where
n  sum of the frequencies
cf  cumulative frequency of the class
immediately preceding the median class
f  frequency of the median class
w  width of the median class
L m  lower boundary of the median class
The Median for a Grouped Frequency
Distribution - Example

Given the table below, find the median.


Class
Class Frequency,
Frequency,ff
15.5
15.5--20.5
20.5 33
20.5
20.5--25.5
25.5 55
25.5
25.5--30.5
30.5 44
30.5
30.5--35.5
35.5 33
35.5
35.5--40.5
40.5 22
5

5
The Median for a Grouped Frequency
Distribution - Example

Table with cumulative frequencies.


Class
Class Frequency,
Frequency,ff Cumulative
Cumulative
Frequency
Frequency
15.5
15.5--20.5
20.5 33 33
20.5
20.5--25.5
25.5 55 88
25.5
25.5--30.5
30.5 44 12
12
30.5
30.5--35.5
35.5 33 15
15
35.5
35.5--40.5
40.5 22 17
17
5

5
The Median for a Grouped Frequency
Distribution - Example

• To locate the halfway point, divide n by


2; 17/2 = 8.5  9.
• Find the class that contains the 9th
value. This will be the median class.
• Consider the cumulative distribution.
• The median class will then be 25.5
– 30.5.
The Median for a Grouped Frequency Distribution

n =17
cf = 8
f = 4
w = 25.5 –20.5= 5
Lm  255
.
(n 2)  cf (17/ 2) – 8
MD  (w)  Lm = (5)  255
.
f 4
= 26.125.
The Mode for an Ungrouped Frequency
Distribution - Example

Given the table below, find the mode.


Values
Values Frequency,
Frequency,ff
15
15 33
Mode
20
20 55
25
25 88
30
30 33
35
35 22
5

5
The Mode - Grouped Frequency
Distribution
• The mode for grouped data is the modal
class.
• The modal class is the class with the
largest frequency.
• Sometimes the midpoint of the class is
used rather than the boundaries.
The Mode for a Grouped Frequency
Distribution - Example

Given the table below, find the mode.


Class
Class Frequency,
Frequency,ff
Modal 15.5
15.5--20.5
20.5 33
Class
20.5
20.5--25.5
25.5 55
25.5
25.5--30.5
30.5 77
30.5
30.5--35.5
35.5 33
35.5
35.5--40.5
40.5 22
5

5
The Midrange

• The midrange is found by adding the


lowest and highest values in the data
set and dividing by 2.
• The midrange is a rough estimate of the
middle value of the data.
• The symbol that is used to represent the
midrange is MR.
The Midrange - Example

• Last winter, the city of Brownsville, Minnesota,


reported the following number of water-line breaks
per month. The data is as follows: 2, 3, 6, 8, 4, 1.
Find the midrange. MR = (1 + 8)/2 = 4.5.
• Note: Extreme values influence the midrange and
thus may not be a typical description of the middle.
The Weighted Mean

• The weighted mean is used when the values in


a data set are not all equally represented.
• The weighted mean of a variable X is found by
multiplying each value by its corresponding
weight and dividing the sum of the products by
the sum of the weights.
The Weighted Mean

The weighted mean


w1 X 1  w2 X 2  ...  wn X n  wX
X= 
w1  w2 ...  wn w
where w1 , w2 , ..., wn are the weights
for the values X 1 , X 2 , ..., X n .
Basic example

• Given two school classes, one with 20


students, and one with 30 students, the
grades in each class on a test were:
• Morning class = 62, 67, 71, 74, 76, 77, 78,
79, 79, 80, 80, 81, 81, 82, 83, 84, 86, 89, 93,
98
• Afternoon class = 81, 82, 83, 84, 85, 86, 87,
87, 88, 88, 89, 89, 89, 90, 90, 90, 90, 91, 91,
91, 92, 92, 93, 93, 94, 95, 96, 97, 98,
• The straight average for the morning class is 80 and
the straight average of the afternoon class is 90. The
straight average of 80 and 90 is 85, the mean of the
two class means. However, this does not account for
the difference in number of students in each class,
hence the value of 85 does not reflect the average
student grade (independent of class).
• The average student grade can be obtained by
averaging all the grades, without regard to classes
(add all the grades up and divide by the total number
of students):
• Or, this can be accomplished by weighting the class
means by the number of students in each class
(using a weighted mean of the class means):

• Thus, the weighted mean makes it possible to find


the average student grade in the case where only
the class means and the number of students in each
class are available.
Coefficient of Variation

• The coefficient of variation is defined to


be the standard deviation divided by the
mean. The result is expressed as a
percentage.
s 
CVar  100% or CVar = 100%.
X 
Coefficient of Variation
Example :
The mean of the number of sales of cars over a 3-month
period is 87 and the standard deviation is 5. The mean of
the commissions is $5225 and the standard deviation is
$773. Compare the variations of the two.
• The coefficient of variation are
sales
s 5
CVar    100%  5.7%
X 87
773
commissions
CVar   100%  14.8%
5224
• Since the coefficient of variation is larger for
commissions, the commissions are more variable than
sales.
Sample Variance for Grouped and Ungrouped
Data

• For grouped data, use the class


midpoints for the observed value in the
different classes.
• For ungrouped data, use the same
formula (see next slide) with the class
midpoints, Xm, replaced with the actual
observed X value.
Sample Variance for Grouped and Ungrouped
Data

The samplevariance for grouped data:

2  f  X  [( f  X ) / n]
2
m m
2

s = .
n 1
For ungrouped data, replace Xm with
the observe X value.
Sample Variance for Ungrouped Data - Example

XX ff ffX
X ffX 2
X 2
55 22 1010 5050
66 33 18
18 108
108
77 88 56
56 392
392
88 11 88 64
64
99 66 54
54 486
486
10
10 44 40
40 400
400
24 ffX ffX 2
nn==24 X==186
186 X 2==1500
1500
Sample Variance for Ungrouped Data - Example

The sample variance and standard deviation:

2
 f  X 2
 [(  f  X ) 2
/ n]
s =
n 1
1500  [(186) / 24]
2

=  2.54
23
s  2.54  16 .

Vous aimerez peut-être aussi