Vous êtes sur la page 1sur 38

Chapter 2

Summation Notation & Central Tendency


(Sections 2.3-2.5, 2.7 and 2.8)

mean

mode

median

C
h
a
p
t
e
r

Summation Notation (2.3)

Individual observations in a data set are denoted


x1, x2, x3, x4, xn

We often use a summation symbol


n

x
i 1

x1 x2 x3 ... xn

C
h
a
p
t
e
r

Summation Notation

To add all the values of variable x from the first


(x1) to the last (xn)

so if
x1 = 1, x2 = 2, x3 = 3 and x4 = 4,
then
4

x
i 1

1 2 3 4 10

C
h
a
p
t
e
r

Notation (continued)

Sometimes we will
have to square the
values before we
add them:

2
2
2
2
2
x

...

x
i 1 2 3
n
i 1

2
2
2
2
2
x

4
i
i 1

1 4 9 16 30

Other times we will n


2
add them and then xi x1 x x ... xn 2
i 1
square the sum:
2

n
2
xi 1 2 3 4 10 2 100
i 1

C
h
a
p
t
e
r

Numerical Measures of Center and


Spread
Central tendency
is the value or values around
which the data tend to
cluster
center

Variability
shows how strongly the data
cluster around value(s)

2
spread
5

C
h
a
p
t
e
r

Describing the Center of a Data Set


(2.4)

Mean

Median

Mode

center

2
6

C
h
a
p
t
e
r

Describing the Center of a Data Set


Mean
The mean of a set of quantitative data is the sum of
the observed values divided by the number of values
The sample mean for a sample dataset x1 , x2 ,..., xn
is denoted by

x (x-bar), and is calculated by


n

x1 x2 ... xn
x

i 1

Note that the population mean is denoted by

(mu), and is calculated by

x x ... x N i 1
1 2

N
N

C
h
a
p
t
e
r

Example of Calculating a Mean


House Price in
Fancytown
231,000
313,000
299,000

x
i 1

2,950 ,000

10
295,000

312,000
285,000
317,000
294,000
297,000

10

315,000
287,000
=2,950,000

the average or mean


price for this sample of
10 houses in Fancytown
is $295,000

C
h
a
p
t
e
r

Example of Calculating a Mean


House Price in
Lowtown

10

97,000
93,000
110,000

i 1

2,950 ,000

10
295,000

121,000
113,000

the average or mean


price for this sample of
10 houses in Lowtown is
also $295,000

95,000
100,000
122,000

99,000
2,000,000
=2,950,000

outlier

C
h
a
p
t
e
r

Comparing the two examples

The mean for both Fancytown and Lowtown is


$295,000
This accurately represents the center of the
data for Fancytown, but not Lowtown.
Dotplots for Fancytown and Lowtown

Lowtown

Fancytown

500000

outlier

1000000

1500000

$295,000
295000

The mean can be very sensitive to a few extreme values.

2000000

C
h
a
p
t
e
r

Describing the Center of a Data Set


Median
The median of a set of quantitative data is the value
which is located in the middle of the data, arranged
from lowest to highest values (or vice versa), with 50%
of the observations above and 50% below.

Finding the Median, M:


Arrange the n measurements from smallest to largest
If n is odd, M is the middle number
If n is even, M is the average of the middle two
numbers
Highest
value

Lowest
value
50%

Median

50%
11

C
h
a
p
t
e
r

Example of Calculating a Median


House Price in
Fancytown
231,000
285,000
287,000
294,000
297,000
299,000
312,000

The median is
between the
two middle
values

313,000

315,000
317,000

Median, M

297 ,000 299 ,000


$298,000
2

C
h
a
p
t
e
r

Example of Calculating a Median


House Price in
Lowtown
93,000
95,000
97,000
99,000
100,000
110,000
113,000

The median is
between the
two middle
values

121,000

122,000
2,000,000

Median, M

100,000 110,000
$105,000
2

C
h
a
p
t
e
r

Describing the Center of a Data Set


Mode
The mode is the most frequently observed value.
The modal class is the midpoint of the class with the
highest relative frequency.

Finding the Mode


Arrange the n measurements from smallest to largest
Count the times each number occurs.
1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 7, 7, 8, 9, 10

Mode

14

C
h
a
p
t
e
r

The Center of a Data Set & Distribution


A data set is symmetric if the left half of the
distribution is exactly (or at least approximately) a
mirror image of the right half.

Mean = Median = Mode

2
15

C
h
a
p
t
e
r

The Center of a Data Set & Distribution


Negative (left) Skew

Mean < Median < Mode

Positive (right) Skew

Mean > Median > Mode

2
16

C
h
a
p
t
e
r

Numerical Measures of Variability

Range

Variance

Standard Deviation

spread

2
17

Describing the Variability of a Data Set

C
h
a
p
t
e
r

Range
Equal to the largest measurement minus the smallest
measurement.
Easy to compute, but not very informative
Considers only two observations (smallest / largest)
Monthly Salaries
3,000

2,000

5,000

8,000

5,000

4,000

9,000

Range (R) = 9,000-2,000 = 7,000

18

Describing the Variability of a Data Set

C
h
a
p
t
e
r

Sample Variance (s2)


For a sample of n measurements is equal to the sum of
the squared distances from the mean, divided by (n-1)
2
2
2
(
x

x
)

(
x

x
)

...

(
x

x
)
2
n
s2 1
n 1

The population variance is denoted by 2


19

Describing the Variability of a Data Set

C
h
a
p
t
e
r

Sample Standard Deviation (s)


For a sample of n measurements is equal to the square
root of the sample variance.
It tells us, on average, how far each data point
deviates from the mean.
n

s s2

2
(
x

x
)
i
i 1

n 1

The population variance is denoted by


20

C
h
a
p
t
e
r

Example of calculating a variance and


standard deviation
Sample
Dataset

x2

1
2

Sample variance,

( x1 x ) 2 ( x 2 x ) 2 ( x 3 x ) 2
s
n 1
(1 2) 2 (2 2) 2 (3 2) 2

3 1
1
2

Sample mean,

Sample standard deviation,

s2

1 1

C
h
a
p
t
e
r

Example of calculating a variance and


standard deviation
Calculate the variance and standard deviation for the
information provided:
n 17,

x 12, x

13

2
22

C
h
a
p
t
e
r
2

Example of calculating a variance and


standard deviation
Calculate the variance and standard deviation for the
information provided:
n 17,

x 12, x

13

Sample variance:
2

2
2

(
12
)
x n
13
17 .283
s2

n 1
17 1

Sample standard deviation:

s2

.283 .532
23

Describing Relative Standing

C
h
a
p
t
e
r

Percentile
For any set of n measurements (arranged in ascending
or descending order), the pth percentile is a number
such that p% of the measurements fall below that
number and 100(1-p)% fall above it.
Upper Quartile (QU or Q3) = 75th percentile
Median (Q2) = 50th percentile
Lower quartile (QL or Q1) = 25th percentile

2
24

Example of Calculating a Percentile

C
h
a
p
t
e
r

Order the dataset:


0,1, 2, 3, 5, 7, 9

Sample
Dataset

3
5

To calculate 50th percentile


Step 1:
n x p = 7 x (.5) =3.5 ~ 4

(round up to next integer)

Step 2:
The 4th location value of the ordered list is
the 50th percentile = 3

0
1
9
2
7

2
25

Example of Calculating a Percentile

C
h
a
p
t
e
r

Sample
Dataset

Order the dataset:


0,1, 2, 3, 5, 7, 9, 10

3
5

To calculate 50th percentile


Step 1:
n x p = 8 x (.5) = 4

0
1

Take the mean of the


4th and 5th values

Step 2:
The mean of the 4th location value and the
5th location value of the ordered list is the
50th percentile = (3+5)/2=4

9
2
7
10

26

Describing the Relative Standing

C
h
a
p
t
e
r

Sample z-score
Tells the distance between a measurement
x and the mean (x ), expressed in terms of
standard deviations.
The sample z-score for a measurement x is

xx
s

2
27

Example of Calculating a z-score

C
h
a
p
t
e
r

Given verbal SAT scores for 2,000 high


school seniors
Mean ( x ) = 550
Standard Deviation (s) = 75

Joe Smiths score = 475


x x 475 550
z

1
s
75

28

Methods for Determining Outliers

C
h
a
p
t
e
r

Outlier
A measurement that is unusually large or
small relative to the other values.
Possible causes:
1. Observation, recording or data entry error
2. Item is from a different population
3. A rare, chance event

2
29

C
h
a
p
t
e
r

Using a boxplot to identify outliers


The box plot is a graph representing information
about certain percentiles for a data set and can be
used to identify outliers. Boxplots

plot the five-number summary

show the spread of the data

detect outliers

2
30

C
h
a
p
t
e
r

The Five-number summary

Lower Quartile
(QL)

Median

Upper Quartile
(QU)

Minimum Value

30

35

Maximum Value

40

45

50

55

BoxPlot

2
31

C
h
a
p
t
e
r

Quartiles and the Interquartile Range

Lower Quartile (QL) = median of the lower half of data set.

Upper Quartile (QU) = median of the upper half of data set.

Interquartile Range (IQR) = upper quartile lower quartile


IQR= QU - QL

2
30

35

40

45

50

55

BoxPlot
32

C
h
a
p
t
e
r

Outliers
An observation is an outlier if it is

< Lower inner fence = QL 1.5 x IQR

> Upper inner fence = QU + 1.5 x IQR

An outlier is extreme if it is

< Lower inner fence = QL 3 x IQR

> Upper inner fence = QU + 3 x IQR

2
33

C
h
a
p
t
e
r

Outliers Example

Student Ages
17
19
19
20
21
22
22
25

18
19
19
20
21
22
23
26

18
19
19
20
21
22
23
28

18
19
19
20
21
22
23
28

18
19
19
20
21
22
23
30

18
19
19
20
21
22
23
37

19
19
20
21
21
22
23
38

19
19
20
21
21
22
24
44

19
19
20
21
21
22
24
47

19
19
20
21
21
22
24

Lower
Quartile
Median
Upper
Quartile

IQR = 22 19 = 3

2
34

C
h
a
p
t
e
r

Outliers Example

Student Ages
17
19
19
20
21
22
22
25

18
19
19
20
21
22
23
26

18
19
19
20
21
22
23
28

18
19
19
20
21
22
23
28

18
19
19
20
21
22
23
30

18
19
19
20
21
22
23
37

19
19
20
21
21
22
23
38

19
19
20
21
21
22
24
44

19
19
20
21
21
22
24
47

19
19
20
21
21
22
24

Lower
Quartile
Median
Upper
Quartile

Outliers

Lower inner fence = 19 (1.5 x 3) = 14.5

Upper inner fence = 22 + (1.5 x 3) = 26.5

35

C
h
a
p
t
e
r

Outliers Example

Student Ages
17
19
19
20
21
22
22
25

18
19
19
20
21
22
23
26

18
19
19
20
21
22
23
28

18
19
19
20
21
22
23
28
Moderate
Outliers

18
19
19
20
21
22
23
30

18
19
19
20
21
22
23
37

19
19
20
21
21
22
23
38

19
19
20
21
21
22
24
44

19
19
20
21
21
22
24
47

19
19
20
21
21
22
24

Lower
Quartile
Median
Upper
Quartile

Extreme
Outliers

Lower inner fence = 19 (3 x 3) = 14.5

Upper inner fence = 22 + (3 x 3) = 26.5

36

C
h
a
p
t
e
r

Outliers Example

Student ages on a boxplot

Mild
Outliers

Smallest data
value not
an outlier

Extreme
Outliers

Largest data
value not
an outlier

37

C
h
a
p
t
e
r

Comparative Boxplot Example

By putting boxplots of two separate groups or


subgroups we can compare their distributional
behaviors.

2
38

Vous aimerez peut-être aussi