Vous êtes sur la page 1sur 10

1

ECON1203/ECON2292
Business and Economic
Statistics
Week 2
Week 2 topics
Measures of central tendency or location
Measures of dispersion or spread
Measures of association
2
Introduction to linear regression
Key references
Keller Chapter 4, especially 4.1-4.4
Numerical summaries of key
features
Previously used graphical methods to
summarize data
Can also use numerical summaries of sample or
populationdata
3
population data
Key features of a single variable
Location, spread, relative standing, skewness
Key feature of two variables
Measures of (linear) association
2
Measures of location
A parameter describes
key feature of a
population
A statistic describes key
f t f l N
x
N
i
i
=
=
1
mean Population

feature of a sample
A natural measure of
location is the arithmetic
mean
Other variants
Weighted mean (WAM)
Geometric mean
n
x
x
N
n
i
i
=
=
1
mean Sample

4
Measures of location
Median is middle value
of ordered observations
When n is odd median
will be unique
5 7
3 4
1
1 ; 3 3 1 0 0
ns observatio 5 of Sample
+
=
>
=
a
Median
a a
n
5
When n is even average
middle two values
Median depends on
ranks of observations
Doubling largest
observation will not
change median
5 7
5
3
> =
a
x
Measures of location
Mode is most frequently occurring value(s)
Modal class previously defined in context of unimodal
histograms
Mean, median & mode all provide different notions
6
of representative or typical central values
For quantitative data mode usually not a very useful
measure of location
What about mean vs. median? Depends
For symmetric distributions mean=median
When positively (negatively) skewed mean >(<) median
Median may be preferred when outliers appear in data
3
Outlier...s
A French woman was amazed
when she received a 11721
trillion phone bill (5000 times
the GDP of France). She
spoke to the telecoms
companyand was toldnothing
Cricket batting averages
30
40
50
60
u
e
n
c
y
7
company and was told nothing
could be done (calculated
automatically) but she could
pay in instalments! Later they
admitted the bill was a mistake
(should have been 117.21)
and waived it...
http://www.guardian.co.uk/business/2012/oct/11/french-
phone-bill
0
10
20
30 40 50 60 70 80 90 100
Average
F
r
e
q
Measures of variability
Keller Ex 3.2 compares
rates of returns on 2
investments
Mean returns (%):
8
A =10.95, B =12.76
Should we chose to
invest in B?
If not why not?
Measures of variability
Range is a simple measure of variability
Range =maximum minimum
For investment example ranges are:
A: 63 00 ( 21 95) =84 95
9
A: 63.00 ( 21.95) =84.95
B: 68.00 ( 38.47) =106.47
Range is simple but potentially misleading
1 1 1 50 50 range =49
1 10 20 40 50 range =49
Is variability the same here?
4
Measures of variability
Whynot measure spread
around location?
Variance is most common
measure of variability
Measures average squared
( )
variance Population
1
2
2
N
x
N
i
i
=

= o
10
Measures average squared
distance from the mean
Division by n1 for sample
variance relates to
properties of estimators
Look ahead to Keller Ch
10 for justification
( )
1
variance Sample
1
2
2
n-
x x
s
N
n
i
i
=

=
Measures of variability
Recall simple n =5 sample
with a =2 (Slide 5)
0 0 1 3 6
sample mean =?
sample variance =?
V i i i d it
11
Variance is in squared units
Standard deviation is
spread measured in original
units
o =
s =
Other measures of
variability?
2
s
2
o
Measures of variability
Again should we invest in A or B?
Mean returns:
A: 10.95, B: 12.76
Variance of returns
A: 479.35, B =786.62
12
A: 479.35, B 786.62
Standard deviation of returns
A: s
2
A
=479.35, s
A
=(479.35)
1/2
=21.89
B: s
2
B
=786.62, s
B
=(786.62)
1/2
=28.05
Calculations in investment data when n =50?
Efficient methods exist for variance or use EXCEL
5
Measures of variability
Using mean & standard deviation in combination we
can standardize data
Create transformed variable with zero mean, unit variance
& hence free of units of measurement
Calculate Z scores
13
Calculate Z scores
(observation mean) divided by standard deviation
For investment A maximum return is 63% which has a Z
score of (6310.95)/21.89 =2.38
or
63% is 2.38 standard deviation units above mean return
Helpful because mean & standard deviation depend on
units of measurement eg proportions or percentages
If returns in proportions (0.63 not 63%) how would Z score
change?
Coefficient of variation
Sometimes measure variation relative to location
Case 1: observations all in millions & standard
deviation is 20 relatively little variability
Case 2: Observations all positive but less than 100 p
s=20 may be a lot of variability
(Sample) coefficient of variation, cv=
Provides measure of relative variability
14
x s /
Measures of relative standing
Median relies on ranking of data to measure
location
Can generalize this notion to percentiles
P
th
til i th l f hi h P t f
15
P
th
percentile is the value for which P percent of
observations are less than that value
Median is the 50
th
percentile
25
th
& 75
th
percentiles called lower & upper quartiles
Difference between upper & lower quartiles called the
inter-quartile range - another measure of spread
6
Measures of relative
standing
Simple example
Denote 25
th
, 50
th
& 75
th
percentiles by Q
1
, Q
2
&
Q
3
Suppose n =8 &ordereddata are
16
Suppose n =8, & ordered data are
x
1
, x
2
, x
3
, x
4
, x
5
, x
6
, x
7
, x
8
Need to divide data into quarters
Thus Q
1
=(x
2
+x
3
)/2, Q
2
=(x
4
+x
5
)/2, Q
3
=(x
6
+x
7
)/2
IQR =Q
3
Q
1
See Keller Ex 4.11 for another example
Measures of association
Do large values of x tend to
be associated with large
values of y?
Graphical answer from
scatter plots
( )( )
covariance Population
1
N
y x
N
i
y i x i
xy

=

=

o
17
p
Covariance is a numerical
measure
Positive (negative)
covariance positive
(negative) linear
association
Zero covariance no
linear association
( )( )
1
covariance Sample
1
n-
y y x x
s
N
n
i
i i
xy
y

=

=
Measures of association
Covariance is not scale
free
Is covariance of 500 big?
Covariance between
height & weight depends
l ti S l
n correlatio Population
y x
=
xy
o o
o

18
g g p
on units used
Correlation coefficient
is standardized measure
of association that is unit-
free
1 (-1) perfect positive
(negative) linear
relationship
1 1 and 1 1
n correlatio Sample
s s s s
=
r
s s
s
r
y x
xy

7
Correlations in each of these
scatter plots? (Keller Fig. 3.13)
19
Least squares: The problem
Have (y
i
, x
i
) pairs for i =1, , n
Portrayed graphically in scatter plot
Interested in linear relationship between y & x
How do you determine the intercept & slope in this
20
y p p
relationship?
Choose values that give the best fit
What do you mean best fit?
One approach is minimize residual sum of squares
Method called least squares
Basis of regression analysis (see Keller Ch 16)
Least squares: The diagram
x b b y
1 0
+ =
y

21
x


b
0
e
1
8
Least squares: The
optimization problem
minimize to chosen are & where

Assume
1 0
1 0 i i
b b
x b b y + =
22
squares of sum residual
the minimize that estimates slope
and intercept be will solution Thus
)
2
1
i i
n
i
y y (

=
Least squares: The solution
x b y b
s
s
b
x
xy
1 0 2 1
= =
Note:
23
Note:
Point of the means will lie on line of best fit
b
1
will have same sign as covariance (correlation) between
y & x
Zero covariance (correlation) b
1
=0 ?
Internet use: Keller exercise
3.52 and extension
Problem
Interested in internet usage
Particularly relationship between education & internet use
Data
24
Random sample of 15 adults
Two variables
Education (years)
Internet use (hours in previous week)
What are the key features of these variables & their
relationship?
9
Internet use: Excel summary
statistics
Educati on Internet use
Mean 12.667 Mean 10.000
Standard Error 0.779 Standard Error 1.857
Median 11 Median 10
Mode 11 Mode 0
25
Mode 11 Mode 0
Standard Deviation 3.016 Standard Deviation 7.191
Sample Variance 9.095 Sample Variance 51.714
Kurtosis -0.114 Kurtosis -0.432
Skewness 0.586 Skewness 0.181
Range 11 Range 24
Minimum 8 Minimum 0
Maximum 19 Maximum 24
Sum 190 Sum 150
Count 15 Count 15
Internet use: Scatter diagram &
fitted regression line
20
25
30
Internet use
26
0
5
10
15
0 2 4 6 8 10 12 14 16 18 20
H
o
u
r
s
o
fu
s
e
Education
Internet use: Regression line
Covari ance
Education Internet use
Education 8.489
Internet use 14.267 48.267
Correl ati on
Education Internet use
Education 1
Internet use 0.705 1
27
b
1
=15.296/(9.095) =1.682
b
0
=10 1.682*12.677 = 11.323
Be careful: EXCEL uses population formulae in calculating
covariances =15.296 =14.267*(15/14)
See Keller p. 138
10
Internet use: Summary
28
Progress report #1
Descriptive statistics (Emphasis of course so far)
What are the key features of data?
How can we best describe these features so that analysis is
informative
Inferential statistics (Emphasis of course to come)
Extracting information about population parameters on basis of
sample statistics
What does a sample mean tell us about a population mean?
Typically only alternative because difficult or impossible to determine
population mean
Need more foundations before covering later in course
29

Vous aimerez peut-être aussi