Week 1.2 - 2019T1

ECON1203
Business and Economic

Statistics
Week 1
Administrivia
⚫ You should have done your Week 2 Tutorial
questions already.
⚫ Project milestone 1
⚫ You should be thinking about this project and the
requirements for the first milestone of your project
☺.
⚫ This is due in week 4.
2
Numerical summaries
⚫ Graphs aren’t the only way to summarize data.
⚫ Key features of a single variable:
⚫ Location, spread, relative location, skewness
⚫ Key features of two variables:
⚫ Measures of (linear) association
3
Numerical summaries
⚫ We are interested in numerical summaries
because the tools in statistics we have can only
deal with quantitative data.
⚫ Think about using statistics for M/F for gender. They
need recoded as 0 or 1 etc.
⚫ This is just simply another way to tell this same
summary ☺.
⚫ Always remember: We are interested in a
population, but typically have data only on a
sample. This is why we’re doing this course ☺.
4
Measures of location
⚫ A parameter describes a key Population mean :
feature of a population N
⚫ A statistic describes a key x i
feature of a sample = i =1
N
⚫ A natural measure of Sample mean :
“location” or “central n
tendency” (a key feature) is x i

the arithmetic mean x= i =1
⚫ Other variants: n
⚫ Weighted mean (WAM)
⚫ Geometric mean 5
Measures of location…
⚫ The median is the middle
value of ordered observations
⚫ When n is odd, the median will
be a particular value
⚫ When n is even, the median is
Sample of n = 5 observations :
the average of the middle two
values 0 0 1 3 3a , with a  1
Median = 1
⚫ The median depends on
4 + 3a
ranks of observations, not x= 7 5
5
absolute values
⚫ Doubling the largest observation
will not change the median!
6
Measures of location…
⚫ Mode: The most frequently occurring value(s)
⚫ “Modal class” (the most common class) was previously
defined in the context of unimodal histograms
⚫ The mean, median and mode each provide different
notions of “representative” or “typical” central values
⚫ We don’t use the mode all that much
⚫ What about mean vs. median?
⚫ For symmetric distributions, mean=median
⚫ For positively (negatively) skewed data: mean > (<) median
⚫ Median often preferred when the data contain outliers
⚫ As per your text and homework: Which is better depends

on what you are going to DO with the information 7
⚫ Do you know where the mean/median
and mode is for this histogram?
40
35
30
25
Frequency
20
15
10
0
-8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 More
Bin
8
Outliers
⚫ A Malaysian man was Cricket batting averages
speechless when he 60
received a $218 trillion 50
phone bill and was

40
Frequency
30
ordered to pay up 20
within 10 days or face 10
prosecution... It wasn't 0
30 40 50 60 70 80 90 100
clear whether the bill Average
was a mistake...
Associated Press
11 April 2006
9
Measures of variability
⚫ Range is a simple measure of variability
⚫ Range = maximum – minimum
⚫ Range is simple, but potentially misleading
1 1 1 50 50 ➔ range = 49
1 10 20 40 50 ➔ range = 49
⚫ Is the “variability” of these two data sets the same?
10
Measures of variability…
⚫ Why not measure the spread Population variance
of values around the “location” N
or “central tendency” of the
data?
(
 ix − μ )2
⚫ Variance is the most common

2 = i =1
N
measure of variability
Sample variance
⚫ Measures average squared
n
( )
distance from the mean
 i −
2
⚫ Division by n–1 for sample
x x
variance relates to properties s2 = i =1
of estimators n-1
⚫ More explanation is provided
in this week’s Homework
11
⚫ Suppose we have a sample of
5 observations on some 0 + 0 + 1+ 3 + 6
x= =2
5
variable:
(0 − 2)2 + (0 − 2)2 + (1 − 2)2 + (3 − 2)2 + (6 − 2)2
0 0 1 3 6 s =
2
5 −1
sample mean = ? 26
= = 6 .5
sample variance = ? 4
⚫ Variance is in squared units
⚫ The standard deviation is the n
spread measured in the  (x i − x )

original units of the data (NOT M1 = i =1
squared) n
n
⚫  = 2
⚫ s= s
2
 xi − x
M 2 = i =1
n
12
⚫ Suppose we can invest in Mutual Fund “A” or Mutual
Fund “B”:
⚫ Mean returns:
A: 10.95 B: 12.76
⚫ Variance of returns:
A: 479.35 B: 786.62
⚫ Standard deviation of returns:
A: s2A = 479.35 B: s2B = 786.62
sA = (479.35)1/2 = 21.89 sB = (786.62)1/2 = 28.05
⚫ Calculations when n is large are tedious

⚫ Using Excel makes it much easier
13
⚫ What is the impact of variance? Have a look below.
150 80
Frequency
Frequency
60
100 40
20
50 0
0
Bin
Bin 40
Frequency
150 30
Frequency
20
100
10
50 0
0
Bin
Bin
Frequency 20
100 15
Frequency
10
50 5
0
0
Bin
Bin
14
Frequency Frequency Frequency
40
20
40
20
0
0
20
40
0
7 8
7.6 8.3
8.2 8.6
8.8 8.9
9.4 9.2
9.5
10
9.8
10.6
Bin
Bin
Bin
10.1
11.2 10.4
11.8 10.7
12.4 11
13 11.3
13.6 11.6
More 11.9
Frequency
Frequency
40
0
20
Frequency
20
40
10
0
5
-22
-14
-9
-4
1
6
Bin
11
Bin
Bin
16
21
26
31
36
41
46
15
Standardizing data
⚫ We can create a transformed variable with zero mean and
“unit” (i.e., 1) variance from any original quantitative variable
⚫ This transformed variable is free of units of measurement
⚫ This is called calculating Z-scores (one Z-score per
observation)
⚫ Calculate [observation – mean] and then divide this difference by the
standard deviation.
⚫ Suppose that for Mutual Fund A, the maximum return is 63%. This point
has a Z-score of (63–10.95)/21.89 = 2.38
...which implies that...
63% is 2.38 standard deviations above the mean return
⚫ Z-scores allow direct comparisons across original variables
measured in different units
⚫ If returns were reported as proportions (e.g., 0.63 rather than 63%), how
16
would the Z-score change?
Coefficient of variation
⚫ Sometimes we wish to measure variation
relative to location
⚫ Case 1: Observations all measured in millions, and
standard deviation is 20 ➔ relatively little variability
⚫ Case 2: Observations all positive but less than 100 ➔
s=20 may indicate a lot of variability
⚫ (Sample) coefficient of variation, cv= s / x
⚫ Provides a measure of relative variability
⚫ Again, comparable across variables
17
Measures of relative location
⚫ The median relies on a ranking of
observations to measure location
⚫ This idea generalizes to percentiles
⚫ The Pth percentile is the value for which P percent
of observations are less than that value
⚫ The median is the 50th percentile
⚫ The 25th and 75th percentiles are called, respectively,
the lower and upper quartiles
⚫ The difference between the upper and lower quartiles
is called the interquartile range - another measure of
spread 18
Measures of relative location…
⚫ Simple example
⚫ Denote the 25th, 50th and 75th percentiles by Q1,
Q2 and Q3
⚫ Suppose n = 8, and ordered data are:
x1, x2, x3, x4, x5, x6, x7, x8
⚫ We need to divide data into quarters
⚫ Thus: Q1= (x2+x3)/2, Q2= (x4+x5)/2, Q3= (x6+x7)/2
⚫ IQR = Q3 – Q1
19
⚫ Think about where these percentiles are
here.
40
35
30
25
Frequency
20
15
10
0
-8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 More
Bin
20
Measures of relative location…
Say that I am interested in the company “Helloworld” (ASX Code HLO)
with a market capitalisation of $147.5m.
Q: In terms of the AllOrds, is this a big company?
Data: 489 companies in All Ordinaries

Percentile $m
(March 7 2015)
10 71.04
Summary Statistics on variable MarketCap ($m) 20 124.10
Obs Mean St. Dev Min Max 30 185.12
40 284.96
489 3452.98 12760.75 2.79 147686.00
50 468.43
60 691.81
Helloworld ranks 371 out of 489 by market capitalisation. 70 1303.32
From the table to the right, we see that Helloworld is 80 2485.47
between the 20th and 30th percentile by market capitalisation. 90 6493.60
21
In fact it is exactly at percentile 100*((489-371)/489) = 24.1.
Measures of association
⚫ “Do large values of x tend to Population covariance
be associated with large
−  x )(y i −  y )
N
values of y?”
⚫ Graphical answer from
 (x i
scatter plots  xy = i =1
N
⚫ Covariance is a numerical
measure Sample covariance
n
 (x − x )(y i − y )
⚫ Positive (negative)
covariance ➔ positive i
(negative) linear s xy = i =1
association n-1
⚫ Zero covariance ➔ no
linear association
22
Measures of association…
⚫ Covariance is not scale Population correlatio n
free
 xy
⚫ Is a covariance of 500 =
“big”?  x y
⚫ Covariance between
height and weight Sample correlatio n
depends upon the units of s xy
measure of each variable r =
⚫ The correlation sx sy
coefficient is a − 1    1 and − 1  r  1
standardized, unit-free
measure of association
⚫ 1 (-1) ➔ perfect positive
(negative) linear 23
relationship
Correlations depicted in each
of these scatter plots?
24
Least squares: The problem
⚫ Whilst correlation is interesting what would be better is
if we can quantify this relationship.
⚫ You can imagine why this is useful because for a
change in the independent variable potentially you
could say that the impact is b1 on the dependent
variable.
⚫ Take a simple example if you had a relationship between
income and height that said:
Income=b0+b1height+e
⚫ You could potentially predict income based on height! How
cool is that? But of course there is a lot of assumptions and
theory involved. We will revisit this in due course. Think of this
as a first taste ☺.
25
Least squares: The problem
⚫ Suppose we have (yi, xi) pairs for i = 1, … , n
⚫ We can depict the data graphically in a scatter plot
⚫ But what if we want to determine more specifically the linear
relationship between y and x?
⚫ How do we pin down the intercept and slope that
describe this bivariate relationship?
⚫ We choose the intercept and slope values that give the best fit
⚫ What do we mean by “best fit”?
⚫ The most common approach is to minimize the residual sum
of squares
⚫ This method called least squares
⚫ This is the basis of regression analysis 26
Least squares: A diagram
y
yˆ = b0 + b1x
w w
y1
e1 w
ŷ1
w
b0
x1
x
27
Least squares: The
optimization problem
Assume
yˆ i = b0 + b1 xi
where b0 and b1 are chosen to minimize
n
 i i
( y
i =1
− ˆ
y ) 2
Thus, the " solution" to this minimization problem

consists of one intercept estimate and one slope estimate
that together minimize the residual sum of squares.
28
Least squares: The solution
(with one independent variable)
s xy
b1 = 2
b0 = y − b1 x
s x
⚫ Notes:
⚫ The point consisting of the means of x and y will lie on the
line of best fit
⚫ b1 will have same sign as the covariance (correlation)
between y and x
⚫ Zero covariance (correlation) ➔ b1 = 0 ➔ ?
29
R-sq: Percentage of variation
explained by the model
⚫ How “good” is our model?
⚫ The fit of the “model” ( = the fitted line) to the actual data ( =
the y value(s) observed at each x value) is described by the R-
squared statistic
⚫ A fit statistic considers how much variation is in the residuals –
i.e., variation in the y variable that is not explained by the
model – compared to how much variation is in the whole y
variable. The larger this “unexplained” variation, the less well
the model “fits”.
⚫ The maximum value of R-sq is 1 (perfect fit) and the minimum
is 0 (no fit)
⚫ In a simple bivariate regression,
R-sq = [correlation of x and y]^2
30
⚫ R-sq is also termed the coefficient of determination
Extended example:
Internet use
⚫ The problem
⚫ Topic: Patterns of internet usage
⚫ Our specific question is about the relationship between
education and internet use
⚫ Data
⚫ Random sample of 15 Australian adults
⚫ Two variables:
⚫ Education (years)
⚫ Internet use (hours in previous week)
⚫ What are the key features of these variables and

their relationship with each other?
31
Internet use: Excel summary
statistics
Education Internet use
Mean 12.667 Mean 10.000

Standard Error 0.779 Standard Error 1.857
Median 11 Median 10
Mode 11 Mode 0
Standard Deviation 3.016 Standard Deviation 7.191
Sample Variance 9.095 Sample Variance 51.714
Kurtosis -0.114 Kurtosis -0.432
Skewness 0.586 Skewness 0.181
Range 11 Range 24
Minimum 8 Minimum 0
Maximum 19 Maximum 24
Sum 190 Sum 150
Count 15 Count 15
32
Internet use: Scatter diagram and
fitted regression line
Internet use
30
25
20
Hours of use
15
10
0
0 2 4 6 8 10 12 14 16 18 20
Education
33
Internet use: Fitted regression
line
Covariance Correlation
Education Internet use Education Internet use
Education 8.489 Education 1
Internet use 14.267 48.267 Internet use 0.705 1
⚫ b1 = 15.296/(9.095) = 1.682
b0 = 10 – 1.682*12.677 = – 11.323
⚫ Be careful: Excel uses population formulae in calculating the

covariances it reports (i.e., it assumes it has the whole
population), so to recover the sample covariances, we have to
multiply by n/(n-1) ➔= 14.267*(15/14) = 15.296 is actually the
sample covariance. Verify that you understand why this
correction works! 34
Internet use: Summary of
findings
➢ Education varies from 8 to 19 years, with a mean of 12.7 and a

median of 11
➢ Internet use varies from 0 to 24 weekly hours, with a mean and
median of 10
➢ The distribution of internet use is symmetric, while that of education
is positively skewed
➢ The scatterplot indicates a positive relationship
➢ This is confirmed by the correlation coefficient of 0.7
➢ The least squares line of best fit indicates that internet use increases
by 1.7 hours for every additional year of education
➢ R-squared is .705*.705 = .497: almost half of the variation in
internet use is explained by the linear function of years of education
35
Summary
⚫ Descriptive statistics (emphasis of the course so far)
⚫ How do we describe the location and spread of quantitative data?
⚫ How can we describe the relationship between two variables?
⚫ What is the basic intuition behind the least-squares line-fitting
technique – arguably the most popular quantitative analysis
technique in all of the social sciences?
⚫ Next up: Probability, random variables, and sampling
⚫ Inferential statistics (later on)
⚫ Extracting information about population parameters on the basis
of sample statistics
⚫ What does a sample mean (which we observe) tell us about a
population mean (which we can never observe)?
36
Administrative reminders
⚫ To do ASAP:
⚫ Get the text (hardcopy or ecopy)
⚫ Complete Week 2 Tutorial problems
⚫ Start thinking about the project and the first milestone.
⚫ You now have sufficient knowledge to deal with this ☺.
37

Week 1.2 - 2019T1

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Week 1.2 - 2019T1

Transféré par

Droits d'auteur :

Formats disponibles

ECON1203

Business and Economic

tendency” (a key feature) is x i

⚫ For positively (negatively) skewed data: mean > (<) median

⚫ Median often preferred when the data contain outliers

⚫ As per your text and homework: Which is better depends

received a $218 trillion 50

phone bill and was

within 10 days or face 10

clear whether the bill Average

⚫ Variance is the most common

spread measured in the  (x i − x )

⚫ Calculations when n is large are tedious

Data: 489 companies in All Ordinaries

Thus, the " solution" to this minimization problem

⚫ Internet use (hours in previous week)

⚫ What are the key features of these variables and

Mean 12.667 Mean 10.000

⚫ Be careful: Excel uses population formulae in calculating the

➢ Education varies from 8 to 19 years, with a mean of 12.7 and a

Vous aimerez peut-être aussi