Académique Documents
Professionnel Documents
Culture Documents
Week 1
Administrivia
⚫ You should have done your Week 2 Tutorial
questions already.
⚫ Project milestone 1
⚫ You should be thinking about this project and the
requirements for the first milestone of your project
☺.
⚫ This is due in week 4.
2
Numerical summaries
⚫ Graphs aren’t the only way to summarize data.
⚫ Key features of a single variable:
⚫ Location, spread, relative location, skewness
⚫ Key features of two variables:
⚫ Measures of (linear) association
3
Numerical summaries
⚫ We are interested in numerical summaries
because the tools in statistics we have can only
deal with quantitative data.
⚫ Think about using statistics for M/F for gender. They
need recoded as 0 or 1 etc.
⚫ This is just simply another way to tell this same
summary ☺.
⚫ Always remember: We are interested in a
population, but typically have data only on a
sample. This is why we’re doing this course ☺.
4
Measures of location
⚫ A parameter describes a key Population mean :
feature of a population N
⚫ A statistic describes a key x i
feature of a sample = i =1
N
⚫ A natural measure of Sample mean :
“location” or “central n
⚫ Other variants: n
⚫ Weighted mean (WAM)
⚫ Geometric mean 5
Measures of location…
⚫ The median is the middle
value of ordered observations
⚫ When n is odd, the median will
be a particular value
⚫ When n is even, the median is
Sample of n = 5 observations :
the average of the middle two
values 0 0 1 3 3a , with a 1
Median = 1
⚫ The median depends on
4 + 3a
ranks of observations, not x= 7 5
5
absolute values
⚫ Doubling the largest observation
will not change the median!
6
Measures of location…
⚫ Mode: The most frequently occurring value(s)
⚫ “Modal class” (the most common class) was previously
defined in the context of unimodal histograms
⚫ The mean, median and mode each provide different
notions of “representative” or “typical” central values
⚫ We don’t use the mode all that much
⚫ What about mean vs. median?
⚫ For symmetric distributions, mean=median
35
30
25
Frequency
20
15
10
0
-8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 More
Bin
8
Outliers
⚫ A Malaysian man was Cricket batting averages
speechless when he 60
Frequency
30
ordered to pay up 20
prosecution... It wasn't 0
30 40 50 60 70 80 90 100
was a mistake...
Associated Press
11 April 2006
9
Measures of variability
⚫ Range is a simple measure of variability
⚫ Range = maximum – minimum
⚫ Range is simple, but potentially misleading
1 1 1 50 50 ➔ range = 49
1 10 20 40 50 ➔ range = 49
⚫ Is the “variability” of these two data sets the same?
10
Measures of variability…
⚫ Why not measure the spread Population variance
of values around the “location” N
or “central tendency” of the
data?
(
ix − μ )2
of estimators n-1
⚫ More explanation is provided
in this week’s Homework
11
Measures of variability…
⚫ Suppose we have a sample of
5 observations on some 0 + 0 + 1+ 3 + 6
x= =2
5
variable:
(0 − 2)2 + (0 − 2)2 + (1 − 2)2 + (3 − 2)2 + (6 − 2)2
0 0 1 3 6 s =
2
5 −1
sample mean = ? 26
= = 6 .5
sample variance = ? 4
⚫ Variance is in squared units
⚫ The standard deviation is the n
Frequency
Frequency
60
100 40
20
50 0
0
Bin
Bin 40
Frequency
150 30
Frequency
20
100
10
50 0
0
Bin
Bin
Frequency 20
100 15
Frequency
10
50 5
0
0
Bin
Bin
14
Frequency Frequency Frequency
40
20
40
20
0
0
20
40
0
7 8
7.6 8.3
8.2 8.6
8.8 8.9
9.4 9.2
9.5
10
9.8
10.6
Bin
Bin
Bin
10.1
11.2 10.4
11.8 10.7
12.4 11
13 11.3
13.6 11.6
More 11.9
Frequency
Frequency
40
0
20
Frequency
20
40
10
0
5
-22
-14
-9
-4
1
6
Bin
11
Bin
Measures of variability…
Bin
16
21
26
31
36
41
46
15
Standardizing data
⚫ We can create a transformed variable with zero mean and
“unit” (i.e., 1) variance from any original quantitative variable
⚫ This transformed variable is free of units of measurement
⚫ This is called calculating Z-scores (one Z-score per
observation)
⚫ Calculate [observation – mean] and then divide this difference by the
standard deviation.
⚫ Suppose that for Mutual Fund A, the maximum return is 63%. This point
has a Z-score of (63–10.95)/21.89 = 2.38
...which implies that...
63% is 2.38 standard deviations above the mean return
⚫ Z-scores allow direct comparisons across original variables
measured in different units
⚫ If returns were reported as proportions (e.g., 0.63 rather than 63%), how
16
would the Z-score change?
Coefficient of variation
⚫ Sometimes we wish to measure variation
relative to location
⚫ Case 1: Observations all measured in millions, and
standard deviation is 20 ➔ relatively little variability
⚫ Case 2: Observations all positive but less than 100 ➔
s=20 may indicate a lot of variability
⚫ (Sample) coefficient of variation, cv= s / x
⚫ Provides a measure of relative variability
⚫ Again, comparable across variables
17
Measures of relative location
⚫ The median relies on a ranking of
observations to measure location
⚫ This idea generalizes to percentiles
⚫ The Pth percentile is the value for which P percent
of observations are less than that value
⚫ The median is the 50th percentile
⚫ The 25th and 75th percentiles are called, respectively,
the lower and upper quartiles
⚫ The difference between the upper and lower quartiles
is called the interquartile range - another measure of
spread 18
Measures of relative location…
⚫ Simple example
⚫ Denote the 25th, 50th and 75th percentiles by Q1,
Q2 and Q3
⚫ Suppose n = 8, and ordered data are:
x1, x2, x3, x4, x5, x6, x7, x8
⚫ We need to divide data into quarters
⚫ Thus: Q1= (x2+x3)/2, Q2= (x4+x5)/2, Q3= (x6+x7)/2
⚫ IQR = Q3 – Q1
19
Measures of location
⚫ Think about where these percentiles are
here.
40
35
30
25
Frequency
20
15
10
0
-8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 More
Bin
20
Measures of relative location…
Say that I am interested in the company “Helloworld” (ASX Code HLO)
with a market capitalisation of $147.5m.
Q: In terms of the AllOrds, is this a big company?
scatter plots xy = i =1
N
⚫ Covariance is a numerical
measure Sample covariance
n
(x − x )(y i − y )
⚫ Positive (negative)
covariance ➔ positive i
(negative) linear s xy = i =1
association n-1
⚫ Zero covariance ➔ no
linear association
22
Measures of association…
⚫ Covariance is not scale Population correlatio n
free
xy
⚫ Is a covariance of 500 =
“big”? x y
⚫ Covariance between
height and weight Sample correlatio n
depends upon the units of s xy
measure of each variable r =
⚫ The correlation sx sy
coefficient is a − 1 1 and − 1 r 1
standardized, unit-free
measure of association
⚫ 1 (-1) ➔ perfect positive
(negative) linear 23
relationship
Correlations depicted in each
of these scatter plots?
24
Least squares: The problem
⚫ Whilst correlation is interesting what would be better is
if we can quantify this relationship.
⚫ You can imagine why this is useful because for a
change in the independent variable potentially you
could say that the impact is b1 on the dependent
variable.
⚫ Take a simple example if you had a relationship between
income and height that said:
Income=b0+b1height+e
⚫ You could potentially predict income based on height! How
cool is that? But of course there is a lot of assumptions and
theory involved. We will revisit this in due course. Think of this
as a first taste ☺.
25
Least squares: The problem
⚫ Suppose we have (yi, xi) pairs for i = 1, … , n
⚫ We can depict the data graphically in a scatter plot
⚫ But what if we want to determine more specifically the linear
relationship between y and x?
⚫ How do we pin down the intercept and slope that
describe this bivariate relationship?
⚫ We choose the intercept and slope values that give the best fit
⚫ What do we mean by “best fit”?
⚫ The most common approach is to minimize the residual sum
of squares
⚫ This method called least squares
⚫ This is the basis of regression analysis 26
Least squares: A diagram
y
yˆ = b0 + b1x
w w
y1
e1 w
ŷ1
w
b0
x1
x
27
Least squares: The
optimization problem
Assume
yˆ i = b0 + b1 xi
where b0 and b1 are chosen to minimize
n
i i
( y
i =1
− ˆ
y ) 2
28
Least squares: The solution
(with one independent variable)
s xy
b1 = 2
b0 = y − b1 x
s x
⚫ Notes:
⚫ The point consisting of the means of x and y will lie on the
line of best fit
⚫ b1 will have same sign as the covariance (correlation)
between y and x
⚫ Zero covariance (correlation) ➔ b1 = 0 ➔ ?
29
R-sq: Percentage of variation
explained by the model
⚫ How “good” is our model?
⚫ The fit of the “model” ( = the fitted line) to the actual data ( =
the y value(s) observed at each x value) is described by the R-
squared statistic
⚫ A fit statistic considers how much variation is in the residuals –
i.e., variation in the y variable that is not explained by the
model – compared to how much variation is in the whole y
variable. The larger this “unexplained” variation, the less well
the model “fits”.
⚫ The maximum value of R-sq is 1 (perfect fit) and the minimum
is 0 (no fit)
⚫ In a simple bivariate regression,
R-sq = [correlation of x and y]^2
30
⚫ R-sq is also termed the coefficient of determination
Extended example:
Internet use
⚫ The problem
⚫ Topic: Patterns of internet usage
⚫ Our specific question is about the relationship between
education and internet use
⚫ Data
⚫ Random sample of 15 Australian adults
⚫ Two variables:
⚫ Education (years)
25
20
Hours of use
15
10
0
0 2 4 6 8 10 12 14 16 18 20
Education
33
Internet use: Fitted regression
line
Covariance Correlation
Education Internet use Education Internet use
Education 8.489 Education 1
Internet use 14.267 48.267 Internet use 0.705 1
⚫ b1 = 15.296/(9.095) = 1.682
b0 = 10 – 1.682*12.677 = – 11.323
36
Administrative reminders
⚫ To do ASAP:
⚫ Get the text (hardcopy or ecopy)
⚫ Complete Week 2 Tutorial problems
⚫ Start thinking about the project and the first milestone.
⚫ You now have sufficient knowledge to deal with this ☺.
37