Académique Documents
Professionnel Documents
Culture Documents
Sept. 3, 2014
Agenda
Stats Lecture
1) Univariate analysis (looking at one variable)
… central tendencies, and variability (dispersion)
2) Bivariate analysis (comparing two variables)
… correlation, t-test, chi-square association
3) Additional context for stats assignment in group project
Applications in SPSS (handout)
Discussion
Where do we start?
Univariate Analyses
Need to make sure all our variables (e.g. scores on a
scale, income figures, gender, ethnicity) are behaving
appropriately for statistical testing
Each must have some variability (e.g. if all women, no
variability, cannot do outcomes based on gender)
Need to check out how much variability and typical
values for each
For example, a typical value may be its average or mean value
These analyses called univariate analyses.
Univariate analysis involves the examination across cases
of one variable at a time.
Summarizing Univariate Distributions
Any set of measurements that summarizes a variable
should have two important properties:
More on standard
deviations, later…
Measures of Central Tendency
An estimate of the center of a distribution of
values; how much our data are similar
The means to determine what is most typical,
common, and routine
Central tendency is usually summarized with
one of three statistics:
1) Mode
2) Median
3) Mean
Measures of Central Tendency 1
The Mode
The mode, the most frequent value in a distribution, is the
least often used as it easily gives a misleading impression:
mnemonic - mode = most.
If the mode occurs twice, then the distribution is
called bimodal.
Can be used for all four levels of measurement (for nominal,
just the most common response: ex. the number of female
and male in a study)
May not be effective in describing what is typical in the
distribution of a variable
Measures of Central Tendency 1
The Mode example
28, 31, 38, 39, 42, 42, 42, 42, 43, 47, 51, 51, 54, 55,
56, 56, 58, 59, 59, 59
9 + 10 / 2 = 9.5 Median
Measures of Central Tendency 3
Mean
The mean, or statistical average, takes into account
the values of each case in the distribution
It is the sum of all of the values divided by the total #
of the values.
Must be interval or ratio level measurements (e.g.,
weight, age, miles driving).
Should not be computed for ordinal level – why?
Mean can promote accuracy or distortion depending
on whether the distribution is symmetrical or
skewed.
Measures of Central Tendency 3
The Mean Example
16
Normal Distribution (aka, Bell Curve) – where
is the mean, median, and mode?
In a perfect normal distribution, mean,
median and mode are equal!
Mode
Median
Mean 17
Means and variances are best measures
for symmetric or normal distributions
Describe by using
arithmetic MEAN
VARIANCE (standard
deviation)
Secondarily,
Range
Mode (most common
value)
Skew (left or right)
Kurtosis (thickness of
tails)
Normal Distribution - Skewness
• Skewness is used in describing abnormal distributions.
• In a normal curve, the right and left halves of the curve
are mirror images of each other.
• If this is not the case, the curve is said to be skewed, either
positively (to the right) or negatively (to the left).
• If the scores tend to be concentrated toward the high
end of the score scale, the curve is negatively skewed.
• If they are concentrated toward the low end of the score
scale, they are positively skewed
20
Example. Means and standard deviations
for all study variables
Std.
Mean N
Deviation
SF-36 Scale 80.47 20.37 257
Number of people in Household 2.81 1.35 257
Number of hours housework (sqrt) 29.68 10.31 257
Financial stress scale 5.07 1.95 257
Mean=50 Mean=80
The Outlier Affect
Outlier: a result that is far different from most of the results for
the group; extreme value(s) that can skew the overall results
Median and mode are not sensitive to outliers. That is, they
tend not to change with outliers
Mean is sensitive to outliers. Mean can change greatly with
outliers.
1, 1, 1, 1, 50 10.8 1 1
1, 1, 1, 1, 100 20.8 1 1
To Address Outliers in Mean Calculations…
2 40 45 46 52 52 55 59 60 61 61 63 64 66 66 66 67
69 70 259
Example:
2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20
Range is 20 – 2 + 1 = 19
2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 100
Range is 100 – 2 + 1 = 99 (outlier effect)
Example of central tendency and variation
2 Assume mean = 5.0
Each point varies around
the mean.
Example of central tendency and variation
2 Assume mean = 5.0
Each point varies around the
mean.
This variation contributes to
the overall standard deviation
(SD)
Measures of Variability 2
Variance
Variance
The variance is the average of the squared differences
from the mean.
It takes into account all the scores to determine the
spread.
To calculate the variance follows these steps:
1) Work out the mean (the simple average of the
numbers)
2) For each number: subtract the mean and then
square the result (the squared difference)
3) Work out the average of those squared differences.
Example of central tendency and variation
2 Assume mean = 5.0
Each point varies around the
mean.
This variation contributes to
the overall standard deviation
(SD)
First, calculate the mean.
Find difference at each point.
Square difference and sum.
𝐷𝐼𝐹𝐹 2
Variance =
𝑛−1
Variance =
𝑒𝑎𝑐ℎ 𝑣𝑎𝑙𝑢𝑒−𝑚𝑒𝑎𝑛 2
𝑛−1
𝑒𝑎𝑐ℎ 𝑣𝑎𝑙𝑢𝑒−𝑚𝑒𝑎𝑛 2
SD =
𝑛−1
Calculations in Excel table
Variance – Example
You and your friends have just measured the heights of your dogs.
Mean=394
3. To calculate the Variance, take each difference, square it, and then average the result:
Variance: σ2 = 2062 + 762 + (-224)2 + 362 + (-94)2 = 108,520 → 108,520/5 = 21,704
Measures of Variability 3
Standard Deviation
Standard Deviation
Standard deviation is the square root of the variance: √(variance)
SD tells us what degree the values cluster around the mean.
Standard Deviation:
σ = √21,704 = 147.32... = 147
Now we can show which heights
are within one Standard Deviation
(147mm) of the Mean:
Central
Tendency Mean;
(best Mode Median Median;
represents all Mode
cases)
Variability Percent of Variance;
(spread; cases in Range Standard
39 dispersion) categories deviation; Range
Bivariate Statistics
Now that we know a bit about each of our variables,
we can start comparing them to each other
We can also look at differences among groups
When comparing two variables or groups, use
bivariate statistics
Multivariate statistics look at the relationships among
many variables or groups at one time, beyond the
scope of our class
Comparing variables and groups…
Parametric Statistics
Parametric statistics require certain assumptions/qualities in
data/variables:
Normal distributions
Dependent variable is interval/ratio
Good sample size (at least 30)
44
Correlation
To determine if a relationship exists between two
linear variables and the direction of the relationship
“What is the actual strength and direction of the relationship
between variables within the sample?”
45
Correlation
Strength indicated by a correlation coefficient (Pearson’s r)
Correlation Coefficient = provides the numerical value that
indicates both the strength and direction of the relationship
(r):
(–) 1.0 = perfect negative relationship
(+) 1.0 = perfect positive relationship
The closer the coeffecient is to either +1.0 or –1.0, the
stronger the linear relationship
Middle = moderate / weaker relationship
Close to 0 = no relationship
Range of Correlation Coefficients
Correlation Coefficients (r)
0.0
-1.0 +1.0
No correlation
Perfect Perfect
negative positive
47
Correlation Matrix
48
Example. Correlations among study variables
1 2 3 4 5 6 7
50
Independent samples t-tests
Compares independent groups (i.e., men and
women, control and experimental group) in terms of
outcomes
Independent t-test is useful in studies that employ
experimental designs.
Compares the means of two samples, but the
samples must be independently drawn from a
population – random selection for experiments
Tells us if the difference between the groups is
statistically significant
51
Example. Independent samples t-tests
t=18.343; p<0.001; two groups are significantly different in the amount of exercise,
indicating that male will be more likely than female to exercise per week.
Nonparametric Statistics
Nonparametric statistics:
Do not depend on a distribution shape
Do not include nor use means, variance and standard
deviations
Use frequencies and percentages to describe the data.
Sometimes, medians, percentiles, and the difference
between the 75th and 25th percentiles (Interquartile range,
or IQR) are used when the data can be sorted and ranked in
order and displayed.
Nonparametric Statistics
Nonparametric statistics used for:
too small sample for parametric statistics,
Ordinal and nominal level dependent variables
Chi square, Mann-Whitney U, Kruskal-Wallis, etc.
=a d
b*c Not c+d
Exposed
c d
= a*d / b*c
= ad / cb sample a b a+b+c+d
sums
Note: range [0 , )
(if c or b = 0, then add 0.5 to all cells)
ODDS ratio or cross product
Disease Disease Row totals
Odds of disease given exposure
Present Absent
Odd of disease given no exp.
cases controls
Odds ratio (OR) = a/bc/d Exposed a b a+b
=a d
b*c Not c+d
Exposed
c d
= a*d / b*c
= ad / cb sample a b a+b+c+d
sums
Note: range [0 , )
(if c or b = 0, then add 0.5 to all cells)
Measures for the strength of the
association
Odds ratio or relative risk
The stronger the association (i.e. the greater the
magnitude of the increased or decreased risk observed)
between a two characteristics (e.g., exposure &
disease), the less likely it is that the
relationship is due merely to the effect of
some unsuspected confounding variable.
An odds ratio equal to 1 (one) means no
association at all.
Causation is not inferred from association.
Stronger odds ratio generally implies
stronger positive association
Disease Disease Row totals
Odds of disease given exposure
Present Absent
Odd of disease given no exp.
cases controls
Odds ratio (OR) = ad / cb Exposed a b a+b
OR: (0, 1)
Let’s take time to summarize these slides
and ask some questions…