Vous êtes sur la page 1sur 63

Module 10

Sept. 3, 2014
Agenda
 Stats Lecture
1) Univariate analysis (looking at one variable)
… central tendencies, and variability (dispersion)
2) Bivariate analysis (comparing two variables)
… correlation, t-test, chi-square association
3) Additional context for stats assignment in group project
 Applications in SPSS (handout)
 Discussion
Where do we start?
Univariate Analyses
 Need to make sure all our variables (e.g. scores on a
scale, income figures, gender, ethnicity) are behaving
appropriately for statistical testing
 Each must have some variability (e.g. if all women, no
variability, cannot do outcomes based on gender)
 Need to check out how much variability and typical
values for each
 For example, a typical value may be its average or mean value
 These analyses called univariate analyses.
 Univariate analysis involves the examination across cases
of one variable at a time.
Summarizing Univariate Distributions
 Any set of measurements that summarizes a variable
should have two important properties:

1. The Central Tendency (or typical value)


mode, median, mean
2. The Spread (variability or dispersion) about that value
range, variance, standard deviation
(That is, how do each of the data values differ from
the mean or median value? )
Example of central tendency and variation
2 Assume mean = 5.0
Each point varies around
the mean.
Example of central tendency and variation
2 Assume mean = 5.0
Each point varies around the
mean.
This variation contributes to
the overall standard deviation
(SD)

More on standard
deviations, later…
Measures of Central Tendency
 An estimate of the center of a distribution of
values; how much our data are similar
 The means to determine what is most typical,
common, and routine
 Central tendency is usually summarized with
one of three statistics:
1) Mode
2) Median
3) Mean
Measures of Central Tendency 1
The Mode
 The mode, the most frequent value in a distribution, is the
least often used as it easily gives a misleading impression:
mnemonic - mode = most.
 If the mode occurs twice, then the distribution is
called bimodal.
 Can be used for all four levels of measurement (for nominal,
just the most common response: ex. the number of female
and male in a study)
 May not be effective in describing what is typical in the
distribution of a variable
Measures of Central Tendency 1
The Mode example

What is the most frequent value?

28, 31, 38, 39, 42, 42, 42, 42, 43, 47, 51, 51, 54, 55,
56, 56, 58, 59, 59, 59

(this listing of the data set is called an array)


Where is the mode in each of these distributions?
Measures of Central Tendency 2
The Median
 The median, the point that divides the distribution in
half; the midpoint of a set of numbers
 To find the median value of a data set, arrange the
data in order from smallest to largest
 Must be used for at least ordinal level of
measurement – why?
 Unlike the mode, the median does not always
coincide with an actual value in the set (unless the
set has an odd number of values
Measures of Central Tendency 2
The Median Example

2, 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20


19 points, 10th one is the Median
= 9 Median

 If the number of points is even – then average the two


values around the middle (n = 18):

2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20

 9 + 10 / 2 = 9.5 Median
Measures of Central Tendency 3
Mean
 The mean, or statistical average, takes into account
the values of each case in the distribution
 It is the sum of all of the values divided by the total #
of the values.
 Must be interval or ratio level measurements (e.g.,
weight, age, miles driving).
 Should not be computed for ordinal level – why?
 Mean can promote accuracy or distortion depending
on whether the distribution is symmetrical or
skewed.
Measures of Central Tendency 3
The Mean Example

2, 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20


ANSWER:
2+2+3+3+4+5+5+7+8+9+10+11+11+14+14+15+16+18+20
Total N = 19
= 177 / 19 = 9.32
= SUM of all values / N
What is the Normal Distribution?

It looks like a bell with one “hump” in the middle,


centered around the population mean, and the
number of cases (data) tapering off to both sides of
the mean;
the symmetrical distribution of scores around the mean
Normal Distribution (aka, Bell Curve) – where
is the mean, median, and mode?

16
Normal Distribution (aka, Bell Curve) – where
is the mean, median, and mode?
In a perfect normal distribution, mean,
median and mode are equal!

Mode
Median
Mean 17
Means and variances are best measures
for symmetric or normal distributions
Describe by using
 arithmetic MEAN
 VARIANCE (standard
deviation)

Secondarily,
 Range
 Mode (most common
value)
 Skew (left or right)
 Kurtosis (thickness of
tails)
Normal Distribution - Skewness
• Skewness is used in describing abnormal distributions.
• In a normal curve, the right and left halves of the curve
are mirror images of each other.
• If this is not the case, the curve is said to be skewed, either
positively (to the right) or negatively (to the left).
• If the scores tend to be concentrated toward the high
end of the score scale, the curve is negatively skewed.
• If they are concentrated toward the low end of the score
scale, they are positively skewed

Skewness is measured from -3.0 to + 3.0

0 skew score = symmetrical distribution


19
Normal Distribution - Skewness

20
Example. Means and standard deviations
for all study variables
Std.
Mean N
Deviation
SF-36 Scale 80.47 20.37 257
Number of people in Household 2.81 1.35 257
Number of hours housework (sqrt) 29.68 10.31 257
Financial stress scale 5.07 1.95 257

Mean=50 Mean=80
The Outlier Affect
 Outlier: a result that is far different from most of the results for
the group; extreme value(s) that can skew the overall results
 Median and mode are not sensitive to outliers. That is, they
tend not to change with outliers
 Mean is sensitive to outliers. Mean can change greatly with
outliers.

Array Mean Median Mode

1, 1, 1, 1, 50 10.8 1 1

1, 1, 1, 1, 100 20.8 1 1
To Address Outliers in Mean Calculations…

 Trimmed mean: do not use the top and bottom five


percent of scores
 In this example, we have 20 values. The lowest and highest values
reflect the lowest 5% and highest 5% values in this list

2 40 45 46 52 52 55 59 60 61 61 63 64 66 66 66 67
69 70 259

 Mean for n = 20 is 66.2,


 Trimmed mean for n = 18 is 53.1
Which measure of central tendency should
we use?
 Both the median and mean are used to summarize
the central tendency of quantitative variables.
 To decide which to use, consider these issues:
1. Level of measurement:
 the median can be used with ordinal level data (often used
in scales); but,
 the mean requires interval or ratio level data.

 the mode should be used for nominal level data. (Think


Yes=1 and No=0 data. What would 0.36 mean? And 0.72?)
Which measure of central tendency should
we use?
 Both the median and mean are used to summarize the
central tendency of quantitative variables.
 To decide which to use, consider these issues:
2. The shape of the distribution
 the median should be used when the data is skewed or has many
outliers
 the mean should be used when the data is fairly “bell shaped” or
normal.
 Tip: Use the mean when the mean and median are
very similar.
Mean or Median?
 Shape of variable’s distribution:
 The mean and median will be the same when the
distribution is perfectly symmetric.
 When the distribution is not symmetric, the mean is pulled
in the direction of extreme values, but the median is not
affected in any way by extreme values.

 Purpose of the statistical summary:


 If the purpose is to report the middle position, then the
median is the appropriate statistic.
 If the purpose is to report a mathematical average, the
mean is the appropriate statistic.
Normal distributions: means and
medians are very close
Arithmetic MEAN
(average value) is nearly
the same at the
MEDIAN (50th
percentile, or value
where half of the ranked
data points lie above
and below.)
Measures of Variability
(Variation/Dispersion)

 How different the data are from each other and is


reported by how the scores fall around the mean
 For nominal data, simply looks at how many in each
category, for the rest…
 Captures how widely and densely spread a variable’s
distribution is.
Measures of Variability
 Variability is usually summarized with one of four
statistics:
1) The Percent of responses in each category (nominal data)
2) The Range (ordinal and higher)
3) The Variance (interval and ratio)
4) The Standard Deviation (interval and ratio)
Measures of Variability 1
Percentage & Range
 For nominal data, simply report percentage in categories
(51% female, 22% social workers)
 For ordinal, interval 7 ratio data, the range is calculated
as the difference between the highest value in a
distribution and the lowest value.
 It can be drastically altered by a extreme value (an outlier)
 “Maximum value minus the minimum value + 1”

Example:
2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20
Range is 20 – 2 + 1 = 19
2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 100
Range is 100 – 2 + 1 = 99 (outlier effect)
Example of central tendency and variation
2 Assume mean = 5.0
Each point varies around
the mean.
Example of central tendency and variation
2 Assume mean = 5.0
Each point varies around the
mean.
This variation contributes to
the overall standard deviation
(SD)
Measures of Variability 2
Variance
 Variance
 The variance is the average of the squared differences
from the mean.
 It takes into account all the scores to determine the
spread.
 To calculate the variance follows these steps:
1) Work out the mean (the simple average of the
numbers)
2) For each number: subtract the mean and then
square the result (the squared difference)
3) Work out the average of those squared differences.
Example of central tendency and variation
2 Assume mean = 5.0
Each point varies around the
mean.
This variation contributes to
the overall standard deviation
(SD)
First, calculate the mean.
Find difference at each point.
Square difference and sum.
𝐷𝐼𝐹𝐹 2
Variance =
𝑛−1
Variance =
 𝑒𝑎𝑐ℎ 𝑣𝑎𝑙𝑢𝑒−𝑚𝑒𝑎𝑛 2
𝑛−1

 𝑒𝑎𝑐ℎ 𝑣𝑎𝑙𝑢𝑒−𝑚𝑒𝑎𝑛 2
SD =
𝑛−1
Calculations in Excel table
Variance – Example
You and your friends have just measured the heights of your dogs.

The heights are: 600mm, 470mm,


170mm, 430mm and 300mm.

Mean=394

1.Find the Mean:


Mean = 600+470+170+430+300/5=394

2. Calculate each dogs difference from the


Mean: (600-394=206), (470-394=76),
(170-394=-224)….

3. To calculate the Variance, take each difference, square it, and then average the result:
Variance: σ2 = 2062 + 762 + (-224)2 + 362 + (-94)2 = 108,520 → 108,520/5 = 21,704
Measures of Variability 3
Standard Deviation
 Standard Deviation
 Standard deviation is the square root of the variance: √(variance)
 SD tells us what degree the values cluster around the mean.

Standard Deviation:
σ = √21,704 = 147.32... = 147
Now we can show which heights
are within one Standard Deviation
(147mm) of the Mean:

So, using the Standard Deviation


The variance and standard deviation are we have a "standard" way of
calculated via your software programs like knowing what is normal, and what
SPSS, Excel, SAS and others, even on hand is extra large or extra small.
calculators Thank goodness for modern Rottweillers are tall dogs. And
technology!
Dachsunds are a bit short
Overview

 Nominal Ordinal Interval or Ratio

Central
Tendency Mean;
(best Mode Median Median;
represents all Mode
cases)
Variability Percent of Variance;
(spread; cases in Range Standard
39 dispersion) categories deviation; Range
Bivariate Statistics
 Now that we know a bit about each of our variables,
we can start comparing them to each other
 We can also look at differences among groups
 When comparing two variables or groups, use
bivariate statistics
 Multivariate statistics look at the relationships among
many variables or groups at one time, beyond the
scope of our class
Comparing variables and groups…
Parametric Statistics
 Parametric statistics require certain assumptions/qualities in
data/variables:
 Normal distributions
 Dependent variable is interval/ratio
 Good sample size (at least 30)

 Examples of parametric statistics


1. Correlation: Is there a relationship between variables?
2. T-Tests
: Are there mean differences in outcomes between two groups?
3. Analysis of Variance (ANOVA)
: Are there mean differences in outcomes among groups?
(two or more groups; will not do in this class)
Probability Value
 A report of how likely the relationship indicated is
statistically significant or may have happened by
chance
 In other words, how sure are we what we found was
not just a fluke?
 Most researchers set the level for statistical
significance at 0.05 or smaller (or 0.01, 0.001)
 Indicated by P Value, e.g. P< .05 means there is less
than 1 in 20 chance of results due to sampling error
P<.01; less than 1 in 100 chance; p<.001; less than 1 in 1,000 chance
Correlation
 To determine if a relationship exists between two
variables and the direction of the relationship
“What is the actual strength and direction of the relationship
between variables within the sample?”

 To determine the degree to which the variables are


related and the probability that this relationship
occurred by chance
“What is the probability that the relationship between variables
within the sample is due to sampling error?”

 These variables must be measured at the interval or


ratio level.
43
Correlation
 To determine if a relationship exists between two
linear variables and the direction of the relationship
“What is the actual strength and direction of the relationship
between variables within the sample?”

44
Correlation
 To determine if a relationship exists between two
linear variables and the direction of the relationship
“What is the actual strength and direction of the relationship
between variables within the sample?”

45
Correlation
 Strength indicated by a correlation coefficient (Pearson’s r)
 Correlation Coefficient = provides the numerical value that
indicates both the strength and direction of the relationship
(r):
(–) 1.0 = perfect negative relationship
(+) 1.0 = perfect positive relationship
 The closer the coeffecient is to either +1.0 or –1.0, the
stronger the linear relationship
 Middle = moderate / weaker relationship
 Close to 0 = no relationship
Range of Correlation Coefficients
Correlation Coefficients (r)

0.0
-1.0 +1.0
No correlation
Perfect Perfect
negative positive

47
Correlation Matrix

 All variables are listed in the left side column &


repeated in a row on the top.
 Find the direction & strength of the correlation
between variables by noting the correlation
coefficient and probability that appears in the
following matrix.

 The row in which the first variable appears intersects with


the column headed by the second variable:

48
Example. Correlations among study variables
1 2 3 4 5 6 7

1. Life satisfaction scale


- .21** .05 .11 -.85** -.13*
.18**
2. Dummy coded marital status - .32** .21** -.24** .06 .89**
3. Number of people in household - .26** .04 -.09 .32**
4. Number of hours of housework
per year (transformed using - .00 .04 .21**
square root)

5. Centered financial stress scale - -.11 .07


6. Friend/relative negative
support/burden .018
-

** Correlation is significant at the 0.01 level (2-tailed).


* Correlation is significant at the 0.05 level (2-tailed).
t-tests
 A statistical procedure that tests the means of two
groups to determine if they are statistically different.

 Two common types:


1) Independent sample t-test
2) Paired sample t-test (use when you are comparing
means on the same subject over time, each subject
having two measures.
1) E.g. Useful when comparing a linear measure over two test
events. These are dependent samples, not independent
samples.

50
Independent samples t-tests
 Compares independent groups (i.e., men and
women, control and experimental group) in terms of
outcomes
 Independent t-test is useful in studies that employ
experimental designs.
 Compares the means of two samples, but the
samples must be independently drawn from a
population – random selection for experiments
 Tells us if the difference between the groups is
statistically significant
51
Example. Independent samples t-tests

t=18.343; p<0.001; two groups are significantly different in the amount of exercise,
indicating that male will be more likely than female to exercise per week.
Nonparametric Statistics
 Nonparametric statistics:
 Do not depend on a distribution shape
 Do not include nor use means, variance and standard
deviations
 Use frequencies and percentages to describe the data.
 Sometimes, medians, percentiles, and the difference
between the 75th and 25th percentiles (Interquartile range,
or IQR) are used when the data can be sorted and ranked in
order and displayed.
Nonparametric Statistics
 Nonparametric statistics used for:
 too small sample for parametric statistics,
 Ordinal and nominal level dependent variables
 Chi square, Mann-Whitney U, Kruskal-Wallis, etc.

 Chi Square Tests (also called cross tabulation)


 Not a cause and effect relationship.
 It is a test of association between nominal variables
(similar to the correlation for interval level variables).
 This test compares expected frequencies with observed
frequencies, easily seen in contingency tables.
Tests of Association
Are very useful when comparing two nominal
variables
 Yes/No: Owning a car or having ready access to a car
 Yes/No: Routine access to healthy foods

 Yes/No: Regular exposure to Cigarette Smoke (exposure)


 Yes/No: Occurrence of influenza (disease)

 High/Mod/Low: Satisfaction with job


 Yes/No: Less than college degree vs. Bachelors or higher
Tests of Association
1. The effect of one variable on another, assuming
that there are no other variables affecting the
association.
e.g. Is the development of lung cancer (Y/N) associated with at least 20
years of cigarette smoking (≥20 years vs. < 20 years)
The strength of the association can be measured comparing the odds of
developing lung cancer given long-term smoking compared to the odds of
developing disease given no long-term smoking.

2. The statistical dependence between two variables.


 In particular, the presence of an association generally
implies that two characteristics occur in one individual
more often than expected by chance alone.
ODDS
Odds of an event
= # events / # non-events
Disease Disease Row totals
Odds of disease given exposure Present Absent
cases controls
= a/b
Exposed a b a+b
= # dx present given exposure
# dx absent given exposure
Not c d c+d
Exposed
Odds of disease given no exposure
= c/d
sample a b a+b+c+d
= # dx present given no exposure sums
# dx absent given no exposure
ODDS ratio or cross product
Disease Disease Row totals
Odds of disease given exposure
Present Absent
Odd of disease given no exp.
cases controls
Odds ratio (OR) = a/bc/d Exposed a b a+b

=a d
b*c Not c+d
Exposed
c d
= a*d / b*c
= ad / cb sample a b a+b+c+d
sums
Note: range [0 , )
(if c or b = 0, then add 0.5 to all cells)
ODDS ratio or cross product
Disease Disease Row totals
Odds of disease given exposure
Present Absent
Odd of disease given no exp.
cases controls
Odds ratio (OR) = a/bc/d Exposed a b a+b

=a d
b*c Not c+d
Exposed
c d
= a*d / b*c
= ad / cb sample a b a+b+c+d
sums
Note: range [0 , )
(if c or b = 0, then add 0.5 to all cells)
Measures for the strength of the
association
Odds ratio or relative risk
 The stronger the association (i.e. the greater the
magnitude of the increased or decreased risk observed)
between a two characteristics (e.g., exposure &
disease), the less likely it is that the
relationship is due merely to the effect of
some unsuspected confounding variable.
 An odds ratio equal to 1 (one) means no
association at all.
 Causation is not inferred from association.
Stronger odds ratio generally implies
stronger positive association
Disease Disease Row totals
Odds of disease given exposure
Present Absent
Odd of disease given no exp.
cases controls
Odds ratio (OR) = ad / cb Exposed a b a+b

Larger values of a and d versus


smaller values of b and c Not c+d
will yield very large odds ratios Exposed
c d
and strong associations
between sample
• Disease Present and Non- a b a+b+c+d
sums
Exposure versus no disease
and Exposure.
Stronger odds ratio generally implies
stronger positive association
Disease Disease Row totals
Odds ratio (OR) = ad / cb Present Absent
cases controls
OR = 1 : No association
Exposed a b a+b
(ad = bc)
OR > 1 when ad > bc
(positive association) Not c d c+d
OR: (1, ) Exposed

OR<1 when bc > ad sample a b a+b+c+d


(negative association) sums

OR: (0, 1)
Let’s take time to summarize these slides
and ask some questions…

Vous aimerez peut-être aussi