Measures of central tendency, dispersion and types of variables

subject – the smallest unit yielding information in the study, variable – a characteristic/ attribute that varies among subjects,
inferential statistic – what the data set (on a limited sample) tells us more
about a broader population
Types of variables: TWO KEY DISTINCTIONS:
LEVEL OF MEASUREMENT (SCALE) CONTINUOUS OR DISCRETE
NOMINAL SCALE numbers are just labels, decision is
arbitrary, they are just random codes, cannot do math with CONTINUOUS If it can be subdivided without limit into
these codes, the only possible comparison is to identify smaller measurements (infinitely varied fractional values)
whether the two subjects have the same or different value of Example: age (divided in days, minutes..) Income, GDP,
the variable. Ex: marital status: single – 1, divorced – 2… distance, weight..
(only MODE)
ORDINAL SCALE: Natural ordering/ ranking but are still
just categories, they have meaningful ordering but nothing
DISCRETE Basic unit of measurement cannot be
beyond that, we can subdivide, add new categories. We can
subdivided
determine not only whether two subjects have the same
Example: marital status variable, region (no in-between
value, but also whether one or the other has a higher value.
values)
Ex: self-reported health 1-good, 2-better 3-excellent (MODE
and MEDIAN)
INTERVAL SCALE: Comparisons of both order and
CATEGORICAL: Discrete variable which has only a finite
magnitude are meaningful, they are proper numbers, we can
(in practice quite small) nr. of possible values known in
measure average, however we cannot compare ratios of
advance Eg: sex coded as Male or Female, with no other
interval-level variables, they have no true zero. For ex:
values. Income if divided in categories: 0-100, 100-200 etc.
temperature, we can say that 20C is for 15C more than 5, but
How often: never, rarely, often, always
we cannot say it is 4 times more (MODE, MEDIAN, MEAN)
RATIO SCALE: A variable that has all the properties of an
interval-level variable and has a true zero, distances between BINARY/ DICHOTOMOUS IS THE SIMPLEST KIND
values have meaning. Example: nr. of cigarettes per day OF NOMINAL-SCALE VARIABLE
*same statistical methods can be used for interval and ratio They have only two possible options (0 or 1, yes or no)
variables (MODE, MEDIAN, MEAN)
Essentially all nominal/ordinal- level variables are discrete, and almost all continuous variables are interval-level variables. This leaves one further possibility, namely a discrete interval-level
variable; the most common example of this is a count, such as the number of children in a family or the population of a country.
MEASUREMENT LEVEL
DISCRETE: NOMINAL/ ORDINAL: MANY Always

categorical, i.e. having a fixed set of possible values INTERVAL/ RATIO: Counts If many different observed
(categories) – If only to categories, variable is binary values, often treated as effectively continuous
(dichotomous)
CONTINUOUS - None CONTINUOUS - MANY
Descriptive statistics - uses the data to provide descriptions of the population, either through numerical calculations or graphs or tables.
Inferential statistics makes inferences and predictions about a population based on a sample of data taken from the population in question (checks whether we can make predictions on the
entire population based on a sample).
DESCRIPTIVE STATISTICS FOR VARIABLES WITH FEW VALUES (eg. categorical variables):
CONTINGENCY TABLE (CROSSTAB): displays the number of observations for each combination of outcomes over the categories of each variable (one-way table for 1 variable, if there are 2
variables it’s two-way contingency table, three-way contingency table for 3 variables etc.)
Joint distribution: frequencies in the internal cells (without the row, column and total margins) - They show how many units have each possible combination of the row and column variables
ANALYSING A CONTINGENCY TABLE: Association: when we are analysing two variables and the relationship between them (bivariate analysis); conditional percentages (or conditional
proportions) of the response variable given the explanatory variable. Conditional distribution: To interpret/understand the association between two variables in a contingency table, focus on such
conditional distributions
THREE VARIABLES:
Proportions and percentages: Juxtapose some examples: e.g. 32% of age 60+ respondents say they never use the internet, versus 6% of age 35-59 respondents (proportions and percentages more
useful than counts when analysing and reporting, because there is different nr respondents within diff. of age groups in the data set) Therefore it is more useful to use proportion/ percentages: Compare
differences in percentages: e.g. the proportion of people who never use the internet is 26 percentage points higher among those over 60 than among those age 35-59 (32 – 6 = 26)
Simplifying: ‘collapse’ together the ‘only occasionally’ and ‘a few times a week’ and ‘most days’ categories into ‘some days’ Ex: the proportion of old people who never use the internet is 15 times the
proportion of young people who never use the internet
DESCRIPTIVE STATISTICS FOR CONTINUOUS VARIABLES

SUMMARY STATISTICS: CENTRAL TENDENCY:
Mode (for any kind of categorical data) – value of the variable that occurs most often, most common response Most useful for discrete variables with small numbers of categories, less useful for
variables with a lot of different values
Median: the central value (50th percentile – 50% of the observations are more than that or less than that) – cannot use with nominal variables, THE VALUE OF THE CENTRAL CASE: ODD
NUMBER: the value of the central case, EVEN NUMBER: we take the mean of the two middlemost numbers; If there is many items in data: (n+1)/2 to find the midle point; for grouped variables
(0-2 hours/ 2-4 hours) find the median by looking where the cumulative percentage (in table column most right) passes 50%
Y = variable of interest; pronounced “Y-bar” = mean of Y; subscript = case number ( is the first case, is the second case, etc.); , Greek uppercase sigma, means “add together everything listed after the
sign” Here, ‘add together all the s over cases ’ = sample size; thus i goes from 1 to n
Y = variable of interest; pronounced “Y-bar” = mean of Y; subscript = case number ( is the first case, is the second case, etc.); , Greek uppercase sigma, means “add together everything listed after the
sign” Here, ‘add together all the s over cases ’ = sample size; thus i goes from 1 to n
Median
Mean
Median
Mean
If mean is larger than median, distribution is positively skewed. (skewed to the right) This is positively skew (skewed to the right, where the tail is)
If median is larger than mean, distribution is negatively skewed. (skewed to the left)
If median and mean are equal, distribution is symmetric.
Stem and leaf plot: Like a histogram made up of digits (most useful for smaller data sets)
Stem width:
Each leaf: number of observations
Types of populations: finite (real groups of ppl or other units) – from which we take convenience, random, stratified, cluster, multistage sampling OR Conceptual (superpopulation,
population of potential outcomes, …) of some kind as the target of inference for what we use
Experiments; One-off historical events, such as votes; Country-level data or other similar ‘complete enumerations’
Types of populations: finite (real groups of ppl or other units) – from which we take convenience, random, stratified, cluster, multistage sampling OR Conceptual (superpopulation,
population of potential outcomes, …) of some kind as the target of inference for what we use
Experiments; One-off historical events, such as votes; Country-level data or other similar ‘complete enumerations’
CALCULATION OF DISPERSION:
Range – the difference between the largest and smallest observed value (can be misleading if there are extreme scores, therefore exclude outliers when calculating)
Interquartile range – divides the distribution into quarters (as we would find the median, but now into four instead of two segments) 1 st quartile separates lowest 25% from upper 75%, 2nd quartile
(median) divides distribution in half, 3rd quartile separates lowest 75% from upper 25% - to calculate the interquartile range (IQR) subtract value of 1 st quartile from value of 3rd quartile (isn’t
sensitive to extreme values, it talks only about the bulk of the data)
Box plot – for single variables, lower-end whisker is the lowest value, 2nd quartile (median) is the central line of the box, height of the box is the inter-quartile range, highest value of upper whisker is
highest non-outlier value observed, 1 and a half IQRs are above the 75th percentile or below the 25th percentile – these are considered outliers
Deviation from the mean: the difference between an observed value and the mean of a variable (if we add up all variation deviations in the data – the sum of the deviations will always
be ZERO – practically that is a definition of mean..) But we can normalise the measure by dividing by the number of values in the distribution (sample size n) – this is Mean average
deviation (MAD) or Mean absolute deviation – take all differences from the mean, get rid of the minus signs (make them positive/ absolute value), sum them up, and divide them by number
of values in the distribution. Or, Another way of turning negative numbers to positive is to square them, Sum the squared deviations, normalise again by dividing this sum by number of
scores: the result is called the variance
Most commonly used measure of deviation: SAMPLE STANDARD DEVIATION (which is in fact square root of the variance): (SUM OF THE DIFFERENCE BETWEEN THE
OBSERVED VALUE AND THE MEAN OF THE VARIABLE) It is a number telling us how measurements for a group are spread out from the mean (a low standard deviation
means that most of values are close to the mean)
INFERENTIAL STATISTICS –Inferential statistical techniques (different tests for different levels of measurement and diff. questions) involve using sample data to make an inference to the
population from which they are drawn
First we start with: NULL HYPOTHESIS: a specific claim about some population characteristics (strictly and narrowly stated)
Then we: use a procedure called statistical significance test – it’s a formal procedure (There are diff. significance tests but all have 7 steps) to assess whether or not the sample of data are consistent
with the hypothesis (look for evidence against the null). To do this, we find the answer (p value) 1. Type of data to be analysed determine the appropriate test to use; 2. Assumptions about the data need
to be fulfilled for the results to be credible 3. State the statistical hypotheses for the test 4. Calculate the test statistic 5. Use the sampling distribution of the test statistic to find its…
6.…corresponding p-value 7.Draw a conclusion from the results and give a substantive interpretation
Differences between observed (fo) and expected (fe) frequencies show the extent to which data are consistent with the null hypothesis of no association (and these differences are summarized
in the test statistic).
A test statistic is a number calculated from the sample which is used to test the null hypothesis (a single number summarising how big the difference is between the data you observed, and the
data you would expect to observe if the null hypothesis were true) - Larger values indicate stronger evidence against the null hypothesis; Smaller values for test statistic indicate weaker evidence
against the null hypothesis
General formula for calculating expected frequencies under the null
CHI-SQUARE TEST (of independence) x2 – significance test for two categorical variables, and we want to analyse the association between them (ARE 2 VARIABLES CONNECTED/
ASSOCIATED AMONG EACH OTHER)
Sample size must be large enough (common rule of thumb is that the test can be safely used if all expected frequencies are at least 5)
Null hypothesis H0 – In the population, there is no association between the two variables (internet use and age group) OR In the population the variables are statistically independent.
Alternative hypothesis Ha: in the population the variables are statistically dependent – null hypothesis are always about the population, never about the sample
The basic idea of the test statistic is the following: 1. See what the cross-tab for your sample looks like (the observed table) 2.Calculate what the table would look like if it agreed exactly with
the null hypothesis (the expected table) 3.Calculate a number which summarises how different the observed table is from the expected table.
Calculating the test statistic: - chi-square test (test statistic) is the sum of all the differences between observed and expected values (squared) divided by expected values,
It is quantifying how different what we observed is from what we would’ve expect if the null hypothesis were true, the bigger this number is, the larger the difference between observed and
expected data
test statistic increases when there are more cells in the table, so we need to take the number of cells into account when deciding how ‘large’ our test statistic is. We define the size of the table in terms of
how many degrees of freedom it has. Degrees of freedom: denotes the number of cells in a contingency table that are free to vary given the marginal frequencies df = (r-1)(c-1) row minus 1 times
column minus one. distribution is a different shape for different degrees of freedom.
Sampling distribution is the distribution of a statistic over a repeated sample (from many, many samples that we would take from the same population)
Then we ask, what is the is probability of getting the test-statistic (of the chi-square test calculated before) for ex.428.5 or larger number if it were the case the null hypothesis were true – the answer is it
would almost never happen, – the probability is almost zero that we would ever observe (o.000…) That tells us that there is no way (effectively) that we will ever observe data like this in the population
if the null hypothesis were true
p-value (Asymptotic Significance 2 sided SPSS) is the end result of a significance test, it is the probability, if the null hypothesis was true in the population, of obtaining a value of the test statistic
which provides as strong or stronger evidence against the null hypothesis (and in the direction of the alternative hypothesis) as the value of the test statistic in the sample actually observed; it is Summary
of strength of evidence against the null hypothesis; The smaller the p-value, the stronger the evidence against the null hypothesis - a p-value is measuring how likely it is that we
would observe a data set as different or more different from the null hypothesis as the sample we observed. For example, a p-value of 0.66 is saying: 66% of the time if we went to collect
samples if the null hypothesis were true, we would observe a dataset that was this far or further from the null hypothesis. So, this deviation from the null hypothesis is the sort of deviation
you would expect to see frequently (more than half the time is 66%)
IN ALL TEST STATISTICS: larger test statistic values correspond to smaller p-values, and indicate stronger evidence against the null hypothesis; smaller test statistic values correspond to
larger p-values, and indicate weaker evidence against the null hypothesis
PHRASE: If the null hypothesis were true – i.e. IN THE POPULATION there was no association between internet use and age group – then the probability of obtaining a value at least as big
as 427.4 is far less than 0.001 (one in a thousand). Thus, we have good grounds to reject the null hypothesis, and infer that the alternative hypothesis is true – i.e. that in the population there is
an association between internet use and age group.
We can reject the null hypothesis that there is no association between age and internet usage: we have very strong evidence that younger people are more likely to use the internet frequently
than older people.
Three conventional levels of significance for p-value: 0.10 (10%), 0.05 (5%), 0.01 (1%)
Connection between chi-square test statistic and p-value: is the chi-square statistic (the value) inside the sampling distribution or how far outside of it is it (p-value quantifies that probability)
Tests are sensitive to sample size, the more data we collect, the stronger the claims we can make (and detect smaller and smaller differences)
T-TEST STATISTIC – test statistic for DIFFERENCE BETWEEN TWO MEANS (response variable – continuous, explanatory variable – dichotomous, i.e. compare two groups) – same 7
steps
Inference for the comparison of groups, for the MEAN OF A CONTINUOUS VARIABLE (rather than a crosstab), test of the hypothesis that there is no difference between groups in the
population
Null hypothesis: In the population, there is no difference in mean TV/video hours between men and women or with meaning (the null hypothesis is zero) (There is no difference between the
population means) – THIS IS ALWAYS THE NULL FOR T-TEST
Alternative hypothesis: Two-sided alternative: in the population, mean TV/Video hours are different for men compared to women , i.e.
The other possibility is a one-sided alternative hypothesis: we only look for evidence of a difference in one direction in the population, i.e. or e.g. men do more hours than women: or
T-test statistic is also a number calculated from the sample that we use to test the null hypothesis (just like chi-square), summarizing how big the difference is between the data you observe and the
data you would expect to observe if the null hypothesis were true. FOR THIS TEST, FOCUS ON DIFFERENCE BETWEEN MEANS (ex. In the sample, men watch 0.55 hours more per day, on
average. = 2.84-2.29 = 0.55) - If mean TV/Video hours is different for men compared to women, it implies there is an association between TV/Video hours and gender. In the sample data there is an
association between TV hours and gender, and we want to know whether such association exists in the sample from which the data is drawn)
Our variable of interest has a mean also in the populations: : mean TV/Video hours among the population of men; : mean TV/Video hours among the population of women
The means in the sample (‘sample means’) serve as estimates of these population means: : mean TV/Video hours among the sample of men : mean TV/Video hours among the sample of women
Although the focus is on the means, t-test method also involves a second parameter, the standard deviation and : the population standard deviations for men and women; estimated by the sample
standard deviations s1 and s2, (p.2) – MEAN DIFFERENCE IN THE POPULATION - MEAN DIFFERENCE IN THE SAMPLE
• First step of test: 1) Sample difference - difference between mean of men and women (2 samples) example it is (it is not zero) - How likely is it that we’d obtain a sample mean difference
as big as 0.55, if the true population mean difference were really 0?
• 2) converting the sample difference into a standardised scale so that we can conveniently find the probability of how rare it would be under the null (i.e. the p-value) – this is done by dividing
(difference of sample means)by its estimated standard error - so, this is the t-test statistic formula
(square rood of the standard error squared of the mean of group 2 plus the standard error squared of the mean of group 1)
WE ARE MOVING FROM SIGMAS THAT ARE STANDARD ERRORS to SIGMAS THAT ARE STANDARD DEVIATIONS: standard deviations of group 2 and 1, sample size of
group 2 and 1
This is further simplified under the assumption that the population variances are equal (both men and women have the same standard deviation in the specific variable) which translates into
this formula:
FINALLY – ESTIMATE OF POPULATION STANDARD DEVIATION FROM THE SAMPLE DATA : (n1 is sample size of first group, n2 is sample size of second group, s21 is
sample variance of first group, s22 is sample variance of second group)
SAMPLE STANDARD DEVIATION: (Week 3) (SUM OF THE DIFFERENCE BETWEEN THE OBSERVED VALUE AND THE MEAN OF THE VARIABLE)
(YOU NEED SAMPLE SIZES, SAMPLE STANDARD DEVIATIONS TO START THIS CALCULATION!!) - WORK FROM BOTTOM UP IF NEEDED!
T-test statistic tells us how far our data (the data we observed) are from the null hypothesis. We are comparing the test statistic to t-distribution (similar to a normal distribution) – in this case, it means
if the null hypothesis were true in the population and there was no difference in the means, we would expect this distribution of t-statistics in different samples, for example, we would expect to see 95%
of the t-statistics between -1.96 and +1.96
What is the probability, if H0 is true, of obtaining a test statistic as far or further from 0 as our t=10.4 – i.e. ≥ 10.4 or ≤ -10.4?
Because it is a two-sided test, we ask how likely is that t is greater than 10.4 or less than 10.4 (what we got as a result from the t-statistic in the example)
Conclusion: It is highly unlikely that we would see a value of t as far away from the (null) hypothesised value of 0 if the samples were drawn from populations where the average TV/Video
hours were equal: The p < 0.001 Therefore, we have Strong evidence against the null hypothesis if the populations (men and women) have the same mean (watch the same hours of TV). and
The null hypothesis is rejected, e.g. at the 5% level (p<0.05)
Men seem to watch more TV on average than women in the population from which the sample was drawn.
Just knowing that t-test statistic (10.4) is bigger than 1.96 (critical value, any time you do a t-test with reasonably sized data) is tells you that p-value must be less than 0.05
*p-value in SPSS output for t-statistic is sig. (two-tailed)
CENTRAL LIMIT THEOREM: As long as the samples are large enough (n>20), the sampling distribution of two-sample t-test is going to be approximately normally distributed.
When the sample is smaller than that, the sample is distributed according to t-distribution, and the t-distribution has degrees of freedom associated to it n1 +n2 – 2 (different calculation for DF than
in chi-square test) – DF define the sample of the t-sampling distribution . According to central limit theorem, if we keep collecting samples of 100, the average we get from them is going to be the
same as the population mean, and the standard error will be the standard deviation of the population divided by the square root of n (sample size). The more data we collect (bigger n) the smaller the
standard error will be, and more confident we can be about the value of the population mean.
Standard error how spread out are the means (or other test statistics) the hat we calculate if we calculate if we do samples in a certain way.
Standard deviation of the population describes how spread out are the data in the population.
Normal distribution (bell-curve) – mean = median=mode, it is symmetric, it is defined by just 2 parameters: the mean and the standard deviation (square of the variance). One standard deviation above
and below the man is always 0.68 of the probability (which is two-thirds of 0.95) The chance of getting value between +-1.96 is 0.95 (this is how it links to 95% confidence interval – critical value)
95% CONFIDENCE INTERVAL – gives us a sense on of how much precision we have in your estimates; often most useful for substantive interpretation of the statistic in question (because there is
variation in sample statistics, due to sampling error, so every sample will be slightly different)
sample difference +/- 1.96 x estimated standard error (error in the difference in means) - FORMULA FOR 95 CONFIDENCE INERVALS
Interpretation: “We can be 95% confident that in the population, men report doing on average between 0.45 hours and 0.65 hours more TV/Video hours per week than women.”
Is the null hypothesis (in this case of zero) within the confidence interval? If not, at the 95% level, we can reject the null. – THIS IS ONE MORE WAY TO TEST AND REJECT THE NULL
HYPOTHESIS.
Two-sided alternative hypothesis: means are not equal, no special claim made about which one is higher than the other. H a: 0 p-value = 2*0.0005 = 0.001
One-sided alternative hypothesis: means are not equal, and we are only interested in deviations in one direction, , e.g. , i.e. Ha: > 0, p-value for this test = 0.0005
Z – TEST STATISTIC – INFERENCE FOR ONE PROPORTION: (CATEGORICAL, DICHOTOMOUS OR BINARY VARIABLES WITH JUST TWO POSSIBLE VALUES) – DIFFERENCE IN
PROPORTIONS
It doesn’t compare two groups, but it compares one group to a particular value of interest, for example a benchmark (50%) of who is going to win elections – is this (52% in poll) a strong evidence
Obama will win
Sample size should be n>30;
– population proportion (that we want to make inferences about, from the sample proportion that we already have)
– the particular value to which we want to compare the population proportion (in this case, 0.5 or 50% votes)
NULL HYPOTHESIS: In the population of voters in Ohio, a proportion of 0.5 (or 50%) of voters intend to vote for Obama (or the population proportion is 0.5) - OR the difference
between them is zero:
Alternative hypothesis: the population proportion intending to vote Obama is not 0.5
Z-statistic (again a single number summarising how big the difference is between the data you observed and the data you would expect to observe if the null hypothesis were true) - larger values indicate
stronger evidence against the null hypothesis.
First, we calculate the sample difference (between the proportion we observe () and the proportion which would exactly agree with the null hypothesis (0) - =
Second, we calculate standard error (to see the extent to which we will have variability from one sample to another) - standard error of the difference; - null hypothesis proportion; n is the
sample size
Third, we do the z-test statistic (same formula as t-test): sample difference divided by standard error
Fourth, Finding p -value (SPSS doesn’t give it for z-test) - If the absolute value of the z-test statistic (as calculated with above procedure) is greater than a critical value, then the p-value is smaller than
the significance level corresponding to that critical value: For example, if z is greater than 1.96 or smaller than -1.96, then p<0.05; In our example, z = 1.69 This is greater than 1.65, so p<0.10
But it is smaller than 1.96, so p>0.05 Thus p is between 0.10 and 0.05 (which of course agrees with the exact value of 0.091) – THIS IS ONLY ONE TAIL PROBABILITY (WE NEED TO
MULTIPLY BY TWO IF… 
Don’t forget to multiply by two when you need to get the two-sided p-value for ‘larger than z’ (in either direction)
0.10 0.05 0.01 0.001
Significance level
(10%) (5%) (1%) (0.1%)
Critical value 1.65 1.96 2.58 3.29
Here z was 1.69, the tail probability is 0.0455, and the p-value is 2*0.0455 = 0.091
If your z-value is negative (e.g. if it had been z=-1.69 here), remove the negative sign (–) first
• 95% confidence interval: sample proportion (from the sample we have) +- 1.96 times square root of (one minus sample proportion divided by sample size) . Interpretation: we can
be 95% confident that the population proportion voting Obama in Ohio is between 0.496 and 0.556, or 49.6% and 55.6% - Margin of error in opinion polls is in fact the
95% confidence interval (here 3% - as 52.6% ± 3%)
One sided vs. two-sided: We are equally interested whether Obama or Romney would lead, so that’s why it’s two -sided test. In one-sided, the null hypothesis would be: or for
example: population proportion voting Obama to be more than 50%
Difference in calculation: p-value – the only using one tail of distribution and p-value is half of the two-sided
One sided vs. two-sided: We are equally interested whether Obama or Romney would lead, so that’s why it’s two -sided test. In one-sided, the null hypothesis would be: or for example: population
proportion voting Obama to be more than 50%
Difference in calculation: p-value – the only using one tail of distribution and p-value is half of the two-sided
Confidence intervals at different levels:
Confidence level /2 z /2
90% 0.10 0.050 1.64
95% 0.05 0.025 1.96
99% 0.01 0.005 2.58
General formula for conf.intervals.
Z-TEST OF TWO PROPORTIONS: Null hypothesis: in the population, no difference in polio infection rate between treatment and control groups: H0: 1= 2, or in other words H0: = 0
where
Alternative hypothesis: in the population, polio infection rates between treated and untreated children would be different: H a: 1 2, or in other words Ha: 0 (Note: two-sided alternative
hypothesis here – does not discount the possibility that the vaccine could increase the polio rate; consider changes in both directions. But we could also consider a one-sided alternative, i.e.
paying attention only to evidence in one direction)
Formula for z-test for two proportions: is the estimated standard error of the difference
Formula for the standard error of the estimated difference: In our example we is approximately 0.000000001 (strong evidence against the null)
95% Confidence interval for difference in proportions (calculated with different formula than conf.interval in significance test): n 1 and n2 are the sample sizes for groups 1 and 2
is the proportion of polio cases in sample 2; is the proportion of polio cases in sample 1
EXAMPLE: We can be 95% confident that there is a reduction of between 284 and 560 polio cases per million children in the vaccinated group compared to the placebo group. We can be 95%
confident that the proportion of prospective Lib Dem voters is higher for those who watched some of the leaders’ debates compared to those who did not, by between 0.027 and 0.091, i.e. 2.7-9.1
percentage points.
CORRELATION AND SIMPLE LINEAR REGRESSION FITTING LINES TO SCATTERPPLOTS (BOTH RESPONSE AND EXPLANATORY VARIABLES ARE CONTINUOUS)
DEFINING ASSOCIATION: Two variables Y and X are associated if the conditional distribution of Y given X is different for different values of X. Linear association: association between X and Y in
terms of how values of Y (summarised by conditional means of Y) change when X increases
But we need to summarise key features of the linear association: is it positive or negative, what is the strength of the association (how tightly are the points concentrated on the line).
Two forms of measures of (linear) association (single numbers that summarise the type and strength of association):
1.Sample covariance is a measure of linear association between X and Y (how two variables Y and X vary together) formula: X1 – observed value,
indicates a positive linear association; scores on X which are positive deviations from mean of X tend to be paired with scores on Y which are positive deviations from mean of Y
indicates a negative linear association; scores on X which are positive deviations from mean of X tend to be paired with scores on Y which are negative deviations from mean of Y
indicates no linear association between X and Y
2.Covariance converted to correlation: Magnitude of covariance depends on units in which and are measured so it is difficult to judge what is small or large covariance, therefore we need
correlation coefficient (to standardise sample covariance) – also known as ‘Pearson’s correlation coefficient’ (in SPSS), r or just “correlation”
indicates perfect positive linear association
indicates perfect negative linear association
indicates no linear association
*Broadly speaking, the closer to or the coefficient is, the stronger the linear statistical association
*‘High’ or ‘low’ correlations need to be interpreted in context
* Covariance and correlation are symmetric: They do not really make a distinction between explanatory and response variables, although we may use this distinction when we talk about
the results
SIMPLE LINEAR REGRESSION MODEL: One explanatory variable (continuous) and one response variable (continuous) – this method makes difference which is which
Parameters of simple linear model: Intercept or constant – Expected value of Y when X=0, : slope or coefficient of X Change in expected value of Y when X increases by one unit (how quickly it
goes up or down) and : error variance or residual variance (Variance of the conditional distribution of Y given X) –
Where is normally distributed with mean 0 and variance - model of estimated regression coefficients
• (estimated intercept or constant) When sleep hours = 0, expected hours of TV/Video = 2.096
• (estimated slope or coefficient of ) When sleep hours increase by one unit in X (hour, year), expected average TV/Video hours increase by 0.055 units (hours).
Fitted value of Y – we need this to estimate the change in Y if X changes:

• Example: How many hours of TV/Video does our model predict for someone who sleeps 8 hours/day? Extra: How many hours can we expect a 25yo person to watch per day?
The sum of squared residuals should be minimilised in this model (Take for each case the difference between its observed and expected Y, square the difference, and add up all the squared differences.
The result is the sum of squared residuals, or residual sum of squares, or Sum of Squares of Errors (SSE): Which is in fact sum of squared distances between observed data values and their
corresponding estimated values under
*If we didn’t know anything about the relationship between these two variables and I asked you to predict Y given X…the mean value of Y would be a good guess. We could calculate the difference
between each value and the mean and square them, and add them up…to produce a measure of total variation in Y. Total Sum of Squares: sum of squared distances of observed values from the mean
of :
*If we do know something about the relationship between X and Y and I ask you to predict Y for a unit drawn at random, the regression line will give the best estimate. We can calculate the difference
between each observed value and its corresponding estimated value…and square, and sum them…Sum of Squares of Errors: sum of squared distances between observed and estimated values of :
Coefficient of determination, or - Improvement of our model over simply using the average value of Y for predicting Y, how much better we are at predicting Y by using this explanatory variable, than
we would have been without it
IN SPSS can be found under: Model summary (R Square), Which means, in the example: 0.002, or 0.2% : Less than 0.2% of the variation in TV/Video hours is explained by the fitted regression
model (with sleep hours as explanatory variable). – In the simple linear regression, there is close link between R square and correlation coefficient, but that’s not the case in the multiple linear
regression.
Significance testing for simple linear regression (slope coefficient) – Null hypothesis: In the population, there is no linear association between the variables X and Y (for the model this would mean
that the regression line is flat)
Alternative hypothesis: (two-sided) in the population, the slope is not 0, there is a linear association between the variables X and Y
<<< Standard error of the estimated slope coefficient:
Test statistic for regression coefficient: We divide the slope coefficient by the slope estimate by the its estimated standard error
• Conclusion: : Highly unlikely that we would see a value of t as far away from the (null) hypothesised value of 0 if the sample was drawn from a population where the null were true
Strong evidence against the null hypothesisThe null hypothesis stated that in the population there was no linear association between sleep hours and TV/Video hours – therefore we infer that
there is a linear association between sleep hours and TV/Video hours

Measures of central tendency, dispersion and types of variables

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Measures of central tendency, dispersion and types of variables

Transféré par

Droits d'auteur :

Formats disponibles

subject – the smallest unit yielding information in the study, variable – a characteristic/ attribute that varies among subjects,

DISCRETE: NOMINAL/ ORDINAL: MANY Always

CONTINUOUS - None CONTINUOUS - MANY

DESCRIPTIVE STATISTICS FOR CONTINUOUS VARIABLES

General formula for conf.intervals.

Fitted value of Y – we need this to estimate the change in Y if X changes:

<<< Standard error of the estimated slope coefficient:

Vous aimerez peut-être aussi