Vous êtes sur la page 1sur 16

Statistical analysis using SPSS

Biostatistics and Computing Nov/Dec 2005

Outline
Categorical variables Continuous variables Regression Analysis of variance Repeated measures analysis

Some references
Landau,S. and Everitt (2003) A handbook of statistical analyses using SPSS, Chapman & Hall/CRC, New York Brace, N, Kemp, R. and Snelgar, R.. (2003) SPSS for psychologists, Palgrave Macmillan Norusis, M.J (2002) SPSS 11.0 guide to data analysis, Prentice and Hall SPSS Inc. (2002) Regression models, Prentice and Hall [MANUAL] SPSS Inc. (2002) SPSS Advances models 11.0, Prentice and Hall [MANUAL] SPSS Inc. (2001) SPSS Base 11.0 Users Guide, Prentice and Hall [MANUAL] George, D. and Mallary, P.(2002) SPSS for Windows step by step, Allyn and Bacon Field, A. (2000) Discovering statistics using SPSS for windows, Sage Publications Bryman, A. and Cramer, D. (2001) Quantitative data analysis with SPSS release 10 for windows, Routledge. Pallant, J. (2001) SPSS survival manucsal, Open University Press Kinnear, P.R. and Gray, C.D. (2000) SPSS for windows made simple, Psychology Press Ltd, Hove. Voelkl, K.E. and Gerber, S.B. (2000) Using SPSS for Windows: Data analysis and graphics, Springer-Verlag, New York Everitt, B.S. (1996) Making sense of statistics in psychology, Oxford University Press, Oxford.

Measurement scales
The possible outcomes from a qualitative variable are from a finite mutually exclusive set of categories. The outcomes from a quantitative variable are measurements taken on an interval scale (outcomes mutually exclusive, logical order, differences meaningful). If the variable additionally is on a scale where zero means the absence of the characteristic the scale is a ratio scale.

Measurement scales
Hierarchy of scales
Measurement Variable Outcome

Qualitative variable Categorical variable

Quantitative variable Continuous variable Interval scale

Dichotomous outcome Binary variable (two categories)

Polytomous outcome (more than two categories)

Ratio scale

Interval but not ratio scale

Nominal variable (no ranking)

Ordinal variable (ranking)

Course conventions
Course structured according to type of data Sections split into
Data description and presentation (data exploration) Statistical inference Interpretation

Statistical inference consists of


Estimation Construction of confidence intervals Hypothesis testing

Categorical data One-way table


The outcomes from a single categorical variable are completely described by a frequency table. The data itself is considered a one-way table. Absolute and relative frequencies (percentages) of each category may be of interest.

Creating a one-way table in SPSS


[female.sav] Analyze - Descriptive Statistics - Frequencies...Variable(s)=depressi - Display frequency tables - OK
DEPRESSI Valid Percent 23.6 60.9 15.5 100.0 Cumulative Percent 23.6 84.5 100.0

Valid

Missing

none mild moderate Total System Missing Total

Frequency 26 67 17 110 8 8 118

Percent 22.0 56.8 14.4 93.2 6.8 6.8 100.0

Total

Graphical display of a one-way table


In SPSS a graphical representation of the category frequencies can be achieved by
bar charts (bar area proportional to frequency) line charts (value on y-axis represents frequency) pie charts (circle section area proportional to frequency)

Displaying a one-way table in SPSS


[female.sav] Graphs - Bar...- Simple - Summaries for groups of cases Define -Bars Represent=N of cases - Category Axis:=depressi - OK [female.sav] Graphs - Line... - Simple - Summaries for groups of cases Define -Line Represents=N of cases - Category Axis:=depressi - OK [female.sav] Graphs - Pie... - Summaries for groups of cases- DefineSlices Represent=N of cases - Define Slices by:=depressi - OK
70

depressi
none mild moderate

60

50

Count

40

30

20

10

0 none m i ld m o d er a te

depress i

Estimates and confidence intervals from a one-way table


In a sample representing a one-way table the population parameter of interest is the relative frequency with which each category occurs. For a given category this parameter is simply estimated by the observed relative frequency. To quantify the uncertainty the standard error of the estimator, or preferably a 95% confidence interval for the parameter should be presented.

Estimates and CIs in SPSS


The estimator can simply be read from the table Currently SPSS does not supply a confidence interval (CI) for a proportion An approximate 95% CI for p is given by (if np large)
1 . 96 p (1 p ) p + 1 . 96 , p n (1 p p n

Type the observed proportions of the four possible categories into a SPSS spreadsheet column Name the column estimate Transform - Compute... - Target Variable=lower - Numeric Expression=estimate-1.96*sqrt(estimate*(1-estimate)/110) Repeat previous step with Target Variable=upper - Numeric Expresssion=estimate+1.96*sqrt(estimate*(1-estimate)/110)

Estimates and CIs in SPSS


Proportion no symptoms mild symptoms moderate symptoms severe symptoms Estimate 0.236 0.609 0.155 0 95% CI [0.157, 0.315] [0.518, 0.700] [0.087, 0.223] n.a. since 110p likely to be small

Example statement: The proportion of women suffering from mild depression was estimated to be 60.9% (95% CI from 51.8% to 70.0%).

Hypothesis testing from a one-way table


In a one-way table a hypothesis of interest may be the equality of the proportions of each category, i.e. H0: p1=p2==pk This can be tested using a chi-squared test which compares observed frequencies with those expected under the null hypothesis. This test is an asymptotic test, i.e. only for large sample sizes it can be treated as a level- test. As a rule of thumb the number of observations within the sample expected to occur within a category under the null hypothesis should not fall below 5. This can be a problem when comparing many categories in small samples. As an alternative in these situations an exact test which uses Monte-Carlo simulation to generate the p-value can be used.

Tests from a one-way table in SPSS


[card.sav] Analyze - Nonparametric Tests - Chi square... - Test variable list=card - Expected Values - All categories equal - Expected Range=Get from data - Exact - Asymptotic only - Continue - OK The relative frequencies of cards A, B and C differed significantly at the 5% level (2=15.96, d.f.=2, p<0.001). The subject had a preference for card B.

choice of card Observed N 37 73 40 150 Expected N 50.0 50.0 50.0 Residual -13.0 23.0 -10.0

Test Statistics choice of card 15.960 2 .000

A B C Total

Chi-Square a df Asymp. Sig.

a. 0 cells (.0%) have expected frequencies less

Hypothesis testing from a one-way table with two categories


When the proportions of a binary variable are compared the null hypothesis is simply H0: p1=p2=0.5 and a test based on the binomial distribution can be used. This test is valid for finite samples (exact test). However, SPSS uses a normal approximation when the sample size exceeds 25. In the binary case for sample sizes less than 26 the binomial test is preferable.

Tests from a one-way table with two categories in SPSS


[coin.sav] Analyze - Nonparametric Tests - Binomial- Test Variable List=coin - Test proportion:=0.50 - Define Dichotomy=Get from data - OK
Binomial Test Category N Observed Test Prop. Prop. 91 109 200 .46 .54 1.00 .50 Asymp. Sig. (2tailed) .229

Group 1 head Group 2 tail Total a Based on Z Approximation.

COIN

At the 5% level the proportion of heads did not differ significantly from the proportion of tails (binomial test, p=0.23). There was no evidence for a biased coin.

Several one-way tables


When the outcomes from a single categorical variable have been observed in several populations the respective one-way tables can be presented as a single two-way table. Clustered or stacked bar charts can be used to visualise the cell frequencies. The samples are displayed on the horizontal axis. The categories of the single variable are displayed in different colours (patters). In a clustered bar chart, the bars corresponding to the same sample are displayed next to each other. In a stacked bar chart, the bars corresponding to the same sample are displayed on top of each other.

Contingency tables
When two variables have been measured on the same units the result can also be displayed as a two-way table, now called a contingency table. This is similar to displaying several one-way tables in an aggregated two-way table except that now neither the column totals nor the row totals have been fixed. For two categorical variables each cell in a contingency table contains the frequency at which the combination of its row and column categories occurred. The row and column totals represent the frequencies of the two variables singly. The concept can be extended into multi-way contingency tables.

Display of a two-way table in SPSS


[pillness.sav] Data - Weight Cases... - Weight cases by Frequency Variable=number Analyze - Descriptive Statistics - Crosstabs... - Row(s):=pillness Column(s):= thought - Display clustered bar charts - Cells... Counts=Observed - Percentages=Row - Continue - OK
PILLNESS * THOUGHT Crosstabulation THOUGHT has definitely I don't crossed definitely not think so my mind has 90 5 3 1 90.9% 43 44.3% 34 34.3% 167 56.6% 5.1% 18 18.6% 8 8.1% 31 10.5% 3.0% 21 21.6% 21 21.2% 45 15.3% 1.0% 15 15.5% 36 36.4% 52 17.6%
B a r C ha r t
th o u g ht
defin itely not I do n't think so

100

PILLNESSnormal

mild

severe

Total

Count % within PILLNESS Count % within PILLNESS Count % within PILLNESS Count % within PILLNESS

Total 99 100.0% 97 100.0% 99 100.0% 295 100.0%


Count

80

has cr ossed m y m ind defin itely has

60

40

20

0 norm al m i ld sever e

p illn es s

10

Estimates of interest
In a 22 table the parameter of interest is the relative risk (RR) or odds ratio (OR) of a category (A1) of outcome A between two samples (or two categories of the second variable, B) B1 and B2.
Category A1 Category not A1 Total Category/sample B1 Category/sample B2 Total a c a+c b d b+d a+b c+d n=a+b+c+d

a (a + b ) RR of A1 comparing B1 and B2 = c (c + d )

OR of A1... =

ab cd

Estimates from a 22 table


If neither row nor column totals have been fixed by the design of the experiment risks within groups, the RR and the OR can be estimated (cross-sectional study). If the row totals have been fixed by the design of the experiment risks within groups, the RR and the OR can be estimated (cohort study). If the column totals have been fixed by the design of the experiment only the OR can be estimated (case control study). When meaningful the RR and OR are simply estimated by substituting the theoretical frequencies by their observed values.

11

CIs for relative risks or odds ratios in SPSS


In SPSS CIs for RRs and ORs can be obtained But the data needs to reduced into a 22 table of the category of interest and its opposite event (defining columns) and the two samples or categories to be compared (defining rows), before they can be calculated
[pillness.sav] Transform - Compute... - Target Variable=nthought - Numeric expression = 1 - If - Include if case satisfies condition=thought=1 - Continue - OK repeat this step with Numeric expression = 2 -If - Include if case satisfies condition =thought>1 - OK Label the new variable nthought 1=Has not thought about suicide, 2=Has thought about suicide Data - Select Cases... - If condition satisfies - If - pillness~=2 - Continue - OK Analyze - Descriptive Statistics- Crosstabs... - Row(s) :=pillness Column(s):=nthought - Cells... - Counts=Observed - Percentage=Row Continue - Statistics... - Risk - Continue - OK Data - Select Cases... - All cases - OK

CIs for relative risks or odds ratios in SPSS


Risk Estimate
PILLNESS * NTHOUGHT Crosstabulation NTHOUGHT Has not Has thought thought about about suicide suicide 90 9 90.9% 34 34.3% 124 62.6% 9.1% 65 65.7% 74 37.4%

Value
Total 99 100.0% 99 100.0% 198 100.0%

95% Confidence Interval Lower Upper

PILLNESS

normal

severe

Total

Count % within PILLNESS Count % within PILLNESS Count % within PILLNESS

.138 .073 .262 The prevalence of thinking about suicide was lower in the normal 198 group than in the severely psychiatrically ill group (RR=0.14, 95% CI from 0.07 to 0.26).

Odds Ratio for PILLNESS (normal / severe) For cohort NTHOUGHT = Has not thought about suicide For cohort NTHOUGHT = Has thought about suicide N of Valid Cases

19.118

8.582

42.590

2.647

2.002

3.500

12

Hypothesis tests for two-way tables


The general homogeneity hypothesis, that the proportions of m categories of outcome A do not differ between k samples (or categories of outcome B), can be tested H0: p(Ai|sample 1)=p(Ai|sample 2)==p(Ai|sample k) for all i=1,,m A chi-squared test can be employed to test this hypothesis This in an asymptotic test, the expected number of counts in each cell should be at least 5. Exact tests that do not rely on a large sample normal distribution are available as an alternative for small sample sizes.

Test for homogeneity in SPSS


Analyze - Descriptive Statistics - Crosstabs... - Row(s):=pillness Column(s):=thought - Statistics... - Chi-square - Continue - OK
Chi-Square Tests A symp. Sig. (2-sided) 6 6 1 .000 .000 .000

Value Pearson Chi-S quare Likelihood Rat io Linear-by-Linear A ssoc iation N of V alid Cases 91.253 100.535 73.501 295
a

df

a. 0 cells (.0% ) have expected count less t han 5. The minimum ex pec ted c ount is 10.19.

At the 5% level the proportions of the thought categories differed significantly between the normal, the moderate and the severely psychiatrically ill groups (2=91.25, d.f.=6, p<0.001).

13

Chi-squared test for independence


In a contingency table the independence hypothesis H1: pijpi.p.j H0: pij=pi.p.j vs can be tested using the chi-squared statistics which compares observed cell frequencies with those expected under the null hypothesis. The independence hypothesis is equivalent to the general homogeneity hypothesis As before this is an asymptotic test - the approximation is reasonable when the number of observations expected in each cell under the independence hypothesis exceeds 4. An exact test can be used as an alternative for small sample sizes.

Hypothesis tests for two-way tables


We might only be interested whether the proportions of a certain category differ between the samples (or categories of second variable), rather than any category. For a given category Ai we want to test the null hypothesis H0: p(Ai|sample 1)=p(Ai|sample 2)==p(Ai|sample k) . The chi-squared test can again be employed, but the categories have to be aggregated into Ai and not Ai before proceeding with the cross tabulation.

14

Use the variable nthought generated previously to consider category has thought about suicide

Test for differences in a proportion in SPSS

Analyze - Descriptive Statistics - Crosstabs... - Row(s):=pillness Column(s):=nthought - Cells... - Counts=Observed - Percentages=Row Continue - Statistics... - Chi-square - Continue - OK
PILLNESS * NTHOUGHT Crosstabulation

Chi-Square Tests Asymp. Sig. (2-sided) 2 2 1 .000 .000


mild PILLNESS normal Count % within PILLNESS Count % within PILLNESS Count % within PILLNESS Count % within PILLNESS NTHOUGHT Has not Has thought thought about about suicide suicide 90 9 90.9% 43 44.3% 34 34.3% 167 56.6% 9.1% 54 55.7% 65 65.7% 128 43.4%

Value Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases 73.353 82.877 64.262 295
a

df

Total 99 100.0% 97 100.0% 99 100.0% 295 100.0%

.000
severe

a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 42.09.
Total

At the 5% level the proportions of subjects that had thought about suicide differed significantly between the normal, the moderate and the severely psychiatric ill groups (2=73.35, d.f.=2, p<0.001).

Hypothesis tests for two-way tables


It is possible to test whether the risk of a category Ai differs between two samples (or two categories of B) B1 and B2 This null hypothesis H0: RR=1 is equivalent to H0: p(Ai|B1)=p(Ai|B2) or H0: OR=1 Again to do this technically the table needs to be reduced into the relevant two-by-two table (recode categories and select samples for comparison). Then the chi-squared test from the two-by-two table is a test for the null hypothesis above. Fishers exact test is automatically provided for 22 tables (use p-value for two-sided test).

15

Test for a difference in a risk in SPSS


[pillness.sav] use previously defined new categories nthought, use the previously defined filter filter_$ to select cases from samples 1 and 3 Analyze - Descriptive Statistics - Crosstabs...- Row(s) :=pillness Column(s):=nthought - Cells... - Counts=Observed Percentage=Row - Continue - Statistics... - Chi-square - Continue - OK
Chi-Square Tests Asymp. Sig. (2-sided) 1 1 1 .000 .000 .000 .000 67.327 198 1 .000 .000 Exact Sig. (2-sided) Exact Sig. (1-sided)

Value

df
b

There was a significant difference in the prevalence of thinking about suicide between the normal and the severely psychiatric ill group (2=67.7, d.f.=1, p<0.001).

Pearson Chi-Square Continuity a Correction Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases

67.669 65.274 74.033

a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 37.00.

Multi-way contingency tables


In a multi-way contingency table the concept of independence becomes more complicated, e.g. in a three-way table the following hypothesis may be of interest mutual independence of the three variables, i.e. none are related partial independence, i.e. association exists between two of the variables, both of which are independent of the third conditional independence, two of the variables are independent in each level of the third, but each may be associated with the third variable A model based approach is helpful for investigating associations in multi-way tables (log-linear modelling). The model based approach supplies the likelihood ratio test.

16

Vous aimerez peut-être aussi