Assumption of Normality

SW388R7
Data Analysis &

Computers II
Assumption of normality
Slide 1
Assumption of normality
Transformations
Assumption of normality script
Practice problems
Compu
ters II
Assumption of Normality
Slide 2
Many of the statistical methods that we will apply

require the assumption that a variable or variables
are normally distributed.
With multivariate statistics, the assumption is that

the combination of variables follows a multivariate
normal distribution.
Since there is not a direct test for multivariate

normality, we generally test each variable
individually and assume that they are multivariate
normal if they are individually normal, though this is
not necessarily the case.
Compu
ters II
Evaluating normality
Slide 3
There are both graphical and statistical methods for

evaluating normality.
Graphical methods include the histogram and

normality plot.
Statistical methods include diagnostic hypothesis

tests for normality, and a rule of thumb that says a
variable is reasonably close to normal if its skewness
and kurtosis have values between 1.0 and +1.0.
None of the methods is absolutely definitive.
Compu
ters II
Transformations
Slide 4
When a variable is not normally distributed, we can

create a transformed variable and test it for
normality. If the transformed variable is normally
distributed, we can substitute it in our analysis.
Three common transformations are: the logarithmic

transformation, the square root transformation, and
the inverse transformation.
All of these change the measuring scale on the

horizontal axis of a histogram to produce a
transformed variable that is mathematically
equivalent to the original variable.
Compu
ters II
When transformations do not work
Slide 5
When none of the transformations induces normality

in a variable, including that variable in the analysis
will reduce our effectiveness at identifying statistical
relationships, i.e. we lose power.
We do have the option of changing the way the

information in the variable is represented, e.g.
substitute several dichotomous variables for a single
metric variable.
Compu
ters II
Problem 1
Slide 6
In the dataset GSS2000.sav, is the following

statement true, false, or an incorrect application of a
statistic? Use 0.01 as the level of significance.
Based on a diagnostic hypothesis test of normality,
total hours spent on the Internet is normally
distributed.
1.
2.
3.
4.
True
True with caution
False
Incorrect application of a statistic
Compu
ters II
Slide 7
Computing Explore descriptive statistics
To compute the statistics

needed for evaluating the
normality of a variable, select
the Explore command from
the Descriptive Statistics
menu.
Compu
ters II
Slide 8
Adding the variable to be evaluated
Second, click on right

arrow button to move
the highlighted variable
to the Dependent List.
First, click on the
variable to be included
in the analysis to
highlight it.
Compu
ters II
Slide 9
Selecting statistics to be computed
To select the statistics for the

output, click on the
Statistics command button.
ters II
Slide
10
Including descriptive statistics
First, click on the

Descriptives checkbox
to select it. Clear the
other checkboxes.
Second, click on the

Continue button to
complete the request for
statistics.
ters II
Slide
11
Selecting charts for the output
To select the diagnostic charts

for the output, click on the
Plots command button.
ters II
Slide
12
Including diagnostic plots and statistics

First, click on the
None option button
on the Boxplots panel
since boxplots are not
as helpful as other
charts in assessing
normality.
Finally, click on the

Continue button to
complete the request.
Second, click on the

Normality plots with tests
checkbox to include
normality plots and the
hypothesis tests for
normality.
Third, click on the Histogram

checkbox to include a
histogram in the output. You
may want to examine the
stem-and-leaf plot as well,
though I find it less useful.
ters II
Slide
13
Completing the specifications for the analysis
Click on the OK button to

complete the specifications
for the analysis and request
SPSS to produce the output.
ters II
Slide
14
The histogram
Histogram
An initial impression of the

normality of the distribution
can be gained by examining
the histogram.
50
40
In this example, the

histogram shows a substantial
violation of normality caused
by a extremely large value in
the distribution.
30
Frequency
20
10
Std. Dev = 15.35

Mean = 10.7
N = 93.00
0
0.0
20.0
10.0
40.0
30.0
60.0
50.0
80.0
70.0
TOTAL TIME SPENT ON THE INTERNET
100.0
90.0
ters II
Slide
15
The normality plot

Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET
3
Expected Normal
The problem with the normality of this

variables distribution is reinforced by the
normality plot.
-1
-2
-3
-40
-20
Observed Value
20
40
If the variable were normally distributed,

the red dots would fit the green line very
closely. In this case, the red points in the
upper right of the chart indicate the
60
80
100
120
severe skewing caused by the extremely
large data values.
ters II
Slide
16
The test of normality

Tests of Normality
a
Kolmogorov-Smirnov
Statistic
df
Sig.
TOTAL TIME SPENT
ON THE INTERNET
.246
93
.000
Statistic
.606
Shapiro-Wilk
df
93
a. Lilliefors Significance Correction
Problem 1 asks about the results of the test of normality. Since the sample
size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample
size were 50 or less, we would use the Shapiro-Wilk statistic instead.
The null hypothesis for the test of normality states that the actual
distribution of the variable is equal to the expected distribution, i.e., the
variable is normally distributed. Since the probability associated with the
test of normality is < 0.001 is less than or equal to the level of significance
(0.01), we reject the null hypothesis and conclude that total hours spent on
the Internet is not normally distributed. (Note: we report the probability as
<0.001 instead of .000 to be clear that the probability is not really zero.)
The answer to problem 1 is false.
Sig.
.000
ters II
Slide
17
The assumption of normality script
An SPSS script to produce all

of the output that we have
produced manually is
available on the course web
site.
After downloading the script,
run it to test the assumption
of linearity.
Select Run Script

from the Utilities
menu.
ters II
Slide
18
Selecting the assumption of normality script

First, navigate to the folder containing your
scripts and highlight the
NormalityAssumptionAndTransformations.SBS
script.
Second, click on
the Run button to
activate the script.
ters II
Slide
19
Specifications for normality script
First, move variables from

the list of variables in the
data set to the Variables to
Test list box.
The default output is to do all of the

transformations of the variable. To
exclude some transformations from the
calculations, clear the checkboxes.
Third, click on the OK

button to run the script.
ters II
Slide
20
Tests of Normality
a
Kolmogorov-Smirnov
Statistic
df
Sig.
TOTAL TIME SPENT
ON THE INTERNET
.246
93
.000
Statistic
Shapiro-Wilk
df
.606
The script produces the same output that we

computed manually, in this example, the tests
of normality.
93
Sig.
.000
ters II
Slide
21
Problem 2
In the dataset GSS2000.sav, is the following
statement true, false, or an incorrect application of a
statistic?
Based on the rule of thumb for the allowable
magnitude of skewness and kurtosis, total hours
spent on the Internet is normally distributed.
1.
2.
3.
4.
True
True with caution
False
ters II
Slide
22
Table of descriptive statistics
Descriptives
TOTAL TIME SPENT
ON THE INTERNET
To answer problem
2, we look at the
values for skewness
and kurtosis in the
Descriptives table.
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
10.731
7.570
13.893
8.295
5.500
235.655
15.3511
.2
102.0
101.8
10.200
3.532
15.614
The skewness and kurtosis for the variable both exceed the rule of
thumb criteria of 1.0. The variable is not normally distributed.
The answer to problem 2 if false.
Std. Error
1.5918
.250
.495
ters II
Slide
23
Problem 3
In the dataset GSS2000.sav, is the following statement
true, false, or an incorrect application of a statistic?
Use 0.01 as the level of significance.
Based on a diagnostic hypothesis test of normality,
"total hours spent on the Internet" is not normally
distributed. A logarithmic transformation of "total
hours spent on the Internet" results in a variable that
is normally distributed.
1.
2.
3.
4.
True
True with caution
False
ters II
Slide
24

Tests of Normality
a
Kolmogorov-Smirnov
Statistic
df
Sig.
Logarithm of NETIME
[LG10(NETIME)]
Square Root of NETIME
[SQRT(NETIME)]
Inverse of NETIME
[1/(NETIME)]
Statistic
Shapiro-Wilk
df
Sig.
.047
93
.200*
.994
93
.951
.118
93
.003
.868
93
.000
.288
93
.000
.495
93
.000
*. This is a lower bound of the true significance.

Problem 3 specifically asks about the results of the test of

normality for the logarithmic transformation. Since our sample
size is larger than 50, we use the Kolmogorov-Smirnov test.
The null hypothesis for the Kolmogorov-Smirnov test of
normality states that the actual distribution of the transformed
variable is equal to the expected distribution, i.e., the
transformed variable is normally distributed. Since the
probability associated with the test of normality (0.200) is
greater than the level of significance, we fail to reject the null
hypothesis and conclude that the logarithmic transformation of
total hours spent on the Internet is normally distributed.
The answer to problem 3 is true.
ters II
Slide
25
Other problems on assumption of normality
A problem may ask about the assumption of normality

for a nominal level variable. The answer will be An
inappropriate application of a statistic since there is
no expectation that a nominal variable be normal.
A problem may ask about the assumption of normality

for an ordinal level variable. If the variable or
transformed variable is normal, the correct answer to
the question is True with caution since we may be
required to defend treating an ordinal variable as
metric.
Questions will specify a level of significance to use and

the statistical evidence upon which you should base
your answer.
ters II
Slide
26
Steps in answering questions about the

assumption of normality question 1
The following is a guide to the decision process for answering
problems about the normality of a variable:
Is the variable to be
evaluated metric?
No
Incorrect application
of a statistic
Yes
Does the statistical

evidence support
normality assumption?
No
False
Yes
Are any of the metric

variables ordinal level?
Yes
True with caution
No
True
ters II
Slide
27
Steps in answering questions about the

assumption of normality question 2
The following is a guide to the decision process for answering
problems about the normality of a transformation:
Is the variable to be
evaluated metric?
No
Incorrect application
of a statistic
Yes
Statistical evidence
supports normality?
No
No
Statistical evidence
for transformation
supports normality?
False
Yes
Either variable
ordinal level?
Yes
True with caution
No
True

Assumption of Normality

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Assumption of Normality

Transféré par

Droits d'auteur :

Formats disponibles

SW388R7

Data Analysis &

Many of the statistical methods that we will apply

With multivariate statistics, the assumption is that

Since there is not a direct test for multivariate

There are both graphical and statistical methods for

Graphical methods include the histogram and

Statistical methods include diagnostic hypothesis

None of the methods is absolutely definitive.

When a variable is not normally distributed, we can

Three common transformations are: the logarithmic

All of these change the measuring scale on the

When transformations do not work

When none of the transformations induces normality

We do have the option of changing the way the

In the dataset GSS2000.sav, is the following

Computing Explore descriptive statistics

To compute the statistics

Adding the variable to be evaluated

Second, click on right

Selecting statistics to be computed

To select the statistics for the

Including descriptive statistics

First, click on the

Second, click on the

Selecting charts for the output

To select the diagnostic charts

Including diagnostic plots and statistics

Finally, click on the

Second, click on the

Third, click on the Histogram

Completing the specifications for the analysis

Click on the OK button to

An initial impression of the

In this example, the

Std. Dev = 15.35

TOTAL TIME SPENT ON THE INTERNET

The normality plot

The problem with the normality of this

If the variable were normally distributed,

The test of normality

a. Lilliefors Significance Correction

The assumption of normality script

An SPSS script to produce all

Select Run Script

Selecting the assumption of normality script

Specifications for normality script

First, move variables from

The default output is to do all of the

Third, click on the OK

The test of normality

a. Lilliefors Significance Correction

The script produces the same output that we

Table of descriptive statistics

The test of normality

*. This is a lower bound of the true significance.

Problem 3 specifically asks about the results of the test of

Other problems on assumption of normality

A problem may ask about the assumption of normality

A problem may ask about the assumption of normality

Questions will specify a level of significance to use and

Steps in answering questions about the

Does the statistical