Vous êtes sur la page 1sur 36

University of Natural Resources and Applied Life Sciences

Vienna

Designing of Experiments for ANOVA models

Marie Sime
ckova

Supervisor: Prof. Dr. Dr. h.c. Dieter Rasch

2008

Abstract
Statistical design is a very important part of applied empirical studies. In this work evaluating
of the required size of experiment in two specific cases of ANOVA models is proposed.
In the first part, the one-way layout with one fixed factor and ordinal categorical response
variable is considered. The Kruskal Wallis test is used to test the equality of main effects and
its properties are compared with those of the F -test. The distribution of response variables is
characterized by the relative effect. A formula for evaluating the sample size for the Kruskal
Wallis was derived by simulation. Then the formula was compared with the two Noethers
formula for the Wilcoxon test in the case of two factor levels.
In the second part, the two-way ANOVA mixed model with one observation for each row
column combination is considered and tests of interaction in this model are studied. Five tests
of additivity are covered, all developed for models with two fixed factors: Tukey test, Mandel
test, Johnson Graybill test, Tussel test and locally best invariant (LBI) test. We confirmed
by simulation that these tests hold the type-I-risk even for the mixed model case. Then their
power was studied. The power of Johnson Graybill, LBI and Tussel test is sufficient, but
the power of Tukey and Mandel test is low for the general type of interaction. A modification
of Tukey test is proposed to solve this problem. Finally, a formula for determination of the
size of an experiment for the Johnson Graybill test is derived.

Zusammenfassung
Versuchsplanung ist ein wichtiger Bestandteil angewandter empirischer Forschung. In dieser
Arbeit wird der erforderliche Versuchsumfang f
ur einige Spezialfalle der Varianzanalyse (VA)
hergeleitet.
Im ersten Teil wird die einfache Klassifikation mit einem festen Faktor und kategorialer Zielvariablen behandelt. Der Kruskal Wallis Test wird zur Pr
ufung der Gleichheit der Haupteffekte herangezogen und seine Eigenschaften werden mit denen des F -Tests verglichen. Die
Verteilung der kategorialen Zielvariablen wird durch den relativen Effekt charakterisiert. Es
wurde eine Formel zur Bestimmung des Stichprobenumfangs f
ur den Kruskal Wallis Test
mit Hilfe von Simulationen abgeleitet. Diese Formel wurde dann mit zwei Formeln von
Noether f
ur den Wilcoxon Test, also f
ur den Spezialfall eines Faktors mit zwei Stufen,
verglichen.
Im zweiten Teil wird die zweifache Klassifikation mit einfacher Klassenbesetzung und einem
gemischten Modell betrachtet. Es wurden f
unf Tests auf fehlende Wechselwirkungen untersucht, die f
ur das Modell mit zwei festen Faktoren entwickelt wurden. Diese sind: der Tukey
Test, der Mandel Test, der Johnson Graybill Test, der Tussel Test und ein bester
lokaler invarianter Test (LBI). Auch im Falle eines gemischten Modells halten wie unsere
Simulationsstudien zeigen alle diese Test das Risiko erster Art ein. Bez
uglich der G
ute
stellten wir fest, dass die G
ute von Johnson Graybill Test, LBI Test und Tussel Test
zufriedenstellend ist, dagegen haben Tukey Test und Mandel Test eine unbefriedigend
geringe G
ute. Eine Modifikation des Tukey Tests wurde vorgenommen aber es ergab sich
keine ausreichende Verbesserung der G
ute. Schliesslich wurde eine Formel zur Bestimmung
des Versuchsumfanges f
ur den Johnson Graybill Test entwickelt.

Contents
1 Introduction

2 One-Way Layout with One Fixed Factor for Ordered Categorical Data

2.1

2.2

Description of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1

The parametric F -test . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.2

The nonparametric Kruskal Wallis test . . . . . . . . . . . . . . . . .

2.1.3

Relative effect and its properties . . . . . . . . . . . . . . . . . . . . .

Design of experiment for the Kruskal Wallis test . . . . . . . . . . . . . . .

11

2.2.1

Comparison of the F -test and the Kruskal Wallis test for the one-way
layout with ordinal categorical response . . . . . . . . . . . . . . . . .

11

2.2.2

Size of experiment for the Kruskal Wallis test . . . . . . . . . . . . .

13

2.2.3

Size of experiment for the Wilcoxon test . . . . . . . . . . . . . . . . .

19

3 Tests of Additivity in Mixed Two-way ANOVA Model with Single Subclass


Numbers
22
3.1

3.2

3.3

Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.1.1

Description of the problem . . . . . . . . . . . . . . . . . . . . . . . .

23

3.1.2

Tests of additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Properties of the additivity tests . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.2.1

Type-I-risk of the additivity tests . . . . . . . . . . . . . . . . . . . . .

25

3.2.2

Simulation study of the power of the additivity tests . . . . . . . . . .

26

3.2.3

Size of experiment for the Johnson Graybill test . . . . . . . . . . .

27

Modified Tukey test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

Chapter 1

Introduction
In biology, agriculture, psychology and many other research fields experiments are very important methods to acquire knowledge. To obtain credible conclusions on the one hand and
not to waste resources on the other hand experiments must be carefully designed. An essential
part of planning an experiment is determination of the number of unit included. This work is
focused on two specific cases estimation of power and sample size for the Kruskal Wallis
test and additivity test in two-way ANOVA model without replication.
The thesis is divided in two parts. In Chapter 2 the interest lies in the one-way layout with
ordinal categorical response. The Kruskal Wallis analysis of variance by ranks is applied.

As demonstrated in papers Rasch and Sime


ckova (2007) and Sime
ckova and Rasch (2008)
the required sample size can be approximated by a formula comprising number of response
categories and relative effect. The results are compared to analogous formula for the F -test
and Wilcoxon test.
In Chapter 3 tests of presence or absence of an interaction in two-way ANOVA mixed model
with only one observation in each subclass (additivity tests) are studied. Five tests originally
developed for the fixed effects ANOVA are considered (Tukey test, Mandel test, Johnson
Graybill test, Tussel test and locally best invariant test) and their type-I-risk and type-II-risk
in the mixed effects ANOVA are evaluated. The minimal sample size is estimated by means
of simulation for Johnson Graybill test and an empirical formula for determining sample
size is presented. Finally, a modification of Tukey test is proposed to increase the power for
generalized interaction scheme. All these results have been already submitted for publication,

see Sime
cek and Sime
ckova (submitted), Rusch et al. (subbmitted) and Sime
ckova and Rasch
(subbmitted).
All presented simulations were performed using the statistical environment R, R Development
Core Team (2008) on a grid of 48 Intel machines at Supercomputing Centre Brno. I would
like to thank the METACentrum project (http://meta.cesnet.cz) for allocation of computing
time.
I cannot forget to commemorate my former supervisor Prof. Harald Strelec. I am extremely
grateful to Prof. Dieter Rasch, my colleagues from Universitat f
ur Bodenkultur Wien and
my friends from Universitat Wien, their suggestions improve this work substantially. The
friendly environment and support was provided by Institute of Animal Science in Prague.

Chapter 2

One-Way Layout with One Fixed


Factor for Ordered Categorical
Data
The aim of my work was to determine the size of experiments for a special case of the Kruskal
Wallis test (Kruskal and Wallis (1952)): for the test of equality of main effects in the
one-way layout with ordinal categorical response variable.
Because the exact distribution of the test statistic of the Kruskal Wallis test under the
alternative hypothesis is not known, results about its power and sample size determination
in literature are very poor. In Mahoney and Magel (1996) a bootstrapping method for estimation of the power is presented, the simulation was performed for comparing of distributions in 3
or 5 groups. A special case of the Kruskal Wallis test when comparing distributions in two
groups is the Wilcoxon test (Wilcoxon (1945)). The approximation of the power of Wilcoxon
test is known (Lehmann (1975)) and in Noether (1987) two formulas for determination of the
size of experiment for the Wilcoxon test are introduced.
In this chapter we will first summarize the F -test and the Kruskal Wallis test and then
properties of the relative effect will be discussed. This knowledge will be used in Section 2.2.2,
where the power of the Kruskal Wallis test is investigated by simulation and a formula for
determination of the size of experiment is derived. Last, outputs of this formula for the
Wilcoxon test are compared to outputs of two Noethers formulas.
Let us conclude with following definition: We say that a distribution function F is lower than
another distribution function G, i.e. F < G, if F (x) G(x) for all x R and there exists
a set A (with positive measure with respect to both distribution functions F and G) such
that for all x A: F (x) < G(x).
Through the thesis the random variables are bold print.

2.1

Description of the model

Let us consider a one-way layout y 1 , . . . , y a with distribution functions F1 , . . . , Fa , respectively. The random variable y i corresponds to the i-th level of the factor A. For each i
4

(i = 1, . . . , a) random sample yi1 , . . . , yini is observed.


The aim is to design the experiment for testing an effect of the factor A. The null hypothesis
H0 : F1 = F2 = = Fa

(2.1)

HA : i, j : Fi < Fj or Fi > Fj .

(2.2)

is tested against the alternative

For F1 , . . . , Fa Gaussian distribution functions the F -test can be used. For non-Gaussian
distributions the Kruskal Wallis test has to be used.
Our interest lies in the case when the response variables y 1 , . . . , y a are ordinal categorical, particularly in the case when y 1 , . . . , y a where derived from some continuous variables x1 , . . . , xa
by discretization. The formal definition follows.
Definition 1. Consider a continuous random variable x. A new ordered categorical random
variable y with r categories is derived from x using a decomposition of the real line based
on a set {1 , 2 , . . . , r1 }, = 0 < 1 < 2 < < r1 < r = +.
The y = i if and only if x lies in the interval (i1 , i ], i = 1, . . . , r.
The set {1 , 2 , . . . , r1 } is called the support of the decomposition.
Because we compare the sample size for the Kruskal Wallis test with the sample size for
the F -test, we will recall the F -test and the Kruskal Wallis test and their properties. Then
the definition of relative effect will be introduced.

2.1.1

The parametric F -test

In this section properties of the F -test are shortly summarized. More information about them
can be found e.g. in (Rasch et al., 2007, section 4.1.1), Scheffe (1959) or Lehmann (2005).
Let us consider a continuous random variable x. The model equation is written in the form
(ANOVA model I):
xij = E(xij ) + eij = + i + eij

(i = 1, . . . , a; j = 1, . . . , ni ),
(2.3)
Pa
where
, i are real numbers (i.e. non-random); it should hold either
i=1 i = 0 or
Pa
i=1 ni i = 0 (which are equivalent if n1 = = na ). The eij are mutually independent
normally distributed random variables with E(eij ) = 0 and var(eij ) = 2 .
P
P
The ni s and a are known real constants. Let us denote N = ai=1 ni and
= i /a.
In Rasch and Guiard (1990) it was shown that the F -test is quite robust and the assumption
of normality of the error terms eij can be relaxed.
We like to design the experiment for test of the hypothesis
H0 : 1 = = i ,

(2.4)

in other words the factor A has no effect on the response variable, against the alternative
HA : i, j : i 6= j .
5

If the errors eij are normally distributed, the exact test statistic for testing the null hypothesis (2.4) is equal to
M SA
F =
,
(2.5)
M SR
where
P P
P
)2
xi x
)2
i
j (xij x
i ni (
and M S R =
M SA =
a1
N a
are the mean squares of the factor A and the residual mean squares (
xi row means, x
overall
mean).
Under the null hypothesis F follows the central F -distribution with f1 = a1 and f2 = N a
degrees of freedom. Otherwise, it follows the non-central F -distribution with f1 = a 1 and
f2 = N a degrees of freedom and the non-centrality parameter equals
Pa
ni (i
)2
= i=1
.
(2.6)
2
Let denote F (f1 , f2 ; ; p) the p-quantile of the non-central F -distribution with f1 and f2
degrees of freedom and non-centrality parameter . If = 0 we shorten F (f1 , f2 ; 0; p)
as F (f1 , f2 ; p).
If the realization F of F in (2.5) exceeds F (a 1, N a; 1 ) the null hypothesis is rejected
on the level , otherwise it is not rejected. For a = 2 the F -test coincides with the t-test for
two independent samples.
A design of an experiment must involve specification of the the required type-I-risk and
required type-II-risk . For determination of the sample size of the F -test two more parameters should be specified, the variance 2 of random errors eij and = max min
(max = max(i ) the greatest and min = min(i ) the lowest of the effects 1 , . . . , a ). We
will determine the maxi-min sample size, which assure the type-II-risk for any 1 , . . . , a
fulfilling max min = . The maxi-min sample size will be denoted nmax .

The type-II-risk of the F -test depends on the non-centrality parameter . The type-II-risk
decreases as increases. For given the term (2.6) is minimized if the a 2 effects are equal
to 21 (min + max ). For any other values would be higher and the type-II-risk lower.
P
If N = ai=1 ni is fixed then the type-II-risk of the F -test is minimized when the subclass
numbers ni are as equal as possible. If a is an integer divisor of N then the type-II-risk is
minimized for n1 = = na = n = N/a. Therefore we choose n1 = . . . , na = n.
To calculate the required sample size nmax we have to solve the quantile equation
F (a 1, na a; 1 ) = F (a 1, na a; ; ).

(2.7)

If max = min + and the other a 2 of the effects i are equal to 21 (min + max ) =
P
2
n 2
min + 2 =
, then ai=1 (i
)2 = 2 and so = 2
2 . The depends on and only

through their ratio, therefore only this ration , called relative effect size, is necessary to
know for calculation of the maxi-min sample size.
To conclude, for evaluation of the maxi-min sample size of the F -test (for given number
of factor levels a) the type-I-risk , type-II-risk and relative effect size must be fixed.
In Table 2.1 we report some values of nmax in dependence on

and for = 0.05 and a = 6.

Table 2.1: Maxi-min sample sizes for = 0.05, a = 6 and different values of and

2.1.2

= 0.05

= 0.1

= 0.2

1
1.2
1.5

41
29
19

34
24
16

27
19
13

The nonparametric Kruskal Wallis test

The F -test for testing the hypothesis about the equality of means in a one-way ANOVA
model I discussed in Section 2.1.1 is based on the assumption that the observed variables are
normally distributed and their distributions in different groups differ only in their expected
values. The Kruskal Wallis test considered in this chapter can be used in cases where
normality is questionable. The principle of this test will be explained in brief, for details see
e.g. (Lehmann, 1975, chapter 5, section 2).
Let y 1 , . . . , y a be random variables with distribution functions F1 , . . . , Fa , the y i corresponds
to the observed variable in the i-th level of the factor A. We will test the hypothesis
H0 : F1 = F2 = = Fa ,

(2.8)

HA : i, j : Fi < Fj or Fi > Fj .

(2.9)

against the alternative


For each i (i = 1, . . .P
, a) there are ni realizations of the random variable y i , denoted by
yi1 , yi2 , . . . , yini , N =
ni . All the observed values yij are set up overall in one vector and
ordered. Let rij be the rank of thePvalue yij in this sequence. Then the sum of ranks in each
i
factor level is evaluated, i.e. Ti = nj=1
rij , i = 1, . . . a.
The test statistic of the Kruskal Wallis test is equal to
a

X T2
12
i
3(N + 1).
Q=
N (N + 1)
ni

(2.10)

i=1

This basic version of the Kruskal Wallis test assumes that the distribution functions Fi are
continuous and therefore there are no ties (almost surely). In our case of categorical variables
y i this is not true and a Kruskal Wallis test corrected for ties should be used (Kruskal and
Wallis (1952)). Let s be the number of distinct values of observations and tl be the number
of tied values of the l-th smallest observed value, l = 1, . . . , s. Then the corrected test statistic
is equal to
Q
Ps
QK =
.
(2.11)
(t3l tl )
1 l=1
3
N N
Let us note that Q is a special case of QK , where tl = 1 for all l = 1, . . . , s = N .
The test statistics Q and QK are under H0 asymptotically 2a1 distributed. For small sample
sizes the critical values are tabulated in software or tables.
For a = 2 (comparing of distributions in only two groups) the test statistics are simplified to
the Wilcoxon (Mann Whitney) test statistics.
7

2.1.3

Relative effect and its properties

The aim of this work is to design an experiment for the Kruskal Wallis test, i.e. to find
maxi-min sample size for given type-I-risk and type-II-risk. Because the exact distribution
of the Kruskal Wallis test statistic under the alternative hypothesis is not known we derived
a formula for determining of the sample size by simulation.
We assume that the ordinal response variables y 1 , . . . , y a are discretized from an underlying
continuous random variables x1 , . . . , xa (see Definition 1 on page 5). The loss of information
caused by the discretization is measured by the so called relative effect.
In more details, the ordered categorical random variable y takes realizations belonging to r
ordered categories C1 C2 Cr with r 1 (we used the symbol to denote the order
relation). We need a measure for the distance between two distributions. For this we use the
approach of (Brunner and Munzel, 2002, section 1.4, formula (1.4.1)).
First, let us recall the definition of distribution function. The concept of Brunner and Munzel
(2002) is used to treat the problem of discontinuity of the distribution function.
Definition 2. Let y be a random variable. Then we define the (standardized) distribution
function of y as

1
F (y) =
P (y < y) + P (y y) .
2
The relative effect is defined as follows.
Definition 3. For two independent random variables y 1 and y 2 with distribution functions
F1 (y) and F2 (y) respectively, the probability
Z
1
p(y 1 ; y 2 ) = P (y 1 < y 2 ) + P (y 1 = y 2 ) = F1 dF2
2
is called the relative effect of y 2 with respect to y 1 .
Note that the relative effect p(y 1 ; y 2 ) is equal to 1 p(y 2 ; y 1 ), and for continuous random
variables p(y 1 ; y 2 ) = P (y 1 < y 2 ).
Our aim is to find a connection between the non-centrality parameter of the continuous model
and the relative effect of the model with ordinal variables. Properties of the relative effect
will be discussed in this section in details.
For simulation experiments we generate ordered categorical variables by decomposition of the
real line, as was described in Definition 1 (page 5). The relative effect for this case is computed
in the following example.
Example 1. Let x be a random variable with distribution function F . The random variable
y is derived from it using the decomposition with support {1 , 2 , . . . , r1 }. Then it holds
P (y = i) = F + (i ) F + (i1 ).
If x1 and x2 are two variables with distribution functions F1 and F2 , respectively, and the
variables y 1 and y 2 are derived from them using the same support {1 , 2 , . . . , r1 }, the

relative effect of y 2 with respect to y 1 can be evaluated as follows


1
p(y 1 ; y 2 ) = P (y 1 < y 2 ) + P (y 1 = y 2 ) =
2
j1
r 
r 
 1X

X
X
P (y 2 = j)
=
P (y 1 = j) P (y 2 = j) =
P (y 1 = i) +
2
j=2

i=1

j1 
r X
X

 

F2+ (j ) F2+ (j1 ) F1+ (i ) F1+ (i1 ) +

j=2 i=1
r 
X

1
2

j=1

j=1

 

F2+ (j ) F2+ (j1 ) F1+ (j ) F1+ (j1 ) .

(2.12)

The following theorem shows the relation between the relative effects of two continuous variables and two ordinal categorical variables, derived from the continuous one by decomposition.
Theorem 1. Let x1 and x2 be two independent continuous random variables and y r1 , y r2
two ordinal categorical variables with r categories, derived from x1 , x2 by decomposition with
support {1 , 2 , . . . , r1 }. Then:
(i) If r = 1 (i.e. y 11 = y 12 = 1 with probability 1) then p(y 1 ; y 2 ) = 21 .
(ii) If r and maxi |i+1 i | 0 then
p(y r1 ; y r2 ) p(x1 ; x2 ).
(iii) If p(x1 ; x2 ) >

1
2

(2.13)

and

1 , 2 , 3 R , 1 < 2 < 3 :

P (1 < x1 < 2 ) P (2 < x2 < 3 ) P (1 < x2 < 2 ) P (2 < x1 < 3 ), (2.14)

then the convergence in (2.13) is monotonous and therefore for each r


1
r+1
r
r
p(x1 ; x2 ) p(y r+1
1 ; y 2 ) p(y 1 ; y 2 ) .
2
Proof.

(i) Obviously:
1
1
1
p(y 1 ; y 2 ) = P (y 1 < y 2 ) + P (y 1 = y 2 ) = 0 + 1 = .
2
2
2

(ii) It is valid
p(x1 ; x2 ) p(y r1 ; y r2 ) =
It follows


1
P (x1 < x2 , y r1 = y r2 ) P (x1 > x2 , y r1 = y r2 ) .
2

|p(x1 ; x2 ) p(y r1 ; y r2 )| < P (y r1 = y r2 ) 0.

(2.15)

(iii) Let us proceed by the induction on r. Take any y r1 , y r2 and make refinement of their
support of decomposition adding one point somewhere between any i and i+1 .
Denote z 1 , z 2 the categorical variables based on this new decomposition.
Analogously as in (2.15),
p(z 1 ; z 2 ) p(y r1 ; y r2 ) =


1
P (z 1 < z 2 , y r1 = y r2 ) P (z 1 > z 2 , y r1 = y r2 ) .
2

Then using that and the independence of the variables x1 , x2


1
P (z 1 < z 2 , y r1 = y r2 ) P (z 1 > z 2 , y r1 = y r2 ) =
2
1
P (i < x1 ) P ( < x2 i+1 )
=
2


p(z 1 ; z 2 ) p(y r1 ; y r2 ) =

P ( < x1 i+1 ) P (i < x2 ) .

This expression should be nonnegative. Using the condition (2.14) for 1 = i , 2 =


and 3 = i+1 finishes the proof.

The part (ii) of Theorem 1 tells us that the random variables with countable infinite support
(i.e. variables y r1 , y r2 , which take the values 1, 2, . . . , r, r ) can for the sufficiently crowded
support have almost the same relative effect as the variables with uncountable support (i.e.
variables x1 , x2 , which take the values from some interval in real numbers).
In part (iii) there is the necessary condition to increase the relative effect by adding a new
support point. If we take 1 = i , 3 = i+1 elements of the original support and 2 a new
element added between the i and i+1 , the condition (2.14) is necessary to increase the
relative effect in this particular case.
The condition is fulfill e.g. for x1 , x2 uniformly distributed. It is usually fulfilled for normally
distributed variables but it does not hold generally, as is shown in Example 2. The condition
is not valid because the difference between the expected values of x1 and x2 is very small
(relatively to their standard deviation).
Example 2. Consider normally distributed random variables x1 and x2 with common variance equal to 1 and expected values equal to 0.1 and 0.1, respectively. Two random variables
y 1 and y 2 are derived from them using the support {0.4, 0.2}. Then p(y 1 ; y 2 ) = 0.5020
(it is easy to see using formula (2.12)).
If a new point 0.3 is added to the support and random variables z 1 , z 2 are derived from the
x1 and x2 using the support {0.4, 0.3, 0.2}, the relative effect p(z 1 ; z 2 ) = 0.5004. This
value is lower then the value of p(y 1 ; y 2 ).
The distribution functions of y 1 , y 2 and z 1 , z 2 are plotted in Figure 2.1.

10

0.8
0.0

0.4

Fz(z)

0.8
0.4
0.0

Fy(y)

Figure 2.1: The distribution functions of y 1 , y 2 and z 1 , z 2 for Example 2; y 1 and z 1 solid
lines, y 2 and z 2 dashed lines.

2.2
2.2.1

Design of experiment for the Kruskal Wallis test


Comparison of the F -test and the Kruskal Wallis test for the oneway layout with ordinal categorical response

Some researchers use the F -test for the comparison of means of ordered categorical variables
although it is not correct. In this section we compare the nominal and the actual risks for
investigating the impact of this mistake.
Let us consider three normally distributed random variables x1 , x2 , x3 , with the same standard deviation = 50/3 = 16.67, and different expected values 1 = 50 /2 = 41.67,
2 = 50, 3 = 50 + /2 = 58.33. Three categorical random variables y 1 , y 2 , y 3 are derived
from the variables x1 , x2 , x3 using the support {30, 50, 70} (see Definition 1 on page 5). We
are comparing a = 3 variables, each attaining r = 4 values.
We are interested in the properties of testing the hypothesis
H0 : 1 = 2 = 3 ,

(2.16)

against the alternative


HA : i, j : i 6= j .
We test this hypothesis using the original normally distributed variables x1 , x2 , x3 or the
ordinal categorical variables y 1 , y 2 , y 3 . The F -test and the Kruskal Wallis test will be
performed in both cases.
Let nom denote the nominal type-I-risk (which should not be violated) and act the actual
type-I-risk (which is attained) of the performed test (F -test or Kruskal Wallis test) and
given variables (continuous or categorical). The act is estimated by simulation. Analogously
the nom and act are used for the type-II-risk.

11

Table 2.2: The actual type-II-risk. The bold printed values are the 20 % robust results.
Nominal values

Normal
F -test

Categorical
KruskalWallis

Normal
KruskalWallis

Categorical
F -test

nom

nom

act

sd(act )

act

sd(act )

act

sd(act )

act

sd(act )

0.10
0.10
0.10
0.10
0.05
0.05
0.05
0.05
0.01
0.01
0.01
0.01

0.20
0.15
0.10
0.05
0.20
0.15
0.10
0.05
0.20
0.15
0.10
0.05

17
19
22
27
21
23
27
32
30
33
37
43

0.1808
0.1410
0.0962
0.0484
0.1850
0.1454
0.0931
0.0483
0.1889
0.1405
0.0943
0.0485

0.0027
0.0024
0.0020
0.0017
0.0037
0.0023
0.0021
0.0020
0.0032
0.0026
0.0031
0.0017

0.2400
0.1975
0.1440
0.0827
0.2582
0.2128
0.1452
0.0881
0.2820
0.2254
0.1628
0.0974

0.0033
0.0039
0.0036
0.0023
0.0032
0.0044
0.0020
0.0025
0.0027
0.0030
0.0033
0.0027

0.2018
0.1589
0.1117
0.0584
0.2108
0.1685
0.1096
0.0602
0.2234
0.1696
0.1165
0.0630

0.0038
0.0030
0.0031
0.0020
0.0038
0.0026
0.0026
0.0019
0.0023
0.0024
0.0034
0.0029

0.2309
0.1880
0.1369
0.0776
0.2425
0.2004
0.1341
0.0799
0.2573
0.2037
0.1451
0.0856

0.0024
0.0030
0.0036
0.0026
0.0047
0.0045
0.0030
0.0030
0.0027
0.0035
0.0036
0.0019

The main question is whether the difference of the actual and the nominal type-II-risks is
substantial. To asses the difference the concept of a 20 % robustness as defined in Rasch and
Guiard (1990) is used: a test is called 20 % robust if |nom act | 0.2 nom .
Simulation
Three levels of the nominal type-I-risk 0.10, 0.05, 0.01 and four levels of the nominal type-IIrisk 0.20, 0.15, 0.10, 0.05 were considered. For each combination of these nominal risks the
maxi-min sample size n for the F -test was evaluated (see Section 2.1.1). A sample of size
n was generated for each of the three normally distributed random variables x1 , x2 , x3 and
from them the categorical random variables y 1 , y 2 , y 3 were derived by decomposition with
the support {30, 50, 70}.
For both the normal and the categorical variables the F -test and the Kruskal Wallis test
were performed. This was repeated 100 000 times and the ratio of non-significant tests was
recorded, it is the (estimated) actual type-II-risk. The repetitions were divided into 10 blocks
of 10 000 and the act were recorded for each of them, these values were used to estimate the
standard deviation of the act (denote sd(act )).

For the actual type-I-risk, the simulation was made in the analogous way, just the means
of the variables x1 , x2 , x3 were all equal to 1 = 2 = 3 = 50 and the ratio of significant
tests was recorded as the estimate of act .
Results
Tables 2.2 and 2.3 reports the actual type-II-risk and type-I-risks of the F -test and Kruskal
Wallis test. The sample size n is the maxi-min sample size of the F -test for the given nom
and nom . For each test the standard deviation sd of the estimate of the risk is recorded
in the second column.
It is not surprising that the actual type-I-risk is in all cases lower than the nominal risk
(the few opposite cases are caused by the error of the estimate, as is seen from standard
deviations). The tests are constructed to keep the given level of the type-I-risk.
12

Table 2.3: The actual type-I-risk.


Nominal values
nom nom n
0.10
0.10
0.10
0.10
0.05
0.05
0.05
0.05
0.01
0.01
0.01
0.01

0.20
0.15
0.10
0.05
0.20
0.15
0.10
0.05
0.20
0.15
0.10
0.05

17
19
22
27
21
23
27
32
30
33
37
43

Normal
F -test
act
sd(act )
0.1013
0.1004
0.0996
0.0984
0.0496
0.0494
0.0505
0.0497
0.0101
0.0098
0.0099
0.0096

0.0030
0.0028
0.0031
0.0022
0.0029
0.0016
0.0026
0.0012
0.0010
0.0011
0.0012
0.0008

Categorical
KruskalWallis
act
sd(act )
0.0995
0.0994
0.0994
0.0981
0.0484
0.0492
0.0487
0.0490
0.0090
0.0091
0.0097
0.0094

0.0029
0.0046
0.0040
0.0018
0.0028
0.0015
0.0027
0.0015
0.0011
0.0004
0.0010
0.0010

Normal
KruskalWallis
act
sd(act )
0.0999
0.0997
0.0987
0.0982
0.0474
0.0485
0.0495
0.0493
0.0090
0.0089
0.0092
0.0091

0.0020
0.0033
0.0036
0.0020
0.0029
0.0013
0.0032
0.0018
0.0011
0.0010
0.0010
0.0007

Categorical
F -test
act
sd(act )
0.1007
0.1010
0.0998
0.0985
0.0509
0.0508
0.0516
0.0509
0.0104
0.0100
0.0107
0.0101

0.0035
0.0039
0.0040
0.0025
0.0032
0.0020
0.0024
0.0010
0.0013
0.0008
0.0014
0.0010

The results for the type-II-risk are more interesting. In Table 2.2 the 20 % robust results
are bold print. Naturally, for normally distributed random variables the actual type-II-risk
of the F -test is close to the nominal one. The type-II-risk of the Kruskal Wallis test is a bit
higher, it is not 20 % robust in two cases.
For the ordinal categorical variable it seems that the actual type-II-risk of the F -test is closer
to the nominal one than the risk of the Kruskal Wallis test. However, the difference between
the nominal risks and the actual ones is greater than 20 %, which is asked in the concept of the
20 % robustness. The act lies neither in the interval [0.08, 0.12] for nom = 0.10, nor in the
interval [0.12, 0.18] for the nom = 0.15, nor in the interval [0.16, 0.24] for nom = 0.20, and
nor in the interval [0.04, 0.06] for the nom = 0.05. Note that the increase of the type-II-risk
is partially caused by the discretization which decreases the relative effect size .
It follows that the maxi-min sample size computed for the F -test and continuous response
variable is lower than the required maxi-min sample size for the Kruskal Wallis test and
ordinal categorical response variables. Some method for the Kruskal Wallis test is necessary
and will be discussed in the following part.

2.2.2

Size of experiment for the Kruskal Wallis test

The Kruskal Wallis test is used to test the hypothesis of equal means (2.8). For keeping the
appropriate type-II-risk it is necessary to determine the maxi-min sample size.
Because the exact distribution of the Kruskal Wallis test statistics (2.10) or (2.11) under
the alternative hypothesis is not known, evaluation of the power of the test and following
determination of sample size is very problematic in the case of ordinal categorical response
variables. In this section a formula for computing the sample size was found out by simulation.
If only two groups are compared, the Kruskal Wallis test coincides with the Wilcoxon test.
The asymptotic power of the Wilcoxon test statistic under the alternative hypothesis is derived
in Lehmann (1975). In Noether (1987) two formulas for determining the size of experiments
13

Table 2.4: The parameters and properties of the used distributions of Fleishman system. All
the distributions have zero mean and standard deviation 1.
No. of distr. Skewness Kurtosis
u = s
t
v
1
2
3
4
5 (Normal)
6 (Uniform)

0
0
1
2
0
0

3.75
7
1.5
7
0
1.2

0
0
0.163194276264
0.260022598940
0

0.748020807992
0.630446727840
0.953076897706
0.761585274860
1

0.077872716101
0.110696742040
0.006597369744
0.053072273491
0

were provided. Properties of Noethers formula were described in Chakraborti et al. (2006)
in details. In Section 2.2.3 our results for the Kruskal Wallis test are compared to the
Noethers formulas .
Let us consider a (a 2) continuously distributed random variables x1 , . . . , xa . We want to
test whether their means are all equal or whether there is at least one pair of these variables
with different means.
Instead of these continuous variables, only the ordinal categorical variables y 1 , . . . , y a are
observed. They are derived from the variables x1 , . . . , xa using the decomposition based
on a support {1 , 2 , . . . , r1 }, as is described in Definition 1 (page 5).
In this section, the simulation to determine the type-II-risk for given sample size is described.

Simulation
For the simulation experiment it is important to choose the mechanism of generating categorical random variables. We used six different types of distribution of the underlying continuous
variables.
All the considered distributions have zero mean and standard deviation 1. The distributions
differ in the skewness and kurtosis. The first distribution is the normal distribution, i.e. both
the skewness and the kurtosis
are
equal to 0. The second distribution was the uniform
distribution in the interval ( 3, 3), its skewness is equal to 0, kurtosis to 1.2.
The other distributions come from the Fleishman system, described in Rasch and Guiard
(1990). The random variable has the form s + tx + ux2 + vx3 , where the x is a standard
normally distributed random variable and s, t, u, v are given parameters. Values of these
parameters and properties of the distributions can be found in Table 2.4. The densities
of these distributions are plotted in Figure 2.2.
For each of these distributions two different decompositions are used:
Equidistant: The support points are equidistant in the area where 99 % of observations
lie in.
Equal mass: The measure of all categories is equal with respect to the given distribution. It means that an observation will be in each category with the same probability.
Emphasize that the support points were fixed for distributions of zero mean.
The parameters of simulation are as follows:
14

0.6
0.4
0.0

0.2

f(x)

0.4
0.0

0.2

f(x)

0.4
0.0

0.2

Distribution 4

Distribution 5

Distribution 6

0.4
0.0

0.2

f(x)

0.0

0.2

f(x)

0.4
0.0

0.2

0.6

0.6

0.6

0.4

f(x)

f(x)

Distribution 3

0.6

Distribution 2

0.6

Distribution 1

Figure 2.2: Densities of the distributions from Fleishman system.


The number of groups a equals 2, 3, . . . , 10 for the normal distribution and 2, 3, 4, 6, 8, 10
for the others.
The difference between the minimal and the maximal expected values equals 1.67,
1.25, 1.11, 1.
The number of categories of the ordinal variables r equals 3, 4, 5, 10, 50.
One of the six distribution from Table 2.4 and decomposition with equidistant support
points or support point of equal mass (see previous paragraph) is used.
The standard deviation of error terms eij equals = 1 (implying / = ).
The type-I-risk equals = 0.05.
Let us fix a, , r, a distribution and a support of decomposition. The expected values
of x1 , . . . , xa are considered as 1 = /2, 2 = +/2, 3 = = a = 0.
To choose the sample size n the following procedure is used. Our focus is on the type-II-risk
between 0.40 and 0.05. The first n is the maxi-min sample size of the F -test for the type-IIrisk equal to 0.40 (formula (2.7) in Section 2.1.1). For this n the type-II-risk of the Kruskal
Wallis test was estimated (see below) and n is increased by 1 until the estimated type-II-risk
is lower than 0.05. For the further analysis only the results with actual type-II-risk smaller
than 0.40 were used.
For a fixed sample size n one step of the simulation is as follows:
15

1. The continuous random samples of size n were generated for each group with the appropriate expected value. Then they were transformed to the ordinal categorical variables,
using the given support of decomposition.
2. The Kruskal Wallis test was performed and the result was recorded.
These two steps were repeated 10 000 times. The (estimated) actual type-II-risk for the
given sample size n is equal to the proportion of the non-significant tests in these repetitions.
For the further analysis only the results with actual type-II-risk smaller than 0.40 were
used.
Formula
By inspection of the results of simulation it was found that there is almost linear dependence
of the required sample size on the maxi-min sample size for the ANOVA F -test and normally
distributed variables (for given a, , the relative effect p, and the number of categories r).
Many linear models were tried. The model below was chosen as the most appropriate (good
fit and not too many factors).
Given the type-I-risk = 0.05, the maxi-min sample size for the Kruskal Wallis test can be
computed as follows:
n() = 3.054 n0 () 47.737

1
+ 51.288 p2 + 82.050 +

1
7.428 n0 () p2 0.535 n0 () +

r
2
1
2 1
+29.708 p + 56.102 223.770 p ,

r
r

+2.336 n0 ()

(2.17)

where the n0 () = n0 (, a, , ) is the maxi-min sample size for the F -test.


The formula (2.17) estimates the sample size very well, only 4.8 % of the residuals are larger
than 20 % of the fitted value. Further, 9.0 % percent is higher than 15 % of the fitted value,
16.6 % percent is higher than 10 % and 30.8 % percent of the residuals is higher than 5 %
of the fitted value.
The negative residuals are not so dangerous because it means that the actual type-II-risk
would be even lower than required. Using the formula (2.17) 48 % of the residuals are negative.
Of course, there is a loss of sources in this case.
In Table 2.5, there are the sample sizes estimated by simulation and the sample sizes estimated
using relation (2.17) for the = 0.20 and various choices of parameters. Figure 2.3 visualized
residuals in model (2.17). The absolute values of residuals increase with the increasing of the
sample size. The ratio of the residuals to the estimated sample size is almost constant for the
sample size over 25.
Discussion
The formula (2.17) for determination of required sample size, given in the previous paragraph,
was derived for some specific cases. The six used continuous distributions with different shapes
and two decompositions provide eleven different distributions of categorical variables. The
question about legitimacy of its generalization arises.
16

Table 2.5: Comparison of required sample sizes attained by simulation and calculated by formula (2.17) for = 0.2. In the columns are subsequently the number of groups a, relative
effect size , identification of underlying distribution (as in Table 2.4), relative effect of the
distribution of categorical variables and their number of categories, maxi-min sample size
of the F -test for normal variables and the maxi-min sample sizes for the Kruskal Wallis test
based on the simulation and calculated by formula (2.17).
Groups

Distribution

Rel. effect

Categories

n0 ()

nSIM

nF IT

2
2
2
2
6
6
6
6
2
2
2
2
6
6
6
6

1
1
1.67
1.67
1
1
1.67
1.67
1
1
1.67
1.67
1
1
1.67
1.67

1
1
1
1
1
1
1
1
3
3
3
3
3
3
3
3

0.66
0.78
0.77
0.89
0.66
0.78
0.77
0.89
0.69
0.77
0.8
0.88
0.69
0.77
0.8
0.88

3
5
3
5
3
5
3
5
3
5
3
5
3
5
3
5

16.71
16.71
6.76
6.76
26.59
26.59
10.2
10.2
16.71
16.71
6.76
6.76
26.59
26.59
10.2
10.2

31
14
11
7
47
22
16
10
28
17
11
7
44
27
16
11

35
15
11
7
55
22
19
10
30
17
10
7
46
26
16
10

17

1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4

reziduals / estimated sample size

20

40

60

80

100

estimated sample size

Figure 2.3: Relation between the residuals of model (2.17) and the required sample size
estimated from this model. The ratio of the residuals and the estimated sample sizes is
plotted on the y-axis, the estimated sample sizes are plotted on the x-axis.
Table 2.6: Sample sizes computed by formula (2.17) for = 0.05, a = 6, r = 5, p = 0.71 and
different values of and .

= 0.05

= 0.1

= 0.2

1
1.2
1.5

61
51
38

51
42
30

40
32
21

It should be noted that the formula is checked for four values of / between 1 and 1.7. It
can be interpolated for all the values in this interval. Similarly, it is assumed the number
of categories r can be interpolated for all integer values between 3 and 50. With r there
is decreasing influence of r on the required size of an experiment (the distribution tends to
a continuous one which is reflected by the fact that there is 1r in the formula, see Theorem 1
in Section 2.1.3). Therefore if r > 50 use formula (2.17) as for r = 50.
For comparison of the sample sizes needed for quantitative and ordered categorical variables
we calculated the values analogue to those of Table 2.1 by formula (2.17), a results are
in Table 2.6. The required sample size is always higher in the case of Kruskal Wallis test
and ordinal categorical variables (the relative effect is lower in that case) than for the F -test
and continuous normally distributed variables and the difference is significant.
To summarize, the required size of an experiment with ordinal categorical variables for given
type-I-risk = 0.05, type-II-risk from the interval [0.05, 0.4], in interval [1, 1.7] and
18

number of compared groups a between 2 and 10 can be calculated by formula (2.17). There
are no restrictions for the other parameters.

2.2.3

Size of experiment for the Wilcoxon test

The Wilcoxon test is a special case of the Kruskal Wallis test for the number of groups
equal to a = 2. In Chakraborti et al. (2006) two formulas for evaluating the sample size for
the Wilcoxon test are mentioned.
Noethers formula is derived for the local alternatives and the required sample size is computed
as
2 !
1 (1 /2) + 1 (1 )
nN F = CEIL
,
(2.18)
6 (p 0.5)2
where 1 is the quantile function of normal distribution, and the risk of the first and the
second kind and p the relative effect. This formula is quite simple, which is useful in practice.
More accurate but demanding more inputs is the second formula (it was derived for an onetailed alternative with instead of /2):

2
p

1 (1 /2)/ 6 + 1 (1 ) (p p2 ) (p p2 )

3
2

nF 2 = CEIL
(2.19)
,
2
6 (p 0.5)
where p2 = P (x1 < x2 and x1 < x2 ) and p3 = P (x1 < x2 and x1 < x2 ), while x1 , x1
are independent random variables distributed as in the group with the lower expected value,
and x2 , x2 are independent random variables distributed as in the group with the greater
expected value.

Note that both of Noethers formulas were derived assuming continuous distributions of the
response variables.
Values of sample size computed using the formula (2.17) derived in Section 2.2.2, and the
formulas (2.18) and (2.19) from Chakraborti et al. (2006) in some special cases can be found
in Table 2.7.
In the left part of Table 2.8 there are differences between the simulated and calculated sample
sizes given as the percentage of the calculated sample size. In the ideal case the most of the
observations should be in the row of 0 % as happen for the formula (2.17). Two Noethers
formulas tend to overestimate the required sample size. In the right part of Table 2.8 there
is analogue data for the type-II-risk between 0.1 and 0.3 only.

19

Table 2.7: Comparison of required sample sizes for the Wilcoxon test evaluated using simulation and formulas (2.17), (2.18) and (2.19) for = 0.2 and some values of the other
parameters.

p2

p2

n0 ()

Simulated

1
1
1
1.67
1.67
1.67
1
1
1
1.67
1.67
1.67
1
1
1
1.67
1.67
1.67
1
1
1
1.67
1.67
1.67

0.66
0.78
0.78
0.77
0.89
0.89
0.69
0.76
0.77
0.8
0.88
0.88
0.71
0.74
0.75
0.83
0.86
0.87
0.69
0.72
0.73
0.82
0.86
0.86

0.47
0.66
0.67
0.63
0.82
0.82
0.52
0.65
0.66
0.68
0.82
0.83
0.56
0.61
0.62
0.71
0.77
0.78
0.53
0.57
0.59
0.72
0.77
0.76

0.54
0.71
0.71
0.7
0.85
0.85
0.58
0.67
0.67
0.73
0.83
0.83
0.63
0.66
0.66
0.78
0.81
0.81
0.59
0.62
0.62
0.77
0.81
0.8

3
4
5
3
4
5
3
4
5
3
4
5
3
4
5
3
4
5
3
4
5
3
4
5

16.71
16.71
16.71
6.76
6.76
6.76
16.71
16.71
16.71
6.76
6.76
6.76
16.71
16.71
16.71
6.76
6.76
6.76
16.71
16.71
16.71
6.76
6.76
6.76

31
14
14
11
7
7
28
17
17
11
7
7
22
20
19
9
8
8
31
25
22
10
8
8

20

Formula
(2.17) (2.18) (2.19)
35
15
15
11
7
7
30
18
17
10
7
7
26
21
20
9
7
7
29
24
22
9
7
7

55
17
17
18
9
9
37
20
19
15
10
9
30
23
22
13
11
10
36
28
25
13
11
11

53
17
16
17
8
8
36
20
19
13
9
8
30
23
21
12
9
9
36
28
24
13
9
9

Table 2.8: The differences between the simulated sample size and sample sizes evaluated using
the formulas (2.17), (2.18) and (2.19), given as the percentage of the evaluated sample sizes.
In the cells are the percentages of the cases in the whole data set.
Percentage
positive diff.

negative diff.

> 40 %
20 %40 %
10 %20 %
5 %10 %
0 %5 %
0%
0 %5 %
5 %10 %
10 %20 %
20 %40 %
> 40 %

[0.05, 0.4]
(2.17) (2.18) (2.19)
0.1
1.6
9.1
8.0
3.8
26.4
5.8
20.4
21.4
3.1
0.2

0
0
0
0
0
0
0
2.1
43.0
43.3
11.6

21

0
0
0.2
1.0
0.5
12.7
1.5
12.3
33.9
30.9
6.9

[0.1, 0.3]
(2.17) (2.18) (2.19)
0
1.1
9.3
8.2
2.8
27.1
3.6
22.0
23.6
2.3
0

0
0
0
0
0
0
0
1.1
43.9
43.0
12.1

0
0
0
0.1
0
13.9
1.1
12.0
33.8
32.4
6.7

Chapter 3

Tests of Additivity in Mixed


Two-way ANOVA Model with
Single Subclass Numbers
Two-way ANOVA models are a well known class of linear models that allow estimation and
testing of two main effects and one interaction effect. Usually, the number of replications
per factor level combination or cell is greater than one, which enables estimation of the main
effects and the interaction effect simultaneously. If the number of replications in each cell is
equal to one, the classic way of estimating or testing the interaction effect is not applicable
anymore.
For example, some patients may react differently to the same drug treatment and it is infeasible to test more drugs on one patient. Testing for an interaction in such a model will be
referred here as testing of additivity hypothesis.
The first test of additivity in case of single subclass numbers was proposed by John Tukey
in Tukey (1949). This test was derived for a very special type of interaction (see Ghosh
and Sharma (1963) and Hegeman and Johnson (1976)). Other tests for some more general
interactions includes Mandel (1961), Johnson and Graybill (1972), Tusell (1990) and Boik
(1993b), the tests are nicely summarized in Alin and Kurt (2006).
Unfortunately, all of these tests were developed for the case of fixed effects model. Since many
possible applications correspond to mixed effects model, it is necessary to find out whether
the proposed additivity tests can be used with mixed effects as well and, if so, how powerful
they are.
In this chapter we will summarize the known additivity tests. Then the type-I-risk and power
of these tests for the mixed effects model is studied and a formula for size of experiment is
provided. Finally, a modification of Tukey test is derived.

22

3.1
3.1.1

Preliminaries
Description of the problem

In this section we will discuss the two-way ANOVA models. First, the model with both factor
and interaction fixed is considered. The response in the ith row and the j th column is modeled
as follows:
y ij = + i + j + ()ij + eij ,
i = 1, . . . , a, j = 1, . . . , b,
(3.1)
where , i , j and ()ij are real constants such that
X
X
X
X
()ij = 0
i =
j =
()ij =
i

and the eij are normally distributed independent random variables with zero mean and variance 2 .
Second, the model with one factor fixed and one factor random is considered, the interaction
is random variable. The response variable is modeled as
y ij = + i + bj + (ab)ij + eij ,

i = 1, . . . , a, j = 1, . . . , b,

(3.2)

where where , i are real constants and bj , (ab)ij and eij are normally distributed random
2 , 2
2
variables, all with zero mean and variance B
AB and , respectively.
We want to test the hypothesis that there is no interaction in the model. In the fixed effect
model (3.1) the additivity hypothesis can be written as
H0 : ()ij = 0,

i = 1, . . . , a, j = 1, . . . , b,

(3.3)

and it is tested against the alternative


HA : ()ij 6= 0 for at least one pair of (i, j).
In the mixed model (3.2) the additivity hypothesis can be written as
2
H0 : AB
=0

(3.4)

and it is tested against the alternative


2
HA : AB
> 0.

A lot of tests were designed for the test of hypothesis (3.3) in the fixed effects model (3.1).
We want to confirm whether these tests work also in the case of the mixed model. The
power of these tests in mixed model is investigated and an empirical determination of size
of experiment for the Johnson Graybill test is derived. Finally, a modification of Tukey test
with improved power is developed.

23

3.1.2

Tests of additivity

Several tests of additivity in fixed effects model (3.1) have been developed over the years. Five
of them will be discussed here, namely the tests by Tukey (1949), Mandel (1961), Johnson
and Graybill (1972), Boik (1993b) and Tusell (1990). More details can be found e.g. in Boik
(1993a) or Alin and Kurt (2006).
P P
Subsequently the following notation will be used: Let y = i j yij /ab denotes the over
P
P
all mean of the response, yi = j yij /b the ith row mean and yj = i yij /a the j th column
mean. The matrix R will stand for the residual matrix with respect to the main effects
rij = yij yi yj + y

(3.5)

The decreasingly ordered list of eigenvalues of RRT matrix will be denoted by 1 2 . . . ,


and its scaled version by
i
i = P
,
i = 1, 2, . . .
k k

If the interaction is present we may expect that some of i coefficients will be substantially
higher than others.

Tukey test: Introduced in Tukey (1949). Tukey suggested first to estimate row and column
effects by the row a column means
i = yi , j = yj and then test for interaction of type
()ij = k i j , where k is a real constant (k = 0 implies no interaction). The Tukey test
statistic ST equals
ST = MSint /MSerror ,
where
MSint =
and
MSerror =

P P
i

j (yij

2
y
(
y

)(
y

i
j ij i
P
P
2
yi y )
yj y )2
i (
j (

P P

y )2 a

yj
j (

y )2 b

(a 1)(b 1) 1

yi
i (

y )2 MSint

Under the additivity hypothesis ST is F -distributed with 1 and (a 1)(b 1) 1 degrees


of freedom.
Mandel test: Introduced in Mandel (1961). Mandel generalized the approach of Tukey and
derived a test for the interaction (ab)ij = ci bj with ci being a certain row constant. He
defined the test statistic SM to test for ci = 0 as
P
P
P P
2
yj y )2
i ) zi (
yj y ))2
i (zi 1)
i
j ((yij y
j (
SM =
/
,
a1
(a 1)(b 2)
where

j
zi = P

yij (
yj y )

yj
j (

y )2

Under the additivity hypothesis SM is F -distributed with a 1 and (a 1) (b 1) degrees


of freedom.

24

Johnson Graybill test: Introduced in Johnson and Graybill (1972). These authors chose
a different approach and derived a test for (ab)ij = k ci dj with ci and dj being a certain
row or column constant and k an overall constant. They suggested the test statistic
SJ = 1 =

eig1 (RR )
.
tr(RR )

The hypothesis is rejected if SJ is high.


Tussel test: See Tusell (1990). Tussel chose a similar approach to the Johnson Graybill
test. Without loss of generality assume a b. The suggested test statistic is
(a1)(b1)/2

SU = (a 1)

a1
Y

i=1

!(b1)/2

The additivity hypothesis is rejected if SU is low. Critical values for this test statistic are
given e. g. in Kres (1972). Note that these tables should be used with (a 1) = p and b = N .
Locally best invariant (LBI) test: See Boik (1993b). This test was designed to have
locally more power than the Tussel test. LBI test statistic equals
SL =

1
1
Pa1 2 .
a1

i
1

The additivity hypothesis is rejected if SL is low.

The critical values of these test can be found by simulation for given a and b.
All these test (together with the procedure for founding of critical values) were implemented
in R environment (R Development Core Team (2008)), package AdditivityTests. It may
be downloaded on
http://5r.matfyz.cz/skola/AdditivityTests/additivityTests 0.3.zip.
As far as we are informed, this is the first R implementation of additivity tests with the
exception of the Tukey test.
All these test were developed for the fixed effects model (3.1). In the next section their usage
for the model with mixed effects (3.2) is examined.

3.2
3.2.1

Properties of the additivity tests


Type-I-risk of the additivity tests

The main interest of our work lies in model (3.2) with one fixed and one random factor.
There arise a question whether the tests presented in the previous section and developed for
the fixed effect model (3.1) can be used also in this situation.
We considered common 5 % type-I-risk level and perform a simulation to estimate the actual
type-I-risk. In the simulation the parameters were set to the following values:
The number of levels of the fixed factor was equal to a = 3, 4, . . . , 10.
25

Table 3.1: Number and percentage of the simulated cases where the actual type-I-risk is lower
or greater than the nominal level 5%.
Test

0.05

> 0.05
Tukey test
349 (96.94) 11 (3.06)
Mandel test
348 (96.67) 12 (3.33)
Johnson Graybill test 339 (94.17) 21 (5.83)
Tusell test
337 (93.61) 23 (6.39)
LBI test
336 (93.33) 24 (6.67)

The number of levels of the random factor b was chosen between 4 and 50 (by 2 between
4 and 20, by 5 between 20 and 50).
2 = 2, 5, 10.
The variance of the random factor was equal to B

The variance of the random error was 2 = 1. For other values of 2 the model can be
scaled (see Example 3 on page 29).
In one step of the simulation a dataset was generated based on the model without interaction
2
(AB
= 0). Then one of the Tukey, Mandel, Johnson Graybill test, LBI or Tussel test was
performed. The percentage of significant results in the number of steps is assumed to be the
actual type-I-risk of the test.
The 10 000 simulation steps were repeated 10 times and standard error of the estimation was
computed based on these 10 repetitions. Then for each test and each combination of parameters the one sided hypothesis
H0 : the actual type-I-risk is lower or equal to 0.05
was tested by one sample t-test on 5 % level against the alternative
HA : the actual type-I-risk is greater than 0.05.
The results of these t-tests for each additivity test are summarized in Table 3.1.
For the Tukey and Mandel tests the vast majority (> 95 %) of cases is not significantly above
the 0.05 level. For the other tests the estimated type-I-risk is higher than 0.05 in slightly
more cases (6 7 %). However, this may be also false positives caused by multiple testing.

We
fixed effects model
Pb would like to remark that although the tests were derived for the P
( j=1 j = 0) we used them in mixed effects model where E bj = 0 but bj=1 bj 6= 0 almost
surely. However, for high number of levels of the random factor b, the sum converges to zero
(law of large numbers, e.g. Grimmett and Stirzaker (1992)).
For 5 % type-I-risk we can say that all five additivity tests seem not to violated the type-I-risk
assumption and therefore they can be used for the mixed effects model as well.

3.2.2

Simulation study of the power of the additivity tests

The power of the Tukey, Mandel, Johnson Graybill, LBI and Tussel tests is studied in this
section. Powers of these tests were compared by simulation. It is shown that while Tukey and
26

Mandel tests have good power when the interaction is a product of the main effects, i.e. when
(ab)ij = k i bj (k is a real constant, i and bj the row and column effects in model (3.2)),
their power for more general interaction is very poor. The other three tests work a bit worse
in this special case but they have appropriately good power in more general cases too.
Let us consider mixed effects model (3.2). Two possible interaction schemes were under
inspection:
Type (A): (ab)ij = k i bj
Type (B): (ab)ij = k i cj ,
2 , mutually
where cj is normally distributed random variables with zero mean and variance B
independent on bj and eij , k is a real constant, i the row effect, bj the column effect.

Two possibilities are considered for the value of b, either b = 10 or b = 50, and 10 different
values of interaction parameter k between 0 and 12 are considered. The other parameters are
2 = 2, 2 = 1, a = 10,
= 0, B
(1 , . . . , 10 ) = (2.03, 1.92, 1.27, 0.70, 0.46, 0.61, 0.84, 0.94, 1.07, 2.00).
For each combination of parameters a dataset was generated for the model (3.2), the Tukey,
Mandel, Johnson Graybill, LBI and Tussel tests were performed and results were noted
down. The step was repeated 10 000 times. The estimated power of the test is the percentage
of the positive results.
All tests were done on = 5% level. The dependence of the power on the constant k is
visualized on Figure 3.1. As we can see while Tukey and Mandel tests outperformed the
other three tests for the interaction type (A), they completely fail to detect the interaction
type (B) even for a large value of k. Therefore, it is desirable to develop a test which is able
to detect a spectrum of practically relevant alternatives while still has the power comparable
to the Tukey and Mandel tests for the most common interaction type (A).
Because in practice the type of interaction is usually not known, it should be recommended
to use Johnson Graybill, LBI or Tussel test for the hypothesis of additivity (3.3) or (3.4).
Other possibility is to use the modified Tukey test proposed in Section 3.3.

3.2.3

Size of experiment for the Johnson Graybill test

In this section we will propose an empirical formula for the required size of experiment for
the Johnson Graybill test.
We consider an interaction term in the model (3.2) of the form
(ab)ij = k i cj ,

(3.6)

where i are the row effects in (3.2), cj are normally distributed random variables with zero
2 , mutually independent to the random variable b and random
expected value and variance B
j
error eij . The k is a real constant.
The interaction (3.6) is a random variable with zero mean and variance
2
.
var(ab)ij = k 2 i2 B

27

0.2

1.0
0.2

0.6

power

0.6
0.2

power

0.6

power

1.0

1.0
0.6

power

0.2

0.0 0.5 1.0 1.5 2.0 2.5 3.0

1.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 3.1: Power dependence on k, b and interaction type; b = 10 left, b = 50 right and
interaction type (A) up, type (B) down. Tukey test solid line, Mandel test dashed line,
Johnson Graybill test dotted line, Tussel test long dash line, LBI test dot-dash line.

28

We perform a simulation to investigate the power of Johnson Graybill test in dependency


on the parameters of the model (3.2) with interaction (3.6):a, b, 1 , . . . , a , k and B . We
consider number of levels of the fixed factor a equal to 10, 20, 30, 40 or 50, and five different
shapes of the distribution generating i :
equidistant
random sample from the normal distribution
random sample from the t3 -distribution
half of the i s cumulated in one point and half of them in another point
two points of i s in the same distance from zero and the other a 2 i s equal to zero.
In all cases 1 , . . . , a are scaled to have zero mean. It was observed that the power P
does not
depend on the shape and depends on the i only through the sum of their squares ai=1 i2 .
Three values of this sum were considered: 296, 665 and 1496.

The variance of the random effect bj was considered to be equal to 1, 2 or 2. The parameter
k (which control the variance of random interaction (ab)ij ) takes values 0.03, 0.05, 0.07 and
0.1. The variance of random noise eij is considered to be equal to 1, in the case of other value
the model should be scaled (see Example 3 on page 29).
The power of a test increases with the distance of its alternative from the null hypothesis.
Based on the simulation, the power of Johnson Graybill test for type-I-risk equal to 5 %
can be approximated by
1
Pa
(b) = 1
(3.7)
2.
4
4
a b k B
i=1 i

The formula was computed only for the power in the interval h0.10, 0.95i.

Let us emphasize that in practice the number of rows a is fixed and we can influence the size
of experiment only through the number of columns b. By simple manipulation formula (3.7)
can be reformulated as follows:


1
Pa
b() =
(3.8)
4
2 ,
a k 4 B
i=1 i
where x means the lowest integer equal to or greater than x. In case of 2 6= 1 the model
should be scaled, see Example 3.

In Figure 3.2 the difference of number of levels of random factor b realized by simulation and
computed by formula (3.8) is plotted. Notice that the formula gives quite satisfactory results,
although there are a few outliers.
Example 3. Scaling the model when 2 6= 1

Consider that we want to plan an experiment and use formula (3.8) to determine the size
of this experiment. This formula supposes that the variance of errors eij in model (3.2) is
equal to 1. This example shows the solution when this assumption is violated.

29

500
300
100
100 0

bSIM bEST

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 3.2: Dependency of the difference of real number of levels of the random factor bSIM
(realized from simulation) and the bEST estimated by formula (3.8) on the type-II-risk (the
line is zero level).

Let the variance of eij in model (3.2) be equal to 2 > 0. We divide the equation (3.2) with
interaction (3.6) by and for i = 1, . . . , a, j = 1, . . . , b define
y ij =

y ij
bj
cj
eij

i
, = , i = , bj = , cj = , k = k and eij =
.

The modified model equation is


y ij = + i + bj + k i cj + eij ,

i = 1, . . . , a, j = 1, . . . , b.
(3.9)
P

The condition analogous to that of model (3.2) are valid, i.e.


i i = 0, the bj , cj and
eij are normally distributed random variables with zero mean. Variances of bj and cj equals
2 = ( /sigma)2 , variance of e equals 2 = var e = 1.
B
B
ij
ij
The size of experiment computed by formula (3.8) for model (3.9) is appropriate for the
original model.

3.3

Modified Tukey test

To increase the power of Tukey test its modification is proposed in this section.
In the classic Tukey test a model
y ij = + i + j + k i j + eij
is tested against the submodel
y ij = + i + j + eij .
30

(3.10)

The estimators of row effects


i = yi y and column effects j = yj y are calculated
in the same way in both models although the dependency of y ij on these parameters is not
linear for the full model.
The main idea behind the presented modification is that the full model (3.10) is fitted by nonlinear regression and tested against the submodel
y ij = + i + j + eij
by the likelihood ratio test. Therefore the estimates of row and column effects therefore differ
between the classical and the modified model.
Non-adjusted test
Under the additivity hypothesis the maximum likelihood estimates of parameters in mo(0)
(0)
del (3.10) are calculated as
= y ,
i = yi y and j = yj y . Residual sum
of squares equals

XX
XX
(0)
(0) 2
=
(yij yi yj + y )2 .
yij

i j
RSS(0) =
i

In the full model (3.10) the first estimate of k is taken as follows



P P 
(0)
(0)
(0)
(0)
yij
i j
(0)
i j
i
j
(0)
k =
,
P P  (0) 2  (0) 2
i
j
i
j

i.e. the same as in Tukey test, and then we continue by the iteration procedure updating
estimates based on previous steps versions
 

P 
(n1) 1 + k(n1) (n1)
y

ij
j
j
j
(n)
a
i
=
,
2
P 
(n1)
(n1)
1
+
k
j
j
 

P 
(n1)
(n1)
(n1)
y

1
+
k

ij
i
i
i
(n)
j
=
,
2
P 
(n1)
(n1)

1
+
k
i
i

P P 
(n1)
(n1)
(n1) (n1)

i
j
ij
i
j
i
j
.
k(n) =




2
P P
(n1) 2
(n1)

i
j
i
j

Surprisingly, it seems that one iteration is just enough in a vast majority of cases. Therefore,
for a simplicity reason let us define

XX
(1)
(1)
(1) (1) 2
RSS =
yij

i j k (1)
i j
.
i

The likelihood ratio statistic, i.e. a difference of twice log-likelihoods, equals


RSS(0) RSS
2
31

1.0
0.2

0.6

power

1.0
0.6
0.2

power

0.0

1.0

2.0

3.0

0.0

1.0

2.0

3.0

Figure 3.3: Power dependence on k and b for the interaction type (B), b = 10 left, b = 50
right. Tukey test solid line, Mandel test dashed line, Johnson Graybill test dotted line,
Tussel test long dash line, LBI test dot-dash line, modified Tukey test two dash line.
and it is asymptotically 2 distributed with 1 degree of freedom.
RSS
The consistent estimate of a residual variance 2 is s2 = abab
and RSS
is approximately
2
2
distributed with ab a b degrees of freedom. Thus, using a linear approximation of the
nonlinear model (3.10) the following statistic

RSS(0) RSS
RSS
abab

(3.11)

is F distributed with 1 and ab a b degrees of freedom. Easy manipulation of (3.11) gives


the modified Tukey test which rejects the additivity hypothesis if and only if


1
(0)
RSS > RSS 1 +
F (1, ab a b; 1 ) ,
(3.12)
ab a b
where F (1, ab a b; 1 ) stands for 1 quantile of F -distribution with 1 and ab a b
degrees of freedom.
For type (A) interaction the power of the modified test is almost equal to the power of Tukey
test. For type (B) interaction the power of all the tests can be seen in Figure 3.3. The power
of the modified Tukey test is much higher than the power of Tukey test for this more general
interaction.
Theoretically, we may expect the modified test to be conservative because just one iteration
does not find precisely the maximum of model (3.10) likelihood. However, as we will see in the
following part a situation for a mall number of rows or columns is quite opposite.
Small sample adjustment
If the left part of Figure 3.3 (b = 10) would be magnified enough it can be observed that
.
the modified Tukey test does not work properly (type-I-risk = 6%). The reason is that
the ratio likelihood test statistic converges to 2 distribution rather slowly (see Bartlett
(1937)) and a correction for small sample size is needed. We present two possibilities that are
recommended if the number of rows or columns is below 20 (that is an empirical threshold
based on our simulations).
32

One possibility how overcome this obstacle is to bootstrap without replacement. Consider
the test statistic S = RSS (0) RSS. Then generate N (boot) times a dataset by model
(boot)

yij

(0)

(0)

=
+
i + j + rij ,

where is a random permutation of indexes of R matrix (3.5). For each dataset the statistic
of interest S (boot) = RSS(0)(boot) RSS(boot) is computed. The critical value of the modified
Tukey test is then the (1)100% quantile of the generated S (boot) . The number of generated
samples N (boot) = 1000 seems to be sufficient in the most cases.
The second possibility is to estimate the residual variance 2 of random errors eij as s2 =
RSS
(sample) datasets using model
abab and then generate N
(sample)

yij

(0)
(0)
(N EW )
=
+
i + j + eij
,

(N EW )

where (eij
) are independent identically distributed random variables generated from
a normal distribution with zero mean and variance s2 . Because under the additivity hypothesis
the parameter k is equal to zero, the proposed test statistic is the absolute value of its estimator
k (1) . As in bootstrapping, for each of the N (sample) datasets the value of the statistic is
computed and the additivity hypothesis is rejected if more than (1 ) 100% of sampled
statistics lie below the statistic k (1) based on the real data. The number of generated samples
N (boot) = 1000 seems to be sufficient in the most of cases.
To conclude, we have proposed a modification of the Tukey additivity test. The modified
Tukey test performs almost as good power as Tukey test when the interaction is a product
of main effects but should be recommended if we also request reasonable power in case of
more general interaction schemes.

33

Bibliography
A. Alin and S. Kurt. Testing non-additivity (interaction) in two-way ANOVA tables with no
replication. Statistical Methods in Medical Research, 15:6385, 2006.
M.S. Bartlett. Properties of sufficiency and statistical tests. Statistical Methods in Medical
Research, 160:268282, 1937.
R.J. Boik. A comparison of three invariant tests of additivity in two-way classifications with
no replications. Computational Statistics and Data Analysis, 15:411424, 1993a.
R.J. Boik. Testing additivity in two-way classifications with no replications: the locally best
invariant test. Journal of Applied Statistics, 20:4155, 1993b.
E. Brunner and U. Munzel. Nichtparametrische Datenanalyse - unverbundene Stichproben.
Springer, Berlin, 2002.
S. Chakraborti, B. Hong, and M.A. van de Wiel. A note on sample size determination for
a nonparametric test of location. Technometrics, 48:8894, 2006.
M.N. Ghosh and D. Sharma. Power of tukeys test for non-additivity. Journal of the Royal
Statistical Society, B25:213219, 1963.
G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes, 2nd Edition. Clarendon Press, Oxford, 1992.
V. Hegeman and D.E. Johnson. The power of two tests for nonadditivity. Journal of the
American Statistical Association, 71:945948, 1976.
D.E. Johnson and F.A. Graybill. An analysis of a two-way model with interaction and no
replication. Journal of the American Statistical Association, 67:862868, 1972.
H. Kres. Statistical Tables for Multivariate Analysis. Springer, New York, 1972.
W.H. Kruskal and W.A. Wallis. Use of ranks in one-criterion variance analysis. Journal of
the American Statistical Association, 47:583621, 1952.
E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, Inc., San
Francisco, 1975.
E. L. Lehmann. Testing Statistical Hypothesis. Springer-Verlag, New York, 2005.
M. Mahoney and R. Magel. Estimation of the power of the kruskal-wallis test. Biometrical
journal, 38:613630, 1996.
34

J. Mandel. Non-additivity in two-way analysis of variance. Journal of the American Statistical


Association, 56:878888, 1961.
G.E. Noether. Sample size determination for some common nonparametric tests. Journal of
the American Statistical Association, 82:645647, 1987.
R Development Core Team.
R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria, 2008.
URL
http://www.R-project.org.
D. Rasch and V. Guiard. The robustness of parametric statistical methods. Computational
Statistics and Data Analysis, 10:2945, 1990.

D. Rasch and M. Sime


ckova. Determining the size of experiments for the one-way ANOVA
model I for ordered categorical data. In Proceedings of the 8th International Workshop
in Model-Oriented Design and Analysis, Almagro, Spain, June 4-8, 2007, 2007. Physica
Verlag, Series: Contributions to Statistics.
D. Rasch, L.R. Verdooren, and J.I. Gowers. The Design and Analysis of Experiments and
Surveys, Second Edition. R. Oldenburg Verlag, Muenchen Wien, 2007.

T. Rusch, M. Sime
ckova, K.D. Kubinger, K. Moder, P. Sime
cek, and D. Rasch. Test of additivity in mixed and fixed effects two-way ANOVA models with single subclass numbers. In
Proceedings of The International Conference on rends and Perspectives in Linear Statistical Inference LINSTAT 2008, Bedlewo, Poland, April 21-25, 2008, subbmitted. Springer,
special issue of Statistical Papers.
H. Scheffe. The Analysis of Variance. John Wiley & Sons, Inc., New York, 1959.

P. Sime
cek and M. Sime
ckova. Modification of Tukeys additivity test. Journal of Statistical
Planning and Inference, submitted.

M. Sime
ckova and D. Rasch. Additivity hypothesis in the mixed two-way ANOVA model

with single subclass numbers. In Proceedings of 15th Summer School of JCMF


ROBUST
2008, Rackova dolina, Pribylina, Slovakia, September 8-12, 2008, subbmitted.

M. Sime
ckova and D. Rasch. Sample size for the one-way layout with one fixed factor for
ordered categorical data. Journal of Statistical Theory and Practice, 2:109123, 2008.
J.W. Tukey. One degree of freedom for non-additivity. Biometrics, 5:232242, 1949.
F. Tusell. Testing for interaction in two-way ANOVA tables with no replication. Computational Statistics and Data Analysis, 10:2945, 1990.
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1:8083, 1945.

35

Vous aimerez peut-être aussi