Académique Documents
Professionnel Documents
Culture Documents
SERIES EDITORS
Hans Amman, University ofAmsterdam, Amsterdam, The Netherlands
Anna Nagurney, University of Massachusetts at Amherst, USA
EDITORIAL BOARD
Anantha K. Duraiappah, European University Institute
John Geweke, University of Minnesota
Manfred Gilli, University of Geneva
Kenneth L. Judd, Stanford University
David Kendrick, University of Texas at Austin
Daniel McFadden, University of California at Berkeley
Ellen McGrattan, Duke University
Reinhard Neck, University of Klagenfurt
Adrian R. Pagan, Australian National University
John Rust, University of Wisconsin
Berc Rustem, University of London
Hal R. Varian, University ofMichigan
The titles published in this series are listed at the end of this volume.
....
"
99-056040
Contents
List of Figures
List of Tables
List of Algorithms
Preface
ix
xi
xiii
xv
1
1
1
2
7
10
11
13
16
17
17
19
21
22
23
23
24
25
27
29
29
34
39
39
40
41
43
viii
4
5
3.1
The Householder method
3.2
The Givens method
Computing the orthogonal matrices
Discussion
43
46
49
54
57
57
58
60
67
75
82
87
90
92
94
99
105
105
108
111
5. SUREMODELS
1
Introduction
2
The generalized linear least squares method
3
Triangular SURE models
3.1
Implementation aspects
4
Covariance restrictions
4;1
The QLD of the block bi-diagonal matrix
4.2
Parallel strategies
4.3
Common exogenous variables
117
117
121
123
127
129
133
138
140
147
149
151
152
153
154
157
158
160
References
Author Index
Subject Index
163
177
179
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
2.1
2.2
2.3
2.4
3.1
3.2
3.3
3.4
3.5
4
15
15
18
22
34
36
44
47
49
53
59
63
71
72
73
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
4.1
4.2
4.3
4.4
4.5
5.1
5.2
5.3
5.4
5.5
5.6
6.1
76
77
78
80
81
83
85
86
96
102
104
107
109
109
109
110
131
136
137
138
139
144
161
List of Tables
1.1
1.2
1.3
1.4
1.5
2.1
2.2
2.3
2.4
3.1
3.2
3.3
3.4
3.5
3.6
23
27
28
33
38
44
47
50
54
60
63
65
66
74
74
xu
3.7
3.8
3.9
4.1
4.2
5.1
5.2
87
91
98
114
115
128
129
List of Algorithms
XIV
5.2
5.3
129
135
Preface
xvi
Chapter 3 is devoted to methods for up- and down-dating the OLM. It provides the necessary computational tools and techniques that are often required
in econometrics and optimization. The efficient parallel strategies for modifying the OLM can be used as primitives for designing fast econometric algorithms. For example, the Givens and Householder algorithms used to compute the QR decomposition after rows have been added or columns have been
deleted from the original matrix have been efficiently employed to the solution
of the SURE and simultaneous equations models. The updating methods are
also employed to solve the recursive ordinary linear model with linear equality constraints. The numerical methods based on the basis of the null space
and direct elimination methods are in turn adopted for the solution of linearly
constrained simultaneous equations models.
The fourth chapter investigates parallel algorithms for solving the general
linear model - the parent model of econometrics - when it is considered as
a generalized linear least-squares problem. This approach has subsequently
been efficiently used to compute solutions of SURE and simultaneous equations models without having as prerequisite the non-singularity of the variancecovariance matrix of the disturbances. Chapter 5 presents a parallel algorithm
for solving triangular SURE models. The problem of computing estimates of
parameters in SURE models with variance inequalities and positivity of correlations constraints is also considered. Finally, chapter 6 presents algorithms for
computing the three-stage least squares estimator of simultaneous equations
models (SEMs). Numerical and computational methods for solving SEMs with
separable linear equalities constraints and when the SEM has been modified
by deleting or adding new observations or variables are discussed. Expressions revealing linear combinations between the observations which become
redundant are also presented.
These novel computational methods for solving SURE and simultaneous
equations models provide new insights that can be useful to econometric modelling. Furthermore, the computational and numerical efficient treatment of
these models, which are regarded as the core of econometric theory, can be
considered as the basis for future research. The algorithms can be extended or
modified to deal with models that occur in particular econometric applications
and have specific characteristics that need to be taken into account.
The practical issues of the parallel algorithms and the theoretical aspects
of the numerical methods will be of interest to a broad range of researchers
working in the areas of numerical and computational methods in statistics and
econometrics, parallel numerical algorithms, parallel computing and numerical linear algebra. The aim of this monograph is to promote research in the
interface of econometrics, computational statistics, numerical linear algebra
and parallelism.
Preface
xvii
The research described in this monograph is based on the work that I have
pursued in the last ten years. During this period I was privileged to have the
opportunity to discuss various issues related to my work with Maurice Clint.
His numerous suggestions and constructive comments have been both inspiring
and invaluable. I am grateful to Dennis Parkinson for his valuable information
that he has provided on many occasions on various aspects related to SIMD
systems, David A. Belsley for his constructive comments and advice on the
solution of SURE and simultaneous equations models, Hans-Heinrich Nageli
for his comments and constructive criticism on performance issues of parallel
algorithms and the late Mike R.B. Clarke for his suggestions on Givens sequences and matrix computations. I am indebted to Paolo Foschi and Manfred
Gilli for their comments on this monograph and to Sharon Silverne for proof
reading the manuscript. The author accepts full responsibility for any errors
that may be found in this work.
Some of the results of this monograph were originally published in various
papers [69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 84, 85, 86, 87, 88] and
reproduced by kind permission of Elsevier Science Publishers B.Y. 1993,
1994, 1995, 1999; Gordon and Breach Publishers 1993, 1995; John Wiley
& Sons Limited 1996, 1999; IEEE 1993; Kluwer Academic Publishers
1997, 1999; Principia Scientia 1996, 1997; SAP-Slovak Academic Press
Ltd. 1995; and Springer-Verlag 1993, 1996, 1999.
Chapter 1
LINEAR MODELS AND QR DECOMPOSITION
INTRODUCTION
A linear model is one in which y, or some transformation of y, can be expressed as a linear function of ai, or some transformation of ai (i = 1, ... ,n).
Here only linear models where endogenous and exogenous variables do not
require any transformations will be considered. In this case, the relationship
or
Ym
ami
am2
amn
+ (::) .
Xn
(1.4)
y=Ax+,
(1.5)
2.1
(1.7)
The OLM assumptions are that each Ei has the same variance and all disturbances are pairwise uncorrelated. That is, Var(Ei) = cr2 and \:Ii =I- j: E(ET Ej) =
o. The first assumption is known as homoscedasticity (homogeneous variances).
The most frequently used estimating technique for the OLM (1.7) is least
squares. Least-squares (LS) estimation involves minimizing the sum of squares
of residuals: that is, finding an n element vector x which minimizes
eTe = (y-Axf(y-Ax).
(1.8)
ax
AT Ax = AT y.
(1.9)
Assuming that A is of full column rank, that is, (AT A) -I exists, the leastsquares estimator can be computed as
(1.10)
and the variance-covariance matrix of the estimator x is given by
Var(x) = cr2 (AT A)-I.
(1.11)
Figure 1.1.
InL(x, cr ) =
m
m
2
1
T
In (21t) - 2 1n(cr ) - 2cr2 (y - Ax) (y - Ax).
-2
Setting In Ljax =
XML=(ATA)-IATy
(1.12)
and
(1.13)
The ML estimator XML, is identical to the least-squares estimator x which is
the best linear unbiased estimator (BLUE) of x in (1.7). If x is the BLUE of
x, it follows that E(x) = x and \fq E 9\n, Var(qT x) ::; Var(qT x), where x is any
linear unbiased estimator of x.
However, the ML estimator 0-~1L differs from the unbiased estimator of cr 2
which is given by
0-2 = (y-Ax)T(y-Ax)j(m-n).
(1.14)
Numerous methods exist for solving the least-squares problem. Some of the
best known methods are Gaussian elimination, Gauss-Jordan elimination, LU
decomposition, Cholesky factorization, Singular Value Decomposition (SVD)
and QR decomposition (QRD). When the coefficient matrix is large and sparse,
these methods, which are called direct methods, suffer from fill-in and they
can be impracticable. In such cases, iterative methods are more efficient, even
though there are intelligent adaptations of the direct methods which minimize
the fill-in. Iterative methods (e.g. Conjugate Gradient) have the advantage
that minimal storage space is required for implementation since no fill-in of
the zero positions of the coefficient matrix occurs during computation. Furthermore, if a good initial guess is known and preconditioning can be used, the iterative methods can converge in an acceptable number of steps. Full details of the
direct and iterative methods are given in textbooks such as [7,40,51,93,98].
(1.15)
where Q E SRmxm is orthogonal, i.e. it satisfies QT Q = QQT = 1m, and R E SRnxn
is upper triangular. Substituting (1.15) in the normal equations (1.9), gives
RTRx=RTYI,
where
QTy= (YI) n
.
Y2 m-n
Under the assumption that A is of full column rank, which implies that R is
non-singular, the least-squares estimator of the OLM (1.7) is computed by
solving the upper triangular system of equations
RX=YI.
(1.16)
x = argmin II ell 2
x
= argminllQTel1 2
x
= argminllQ T Y_ QTAxll2
x
=
= arg~in (IIYI -
Rxll2 + IIY211 2)
= argminllYl -Rx112
x
= R-1YI.
The quantity
lIy - AxII2 =
2.1.1
(1.17)
(1.18)
Cx=d
The assumptions for the matrix C are rank (C) = k and k < n. The first assumption implies that there are no linear dependencies among the restrictions and
the second that the RLS cannot be derived by solving the system (1.17).
Differentiating the Lagrangian function
L = (y-Ax)T (y-Ax) +2"T (Cx- d)
with respect to x and ').. and equating the results to zero, gives
aL
ax =
-2yTA+2xTATA+2')..TC=O
and
and, under the assumption that the matrix is not singular, a unique solution for
x* can be obtained. Since ATA is assumed to be non-singular it follows that
x*
(1.19)
= (C(AT A)-ICT)-l(Cx-d).
(1.20)
Observe that the RLS estimator x*, differs from the unrestricted least-squares
estimator x by a function involving the extent to which x fails to satisfy the
restrictions [119]. For consistent restrictions, the variance-covariance matrix
of the RLS estimator x* may be shown to be
2.2
The difference between the General Linear Model (GLM) and the OLM is
that there is a correlation between the disturbances Ei (i = 1, ... ,m). The GLM
is given by (1.6), i.e.
(1.21)
where n is a known non-negative definite matrix. The particular GLM, where
n is a diagonal matrix, but not the identity matrix, is called the Weighted Linear
Model (WLM). If n is positive definite, then there exists an m x m non-singular
matrix B such that
(1.22)
Premultiplying the GLM by B- 1 gives the transformed OLM
z = Cx+~, ~ '" N(0,(J2/m)'
= (ATn-IA)-IATn-ly.
(1.23)
(1.24)
= (z-Cx)T(z-Cx)/(m-n)
= (y-Ax)Tn- l (y-Ax)/(m-n)
= (yTQ-l y _ xTATn- ly)/(m-n).
(1.25)
Using the assumption rv N(0,cr 2n), it can be shown that the GLS estimator
= (21tcr2)-m/2IQI-l/2e-(y-Ax)Tn-l(y-AX)/2cr2.
2.2.1
A GENERALIZED LINEAR LEAST SQUARES APPROACH
The derivation of x using (1.23) is computationally expensive and numerically unstable when n is ill-conditioned. Further, if n is singular then the numerical solution of the GLM will fail completely and the replacement of B- 1
by the Moore-Penrose generalized inverse B+ will not always give the BLUE
of x [91]. Numerically efficient methods have been designed to overcome the
ill-conditioning problems of the method [91, 110, 140]. Soderkvist has proposed an algorithm based on constrained and weighted least squares which
uses as a main tool the weighted QRD [54, 55]. Earlier, Paige considered the
GLM as a Generalized Linear Least Squares Problem (GLLSP) and employed
the generalized QRD (GQRD) to solve it [15, 51, 100, 113]. A summary of
the results in [56, 57, 91, 109, 110, 111, 112] is given, in which the GLM is
solved by treating it as a GLLSP problem. With this approach the difficulties
caused by singularity and those caused by ill-conditioning are avoided.
The estimator x in (1.23) is derived by solving
x=
argmin IIB- I 1I 2
x
(1.26)
Although the above formulation allows for singular n reconsider, without loss
of generality, the case when n is non-singular, that is B E 9\mxm has full col-
_)
QT A = ( ~
==
n
1
m-n-l
(R0 11Y)
0
(1.27a)
and
m-n-l
(QTB)P=L T ==
nC'
1
m-n-l
0
0
j
<>
0
Lf2
dT
LI2
).
(1.27b)
where R E 9\nxn and LT E 9\mxm are upper triangular and non-singular, and
Q,P E 9\mxm are orthogonal. The GLLSP (1.26) can be equivalently written
as
argmin IlpT uII2
u,x
or
argmin (IIUI 112 +y2 + IIu2112)
Ul,y,U2,X
Y=RX+LflUI
subject to { 11 = <>y + dT U2
+ jy+Lf2u2
(1.28)
0= LI2u2
where uTP is conformably partitioned as (uf y uI). The third and second constraints in (1.28) give, respectively, U2 = 0 and y = 11/<>. In the first
constraint the arbitrary subvector UI is set to zero in order to minimize the objective function. Thus, the estimator of x derives from the solution of the upper
triangular system
(1.29)
Notice that, if 0 = 0 i- 11, then the GLM is inconsistent. Hammarling et aI. in
[57] solve the GLM using the SVD and present a modified procedure when the
GLM is found to be inconsistent [56].
An expression for the variance-covariance matrix of x is
(1.30)
10
Different methods have been proposed for forming the QRD (1.15), which
is rewritten as
(1.32a)
or
(1.32b)
where A E 5)tmxn, Q = (QI Q2) E 5)tmxm, R E 5)tnxn and m > n. It is assumed
that A has full column rank. Emphasis is given to the two transformation methods known as Householder and Givens rotation methods. The Classical and
Modified Gram-Schmidt orthogonalization methods will also be briefly considered.
A general notation based on a simplification of the triplet subscript expression
will be used to specify sections of matrices and vectors [65, 101]. This notation
has been called Colon Notation in [51]. The kth column and row of A E 5)tmxn
are denoted by A:,k and Ak,: respectively. The submatrix Ai:k,j:s has dimension
(k - i + 1) x (s - j + 1) and its first element is given by ai,j. Similarly, Vi:k is a
(k - i + 1)--element subvector of v E 5)tm starting with element Vi. That is,
ai,j
Ai:k,j:s
ai,s )
ai,j+ I
ak,j+ I
...
ak,s
and
Vi:k= (
L.
Vi+l
Vi)
If the lower or upper index in the subscript notation is omitted, then the default
values are one and the upper bound of this subscript for the matrix or vector,
respectively. A zero dimension denotes a null matrix or vector and all vectors
are considered to be column vectors unless transposed. For example, Ak,: is
a column vector and AI.: == (ak,l ... ak,n) is a row vector, Ai:,:s is equivalent
to Ai:m,l:s and Ai:k,j:s is a null matrix if k < i or s < j. Notice that A[k,j:s is
equivalent to (Ai:k,j:s) T and not (AT) i:k,j:s which denotes the (k - i + 1) x (sj + 1) submatrix of AT.
3.1
11
An m x m Householder transformation (or Householder matrix or Householder reflector) has the form
where h E
(1.33)
where b = IlhI1 2 /2. Householder matrices are symmetric and orthogonal, i.e.
H = HT and H2 = 1m. They are useful because they can be used to annihilate
specified elements of a vector or a matrix [18, 53, 126].
Let x E m be non-zero and H be a Householder matrix such that y = H x
has zero elements in positions k to n. If H is defined as in (1.33) and Xj is any
element of x other than those to be annihilated, then
sn
if i = k, ... ,n,
ifi=jandj=l-k, ... ,n,
otherwise,
Xi
hi =
{ XjS
i=x]+L~
p=k
and
such that
Yi = {
To avoid a large relative error, the sign of S is chosen to be the same as the
sign of Xj. Notice that, except for the annihilated elements of x, the only other
element affected by the transformation is Xj.
Consider now the computation of the QRD (l.32a) using Householder transformations [15,51, 123]. The orthogonal matrix QT is defined as the product
of the n Householder transformations
12
where iIi = Im-i+1 - hhT jb, b = hThj2 and a zero dimension denotes a null
matrix. It can be verified that the symmetric Hi is orthogonal, that is Hr = Hi
and
= 1m. If A (0) == A and
H1
n-i
R{i))
12
A{i) m-i'
(1
~i
< n),
(1.34)
where R~il is upper triangular, then Hi+ I is applied from the left of A (i) to annihilate the last m - i-I elements of the first column of A(i). The transformation
Hi+IA(i) affects only A(i) and it follows that
A(n)
(~).
(1.35)
A summary of the Householder method for computing the QRD is given by Algorithm 1.1. The square root function is denoted by sqrt and h E 9tm- i + l . Notice that no account is taken of whether the matrix is singular or ill-conditioned,
i.e. when the division by b can be performed. Steps 7 and 8 may be combined
into a single step, but for clarity a working vector z E 9tn - i +1 has been used.
13
-T
. =
Q A .,l.k
(RI:k,I:k)
0
Q A:,k+l:
= A:,k+l: -
T
YWY A:,k+l:.
h)/Vb
and
Y;,i =
h/Vb,
where Hi = 1m - hhT /b and, initially, Y;,l = h/v'b and W = h [15, 51]. The
same procedure is repeatedly applied to the submatrix Ak+l:m,k+l:n until the
QRD of A is computed.
3.2
1
i--+
Gi,j=
j--+
-s
(1.36)
c
1
where c = cos(~) and s = sin(~) for some~. Apart from s and -s all offdiagonal elements are zero. Givens rotations are orthogonal, thus GT.iG;,j =
Gi,jGT.i = 1m. The rotation Gi,j when applied from the left of a matrix, annihilates a specific element in the jth row of the matrix and only the ith and jth
rows of the matrix are affected [51]. While Householder transformations are
useful for introducing zero elements on the grand scale, Givens rotations are
important because they annihilate elements of a matrix more selectively.
Let G~1 have the structural form (1.36) such that the rotation
(1.37)
14
results in iij,k being zero, whereA,A E 9l mxn and 1 :::; k:::; n. The rotation (1.37)
which affects only the ith and jth rows of A can be written as
Ap ,: =
if p = i,
CAi': +sAj,:
{ cA j,: - SAi,:
A p ,:
if P = j,
ifp=l, ... ,mandpi=i,j.
(1.38)
If ii j,k is zero, then ca j,k - sai,k = o. If ai,k i= 0 and a j,k i= 0, then using the
trigonometric relation c2 + s2 = 1 it follows that
and
c=ak/t
I,
,
If ai,k and aj,k are both zero, then
(1.38) - is reduced to
(1.39)
(ai,kAi,: +aj,kAj,:}/t
{
Ap ,:= (ai,kAj,:-aj,kA;,:)/t
A p ,:
if p = i,
~fp:j,
..
Ifp-1, ... ,mandpi=l,j.
(1.40)
A E 9lmxn .
1: for i = 1,2, ... , n do
2:
for j = m, m - 1, ... , i + 1 do
3:
A := G)~I,jA
4:
end for
5: end for
The total number of Givens rotations applied by Algorithm 1.2 is given by:
n
I)m-i) = n(2m-n-1)/2.
i=1
(1.41)
15
G(1)
3,4
Figure 1.2.
d2,32)
... -
G(1)
2,3
G(1)
\,2
I a
G(3)
o.
3,4
.. -
d3,42)
_
_
.0
Two examples of Givens rotation sequences are also shown in Figure 1.3,
where m = 10 and n = 6. A number i (1 ::; i ::; 39) at position (j, k) indicates where zeros are created by the ith Givens rotation (1 ::; k ::; n and
k < j ::; m). The first sequence is based on Algorithm 1.2, while the second,
called diagonally-based, annihilates successively su~iagonals of the matrix
starting with the lower sub--diagonal.
8 17
7 16 24
6 15 23 30
5 14 22 29 35
413 21 28 34 39
3 12 20 27 33 38
211 19 26 32 37
1 10 18 25 31 36
(a) Column-based
Figure 1.3.
28 ~5
22 29 36
16 23 30 37
11 17 24 31 38
7 12 18 25 32 39
34
4 8 13 19 26 33
2 5 9 14 20 27
1 3 6 10 15 21
(b) Diagonally-based
The column and diagonally based Givens sequences for computing the QRD.
Algorithm 1.3 gives the order of the rotations applied when the diagonallybased sequence is used. Notice that the rotations are between adjacent planes
and both column-based and diagonally-based algorithms apply the same number of Givens rotations, given by (1.41).
Gentleman proposed a square root free algorithm for computing the QRD
(1.32a) [48]. His method removes the need for the calculation of any square
roots in computing the Givens rotations. This resulted in improving the efficiency of computing the QRD (1.32a) on a serial computer.
16
Algorithm 1.3 The diagonally-based Givens sequence for computing the QRD
of A E 9t mxn .
1: for i = 1,2, ... , m - 1 do
2:
for j = 1, ... , min(i,n) do
3:
A:= G~21,pA,
4:
end for
5: end for
where
p = m- i+ j
3.3
The CGS method has poor numerical properties which can result in loss of
orthogonality among the computed columns of Q. The numerical stability of
the CGS method can be improved if a modified version, called the Modified
Gram-Schmidt (MGS) method, is used. The MGS method rearranges the computations of the CGS algorithm, such that at each stage a column of Q and a
17
row of R are determined [15, 51]. The MGS method for computing the QRD
(1.32b) is given by Algorithm 1.5.
Algorithm 1.5 The Modified Gram-Schmidt method for computing the QRD.
1: for i = 1, ... , n do
2:
Ri:i := ifA:,dl
3:
A,i := A:,i/Ri,i
4:
for j=i+l, ... ,ndo
5:
R,')', :=A!'A),
.,1
"
6:
Data parallel algorithms for computing the QRD are described. The algorithms are based on the Householder, Givens rotations and Gram-Schmidt
methods. Using regression, accurate timing models are constructed for measuring the performance of the algorithms on a massively parallel SIMD (Single
Instruction, Multiple Data) system. The massively parallel computer used is
the MasPar MP-1208 with 8192 processing elements. Although the algorithms
were implemented on the MasPar, the implementation principles should be in
general applicable to any massively parallel SIMD computer [83]. The timing
models will be of the same order when similar SIMD architectures are used,
but coefficient parameters will be different due to the differences in software
and hardware designs that exist among different parallel computers.
4.1
18
in a data parallel mode. These systems are found to be useful for specific
applications such as database searching, image reconstruction, computational
fluid dynamics, signal processing and econometrics. The effectiveness of a
SIMD array processor depends on the interconnection network, the memory
allocation schemes, the parallelization of programs, the languages features and
the compiling techniques [26, 27, 67, 117].
The MasPar SIMD system is composed of afront-end (a DEC station 5000)
and a Data Parallel Unit (DPU). The parallel computations are executed by
the Processing Element (PE) array in the DPU, while serial operations are
performed on the front-end. The 8192 PEs of the MP-1208 are arranged
in a eSl x eS2 array, where eSl = 128 and eS2 = 64. The default mapping
distribution in the MasPar is cyclic. In a cyclic distribution, an n element vector
and an m x n element matrix are mapped onto n / eSl eS21 and m / eSll n / eS21
layers of memory respectively. Figure 1.4 shows the mapping of a 160 x 100
matrix A and a 16384--element vector v on the MP-1208. Other processor
mappings are available for efficiently mapping arrays on the PE array, when
the default cyclic distribution is not the best choice [99].
VI
V65
~1r
__-trii==l-AI29,100
ii:;;~~---lt- A 160,100
Layer
#2
Layer
#3
A 129,65
Figure 104.
V8129
V8193
V8257
VI6321
---~...,,--- V64
V8192
=~;:::::=== V8256
V16384
The main languages for programming the MasPar are the MasPar Fortran
(hereafter MF) and MasPar Programming Language. The language chosen
for implementing the algorithms was MF, which is based on Fortran 77 supplemented with array processing extensions from standard Fortran 90. These
array processing extensions map naturally on the DPU of the MasPar. MF also
supports the forall statement of HPF, which resembles a parallel do loop. For
example given the vectors h E SRm and z E SRn, the product A = hzT can be
19
computed in parallel by
forall (i = 1 : m, j
= 1 : n) Ai,} = hi * z}
(1.43a)
or
A = spread(h,2,n) *spread(z, I,m).
(1.43b)
(1.44)
where Ci (i = 0, ... ,3) are constants which can be found by experiment. The
above model can describe adequately the execution time of (1.43a) and (1.43b).
If m or n is greater than eSleS2, then the timing model (1.44) should also
include combinations of the factors rm/esl eS21 and rn/esl eS21, which correspond to the number of layers required to map a column and a row of the matrix
on the DPU. In order to simplify the performance analysis of the parallel algorithms, it is assumed that m, n ~ eSI eS2 and the dimensions of the data matrix
are multiples of eSI and eS2, respectively.
4.2
The data parallel version of the serial Householder QRD method is given
by Algorithm 1.6. The application of HjA(i-l) in (1.34) is activated by line 3
and the time required to compute this transformation is given by <1>1 (m - i +
1, n - i + 1). Thus, the total time spent on computing all of the Householder
transformations is
n
20
It can be observed that the application of the ith and jth transformation have
the same execution time if
r(m-i+l)/ell = r(m-j+l)/ell
and
4.3
21
As in the case of the Householder algorithm, the performance of the straightforward implementation of the MGS method will be significantly impaired by
the overheads. Therefore, the n = Nes2 steps of the MGS method are used in
N stages. At the ith stage, eS2 steps are used to orthogonalize the (i - 1)es2 + 1
to ies2 columns of A and also to construct the corresponding rows of R. Each
step of the ith (i = 1, ... , N) stage has the same execution time, namely
Thus, the execution time to apply all Nes2 steps of the MGS method is given
by
N
<l>3(Mesl,Nes2) = eS2
L <1>1 (Mesl, (N -
i=l
i + l)es2).
22
4.4
A Givens rotation, when applied from the left of a matrix, affects only two of
its rows: thus a number of them can be applied simultaneously. This particular
feature underpins the development of parallel Givens algorithms for solving a
range of matrix factorization problems [29, 30, 69, 94, 95, 102, 103, 129].
The orthogonal matrix QT in (1.32a) is the product of a sequence of Compound Disjoint Givens Rotations (CDGRs), with each compound rotation reducing to zero elements of A below the main diagonal while preserving previously annihilated elements. Figure 1.5 shows two sequences of CDGRs for
computing the QRD of a 12 x 6 matrix, where a numerical entry denotes an element annihilated by the corresponding CDGR. The first Givens sequence was
developed by Sameh and Kuck [129]. This sequence - the SK sequence - applies a total ofm+n-2 CDGRs to triangularize an m x n matrix (m > n), compared to n(2m - n -1) Givens rotations needed when the serial Algorithm 1.2
is used. The elements are annihilated by rotating adjacent rows. The second
Givens sequence - the Greedy sequence - applies fewer CDGRs than the SK
sequence but, when it comes to implementation, the advantage of the Greedy
sequence is offset by the communication overheads arising from the construction and application of the compound rotations [30,67, 102, 103]. For m n,
the Greedy sequence applies approximately log m + (n - 1) log log m CDGRs .
11
10 12
9 11 13
8 10 12 14
7 911 13 15
6 8 10 12 14 16
5 7 911 13 15
4 6 8 10 12 14
3 5 7 911 13
2 4 6 8 10 12
1 3 5 7 9 11
(a) SK sequence
Figure 1.5.
3 6
5 8
2 4 7 10
2 4 6 9 12
1 3 6 811 14
2
1
1
1
1
1
3 5 7 10 13
3 5 7 9 12
2 4 6 8 11
2 4 6 8 10
2 3 5 7 9
The adaptation, implementation and performance evaluation of the SK sequence to compute various forms of orthogonal factorizations on SIMD systems will be discussed in the subsequent chapters. On the MP-1208, the ex-
23
ecution time of computing the QRD of an Mesl x Nes2 matrix using the SK
sequence, is found to be
T4 (M,N) = N(25.64+5.51N -7.94N2 + 11.1M + 15.99MN) +41.96M.
4.5
COMPUTATIONAL RESULTS
Algor. 1.6
Improved Algor. 1.6
Exec. Tl(M,N) Exec. T2(M,N)
X 10- 2
X 10- 2
Time
Time
Algor. 1.7
Exec. T.l(M,N)
x 10- 2
Time
Algor. SK
Exec. T4(M,N)
X 10- 2
Time
10
10
10
14
14
14
18
18
18
22
22
22
3
7
9
5
9
13
5
9
17
7
15
19
5.48
22.15
33.80
17.48
47.86
90.30
22.34
61.90
189.16
48.61
188.79
287.23
3.21
12.07
18.49
9.30
24.52
46.64
11.60
30.68
94.41
23.98
89.86
138.59
21.16
67.73
92.65
62.27
150.05
242.63
82.03
207.47
503.44
175.85
585.56
805.45
5.55
22.34
34.09
17.54
48.03
90.56
22.35
61.98
189.03
48.71
188.49
286.33
2.58
9.28
13.80
7.36
18.75
34.38
9.14
23.77
69.42
19.10
68.88
103.31
2.59
9.34
13.92
7.35
18.86
34.54
9.15
23.79
69.31
18.90
68.57
102.81
3.22
12.08
18.48
9.30
24.50
46.64
11.59
30.52
94.26
23.93
89.82
138.27
21.04
67.59
92.62
62.36
150.11
242.68
82.25
207.61
503.51
175.96
585.64
805.71
24
the Cambridge Parallel Processing (CPP) linear algebra library (LALIB) are
investigated [19]. The LALIB QRD algorithm is a data-parallel version of
the serial Householder algorithm proposed by Bowgen and Modi [17]. The
performances of Algorithm 1.6 and the QRD LALIB routine are compared. A
second Householder algorithm which is efficient for skinny matrices is also
proposed.
5.1
nr
(1.45)
25
which computes the inner-product Ui = hT A,i for all i, where Ii has value true.
In Fortran-90 the F-PLUS functions sumr(A), matc(h,n) and matr(u,m) can
be expressed as sum(A, 1), spread(h,2,n) and spread(u, I,m), respectively.
The main difference, however, between F-PLUS and HPF, is that the F-PLUS
statement computes all the inner-products hT A and then assigns simultaneously the results to the elements of u, where the corresponding elements of L
have value true. This difference may cause degradation of the performance
with respect to execution speed, if the logical vector L has a significant number of false values. Consider, for example, the three cases, where (i) all elements of L have a true value, (ii) the first n/2 elements of L have a true value
and (iii) only the first element of L has a true value. For m = 1000 and n =
500, the execution time in msec for computing (l.45) on the 1024-processor
GAMMA-I (hereafter abbreviated to GAMMA-I) for all three cases is 249.7,
while, without masking the time required to compute all inner-products is
given by 247.79. Explicitly performing operations only on the affected elements of u, the execution times (including overheads) in cases (ii) and (iii) are
found to be 147.84 and 13.21, respectively. This example shows the degradation in performance that might occur when implementing an algorithm without
taking into consideration the systems software of the particular parallel computer.
5.2
26
The Fortran-90 sub array expressions are not supported by F-PLUS. However functions and subroutines are available for extracting and replacing subarrays. Hence, working arrays need to be used in place of Ui:n and Ai:m,i:n.
The computational cost of extracting the affected subarrays and re-assigning
them to the original arrays can be higher than the savings in time that might be
achieved by working with subarrays of smaller dimensions. This has been considered previously in detail within the context of improving the performance
of Algorithm 1.6.
The Block-Parallel version of Algorithm 1.8 (hereafter called BPHA) is divided into blocks, where each block comprises transformations that have the
same time complexity. The first block comprises the kJ (1 ::; kJ ::; min(n, es))
Householder transformations HI, ... ,Hkl' where kJ is the maximum value satisfying rm/esHn/es1 = r(m+ l-kI}/esH(n+ l-kd/esl The transformations are then applied using Algorithm 1.8. The same procedure is applied
recursively to the smaller (m - kl) x (n - kJ) submatrix Akl+J:m,kl+J:n, until A
is triangularized. Generally, let mo = m, no = n, mi = mi-J - ki (i> 0) and let
the function f(A,m,n) be defined as
f(A,m,n)
(1.46)
where A, m and n are integers and 1 ::; A ::; n. The ith block consists of
ki transformations which are applied using Algorithm 1.8 to the submatrix
27
Ak(i)+l:m,k(i)+l:n'
Table 1.2.
BPHA.
m
Execution times (in seconds) of the CPP LALIB QR_FACTOR subroutine and the
QR_FACTOR
BPHA
QR_FACTOR I BPHA
300
600
900
1200
1500
1800
25
25
25
25
25
25
1.05
1.60
2.20
2.79
3.33
3.94
1.09
1.71
2.30
3.01
3.50
4.20
0.96
0.93
0.96
0.93
0.95
0.94
2000
2000
2000
2000
2000
2000
2000
2000
25
100
175
250
325
400
475
550
4.30
32.48
74.65
132.08
221.31
313.10
420.16
570.44
4.68
25.56
58.76
99.57
152.68
205.66
285.45
368.93
0.92
1.27
1.27
1.33
1.45
1.52
1.47
1.55
200
400
600
800
200
400
600
800
18.76
88.07
239.61
501.57
10.41
45.14
115.77
229.52
1.80
1.95
2.07
2.19
5.3
The cyclic mapping distribution might not be efficient when the number of
columns of A is small. Parallelism in the first dimension is more efficient for
skinny matrices, that is, when m/es Hnjes1
mjes 21 [25]. Parallelism
in the second dimension is inefficient since n ~ m. Parallel computations are
performed only on single columns or rows of A when parallelism in the first
or second dimension, respectively, is used. Algorithm 1.9 is the equivalent of
nr
28
m
500
500
500
2000
2000
2000
4000
4000
4000
10000
10000
10000
Execution times (in seconds) of the CPP LALIB QR_FACTOR subroutine and Algo-
n
8
16
32
8
16
32
8
16
32
8
16
32
QR_FACTOR
Algorithm 1.9
0.45
0.90
1.77
1.38
2.76
5.35
2.61
5.21
10.09
6.35
12.67
24.51
0.36
1.19
4.27
0.42
1.39
4.97
0.53
1.77
6.38
0.89
2.98
10.74
1.26
0.76
0.41
3.31
1.99
1.07
4.91
2.94
1.58
7.14
4.26
2.28
29
A block-version of Algorithm 1.9 can also be used but, under the assumption of small n (n es 2 ), the number of blocks will be at most two. Thus, any
savings in computational time in the block-version algorithm will be offset by
the overheads. If m n, then at some stage i of the Householder algorithm the
affected submatrix Ai:m,i:n of A could be considered skinny. This suggests that
in theory an efficient algorithm could exist that initially employs BPHA and
which switches to Algorithm 1.9 in the final stages [82].
(1.47)
where Ai E SRmxnj (m> ni) is the exogenous full column rank matrix in the
ith regression equation, Qi is an m x m orthogonal matrix and Ri is an upper
triangular matrix of order ni [78]. The fast simultaneous computation of the
QRDs (1.47) is considered.
6.1
Consider, initially, the case where the matrices AI, ... ,AG have the same
dimension, that is, n} = ... = nG = n. The equal-size matrices suggests that a
3-D array could be employed. The m x n data matrices A I , ... ,AG and the upper
triangular factors R I, ... ,RG can be arranged in an m x n x G array A and the
n x n x G array R, respectively. Using a 2-D mapping, computations performed
on scalars, I-D and 2-D arrays correspond to computations on I-D, 2-D and
3-D arrays when a 3-D mapping is used. Thus, in theory, the advantage over a
2-D mapping is that a 3-D arrangement will increase the level of parallelism.
The algorithms have been implemented on the 8192-processor MasPar MP1208, using the high level language MasPar-Fortran. On the MasPar, the 3-D
arrangement of the equal-size matrices is mapped on the 2-D array of PEs plus
memory, with computations over the third dimension being performed serially.
This indicates that under a 3-D arrangement the increase in parallelism will not
be as large as is theoretically expected.
The indexing expressions of 2-D matrices and the replication and reduction functions can be used in a 3-D framework. That is, the function spread
which replicates an array by adding a dimension and the function sum which
adds all of the elements of an array along a specified direction can be used.
For example, if B and C are m x n and m x n x G arrays respectively, then
C:= spread(B,3,G) implies that forall k, C:,:,k = B:,: and B:= sum(C,3) is
equivalent to B(i,j) = Lf=1 Ci,j,k whereas sum(C) has a scalar value equal to
the sum of all of the elements of C.
30
31
i = 1, ... , m + n - 2,
G(i,}) =
Jr,
where
( " ")
Gkt,]
(c(i,})
k
(i,})
s(i,}))
k
(i,})'
-sk
({i,})
ci
ck
c(i,})
1
...
1
=,
... ,p,
Sdepend on i.
c(i,})
P
cp(i,}))
and
ST" = ((i,})
:,]
sl
_ (i,})
Sl
.. .
(i,})
sp
where j = 1, ... , G. Also, letA =AS+I:S+2p,:,: - that is, A:,:,i corresponds to the
2p x n submatrix of Ai starting at row ~ + 1 (i = 1, ... , G). The simultaneous
application of the ith CDGRs in a data parallel mode may be realized by:
A*I"2p"2"
. . '0'"" :=A2:2p:2::
" ,
(1.48a)
32
A *2..
'2p'2",.,' := AI:2p:2::
"
(1.48b)
(1.48c)
and
where (1.48a) and (1.48b) construct the 3-D array A * by pairwise interchanging
the rows in the first dimension of A.
The algorithms have been implemented on the MasPar MP-1208 using the
default cyclic distribution for mapping arrays on the Data Parallel Unit (DPU).
In order to simplify the complexity of the timing models the dimension of the
data matrix Ai (i = 1, ... , G) is assumed to be a multiple of the size of the array
processor and G::; eS2. That is, m = Mesl and n = Nes2, where M ~ N ~ G,
eSI = 128 and eS2 = 64. Furthermore, the algorithms have been slightly modified in order to reduce the overheads arising in their straightforward implementation. These overheads mainly comprised the remapping of the affected
subarrays into the DPU and were overcome by referencing a subarray only if
it was using fewer memory layers than a previous extracted subarray.
The dimensions of the data matrices suggest that the time required to execute
the procedure transform in line 6 of Algorithm 1.10 is given by
(l.49)
where m = m/esll, ii = n/es21 and Co,.. , Cs are constants. That is, the total
time spent in applying all of the CHTs, is given by
Nes2
=L
i=1
=N(co+cIN +C2M
=TH(M,N,G) + ,(M,N),
(1.50)
where Co, ... , C7 are combinations of the constants in (1.49) and ,(M,N) is a
negligible function of M and N. A sample of more than 500 execution times
(in msec) were generated for various values of M and N. The least-squares
estimators of the coefficient parameters in the model TH(M,N, G) are found
to be Co = 13.23, CI = -0.49, C2 = 3.30, C3 = 1.89, C4 = 1.60, Cs = -0.12,
C6 = 1.56 and C7 = 0.72.
The timing model TH(M,N, G) can be used as a basis for constructing a
timing model of the MGS algorithm in Algorithm 1.11. Using backwards stepwise regression with the initial model given by TH(M,N, G), the execution time
model of the 3-D MGS algorithm is found to be
TMGS(M,N,G) =N(1O.75+2.69M
+ G(2.21 + 2.34N + 2.70M + 0.75MN)).
(1.51)
33
It can be observed that, unlike TH{M,N, G), this model does not include the N 2
and GN3 factors. This is because at the ith step of the 3-D MGS algorithm the
affected subarray of A has dimension m x (n - i + 1) x G which implies that
the timing model is given by L~~2 <1>1 (MeS1,NeS2 - i + 1, G,es1,es2).
Similarly, the timing model of the SK sequence is found to be
TG{M,N, G) =N{74.08 - 26.05N + 2. 18N2 + 37.16M - 1.68MN
+ G{9.11N - 5.03N2 + 6.50M + 12.46MN))
(1.52)
+4.03GM.
From Table 1.4 and analysis of the timing models it may be observed that
the SK sequence algorithm has the worst performance. Furthermore, the MGS
algorithm is outperformed by the Householder algorithm when M > N. For
G = 1 the timing models of the 3-D QRDs algorithms have the same order of
complexity as their corresponding performance models for the single-matrix
2-D QRDs algorithms [83]. The analysis of the timing models shows that,
in general, the 3-D algorithms perform better than their corresponding 2-D
algorithms with the improvement getting larger with G. The only exception is
the 3-D Givens algorithm which performs worse than the 2-D Givens algorithm
when the number of CDGRs is large. Figures 1.4 shows the ratio between the
2-D and 3-D algorithms for computing the QRDs, where G = 16.
Table 1.4.
M
4
4
4
4
8
8
8
10
10
10
10
10
10
1
1
3
3
5
5
5
1
1
1
5
5
5
5
10
3
8
3
6
10
5
8
10
5
8
10
Householder
Exec.
TH(M,N,x)
x 10- 3
Time
0.87
1.48
2.60
5.65
9.07
16.26
25.71
1.74
2.51
3.02
16.76
25.55
31.22
0.88
1.51
2.59
5.65
9.06
16.26
25.86
1.76
2.55
3.07
16.76
25.51
31.34
Modified Gram-Schmidt
Exec.
TMGS(M,N,x)
X 10- 3
Time
1.12
2.06
3.28
7.66
11.46
21.35
34.29
2.34
3.54
4.31
21.51
33.44
41.08
1.13
2.05
3.26
7.63
11.47
21.33
34.48
2.33
3.51
4.29
21.55
33.35
41.22
Givens Rotations
Exec.
TG(M,N,G)
x 10- 3
Time
6.80
ll.4l
18.96
43.41
82.92
154.34
249.77
16.08
22.69
27.05
168.12
260.70
322.29
6.72
11.52
18.98
43.42
82.89
154.37
249.68
15.75
22.76
27.44
168.21
260.57
322.14
34
Modified Gram-Schmidt
311
Givens Rotations
Figure 1.6. Execution time ratio between 2-D and 3-D algorithms for computing the QRDs,
where G = 16.
6.2
Consider the simultaneous computation of the QRDs (1.47) using Householder transformations, where the matrices are not restricted to having the same
number of columns. However, it is assumed that the data matrices are arranged
so that their dimensions are in increasing order, that is, nl ~ n2 ~ ... ~ nG. The
QRDs can be computed by applying a total of nG CHTs in G stages. At the
end of the kth stage (i = 1, ... , G) the QRDs of AI, ... , Ak are computed and the
first nk rows of Rk+ 1, ... , RG are constructed. In general, if Vi: A (i,O) = Ai and
no = 0, then the kth stage computes simultaneously the G - k + 1 factorizations
QTi,kA(i,k-l) -_
R(i,k)
1
(1.53)
by applying nk - nk-l CHTs, where R~i,k) is upper triangular and a CHT comprises G - k + 1 single Householder transformations. Thus,
0) (I 0
. ...
Q1,2
ni _ 1
35
and
nl
n2 -nl
n3 -n2
ni-ni-I
R(i,l)
k(i,l)
k(i,I)
ft.(i,I)
R(i,2)
I
3
ft.(i,2)
2
ft.(i,2)
i-I
R(i,3)
Ri=
ft.(i,3)
i-2
i= 1, ... ,G,
R(i,i)
I
where
np+1 -np
R(i,P) 2
ft.(i,p)
2
ni-nj-I
ft.(i,p)
i-p+1
'
p= 1, ... ,i-l.
Figure 1.7 shows the stages of this method when G = 4. The orthogonal
matrix Q;k in (1.53) is the product of nk - nk-I Householder transformations,
(i,k)
f h k
say H (i,k)'
... HI
. At the pth (p = 1, ... ,nk - nk-I) step 0 t e th stage
nk
nH
36
----.. "1""'-
--.... ftl40--
-r~
______
- - - - - I
QT) ,1A(.ol
Stage I
,
n - nl
m-tll
1
m-f)L-_~L-____~
or
Stage 2
Figure 1.7.
...... - ny+
11.(0.11
Stage 3
An alternative approach (hereafter called scattering ) that does not necessitate inter-processor communication is to distribute the matrices evenly over
the processors. Each processor then factorizes its allocated matrices one at a
time using the SPMD paradigm. The main difficulty in this approach is determining how to distribute the matrices over the processors in order to achieve
load balancing. In the case where all the data matrices have the same dimen-
37
Algorithm 1.12 The task-farming approach for computing the QRDs (1.47) on
p (p
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
sion then each processor will hold G / p matrices, under the assumption that
G is a mUltiple of p; otherwise, the matrices Ai] , ... ,Ai,," are allocated to the
processor Pi (i = 1, ... , p), so that
I
(1.54)
38
The task-farming and scattering methods have been implemented on an 8processor distributed memory system - the IBM SP2 - using MPI (Message
Passing Interface), Fortran77 and the LAPACK QRD subroutines [3]. It was
assumed that the matrices were local to each processor, i.e. the task-farming
method is reduced to sending the index of a matrix to be factorized rather
than the matrix itself to the processors. The experiments were performed with
G = 50, m = 5000 and 50 ~ ni ~ 300 (i = 1, ... , G). Table 1.5 shows the
time in seconds, speedup and efficiency achieved by each method. The main
difference between the task-farming and scattering methods is that the latter
uses all processors for computation while the former uses one processor only
for communication.
Table 1.5.
The task-farming and scattering methods for computing the QRDs (1.47).
Method
Time
Idle time
Speedup
Efficiency
296.5
43.4
39.3
0%
12%
0%
1.00
6.83
7.54
1.00
0.85
0.94
Chapter 2
OLM NOT OF FULL RANK
INTRODUCTION
Consider the Ordinary Linear Model (OLM)
(2.1)
y=AX+E,
where A E Sltmxn (m > n) is the exogenous data matrix, y E Sltm is the response
vector and E E Sltm is the noise vector with zero mean and dispersion matrix
(J2/m. The least squares estimator of the parameter vector x E Sltn
argminET E = argmin IIAx- y112,
x
x
(2.2)
has an infinite number of solutions when A does not have full rank. However,
a unique minimum 2-norm estimator of x say, X, can be computed.
Let the rank of A be given by k (k :::; n). The solution of (2.2) is computed
in two stages. In the first stage the coefficient matrix A is reduced to a lower
trapezoidal form; in the second stage the lower trapezoid is triangularized. The
orthogonal decompositions of the first and second stages are given, respectively, by:
k
o)m-k
Lz
(2.3)
and
n-k k
(LI L2)P=
(0
L)'
(2.4)
40
n-k
QTAnp~ (~
o)m-k
L k
(2.5)
x= PL-1QT
2
2Y,
A
where
and
Numerous methods have been proposed for computing the orthogonal factorizations (2.3) and (2.4), on both serial computers and MIMD parallel systems
[16, 18,51,93].
Algorithms are designed, implemented and analyzed for computing the complete QLD on the CPP DAP 510 massively parallel SIMD computer (abbreviated to DAP) [70]. The algorithms employ Householder reflections and Givens
plane rotations. Algorithms are also proposed for reconstructing the orthogonal matrices involved in the decompositions when the data which define the
orthogonal transformations are stored in the annihilated parts of the coefficient
matrix A. The implementation and execution time models of all algorithms
on the DAP are considered in detail. All of the algorithms were implemented
on the 1024-processor DAP using double precision arithmetic. The timing
models are expressed in msec.
The computation of the QLD (2.3) using Householder reflections with column pivoting is considered. This method is also used when A is of full column
rank but ill-conditioned. Let the elementary permutation matrix I~i,Jl) denote
the identity n x n matrix In with columns n - i + 1 and Jl interchanged and let
Qf =Im-
h(i)h(i)T
hi
(2.6)
41
QT =
and
n=
i=1
To describe briefly the process of computing the QLD (2.3) let, at the ith (0 :$
i:$ k) step,
A (i) =
Vii
)m-i
i
'
LW
A permutation I~i'lIi) is equivalent to swapping first the elements ~n-i+1 and ~lIi
of the ~ vector and then swapping the elements n - i + 1 and Jli of the ~ vector
where, initially, ~i = ~i = i (i = 1, ... ,n).
2.1
SIMD IMPLEMENTATION
The QLD (2.3) has been computed on the DAP under the assumption that
m = Mes and 1 < m :$ es 2 . That is, the dimension of the matrix
A is an exact multiple of the edge size of the array processor and m/es 21=
M / es1 = 1. A sample of execution times has been generated from the application of a single Householder reflection on matrices with various dimensions.
A regression model has been fitted to this sample with the execution time denoting the response variable. The predetermined factors of the timing model
= Nes,
42
are derived from the number of es x es layers involved in the arithmetic computations and the number of times layers are replicated or reduced. The time
required to construct and apply the ith Householder reflection is found to be:
TI (M,N)
QT
Q~-J)es+j'
tK =
k- (K -1)es and
tl = ... = tK-I
= es. If mj = (M - i+ l)es
nj
same number of es x es layers of memory and so they have the same execution
time when applied on the left of A(i).
Algorithm 2.1 effects the computation of the QLD (2.3). The procedure
column_swap performs the appropriate column and element interchanges on
its matrix and vector arguments if nj - j + 1 f= Ilj. Initially, the squared Euclidean norms of the columns of A(i) (denoted by A) are stored in VI: ni For
greater accuracy VI :ni is recomputed prior to the applications of the Householder reflections in S(i). The Householder vectors are stored in the annihilated
positions of A and in the last k elements of V. Within the context of OLM estimation, Algorithm 2.1 can be applied to the augmented matrix (Y A) except
that the permutations are performed only on the columns of A.
The time spent in applying all the Householder reflections is I,f:, J tj TJ (mj, nj).
If A has full column rank - that is, k = n and tj = es (i = 1, ... ,N), then the
total time is:
In this case, the estimated execution time of Algorithm 2.1 is found to be:
TQdM,N) = N(105.92 + 31.79M +5.86N + 27.81MN -9.18N2).
This estimate is derived using the backward stepwise regression method, where
the explained variable is the execution time and the initial explanatory variables
are determined after evaluating T2(M,N) [83, 137]. Table 2.1 shows the high
accuracy of the timing models T2(M,N) and TQdM,N) when used to predict
the execution time of Algorithm 2.1. Notice that the time spent in applying the
Householder reflections is approximately 90% of the total execution time of
Algorithm 2.1, that is, T2(M,N)/TQdM,N) ~ 0.90.
43
1: let ~i = ~i = i (i = 1, ... , n)
2: fori:=1,2, ... ,Ndo
3:
let mi == (M - i + l)es, ni
4:
Vl: n; := sumr(A *A)
5:
forj:=1,2, ... ,esdo
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
column_swap(A, V, ~, ~, ni - j + 1,11 j)
h:=O
hl:m;-j+l := Ai:m;-j+l,n;-j+l
s:= sqrt(Vn;_j+J)
if hm;-j+l < 0 then s:= -s
hm;-j+l := hm;-j+l +s
Vn;-j+l := hm;_j+l
b:=s*hm;_j+l
Am;-j+l,n;-j+l := -s
W = matc(h,ni - j)
X = sumr(W *Al:m;-j+l,l:n;-j)/b
l,l:n;-j := l,l:n;-j - W * matr(X, mi)
.
2
Vl:n;-j'= Vl:n;-j- (Am;-j+l,l:n;-j)
21:
22:
23:
end if
24:
end for
25: end for
Two methods for computing the factorization (2.4) are presented. The first
uses Householder reflections and annihilates one row at a time of the Ll matrix;
the second uses Givens rotations for simultaneously annihilating elements in
certain rows of Ll.
3.1
44
Table 2.1.
Matrix dimension
m
n
256
256
256
320
320
320
480
480
480
480
480
Predictions x 1()'
T2(M,N)
TQdM,N)
Execution time
of Algorithm 2.1
32
64
224
32
96
288
32
96
288
352
416
0.58
1.56
10.56
0.70
3.58
20.12
0.99
5.31
32.82
45.38
58.90
0.56
1.56
10.56
0.70
3.58
20.12
1.00
5.31
32.82
45.38
58.90
Ratio
T2(M,N)/TQdM ,N)
0.51
1.40
9.61
0.62
3.26
18.50
0.89
4.89
30.64
42.38
54.98
0.91
0.90
0.91
0.89
0.91
0.92
0.89
0.92
0.93
0.93
0.93
92
.
+ Li,i'
Ci = y;s and Ik = (el ... ek). The apph-
L) gives
(-L-xL-T,: L-Yixei
A
T) , where
i
-9
x=(LLi,:+YiL:,i)/Ci.
(2.9)
The reflection (2.9) annihilates the ith row of L by modifying all of L but only
the ith column of L. After the reflections PI, ... , Pi (1 ~ i ~ k) have been
applied on the right of (L L), the first i rows of L are zero and L remains
lower-triangular. The orthogonal matrix P in (2.4) is defined as P = PI P2 Pk.
Figure 2.1 shows the annihilation pattern for L when using Householder reflections.
-n----k!
After Stage 1
After Stage 2
After Stage 3
o 0 om
mmmm e
mmmm e e
mmmm e e e
o 0 o e m
m m m em e
m m m em e e
e
e e
o 0 o e em
m m m e em e
@!l Modified
o Unchanged
@] Annihilated
Figure 2.1.
After Stage 4
e
e e
e e e
e e e m
D Zero Element.
=0 =k-
. eqUlv
. alent to Pj+t(i) , were
h
t (i) =
1S
~i-I
"-q=1 tq .
The reflection
45
.L*
4:
6:
7:
8:
9:
5:
10:
11:
12:
:=Yj*S
46
The timing model TCQdK,11) can be used to predict the total execution time
of Algorithm 2.2 which has negligible organizational overheads.
Notice that, if tl = ... = tK-I = es, tK = 0, K> 1 and 0 =I es, then the
execution time of Algorithm 2.2 will increase by (es - o)(K -1)(a2 +a311)
msec. Observe also that if D = (L* L~j) and fT = ((Lj)T Yj), then lines
8-10 of Algorithm 2.2 may be expressed as:
X:=
Dj+I:k,I:ii+1
j).
This formulation uses fewer es x es layers of memory than Algorithm 2.2 when
r(n+ 1)/esl However, on the DAP, the slight improvement in speed of
using fewer memory layers is lost to the overheads incurred in placing L~j in
D and then storing it in L.
11 =
3.2
A parallel Givens sequence (hereafter called PGS) requiring n - 1 compound disjoint Givens rotations (CDGRs) to compute the orthogonal decomposition (2.4) is proposed. The PGS does not create any non-zero elements above
the main diagonal of t and previously created zeroes of L are preserved. Let
Gi,j E 9tnxn denote a Gives rotation that annihilates Li,j when applied from the
right of (L t), and which affects only the jth and ith columns of L and t,
respectively. At most min (n,k) disjoint Givens rotations can be applied simultaneously: here it is assumed that n :S k. The elements annihilated after the
application of the ith CDGR, denoted by G(i), lie on a diagonal and their total
number is given by
ei= {
if 1 :S i
< n,
if n :S i:S k,
n+k-i ifk<i<n+k.
Table 2.2 shows the Givens rotations comprising G(i) of the PGS, the diagonals and the total number of elements annihilated after the application of G(i) ,
where di,j denotes the diagonal of L starting at element Li,j. In Fig. 2.2(a), i
(1 :S i < n) corresponds to the elements of L annihilated from the application of
G(i), where k = 16 and n = 22 (n = 6). Figure 2.2(b) illustrates an alternative
annihilation scheme equivalent to the PGS.
The application of G(i) affects ei consecutive columns of each L and L. Let
the submatrices comprising these columns be denoted by Vi) and t(i), respectively. Furthermore, let x = (XI, . .. ,xej ) denote the elements of Vi) annihilated
after the application of G(i) and the corresponding elements of t(i) used in
constructing the rotations be given by Y = (Yl, .. . ,Yej). The application of G{i)
47
Diagonals annihilated
ei
Gl,n
Gl,n-1 G2,n
Gl,n-2 G2,n-l G3,n
dl,n
dl,n-l
dl,ii-2
I
2
3
dl,1
d2,1
ii
ii
k+I
dk-ii+I,1
dk-ii+2,1
ii-I
k+ii-2
k+ii-I
Gk-l,I Gk,2
Gk,1
dk-1,1
dk,l
2
I
2
3
ii
ii+I
6
7
8
9
5
6
7
8
10 9
4 3 2 1
5 4 3 2
6 5 4 3
7 6 5 4
8 7 6 5
11 10 9 8 7 6
12 11 10 9 8 7
13 12 11 10 9 8
14 13 12 11 10 9
15 14 13 12 11 10
16 15 14 13 12 11
17 16 15 14 13 12
18 17 16 15 14 13
19 18 17 16 15 14
20 19 18 17 16 15
21 20 19 18 17 16
1
2
3
4
5
6
7
8
9
(i
= 1, ... ,n -
3
4
5
6
7
8
9
4
5
6
7
8
9
5
6
7
8
9
6
7
8
9
10
10 11
10 11 12
10 11 12 13
10 11 12 13 14
10 11 12 13 14 15
11 12 13 14 15 16
12 13 14 15 16 17
13 14 15 16 17 18
14 15 16 17 18 19
15 16 17 18 19 20
16 17 18 19 20 21
(b) Alternative PGS.
(a) PGS.
Figure 2.2.
2
3
4
5
6
7
8
9
ii
1) on the right of
(L L)
can be written as
48
where
G (i)
CI
( -SI
'.
.. .
SI
Ceo
I
-Se;
CI
Se
.. .
CI
C2
C2
CI
C2
CI
C = matr(c,k) =
I,
Ce;
C")
Ce;
...
C ei
and
S = matr(s,k) =
SI
SI
SI
S2
SI
S2
::
...
...
sel)
S
~'.
Se;
In the implementation of the PGS on the DAP, let n = 11es and k = Kes.
The application of the CDGRs is divided into the three phases <1>1, <1>2 and <1>3.
Phase <l>i applies the CDGRs in S(i) ( i = 1, 2, 3 ), where
\
S)i} =
G(j)
{
G(ii+~-I)
G(k+J-I)
if i = 1 and j = 1, ... ,n - 1
if i = 2 and j = 1, ... ,k - n,
if i = 3 and j = 1, ... ,no
Phase <1>2 is divided into the K -11 sub-phases <PI,, <PK-1l' where, at subphase <Pi, the diagonals dl,l, d2 ,1, ... ,des,1 of the (K + 1 - i)es x 11es submatrix
L(i-I)es+l:k,l:ii are annihilated. In phases <1>2 and <1>3 previously annihilated
es x es submatrices of L and L are excluded from the computations. Figure 2.3
shows the phases and the sub-phases of PGS, where es = 4. At the beginning
of sub-phase <P3, the top 8 x 8 submatrix is zero and is excluded from the
computation (2.10).
A timing model for the PGS needs to be derived in order to compare the performance of the Givens and Householder algorithms on the DAP. This model
can be obtained by adding the (parameterized) estimated execution times of the
different phases of the algorithm. Then the factors of the total execution time
model, say TROT(K, 11), are used in the stepwise regression to construct a timing
model of the Givens algorithm, say TpGs(K, 11), which also takes into account
the various organizational aspects of the implementations that do not occur in
the different phases. Furthermore, the analysis and comparison of TROT(K, 11)
and TpGs(K,11) will indicate any inefficiencies in the implementation of the
algorithm.
From experiments, the estimated execution time for constructing and applying the CDGR G(i) (i = 1, ... , n - 1) in (2.10) is found to be:
t(K,ei)
Phase CPt
Phase CP2
49
Phase CP3
7 1615
1716
7
.I~
1
2[1
32 1
Thus, the time required to apply all the CDGRs of the PGS is:
11
i=1
K-11
i=1
11
+ es LT(11 + 1 - i, 11 + 1 - i)
i=1
After expansion and using backward stepwise regression, the execution time
model of the PGS (including the overheads) on the DAP is given by:
In some cases, such as in the deletion of data from the OLM and in recursive constrained OLM, the orthogonal matrices must be formed [75, 85].
The algorithms used in computing the orthogonal factorizations (2.3) and (2.4)
50
Matrix dimension
n-k
k
32
192
192
192
352
352
352
352
512
512
512
512
512
Execution time
of Algorithm 2.2
TCQdK,TJ)
0.08
0.83
1.58
2.34
2.20
4.55
11.59
13.94
4.21
9.06
23.44
33.09
37.92
0.08
0.82
1.58
2.33
2.18
4.52
11.55
13.89
4.14
8.94
23.34
32.94
37.74
32
32
96
160
32
96
288
352
32
96
288
416
480
0.17
1.28
3.81
6.85
3.34
9.89
35.67
45.84
6.34
18.69
65.50
103.43
124.00
TROT(K,TJ)
0.15
1.11
3.19
5.67
2.83
8.38
30.65
39.07
5.33
15.99
58.23
92.44
110.49
TQ(M,k)
= es
(K -1)
2
OLMnotoffullrank
51
o.
-T
6:
== QI:m;,I:m' H == HI:m;,k;:k;+t;
and ~
== ~k;:k;+t;.
7:
8:
QT := QT - matc(sumc(QT * W) /~ j, mj) * W
9:
end for
10: end for
diagonal element of L == Am-k+l:m,n-k+l:n. The value of Cj in (2.8) can be
computed as -1iAm-k+i-I,n-k+i-l. It can be proved that the matrix p('J...) =
PI ... P'J... has a special structure which facilitates the development of an efficient
algorithm for computing P = p(k).
THEOREM
ii
pi') =
k-A
(D~')
o
I
)ii+A
(2.11)
k-A'
11 ef)
11 -
--LI"
C}
1-
ri
CI
'"
o
o
h-I
and
W(1)
== (
--Ll"
11 - )
C}
ri'" .
1-CI
52
Inductive step: Assume that p(A) has the structure defined in (2.11). It must
be shown that p(A+ 1) also has this structure.
Now,
P (A+I) -_ p(A)p1..+1
D(A)
-T
_D(A)LA+ 1,:LA+ 1,:
W(A)
CA+l
_ YI..+I IT
CA+I 1..+1,:
0
0
0
_ YA+l D(A)I
1..+1,:
CA+I
1- i+1
CA+I
0
0
0
h-A-l
and
W(A)
W(A+l) = (
CA
Notice that, given W(A), W(A+I) can be derived by computing only its ith
column W~~: ~). Thus, given p(A), the calculations of D(A+ I) and W~~: ~) can
be summarized as follows:
let x = -1- (D(A)I1..+1,: ) in
cl..+I
YA+l
and
where efiH+l is the last column of the identity matrix I fiH +1. Algorithm 2.4
shows, in data-parallel mode, the reconstruction of the orthogonal matrix P.
On the DAP the implementation of Algorithm 2.4 comprises /l phases <1>1 , ... ,
<1>1/' where /l = rn/es1-11 + 1 and 11 = r(ii + I) /es 1- During <1>1, .. , <1>1/-1 the
number of rows of the affected submatrix of PI :n, l:fi is multiple of es. Phase <1>i
(i = I, ... , /l) has Si steps with the jth step being equivalent to step (j + L~~\ Sq)
of Algorithm 2.4, where
<>= 11es-ii
{
Si= es
k - <> - (/l- 2)es,
ifi= I,
ifi=2, ... ,/l-I,
if i = /l and /l > 2.
53
4:
5:
6:
7:
8:
9: end for
matc(x,li) * matr(Li,:, Ii + i)
i *x
In Fig. 2.4 a grid and a shaded box denote the submatrices Pl:n,l:ii and Pl: ii+ Si ,l:ii
(i = 1, ... ,p), where k = 17, Ii = 6 and es = 4.
SI
Figure 2.4.
=2
S2
=4
S3
=4
S4
= 4
S5
= 3
where a = 0.78, al = 0.21 and a3 = 0.55. Thus, for ni = Ii + L~=l Sj, the estimated time of executing Algorithm 2.4 on the DAP, excluding the overheads,
is given by:
II
54
If ii
Si
= es (i = 1, ... ,11),
i=J
also given. As in the previous cases the timing models are highly accurate.
Table 2.4.
Times (in seconds) of reconstructing the orthogonal matrices QT and P on the DAP.
Algorithm 2.3
T{2(M,k) x 103
Mes = x and k = y
160
160
288
288
288
416
416
416
416
544
544
544
544
32
160
32
160
288
32
160
288
416
32
160
288
416
DISCUSSION
0.48
1.53
1.49
5.86
7.70
3.05
12.96
19.22
21.81
5.15
22.75
35.83
43.90
0.48
1.54
1.48
5.85
7.70
3.05
12.96
19.25
21.90
5.17
22.87
35.82
44.02
Algorithm 2.4
Tp(k,n - k) x 103
k=xandfi=n-k=y
0.60
3.90
1.51
8.71
20.96
2.82
14.99
34.46
61.25
4.51
22.76
50.58
87.81
0.61
3.88
1.53
8.68
20.87
2.83
14.98
34.40
61.08
4.52
22.79
50.56
87.83
55
Chapter 3
UPDATING AND DOWNDATING THE OLM
INTRODUCTION
In many applications, it is desirable to re-estimate the coefficient parameters of the OLM after it has been updated or downdated by observations or
variables. For example, in real time applications updated solutions of a model
should be obtained where observations are repeatedly added or deleted. In
computationally intensive applications such as model selection, regression diagnostics and cross-validation, efficient and numerically stable algorithms are
needed to solve models that have been modified by adding or deleting variables
or observations [10, 22, 24, 52, 138, 139].
Consider the OLM
Y =AX+E,
(3.1)
where y E 9\m is the response variable, A is the full column rank exogenous
m x (n - 1) matrix (m 2: n), x is the unknown vector of n - 1 parameters and
E E 9\m is the error vector with zero mean and covariance matrix (J2/m. Given
the QRD of the augmented matrix A = (A y)
QTA =
(R) m-n'
n
with
(3.2)
58
ADDING OBSERVATIONS
The updated OLM problem is the estimation of the BLUE of x in
where the BLUE of (3.1) has already been derived. Here the information added
to the original OLM (3.1) is denoted by
z) == b
(3.4)
Q~ (~) = (~),
with
Sn
'
(3.5)
the least-squares solution of the updated OLM is given by Rnxn = Un. where
Qn E ~(m+k)x(m+k) is orthogonal and Rn is an upper triangular matrix of order
n - 1. Thus. the (observations) updating problem can also be regarded as the
computation of the QRD (3.5) after (3.2) has been computed. This is equivalent
to computing the orthogonal factorization
(3.6a)
or
(3.6b)
where Qis an (n+k) x (n+k) orthogonal matrix. Notice that when (3.6a) and
(3.6b) are computed the orthogonal matrix Q~ in (3.5) is defined. respectively.
by
and
59
Yi =
7 9
6 8 10
5 7 911
4
3
2
1
6 8 10
5 7 9
4 6 8
3 5 7
2 4 6
3 5
4
(a) UGS-l.
10 11
9 10
8 9
7 8
6 7
3 4 5 6
2 3 4 5
1 2 3 4
1 2 3 4
8
7
6
5
(b) UGS-2.
(c) UGS-3.
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9
8
7
6
4 5
Figure 3.1. Updating Givens sequences for computing the orthogonal factorizations (3.6),
where k = 8 and n = 4.
The Householder method and UGS-l have been implemented on the DAP
(CPP DAP 510), using single precision arithmetic. Without loss of generality it
has been assumed that n = Nes and k = Kes, and K = 1 in the case of UGS-1.
60
Highly accurate timing models of the Householder and Givens methods are
given, respectively, by:
THup(N,K)
TGup(K)
2.1
126
363
8
199
720
16
347
1563
20
420
2047
32
40
64
641
789
3754
5103
1231
10164
kr
61
out, however, increasing the communication overheads between the PEs. Since
R is overwritten by It
",)'
"_
AT
","
(3.8)
The total time spent in applying the Householder reflections is thus given by:
n
<I>2(k,n) = L<I>I(k,n,i)
i=1
N
(3.9)
where N = fn/es21 and K = fk/esll However, the overheads of the implementation which arise mainly from the passing of arguments (subarrays) to
various routines are not included in this model. Evaluating <1>2(k,n), and using backward stepwise regression on a sample of more than SOOO execution
times, gives the following highly accurate timing model for the cyclic-layout
62
Householder implementation:
Calculations show that the residuals are normally distributed. Thus, the hypothesis tests made during the selection ofthis model are justified [137]. The
adequacy of the latter model, measured by the coefficient of determination, is
found to be 99.99%.
Using cyclic and column layouts to map respectively the matrices R and D
on to the PEs, a model for estimating the execution time of the ith Householder
reflection is:
+ C31n/es21,
(3.11)
where co, . .. ,C3 are constants. Evaluating I~I <1>3(k, n, i) and using regression
analysis, the execution time of the column-layout implementation is found to
be:
T2(k,n) = n( 13.51
(3.12)
From Fig. 3.2 it will be observed that neither of the implementations is superior in all cases. The efficiency of the cyclic-layout implementation improves
in relation to that of the column-layout implementation, for fixed k and increasing n. Table 3.2 shows that the column-layout is superior for very large k
and relatively small n.
The results above suggest that the application of the n required Householder
reflections be divided into two parts. In the first part, nl reflections are applied
to annihilate the first n 1 columns of A using cyclic-layout; in the second stage
the remaining n2 = n - nl reflections reduce to zero the submatrix D:,nl+l:
using column-layout, where D:,nl+l: comprises the last n2 columns of D. Let
tl (k, n, n t) be the time required to complete the first stage. Then, the total
execution time of the hybrid implementation is given by:
(3.13)
where, on the MasPar:
63
13
10.5
8
5.5
0.5
~~~~~m<'SOQS2">C
8192
12288
16384
20480
24576
k
28672
32768
36864
40960
Figure 3.2. Ratio of the execution times' produced by the models of the cyclic-layout and
column-layout implementations.
Table 3.2.
Cyclic
TI (k,n)
Column
T2{k,n)
22
30
38
46
54
94
62
70
78
86
14.06
20.39
27.43
33.52
38.91
75.94
43.60
50.62
58.13
66.56
14.68
20.39
26.53
32.45
38.38
75.88
44.30
50.17
58.74
67.31
3.04
5.63
8.68
12.18
16.65
49.46
21.79
27.90
34.46
41.71
3.02
5.40
8.47
12.23
16.67
49.46
21.80
27.82
34.35
41.56
The value ofnl, which may not be unique, is chosen to minimize T4{k,n,nl).
That is, nl is the solution of
argminT4{k,n,nt}
n1
subject to
{o ::; nl
::; n,
n 1 is integer
(3.14)
64
(3.15)
where 0 ~ n I ~ n. This estimation method does not, however, take into account the cost of remapping D:,nl+):' Better estimates of ni might possibly
be obtained by constructing more accurate timing models than that given by
(3.14), or by introducing a weighting function, say W(nI,k,n,es),es2), into
(3.15) which takes account of the implementation overheads.
Table 3.3 shows the execution times for the three implementations on the
MasPar, where the negligible time required to compute the estimates of n} is
not included and T1(k,n,n.) has been evaluated as IZ;) <P)(k,n,i). The estimates for ni obtained using (3.14) and (3.15) are denoted by 'h and nj, respectively. In most cases, the two estimation methods yield different values for
n I. The execution time for the hybrid implementation, using nj, is found to be
more accurate than those using iiI in approximately half of the cases. In some
instances, for very small n and relatively large K, the hybrid implementation
reduces to the column-layout implementation: that is, the estimated value of
n) is zero (see, for example, the case n = 32 and k = 12800). With the exception of a few cases, the hybrid implementation on the MasPar is more efficient
than both the cyclic-layout and column-layout implementations.
On the GAMMA a hybrid algorithm has also been investigated with ii} denoting an estimator of n} which is the solution of
nl
i=}
+ ~(n -
n})(n - n}
(3.16)
65
Execution times (in seconds) of the RLS Householder algorithm on the MasPar.
k/128
nl
Hybrid
with nl
n*I
Hybrid
with nj
Cyclic
Column
80
80
80
80
80
80
90
90
90
90
100
100
100
100
100
100
110
110
110
110
120
120
120
120
120
120
32
64
96
128
160
192
32
64
96
128
32
64
96
128
160
192
32
64
96
128
32
64
96
128
160
192
0
5
20
52
84
116
0
0
8
40
0
0
0
26
58
90
0
0
0
11
0
0
0
0
0
31
10.31
30.00
59.77
92.10
132.42
170.62
11.01
32.57
63.29
99.14
11.48
33.52
66.79
102.65
153.04
200.62
11.95
34.46
68.20
108.52
12.66
35.85
70.07
115.08
170.85
228.29
0
25
57
89
121
153
0
20
7
39
0
15
0
29
61
93
0
10
0
19
0
5
0
9
0
13
10.31
28.83
59.30
88.60
129.61
167.35
10.78
30.46
63.51
98.68
11.48
31.87
66.80
103.12
153.52
201.33
12.19
33.51
68.68
107.35
12.42
34.93
69.85
112.03
170.62
230.39
20.15
40.78
71.01
100.54
141.57
179.30
22.73
45.71
79.69
112.73
25.08
50.15
88.36
124.92
175.78
222.65
27.65
55.55
96.79
136.88
30.00
60.46
105.47
149.Q7
210.00
265.54
10.54
31.17
63.29
105.93
159.85
223.83
10.78
32.34
65.15
108.05
11.48
33.04
66.56
109.93
165.01
230.39
11.71
34.45
67.97
112.26
12.18
35.40
69.61
114.14
170.15
236.26
where w is the weight constant. The value of w which produces a more efficient
hybrid algorithm under iiI was found by experiment to be 0.2.
Table 3.4 shows the performances of the various algorithms on the GAMMA.
In most cases the hybrid algorithms perform better than the column-layout and
cyclic-layout algorithms. However, the hybrid algorithm based on the minimization of the estimated execution time, that is, on nj, is found to have the
least deviation from the best execution time.
For other SIMD systems, the value of n I may best be derived using the
straightforward minimization of the total number of memory layers used, rather
than minimizing the estimated time given by the performance model in (3.13)
which requires the time consuming re-determination of the coefficients of the
various timing models. However, if the remapping overheads are significant (as
in the case of GAMMA), then the minimization of (3.16) - the third method-
66
Table 3.4.
k/32
iii
Hybrid
with iii
n*I
Hybrid
with nj
nl
Hybrid
withnl
Cyclic
Column
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
128
128
128
128
128
128
128
128
128
128
128
128
128
128
128
128
8
16
24
32
40
48
56
64
72
80
88
96
104
112
120
128
8
16
24
32
40
48
56
64
72
80
88
96
104
112
120
128
0
0
0
7
0
1
11
48
58
68
78
89
99
109
119
127
0
0
0
0
0
0
0
3
0
0
0
84
95
112
120
128
0.62
2.09
4.43
8.79
11.73
19.18
23.41
26.66
18.95
25.70
31.98
38.10
47.52
56.31
63.84
69.69
0.62
2.09
4.40
7.58
11.66
16.55
22.28
32.71
36.42
44.74
53.88
45.63
57.93
72.84
81.25
88.65
0
0
0
8
0
0
7
15
23
31
39
47
55
63
71
79
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0.62
2.10
4.43
8.75
11.72
16.64
24.00
28.29
34.83
40.98
46.61
51.68
39.22
46.89
53.81
60.11
0.62
2.09
4.41
9.80
11.66
16.55
22.28
33.16
36.44
44.73
53.89
70.24
74.90
86.62
99.19
121.04
0
0
5
13
26
34
42
50
63
71
79
87
100
108
116
124
0
0
0
7
0
0
6
14
60
68
76
84
98
106
114
122
0.62
2.10
5.88
8.79
14.04
18.68
23.01
26.84
20.53
26.69
32.32
37.37
47.93
55.89
62.54
68.38
0.62
2.08
4.41
9.82
11.66
16.55
25.74
31.18
24.35
32.13
39.25
45.63
59.43
69.49
77.91
85.30
3.15
6.23
9.31
11.97
16.85
21.49
25.82
29.65
24.21
30.38
36.00
41.06
49.69
57.65
64.29
70.14
3.97
7.84
11.72
15.07
21.23
27.09
32.55
37.37
30.57
38.36
45.47
51.86
62.78
72.84
81.25
88.65
0.62
2.10
4.43
7.62
11.72
16.64
22.41
29.03
36.63
44.96
54.17
64.24
75.31
87.09
99.72
113.22
0.62
2.08
4.40
7.58
11.66
16.55
22.28
28.87
36.43
44.74
53.88
63.89
74.90
86.61
99.19
112.61
2.2
67
Parallel Givens strategies that compute the factorization (3.6) in fewer steps
by using UGSs are considered. The strategies are modifications of the annihilation schemes reported in [30, 94, 103]. The first block-parallel strategy
uses the recursive doubling approach while the second strategy is based on the
Greedy annihilation scheme.
Let the matrix b be partitioned as
V' _ (R~O)) n
Qi,O
I -
Pi - n '
with
QiO =
(Oro)
oro nPi - n '
(3.18)
Qo=
it follows that
R(O)
R(o)
1..+1
68
QJ., (~~~\)
(Rn:,
j = 1, ... ,g"
(3.19)
gi= l(A+l-
~gj)/2J,
J=I
i-I
R(i-I) - R(i)
2g;+1 g;+1
J=I
R)i)
QT =
("
QfA+l)2-J
and
nT =
In
0
0
0
0
In
0
0
0
0
0
0
0
0
0
0
In
0
0
0
0
0
0
In
In
0
0
0
0
0
In
such that
R(i)
R(i)
n! Q! R(i-I) = n!
1
(i)
R(A,+1)2- i
(i)
R(A,+1)2- i
n(A+ 1)2-1
-T
-T
69
Ik+n-n(A+I)/2i-l'
Notice that atthe ith (i = 1, ... , log2 (A. + 1) stage the gi orthogonal matrices
can be applied to annihilate the last gi upper-triangular factors of R(i-I). In
order to simplify the description of this method let k = (2g - 1)n and V be
partitioned into sub-blocks VI, ... V2K -I, where Vi E ~nxn (i = 1, ... , 2g 1). This algorithm, hereafter called the bitonic algorithm, initially computes
simultaneously, for j = 1, ... , 2g - 1, the QRDs
T
(0)
(3.20)
QoDj=R
..
j,
j
h
At stage i (i = 1, ... ,g) the bitonic algorithm computes in parallel the UQRDs
R(i-l))
T
j
Qj,i ( R(i-I).
j+2{g-,)
((i))6
R
n'
. - 1, ... , 2(g-i) ,
J-
where R == R~~) and Ry) is upper triangular. After the gth stage
computed. The orthogonal matrix Qn (3.5) is defined as
where
Qf.o
T _ (
Q0
-
d l ,2)
I,i
dl,l)
I,i
dI,I)
Qi=
d 2,1)
l,i
2g -',i
d 2,I)
2g -',i
dl,i2,2)
d I ,2)
2g -',i
,2)
d22-',i
g
(3.21)
R == R~g)
is
70
and
Algorithm 3.2 The bitonic algorithm for updating the QRD, where R
1: let b T =
=R~~l'
: fo:~m:tel:~ .~~:j;;P'j:a(I~;O))
,
n
Pi- n
:::
fO:~m:~tel:~~:~o~;;p(l~\)
(R~p):
RJ+p+p
=
16:
end for all
17:
blocks := J1 + P
18: end for
Figure 3.3 shows the process of computing (3.21) using CDGRs, for the
case where n = 6. An integer I (I = 1, ... ,n), a blank and a denote, respectively, the elements annihilated by the CDGR G(l,j), a zero element and a
non-zero element. An arc indicates the rotations required to annihilate individual elements. An element of R~~;(~_j) at position (r,q) is annihilated by a
Givens rotation that affects the rth row of R~~-;~l-j) and the qth row of RY-I)
71
(r,q = 1, ... ,n and r::; q). In general n CDGRs are applied using this Givens
I~
I~
I~
1'1
.!.
I~
Il- l.
1l
1'1
1
C(4,j)
,.
,.
4~
~I
Figure 3.3.
C(2,j)
~
I~
I~
21. I.
2l- I-
1'2 l-
1'2
1'2
c(3,j)
,.
I~
3
Ie(
3.
I~
C(5,j)
.. .
'5(
1'3
c(6,j)
..
'6
The total number of CDGRs applied to compute (3.6a) using the bitonic
algorithm is given by
TI (n,k,'A.,p)
= max (To(pi,n))
= 1, ... ,'A.,
(3.22)
72
5
4 6
3 5 7
2 4 6 8
113 5 7 19
5
4 6
3 5 7
2 4 6 8
1 3 5 7 19
5
4 6
3 5 7
2 4 6 8
1 3 5 7 9
Figure 3.4.
lU 11 12 13 14 15
10 11 12 13 14
10 11 12 13
10 11 12
10 11
10
10 11 12 13 14 15
1U 11 12 13 14
10 11 12 13
10 11 12
10 11
10
16 17 18 19 2U 21
16 17 18 19 20
16 17 18 19
16 17 18
16 17
16
= P2 = P3 = 6.
The number of CDGRs applied to update the QRD using the UGSs is given
by
This indicates the efficiency of the bitonic algorithm for computing (3.6a) for
A > 2, compared with that when using the UGSs.
The second parallel strategy for solving the updating problem is a slight
modification of the Greedy annihilation scheme in [30, 103]. Taking as before n = 6 and k = 18, Fig. 3.5 indicates the order in which the elements are
annihilated. Observing that the elements in the diagonal of R are annihilated
73
4 7
3 6 9
3 6 811
2 5 8 10 13
2 5 7 10 12 15
2
2
2
1
1
1
1
1
1
1
1
1
Figure 3.5.
4
4
4
3
3
3
3
3
2
2
2
2
2
7
6
6
6
5
5
5
4
4
4
4
3
3
3
9
9
8
8
7
7
7
6
6
6
5
5
5
4
4
12 14
11 14
11 13
10 13
10 12
912
911
811
8 10
810
7 9
7 9
7 9
6 8
6 8
5 7
6
74
k=M
UGSs
bitonic
Greedy
15
15
15
15
30
30
30
30
60
60
60
60
5
10
20
40
5
10
20
40
5
10
20
40
75
150
300
600
150
300
600
1200
300
600
1200
2400
89
164
314
614
179
329
629
1229
359
659
1259
2459
72
87
102
117
147
177
207
237
297
357
417
477
43
47
50
54
89
96
102
107
187
198
208
217
sidered within the context of the SURE model estimation [84]. In this case
the performance of the Householder algorithm was found to be superior to
that of the Givens algorithm (see Chapter 1). The simultaneous factorizations
(3.21) have been implemented on the MasPar within a 3-D framework, using Givens rotations and Householder reflections. The Householder algorithm
applies the reflections H(I,j), ... ,H(n,j), where H(l,j) annihilates the non-zero
elements of the lth column of R;~;(~_i) using the lth row of RY-I) as a pivot
row (I = 1, ... ,n).
Table 3.6.
bitonic Householder
Householder
bitonic Givens
UGS-J
64
64
64
2
3
5
2
3
5
2
3
5
0.84
1.55
4.83
4.15
7.78
27.12
10.15
19.69
72.12
0.23
0.35
0.82
1.78
2.98
10.27
5.51
10.05
37.48
1.29
2.34
7.97
8.98
18.21
69.96
27.96
58.78
236.13
1.45
2.46
9.04*
9.77
19.45
76.96*
32.41
67.51*
278.20*
192
192
192
320
320
320
* Estimated times.
Table 3.6 shows the execution times for the various algorithms for computing (3.6a) on the 8I92-processor MasPar using single precision arithmetic.
Clearly the bitonic algorithm based on Householder transformations performs
better than the bitonic algorithm based on CDGRs. However, the straightfor-
75
2.3
Computational and numerical methods for deriving the estimators of structural equations models require the updating of a lower-triangular matrix with
a matrix having a block lower-triangular structure. Within this context the
updating problem can be expressed 'as the computation of the orthogonal factorization
-T
(0)t E-
(A(I))
;\(1) =
el
(G-l)K-E+eG'
(3.23)
where
K -el
A(1) =
K -e2
K -eG-I
-(I)
-(I)
e2
A2,1
e3
A31,
eG
AG,I
-(I)
A 3,2
-(I)
K-el
K-el
A(1) =
K-e2
K -eG-I
A
A(I).
-(I)
-(I)
A GG
, -1
A G,2
A(I)
LI
A( I)
A21,
A(I)
K-e2
K -eG-I
A(I)
L2
A(I)
LA(I)
G_ 1
A G- 12
,
AG_1,1
.
76
= 1, ... , G -
Y+
t~G)
A(G-I)
2,1
t(G-I)
2
0
0
TD(e,K,G) =
j= 1, ... ,G-i,
(3.25)
1=1
where e = (el, ... ,eG). Figure 3.6 shows the annihilation process for computing the factorizations (3.23), where G = 5 and iii denotes a submatrix eliminated at stage i (i = 1, .. . , G - 1).
Stage 1
Stage 2
Stage 3
Stage 4
",
I\.
Figure 3.6.
G=5.
Figure 3.7 illustrates various annihilation schemes for computing the factorization (3.24) by showing only the zeroed matrix A~2 j,j and the lowertriangular
77
are equivalent to those of block-updating the QRD the only difference being that an upper-triangular matrix is replaced by a lower-triangular matrix
[69, 75, 76, 81]. These annihilation schemes can be employed to annihilate
different submatrices of A(1), that is, at step i (i = 1, ... , G - 1) of the factoriza. (3 23) the sub matrices
.
A-(i)
. the
hon.
i+l,I"'" A-(i)
G,G-i can b e zeroedth
WI out usmg
same annihilation scheme. Assuming that only UGS-2 or UGS-3 schemes are
employed to annihilate each submatrix, then the number of CDGRs given by
(3.25) is
T~2v(e,K,G) = ~ max(K-ej+ei+j-l)
1=1
G-l
=(G-l)(K-l)+ L,max(ei+j-ej),
i=1
3 2 1
4 3 2
5 4 3
6 5 4
7 6 5
8 7 6
H 987
U 1() 9 8
112 1 1~ 9
4
5
6
7
8
9
Il~ I
114 1 1
15 l,n.
UGS-2
Figure 3.7.
15 14 13 12
14 13 12 11
13 12 11 1~
1" 11 111 9
11 1() 9 8
1~
9
8
7
6
5
4
9
8
7
6
5
4
3
8 7
7 6
6 5
5 4
4 3
32
2 1
UGS-3
6
7
8
9
6
7
8
9
3 1
42
6 3
7 6
3 1
4 2
6 3
7 6
3 1
1~ 42
1111(1 3
5
6
7
8
5
6
7
8
5
Bitonic
4
5
5
6
6
7
7
8
8
9
9
3
3
4
4
5
5
5
6
6
7
7
10 8
2 1
2 1
2 1
3 1
3 1
3 1
4 2
42
42
5 3
5 3
6 4
Greedy
4
4
4
4
4
4
4
4
4
4
4
4
3
3
3
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
HOUSEHOLDER
The factorization (3.23) is illustrated in Fig. 3.8 without showing the lower
triangular matrix J(1), where each submatrix of A:(I) is annihilated using only
the UGS-2 or Greedy schemes, K = 10, G = 4 and e = (2,3,6,8). This particular example shows that both the schemes require the application of the same
78
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
2
2
3
3
4
4
5
6
1
1
1
1
2
2
3
4
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 1
10 987 6 5 4 2
20 19 18 17 16 15 14 13 7 6 5 4 3 2 1
21 20 19 18 17 16 14 13 8 7 6 5 4 2 1
22 21 20 19 18 16 15 13 9 8 7 6 4 3 1
23 22 21 20 19 17 15 14 10 9 8 7 5 3 2
24 23 22 21 20 18 16 14 11 10 9 8 6 4 2
25 24 23 22 21 19 17 15 12 11 10 9 7 5 3
34 33 32 31 30 29 28 27 19 18 17 16 15 14 13 4
35 34 33 32 31 30 28 27 20 19 18 17 16 14 13 S
36 35 34 33 32 30 29 27 21 20 19 18 17 15 13 6
37 36 35 34 32 31 29 27 22 21 20 19 17 16 13 6
38 37 36 35 33 31 30 128 23 22 21 19 18 16 14 7
39 38 37 36 34 32 30 28 24 23 22 20 18 17 14 8
40 39 38 37 35 33 31 29 25 24 23 21 19 17 15 9
41 40 39 38 36 34 32 30 26 25 24 22 20 18 16 10
3
4
4
5
5
6
7
8
Figure 3.B.
The intrinsically independent annihilation of the submatrices in a blocksubdiagonal of A(1) makes this factorization strategy well suited for distributed
memory systems since it does not involve any inter-processor communication.
79
However, the diagonally-based method has the drawback that the computational complexity at stage i (i = 1, ... , G - 1) is dominated by the maximum
. ed to annihl
th
b
. A-(i)
A-(i)
number 0 fCDGRs requlf
1 ate e su matnces Hl,l'' HG-i,G-i.
(3.27)
Consider the case of using the UGS-2 scheme. Initially UGS-2 is applied to
annihilate the matrix ..1(1) under the assumption that it is dense. As a result the
steps within the zero submatrices are eliminated and the remaining steps are
adjusted so that the sequence starts from step 1. Figure 3.9 shows the derivation
of this sequence using the same problem dimensions as in Fig. 3.8. Generally,
for PI = 1, Pj = Pj-l + 2ej - K (l < j < G) and 11 = min(Pl, ... ,PG-d, the
annihilation of the submatrix Ai starts at step
si=Pi-Il+1,
(3.28)
-(i)
scheme on the Ai+l,l' ... ,Ai+G-i,G-i submatrices. Let the columns of each
submatrix be numbered from right to left, that is, in reverse order. The number
of elements annihilated by the qth (q > 0) CDGR in the jth (j = 0, ... , K - ei)
column of the ith submatrix Ai is given by
o
ei+1
(i,q-l) + (i,q-l) _ (i,q-l) + (i-l,q-l)
aj
rj _l
rj
aj
r j_ l
- rj
(i,q-l) + (i,q-l)
(i,q-l)
rK-ei_1
if j > q and j
if q = j = 1,
> k-
if j = 1 and q
> 1,
otherwise.
ei,
80
19 18 17 16 IS 14 13 1" 11 10 9 8 7 6 5 4 3 2 1
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
22 21 20 19 18 17 16 IS 14 13 12 11 10 9 8 7 6 5 4
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
24 23 22 21 20 19 18 17 16 15 14 13 12 11 Ie 9 8 7 6
25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7
26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8
27 26 25 24 23 22 21 20 19 18 17 16 IS 14 13 12 11 10 9
28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
29 28 27 26 25 24 23 2..! 21 20 19 18 17 16 15 14 13 12 11
30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 IS 14
33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15
34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
UGS-2
12 11 10 987 6 5
13 12 11 10 9 8 7 6
14 13 12 11 10 9 8 7
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
19 18 17 16 IS 14 13 12 11 10 9 8 7 6 5
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6
21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
22 21 20 19 18 17 16 IS 14 13 12 11 10 9 8 7 6 5 4
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6
25 24 23 22 21 20 19 18 17 16 IS 14 13 12 11 10 9 8 7
26 25 24 23 22 21 20 19 18 17 16 IS 14 13 1" 11 10 9 8
27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 II 10 9
28 27 26 25 24 23 22 21 ~O 19 18 17 16 15 14 13 12 11 10
Modified UGS-2
Figure 3.9.
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 1
10 9 8 7 6 5 4 2
IS 14 13 12 11 10 9 8 7 6 5 4 3
16 15 14 13 12 11 10 9 8 7 6 5 4
17 16 IS 14 13 12 11 10 9 8 7 6 4
18 17 16 IS 14 13 12 11 10 9 8 7 5
19 18 17 16 IS 14 13 12 11 10 9 8 6
20 19 18 17 16 15 14 13 12 11 10 9 7
21 20 19 18 17 16 15 14 13 12 11 10 8
22 21 20 19 18 17 16 IS 14 13 12 11 9
23 22 21 20 19 18 17 16 15 14 13 12 10
24 23 22 21 20 19 18 17 16 15 14 13 11
2S 24 23 22 21 20 19 18 17 16 15 14 12
26 25 24 23 22 21 20 19 18 17 16 15 13
27 26 25 24 23 22 21 20 19 18 17 16 14
28 27 26 2S 24 23 22 21 20 19 18 17 IS
2
2
3
3
4
5
6
7
8
9
10
81
1
1
1
2
2
3
5 4
6 5
7 6
8 6
9 7
11 10 8
12 11 9
13 12 10
3
4
4
5
5
6
7
8
2
2
3
3
4
4
5
6
1
1
1
1
2
2
3
4
18 17 16 15 14 13 12 11
19 18 17 16 IS 14 13 12
20 19 18 17 16 IS 14 13
21 20 19 18 17 16 15 14 13 12 11 10 9 7 6
22 21 20 19 18 17 16 15 14 13 12 11 9 8 6
23 22 21 20 19 18 17 16 15 14 13 11 10 8 7
24 23 22 21 20 19 18 17 16 15 13 12 10 9 7
25 24 23 22 21 20 19 18 17 15 14 12 11 9 7
26 25 24 23 22 21 20 19 17 16 14 13 11 10 8
27 26 25 24 23 22 21 19 18 16 15 13 12 10 8 6
28 27 26 2S 24 23 22 20 18 17 15 14 12 11 9 7
29 28 27 26 2S 24 23 21 19 17 16 14 13 11 9 7
30 29 28 27 26 25 24 22 20 18 16 15 13 12 10 8
31 30 29 28 27 26 25 23 21 19 17 15 14 12 10 8
32 31 30 29 28 27 26 24 22 20 18 16 14 13 11 9
33 32 31 30 29 28 27 2S 23 21 19 17 15 13 11 9
34 33 32 31 30 29 28 26 24 22 20 18 16 14 12 10
5
5
5
6
6
7
7
8
3
3
3
4
4
5
5
6
1
1
Greedy
1
1
2
2
3
4
Modified Greedy
Figure 3.10.
to the submatrices of Ai (i = 1, ... , G - 1). The annihilation pattern of the (efficient) Greedy scheme is gradually reduced to that of the UGS-2 schemes.
Generally, the direct employment of the Greedy scheme is expected to perform
at least as efficiently as the UGS-2 scheme. In the example in Fig. 3.10 the
column-based method is the same using both the UGS-2 and Greedy schemes.
82
2.4
mi =
L mj -1'}i + l.
j=I
si=21'}i-1-Lmj,
i=l, ... ,x
j=2
and the total number of CDGRs required to compute the QRD (3.29) is given
by
TI (x,b,m,e,s) = mi +b+1'}x - 2 - min(sl' ... ,sx)'
(3.30)
where m, e and S are x element vectors with ith elements mi, 1'}i and Si respectively. Two examples are given in Fig. 3.11 for m = 17, n = b + 1'}x - 1 = 9,
b = 4 and x = 4. The framed boxes denote the submatrices AI, ... ,Ax, respectively.
For large structured banded matrices, the partition and updating techniques
employed in the bitonic algorithm can be used to reduce the number of CDGRs
applied. To simplify the description of the second parallel method (method-2),
let
i = 1, ... ,2P .
I.
5
4
3
2
I
6
5 7
4 6 S
3 5 7
.1. ..
I. I
I
i.
9.
4 0
3 5
24
1 3
o2
I}
7
6
5
4
10
9
8
7
6
i: 1 2 3 4
mj: 2 4 6 5
t'jj:
1 2 4 6
,nj: 2 5 9 12
s;: 1 -1 -3 4
I -1 3 0
o Initial non-zero
F ill-in
Figure 3.11.
~.
-3- 1 1 3
4- 202
1 2 5 6
o Zero
6.
1 3 5 7
o2 4 6
I 1 3 5
2 0 2 4
-3- 1 1 3
u 1-
S.
7 9
6 8
S7
4 6
1 1 3 S
2 0 2 4
titj: 6 9 812
5j:
I3 5
214
i: 1 2 3 4
mj: 6 4 2 5
t'jj:
4 . . 1
1 3
o2
1. I.
S lO
1 3 5 7 ~11
o 2 4 6 Sl( 11
-1 1 3 5 7 911
14 6 8 IG
3 5 7 9 1
k 4
83
eF))
R(j-I)
2;-1
O(j-I)
O(j-I)
R(j-I)
2;
h+ (2i -l)~'
b-~*
j,
where
(3.31)
QL is orthogonal,
O(j-I)
(2 j -
1-
(j_I)
( R ;-1
2
(j_I)) =
O(j-I))
in (3.31) be partitioned as
2j-I~*
b-~*
R(j-l)
1,2/-1
R(j-l)
2,2;-1
R(j-l)
3,2;-1
84
= (RU-I)
3,2i-1
OU-I))
2
'
where O~j-l) becomes non-zero during the annihilation process. Starting from
the main diagonal, the Ith CDGRs annihilates the Ith superdiagonal of R~j;~1
I = 1, ... ,b + (2 j - 1 - 1) t}*. In general, an element of R~j;~l at position q)
(1,
becomes zero after the application of the (q - I + 1)th CDGRs which rotates
and R~-l), respectively (q ~ I). In addition the
the Ith and qth row of R~j;~1
,
banded upper triangular structure of R~-l) is preserved and
( RU-1)
1,2i-l
RU-I)
2,2i-l
OU-l))
1
QL
QL is the product of
where
Figure 3.12 shows the process of applying the CDGRs for b = 8, t}* = 3 and
j = 2. The matrices are partitioned as
(
RU-.l)
IOU-I)
)
3,21-1
2
After the application of the first CDGRs, the 4th and 5th rows of O~j-l) become
oV-
b-~'
12 3 4
1 2 3
1 2
1
5
4
3
2
6
5
4
3
7
6
5
4
8
7
6
5
1 213 ,4
85
91C 11
8 9H
7 8 9
6 7 8
5 6 7
b+(2j-1 - l ) ~'
~.
= 3 and j = 2.
if 2t'}*
>
- m* ,
otherwise.
(3.32)
05g 5 p,
subject to { 2t'}* < m* < b + t'}* ,
86
Matrix A
After Stage 4
Figure 3.13.
= 4 and g = 1.
87
Table 3.7 gives examples of g and T4(m* ,b, 1'}* ,p,g) for some m*, b,
Table 3.7.
m*
10
15
30
40
50
100
1'}*
and p.
'i}*
8
12
10
20
30
80
6
10
5
5
20
3
4
5
6
7
8
3
4
0
0
oor 1
lor 2
58
175
218
463
2688
10678
40
2.5
Recursive least squares with linear equality constraints (LSE) problem can
be described as the solution of
argmin IIA (i)x - y(i) 112
subjectto
ex =
d,
for
i = 1,2, ... ,
(3.33)
where
e E ~kxn, X E ~n, d E ~k and (A(O) y(O)) is a null matrix. It is assumed that there are no linear dependencies among the restrictions
and also that the LSE problem has a unique solution which cannot be derived
by solving the constraints system of equations. That is, the rank: of e is k < n
and (A (i)T eT ) T is of full column rank:.
The two main methods which have been used to solve (3.33) are Direct
Elimination (DE) and Basis of the Null Space (BNS) [14, 93]. The performances of these two methods for solving the recursive LSE problem on an
array processor are discussed.
Yi E ~mi, Ai E ~mixn,
2.5.1
The BNS method uses an orthogonal basis for the null space of the constraints matrix to generate an unconstrained least-squares problem of n - k
parameters [93]. Let the QRD of eT be given by
k
with
Q= (QI
(3.34)
88
(3.35)
Y
where Z(i)
pT (Z(i) q(i)) = ( ~
Pi)
(3.36)
~'
h+ Ql'li) ,
where'1- i ) = Rjl Pi. To compute x(i+1), the updated solution of (3.33) at stage
i+ 1, let
ZHI =Ai+lQ2,
(3.37)
(3.38)
and
T
PHI
(Ri
0
Zi+1
Pi )
l1i
qHl
(Ri+l
0
0
PHI) n -. k
l1i+1
0
(3.39)
mi+l
where Ri+ 1 is upper triangular and Pi+ 1 is orthogonal. Then, solving the triangular system of equations
(3.40)
allows x(i+1) to be computed as
x(i+l) =
h+ Q2y(i+l)
(3.41)
89
2.5.2
THE DE METHOD
Let Q E 9lkxk be orthogonal, IT E 9lnxn be a permutation matrix and L be
(3.42)
and x = nT x, where
X=
(~~) ~-k
== J-CXI'
(3.43)
where
If
and
then
A(i)X-y(i) =A(i)nnT x-y(i)
= (..1(i)
(A (i) -
= Z(i)XI - q(i).
(3.44)
90
For
n-k
Ai+l II =
(Ai+l
"'i+l)
and
(3.45)
the factorization (3.39), and (3.43) can be used to compute x(i+l) where, now,
Ri+lXl
= Pi+l
2.5.3
IMPLEMENTATION AND PERFORMANCE
The main difference between the matrix computations required by the BNS
and DE methods is that the mi+l x n matrix Ai+l and the n x (n - k) submatrix
Q2 used in the former correspond to the mi+ 1 x k and k x (n - k) "'i+ 1 and t
matrices, respectively, used in the latter. Since k < n, the DE method is always
faster than the BNS method.
The BNS algorithm has been implemented on the SIMD array processor
AMT DAP 510 [67]. Householder transformations have been employed to
compute the orthogonal factorizations (3.34) and (3.39). For the matrix-matrix
multiplications and solution of triangular systems the outer product and column sweep methods have been used, respectively [67]. The analysis of the
relative performances of the BNS and DE methods has been made by using accurate timing models. The assumption that the data matrices have dimensions
which are multiples of the edge size of the array processor has been made. As
expected, the analysis of the timing models shows that for all cases the DE
method outperforms the BNS method, with the difference in performance between the two methods decreasing as k increases for fixed n. This is illustrated
in Table (2.5.3).
QnAn =
TA
(R)k+n
0 m-k-n'
with
R_
-
Sn
'
(3.47)
91
= 96, n =
BNS
DE
10
10
10
10
10
18
18
18
18
18
18
18
18
1
3
5
7
9
3
5
7
9
11
13
15
17
6.13
4.31
2.76
1.48
0.46
16.80
13.66
10.82
8.26
5.98
3.97
2.22
0.74
3.76
2.87
2.02
1.21
0.43
10.23
8.71
7.26
5.86
4.52
3.21
1.95
0.70
the least-squares solution of the modified OLM is given by Rnxn = Un, where
Qn E 9tmxm is orthogonal and Rn is an upper triangular matrix of order k+n-l.
If (3.2) has already been computed, then (3.47) can be derived by computing
the smaller QRD
Q-T(B2
set ) = (R)k+1
0 m-n-k'
(3.48)
with
QTB = (Bt) n - 1 .
B2 m-n+ 1
(3.49)
The orthogonal matrix Qn and upper triangular factor R in the QRD (3.47) are
given, respectively, by
(3.50)
and
R=Note that A~ An
(~
t :n) ~ 0
(3.51)
~~~) =- (~~~
BfB~?k~Rn
(~~yT ~ ~~:
T
yT B yTY
u R uT + Rn
A
Sn
Bt
u~
uR:A~un
Bf
). (3.52)
T
u U+ u~ Un + ~
92
whereB= (B
y)
u) [67].
andBl = (Bl
DELETING OBSERVATIONS
A== (A y) ==
(~l)
== (AI
A2
A2
Yl)m-k
Y2 k
and
(1) m-k
k
.
2
The downdating problem can also be expressed as the computation of the QRD
with
R ==
(Rno un) n1 Sn
(3.54)
when (3.2) is known. The restriction n ::; k < m is imposed to guarantee that at
least one observation is deleted and that the resulting downdated OLM is overdetermined. Furthermore, it is assumed thatA2 has full column rank. Note that
if the deleted observations do not occupy the first m - k rows of the OLM, then
A and Q in (3.2) need to be replaced by OTA and OT Q, respectively, where OT
is a permutation matrix, whose effect is to bring the deleted rows of A to the
top [15,67].
If
m-k
Qf2)n
Qf2 m-k
T
k-n
Q32
93
in (3.2) is known, then the downdating problem can be solved in two stages
[67]. In the first stage the orthogonal factorizations
HT
(Q~I
Q~2)
Q
Q
31
32
and
GT (Qfl
q}2
Z
Q
22
(Z
0
A)
= (D
0
0
q~2) m - k
Q 32 k-n
A~
Q 22
E)
m-k
B n
(3.55a)
(3.55b)
are computed, where H E sx(m-n) x (m-n) and G E sx(m-k+n) x (m-k+n) are orthogonal, Z is upper triangular and IDI = Im-k. In the second stage the QRD
(3.56)
is computed, where R correspond,s to the upper triangular factor of the QRD
(3.54). Observe that
0) (1
(G
o h-n <>
T
it follows that
(3.58)
and since (Qll
ZT
= Im-k- QllQfl
(3.59)
Thus, if Q in (3.2) is not available, then (3.58) and (3.59) can be used to derive
Qll and Z [39, 67, 75, 81].
94
where l denotes the imaginary unit [15, 20, 67, 75, 93,122]. Thus, the problem
can be seen as the updating of the QRD (3.2) by tAl:
QT (tAl)
R
(i?)0 m-k
n
(3.61a)
or
(3.61b)
The QRD updating (3.61) can be obtained using hyperbolic Givens rotations
or Householder transformations. Hyperbolic Householder reflections have the
form
i= 1, ... ,n,
(3.62)
A(l) IS
" the lth column of AI,
A Yi = J{i,i
A + S, S = SIgn
. (A
)(1'.2
II A(I)112) '
where A:,i
J{i,i J(i,i - A:,i
Ci = sri and ei is the ith column of In. Application of hyperbolic Householder
transformations does not involve complex arithmetic. However, it should be
noted that this downdating method is not numerically stable.
4.1
PARALLEL STRATEGIES
On the DAP the factorizations (3.55) and (3.56) have been computed under
the assumptions that n = N es, m = Mes, m - k = es (es observations have been
deleted) and that, except for R, none of the other matrices is required. That is,
the orthogonal matrix Qn is not computed.
In the first stage (3.55a) is computed using Householder transformations
with total execution time (in msec)
for X
=M -
N.
UGS-l can be used, as shown in the updated OLM, to compute (3.55b). That
is, the orthogonal matrix GT is the product of the CDGRs of UGS-l which
computes the updating
95
UGS-3 can also be employed to compute the equivalent form of the orthogonal
factorization (3.55b)
GT (Ql1
q12
QI2
(3.63)
Ql1
Z
IA ) n
10 m-k
withanumberi(i= 1, ... ,18)inthematrices (Ql1 ZT) T and (AT OT) T denoting, respectively, the elements annihilated and filled in when the ith COGR
is applied. On the OAP the execution time model for computing (3.55b) using
UGS-l is found to be
(3.64)
(Zi,i+O"i)ef)
(3.65)
which annihilates the ith column of (Ql1 Q12) , denoted by q, and which also
reduces Zi,i+l: to zero and makes IZi,d = 1 (i = 1, ... ,m - k). Here, ei is the ith
column of Im-k and O"i = 1 if Zi,i > 0, and O"i = -1 otherwise. Let
ni ( D(i)
0
m-k-i
m-i
Q(i)
~(~))
EO(I)
(Q(i)
R(i))
== Z(i) E(i)
Z{i)
Q(i-l)
=Hi ( Z{i-l)
R(i-l)) n
E(i-l) m-k'
(3.66)
where Q(O) == (Ql1 Q12)' Z(O) == (Z QI2)' R(O) == R, E(O) == 0, ID(i) I = Ii,
and in (3.63), B == R(m-k) and E == E(m-k). From (3.66) it follows that
(3.67a)
96
00000
U.OOOOl~.
11~.000~~1
~~.oo~lnl
~ll.0~119rI5
141:
13 11'
.~2~I:U~4
3 5 7 9 lit
1 11 9 7 5 3
246 1 8l0~
13 1 57~1
11 1 ~7 1 531
1211086 142.
2468[(
3579
468
108642
9753
864
5 7
7 5
lSI!
l'
~O
161
l~
1~
,8
1'l
1~
9
8
7
6
5
4
3
11
13
9 10 I
8 9 III .1
7 8 9 10
6 789
5 6 7 8 9
4 5 678
23 4 5 6 7
112 3 4 15 6
.0 0 0 0 0 1
000 0 1~
.00 0 1~
.00 11211
.0 !2r21
!3122 tn
1 1 11 11] 19 8 7 16 5 14
1'1 I 11 9 8 7 6 5
1 10 9 8 7 6
1:
1
1 11111] 9 8 7
1 l'
1 12 III III 9 8
In3 11 1110 9
1 1:
o Zero element
321
43 2
543
6 5 4
7 6 5
8 7 6
[I] Fill-in
[!] non-zero element
Figure 3.14.
97
(3.67b)
(3.67c)
and
(3.67d)
Algorithm 3.3 computes (3.63) using Householder transformations. For convenience, the superscripts of the matrices Q(i), R(i), Z(i) and E(i) have been
dropped. Notice that the observations that have been deleted from the OLM
are generated by the 10th step of the algorithm. This step together with the
steps that modify Z, i.e. steps 11 and 12, can be deleted.
4:
5:
6:
7:
8:
9:
10:
11:
12:
y:=Zi,i+cr
~ := 1 + IZi,il
W := sum(spread(Q:,i,2,n) *R, 1)
forall(j = 1, ... ,n) Rj,: := Rj,: - Qj,i *W /~
W:= cr*Zi,i: +el
forall(j = 1, ... ,n) Qj,i: := Qj,i: - Qj,i *W /~
E i ,: := -cr* W
Zi,i+l: := 0
Zi,i := -cr
13: end for
The total execution time required to implement Algorithm 3.3 on the DAP,
without modifying Z and without regenerating the deleted data E, is given by
98
is the time needed to compute (3.61b) using hyperbolic Householder transformations. The method which uses Algorithm 3.3 - called the Householder
method - is inefficient for large N due to the triangularization of Bin (3.63).
However, it performs better than UGS-l for N = 1,2,3. The disadvantages of
UGS-l, which necessarily computes the transformation of Z to D and regenerates E, indicates that the efficiency of the Householder method might increase
with respect to UGS-l as the number of observations to be deleted increases
- that is, for larger m - k. Use of hyperbolic Householder transformations
to solve the downdating problem produces the fastest method. However, the
execution efficiency of this method is offset by its poor numerical properties.
Table 3.9.
T.vl (N)
1's2g(N)
1's2h (N)
T.v2L(N)
T3(N)
1
2
3
4
8
16
32
0.15
0.21
0.27
0.32
0.55
1.01
1.92
0.50
1.01
1.64
2.40
6.75
22.62
90.l5
0.l2
0.26
0.48
0.77
2.65
9.87
38.17
0.l6
0.50
1.13
2.l7
12.62
86.81
647.46
0.27
0.76
1.61
2.94
15.27
96.69
685.63
0.19
0.45
0.79
1.19
3.55
12.l4
47.32
T _
Qll-
C'WIT) .,
:
n2
:
WJ
ng
and
nl
n2
ng
All
,
Al ,2
Al,g
A22
,
A2,g n2
A=
Ag,g ng
Gi
(lti
Zi-l
.. .
Ai,i ...
nl
Ai,g) _
(i-I) Eg
(0
Zi
'b) T.
For
(3.68)
== (E~g) ., . E~g))
99
B=
ni
n2
RII,
RI2,
R2,2
ng
RI,g nl
R2,g n2
Rg,g ng
The updating (3.68) can be computed using the UGS-2, UGS-3, or Householder methods. For the QRD of B the orthogonal factorizations
ni
Qf(Ri,i ... Ri,g)= (Ri,i
ng
...
Ri,g),
QT =
(Of .~)
Rl,1 ~1,2
and
R2,2
R= (
~I,g)
R2,g
Rg,g
Downdating of the regression model (3.1) after removing a block of regressors is a straightforward operation. Partitioning the data matrix A as
nl
A= (AI
n2
n3
A2 A3)'
o m-n*'
with
R=
(3.69)
100
o? (i?2.3)
=
i?3.3
(R02.3) n3n2 + 1
(3.70)
R == (i?l.l ~1.3).
R2.3
Hence, the algorithms for solving the updating (3.6) can be employed to compute (3.70).
The re-triangularization procedure above can be extended to deal with the
case where more than one block is deleted. Here the more general problem of
deleting arbitrary regressors from the OLM will be considered. Furthermore,
in order for the proposed strategies to be used in other related problems and
to have a consistent notation, it will be assumed that the matrix i? in (3.2) is
a K x (K + G) upper trapezoid, i.e. m = K and n = K + G. Thus, within this
context, the regressors-downdating problem is redefined as the computation of
the orthogonal factorization
K
R = (R(l)
with
G
R(2))K,
(3.71)
S=
( so SO)KG'
let
(3.72)
where the jth column of the permutation matrices S and S are given, respectively, by the ).jth column of lK and the ).jth colu~ of lG, ). = ().l ... ).k) and
101
).1 < ... < ).k. The QRD of the RS matrix can be derived by computing the
QRDs
with
(3.73)
and
QT(~n=mk-k-g'
(3.74)
such that the orthogonal matrix (),T and the upper triangular factor
QRD (3.71) are given respectively by
R in
the
_- (ROQ-TR
R(2)
R-
(3.75)
Tu().,k)=(k-k)-,u+l,
(3.76)
= 2j+k-).j+k + 1 for
j = 1, ... ,k-k.
The CDGRs comprise rotations between adjacent planes and the non-zero elements in the (j + k)th column of RP) are annihilated from bottom to top by
the O"jth CDGR, where O"j = Pj -,u+ 1 (j = 1, ... ,k-k).
The computation of the QRD (3.74) can be obtained using any known parallel Givens annihilation scheme such as those reported in [30, 75, 103, 129].
However, the factorization (3.74) can start simultaneously or before the completion of the QRD of R~I) in (3.73). Thus, if the SK (Sameh and Kuck) annihilation scheme in [129] is used for the factorization (3.74), then the total
number of rotations applied to compute the QRD of RS is given by
(1)
TRS ('A.,k,g,K) =
{Tu().,k)
K -k+g-2+max(0,2-K +).k)
ifg=O,
if g> 0.
(3.77)
102
-- k
lele Ie ele Ie e
Ie Ie ele Ie e
30 Ie ele Ie e
' .OJ
-Ie Ie e
lele e
~J!ol,) Ie e
ele
ele
ele
ele
ele
ele
Ie
-Ie
Ie
r:I~
e
e
e
e
e
e
e
e
e
Ie e ele e
~9 e ele e
2D~ ele e
2H~ 311e e
ele Ie
ele Ie
ele Ie
ele Ie
Ie Ie
-Ie Ie
Ie Ie
Ie
~ ~:l
tlli
'11
19~1
~1
~,
I":l
'1, '1'7
..,
!olD
J.,
13:
121 2.1
25~1
"'''
131151'1
ele
ele
eie
ele
ele
ele
ele
ele
Ie
Ie
?Q 'l~
t\2
InB
:12
?"
?~
B
,10 l'
i911
6 18 I~
517 9
416 8
9
31$
214
8
1 i3
7
I"
."
Figure 3.15.
,?'
u
21 21
2~ 22
...
U 20
I ' III
SK scheme
"'II
.,,,
~o
~5 11 119
Ehmmate numbers
ee ee ee ee ee
15 e ee ee ee ee
1416 ee ee ee ee
1315 17 e ee ee ee
416 18 e ee ee e
315 1719 ee ee e
1214 1618 ~O ee ee
113 1517 1921 ee e
012 1416 1820 22 ee
Subtract m = 14
This method for computing the QRD of RS is called the SK-based scheme
and its derivation from the SK Givens sequence is illustrated in Fig. 3.15,
where K = 30, k = 7, g = 3 and ~ = (4,10,11 , 12,15, 21,22) or, equivalently,
k = 8, g = 2 and ~ = (4,10,11 , 12, 15,21,22, 30). Initially the SK scheme is
employed to triangularize a K x e full dense matrix, where a number i denotes
the elements annihilated by the ith (i = 1, . .. , K + e - 2) CDGR and a denotes a non-zero element. Then the numbers at positions ~ j + 1 to K of the jth
(j = 1, . . . , k) column of the matrix - that is, the numbers in the zero part of
the matrix - are eliminated. The minimum number, e.g. m, from the remaining
positions in the matrix is computed. Finally the SK-based scheme is derived
by subtracting m -1 from each number. Observe that for g = 0 the SK-based
scheme is identical to the parallel algorithm reported in [74], while, for k = 0,
the SK-based scheme is identical to the SK Givens sequence for computing
the QRD of a K x g matrix. However, for g = 0 the complexity analysis of
103
the SK-based parallel strategy will hold for all cases, in contrast to that in [74]
which fails if 3j : 1..j = j.
The number of CDGRs applied to compute the QRD of RS in (3.71) can be
reduced if the maximum number of elements is annihilated by a single CDGR
[30, 69, 103]. The rotations are not restricted to being between adjacent planes
and previously annihilated elements are preserved. Let r)q) = La)q) /2J denote
the maximum number of elements in column j (j = 0, 1, ... ,k) of RS that can
be annihilated by the qth (q> 0) CDGR, where a dummy column at position
zero has been inserted with 5.0 = 0 and Vq: a~q) = o. At step q an element in
position (I, j) of RS is annihilated by a Givens rotation in planes I and (I where I
r)q) ) ,
TA;) (1..,k,g,K)
IOgK + (g-I)loglogK
{
= logk+ (e-l) loglogk -log(k- 1..kj-l)
logk + (e-l) loglogk -Iog(k - Ak+l )
if k = 0,
if g = 0,
if k,g
> 0,
(3.79)
where 1..k+ 1 = 0 and k = 1..k - k. This result derives from the complexity of the
Greedy algorithm when applied to a dense matrix [103]. Clearly, the Greedybased scheme requires no more CDGRs than the SK-based scheme.
104
5
2
2
1 5
1 4 4 8 .
1 3 6 4 7 10
1 4 7
3 6 9 .
3 5 8 . 3 6 9 1
3 6 811
2 4 7 lU 3 6 911 14
2 5 8 10 13
2 4 6 9 12 3 5 811 1316
1 3 5 8 1114 3 5 8 IU 1315 18
2 5 7 10 1 15
2 5 7 10 1215 17121J
2 4 7 9 1 14 17
135 710 1316
124 6 9 1215 18.
2 4 6 8 1114 17~
3 5 7 10 13 Hi 19
4 6 9 1 1518
2 5 811 1417
1 4 7 10 1316
3 6 9 1 15
2 5 811 14
2 4 7lU 13
1 3 6 9 12
1 3 5 811
1 2 4 7 10
2 4 6 9
3 5 8
3 5 7
2 4 6
246
135
1 3 5
1 2 4
1 2 3
Greedy-based scheme
Figure 3.16.
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
5
4
4
4
4
4
4
3
3
3
3
3
3
3
2
2
2
2
2
2
2
Chapter 4
THE GENERAL LINEAR MODEL
INTRODUCTION
Consider the General Linear Model (GLM)
(4.1)
X,u
subject to y = Ax + Bu,
(4.2)
QT (y
A) =
(~T ) == (~
Y2
n-l
)m-n
RT
n-l
(4.3a)
106
and
1 n-l
)m-n
0
pOl
Lz2
(4.3b)
n-l
where RT, Lll and L22 are lower triangular non-singular matrices. The BLUE
of x, say X, is obtained by solving the lower triangular system
QT
mt
1
Q'A~ (~)
~
Ai - m - mi'
I
Yi
n-l
~i
(4.4a)
and
mi
(QfB)P, ~ L, "
m, (~II
m-mi Lzl
m-mi
0
Li
),
(4.4b)
107
Aj
Lj
j)
Uj
(4.5)
It follows that, since
GLLSP
L aj
= 0, aj =
(4.6)
Uj,X
Thus, after the application of the (n(n + 1) /2)th left and right Givens rotation
Paige's sequence solves reduced GLLSPs.
Similarly, the column-based Givens sequence shown in Fig. 4.1 (b) can be
used to compute (4.3). However, this sequence annihilates complete rows of
A more slowly than Paige's sequence. As a result the column-based sequence
performs twice as many operations when m n [111].
36 28 21
44 35 27
52 43 34
60 51 42
68 59 50
76 67 58
84 75 66
92 83 74
lOCI 91 82
3
5
8
12
24 17
40 31 23
48 39 30
56 47 38
64 55 46
72 63 54
80 71 62
HIE 97 88 79 70
0596 87 78
~04 95 86
15
20
26
33
41
49
57
65
73
10~ C)C) 90 81
10'7 98 89
10
14
19
25
32
6
9
13
18
1
2
4
7
11
16
22
29
37
45
53
61
69
77
~o~ 94 85
~02 93
01
Figure 4.1.
99 88
lOCI 89
101 90
0291
103 92
104 93
105 94
ll1E 95
10'7 96
10~
76 63 49 34 18
77 64 50 35 19
78 65 51 36 20
79 66 52 37 21
80 67 53 38 22
81 68 54 39 23
82 69 55 40 24
83 70 56 41 25
84 71 57 42 26
97 85 72 58 43 27
98 86 73 59 44 28
87 74 60 45 29
75 61 46 30
62 47 31
48 32
33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(b) Column-based
108
PARALLEL ALGORITHMS
The Paige's sequence shown in Fig. 4.1(a) annihilates, diagonal by diagonal, the elements of A and the elements of each diagonal are annihilated by
successive Givens rotations from the bottom to the top. The annihilation of
a diagonal can start before the previous diagonal has been completely zeroed.
This leads to the SK sequence, a parallel Givens sequence developed by Sameh
and Kuck [129]. The SK sequence computes the QLD of an m x n matrix by
applying m + n - 2 CDGRs for m > n and 2n - 3 CDGRs when m = n. Figure 4.2 shows the SK sequence for m = 18 and n = 8. Each CDGR preserves
the previously created zeroes and a single element is annihilated by rotating
two adjacent planes. The total number of elements annihilated by the application of the ith CDGR is given by e(i, m, n), where, for m 2: 2n
e(i,m,n)
= {min(l(i ~ 1)/2J +
m+n-l-l
1, n)
if 1::; i < m,
if m ::; i ::; m + n - 2,
m+n-i-l
if 1 ::; i < m,
if m ::; i
< 2n,
if 2n ::; i ::; m + n - 2.
Figure 4.2.
The SK sequence.
109
Figure 4.3.
= 8.
n = 8 and G(I6) is applied to the left of (y A B). The zero (shaded) layers
of (y A) and B are not affected by the application of G(I6). Similarly, the
zero layers of G(16)B are not affected by the right application of p(16) which
annihilates the fill-in of G(I6) B.
0
0
0
0
0
(y A)
[Q) Annihilated [!] Fill in [!] Non zero
Figure 4.4. The application of the SK sequence to compute (4.3) on a 2-D SIMD computer.
110
Modi and Bowgen modified the SK sequence, hereafter called the MSK
sequence, to facilitate its implementation on the DAP using systems software
in which operations can be performed only on es x es matrices and es element
vectors [104]. The MSK sequence requires a greater number of compound
rotations to compute the QLD (4.3a) than does the SK sequence, but it has the
advantage that the CDGRs are not applied to any previously annihilated es x es
layers. The MSK sequence can be generalized so that MSK(p) partitions A
into n/ p blocks (each consisting of p columns) and applies the SK sequence to
these blocks one at a time (where it is assumed that n is a multiple of p). Thus,
the total number of CDGRs needed to compute the QLD (4.3a) using MSK(p)
is given by:
nip
Note that the SK sequence and the MSK sequence in [104] are equivalent to
MSK(n) and MSK(es), respectively. Figure (4.5) shows MSK(4) and MSK(8),
where m = 24 and n = 16. The shaded frames denote the submatrices to which
the SK sequence is applied to transform the matrix into lower triangular form.
73 ~1 69 67 5553 51 49 3331 29 27 7 5 3 1
74 ~2 70 68 5654 52 50 3432 30 28 8 6 4 2
75 73 71 69 5755 53 51 3533 31 29 9 7 5 3
76 74 72 70 5856 54 52 3634 32 30 ~o 864
77 ~5 73 71 5957 55 53 3735 33 31 ~1 975
78 76 74 72 64J 58 56 54 3836 34 32 ~2 10 8 6
79 ~7 75 73 6159 57 55 3937 35 33 ~3 1197
80 ~8 76 74 6260 58 56 4038 36 34 ~4 12108
r79 77 7S 6361 59 57 4139 37 35 ~5 1311 9
78 '76 64 62 60 58 4240 38 36 ~6 1412 10
.77 6563 61 59 4341 39 37 ~7 1513 11
6Cl64 62 60 4442 40 38 ~8 1614 12
.6S 63 61 4S43 41 39 ~9 171S 13
64 62 4644 42 40 ~ 1816 14
.63 4745 43 41 119 1715
484(; 44 42 220 18 16
.47 45 43 ~21 1917
46 44 2422 20 18
.45 ~5 23 21 19
lUi 24 22 20
.25 23 21
24 22
45 43 41 39 37 35 33 31 ~S 13 11 9 7 531
46 44 42 40 38 36 34 32 ~6 14 12 10 8 6 4 2
47 45 43 41 39 37 35 33 ~7 15 13 11 9 7 5 3
48 46 44 42 40 38 36 34 ~8 16 14 12 10 864
49 47 45 43 41 39 37 35 917 15 13 11 975
50 48 46 44 42 40 38 36 W18 16 14 12 10 8 6
51 49 47 45 43 41 39 37 119 17 15 1311 9 7
52 50 48 46 44 42 40 38 220 18 16 1412 108
.51 49 47 45 43 41 39 2..121 19 17 1513 119
50 48 46 44 42 40 ~22 20 18 1614 12 10
.49 47 45 43 41 ~23 21 19 17 15 13 11
48 46 44 42 Ui24 22 20 18 16 14 12
.47 45 43 725 23 21 1917 15 13
46 44 ,..,,'" 22 20 18 16 14
.45 927 2S 23 21 19 17 15
3028 26 24 22 20 18 16
.29 27 2S 23 21 19 17
28 26 24 22 20 18
27 25 23 21 19
26 24 22 20
25 23 21
24 22
(a) MSK(4)
(b) MSK(8)
.23
Figure 4.5.
.23
111
-A1:m-(i-l)p,n-jp+l:n-(j-l)p
SP) as follows:
for 1 ~ J' ~ a(lj)
== 2nj -1,
S~2)
= da\i)+i)
I,}
== mj -
S~3) = da\i)+aY)+j)
== n l"-1.
s~1)
=
I,}
G(j)
(k =
2nj,
(k)
e;,j
I~
where
(k)
L ap,mj,nj),
,k-l
ej,j = e(J+
p=l
and
~+~+2e~j =mj,
Confonnably partitioning the top mj x (nj + (2N - iA)es/2) submatrix of A,
say A*, as
-(j))
A* =
(1ti)
-(i)
A2
112
then A * :=
S\k!
A (i)
I,]
is computed as
A(i) :=
CI
CI
CI
CI
CI
CI
C2
C2
C2
C2
C2
C2
Ce Ce
Ce Ce
Ce
Ce
DI,:
D(2e-I),:
Du,:
-SI
-SI
-SI
SI
Sl
SI
-S2
-S2
-S2
S2
S2
S2
-Se
Se
-Se
Se
-Se
Se
D2,:
D3,:
D4,:
D2,:
DI,:
D4,:
D3,:
(4.7)
Du,:
D(U-I),:
where e = e;,1, D == A(i), U x (ni + (2N - iA)es/2) are the dimensions of all
matrices and * denotes element by element multiplication. Samples of execution times have been generated by constructing and applying a single CDGR
to matrices with various dimensions. Fitting a model to these samples, the
estimated execution times of constructing and applying (4.7) were found, respectively, to be
al
+ b l rU/es1 rN - (i -1)A/21
and
S;,1
r~,1 x C~)k) and r~,1 x ci,)k) , respectively. Similarly let the right compound rota-
si1
rt1
r},1
113
rotations are not applied to some columns of B when i = 2N /1.., the values of
C~)k), ct)k) and ci,)k) are different. If PA, PB=PBL+PBR, PQ and Pp are all of the
es x es submatrices of A, B, Qr and P, respectively, which are involved in the
compound rotations when the MSK(Aes/2) sequence is applied and where PBL
and PBR correspond to the left and right compound rotations, then
el)
2NI').. 3
PA(M,N, A) = L LtJrt~/eslrc~)k)/esl,
i=l k=lj=l
2NI').. 3 aU)
PB(M,N,A) = L L
(rrt~ /esl
i=l k=lj=l
2NI').. 3
a(i)
k
(k)
Pp(M,N,A) = PQ(M,N,A).
More details of these are given in [67]. The total time spent on constructing
and applying the CDGRs of the MSK(Aes/2) sequence is
114
Pr
1
2
140
94
1059
1160
0.92
0.84
0.90
0.84
0.25
0.17
1.17
1.04
1
2
204
126
1962
2113
1.55
1.41
1.53
1.41
0.53
0.39
2.08
1.82
1
2
588
318
12084
11863
7.67
6.79
7.60
6.83
5.03
3.47
12.69
10.33
1
2
4
8
16
6496
3440
1912
1148
766
281808
289770
307472
338630
382821
159.63
155.22
159.98
173.70
195.14
158.01
154.24
159.92
174.68
197.57
148.75
95.59
68.42
52.19
35.68
307.41
250.22
228.59
227.08
233.53
1
2
1292
670
51585
47690
29.34
25.53
29.16
25.68
32.46
19.4
61.75
45.16
16
20
Rotations
Overh.
MSK(A.es/2)
20
1
2
3
6
3684
1914
1324
734
166841
167481
169887
174312
93.88
89.23
88.77
89.17
93.08
88.95
89.23
90.17
105.52
66.18
53.41
40.60
198.61
155.14
142.78
130.91
20
1
2
3
4
6
12
6792
3540
2456
1914
1372
830
333560
340065
349507
355739
370849
412475
186.31
180.57
182.37
183.96
190.07
209.57
184.60
179.40
182.57
184.05
190.89
212.10
208.13
131.08
105.95
92.65
79.38
64.07
392.79
310.71
288.54
276.72
270.28
276.18
The employment of Householder transformations to compute (4.3) is straightforward [13, 17, 18, 28, 37,40, 56, 93]. The orthogonal matrices QT and P
in (4.3) are the products of Householder transformations. Unlike the Givens
strategies, the Householder algorithm cannot exploit the triangular structure of
B. The timing model (in seconds) of the Householder method is found to be
115
Th(M,N)
Householder
MSK(Aes/2)
2
3
9
9
9
12
12
12
16
16
16
20
20
20
20
1
1
1
2
4
1
2
3
1
2
3
1
2
3
6
0.43
0.77
7.25
8.88
12.03
14.92
17.70
20.44
31.86
36.63
41.37
58.47
65.80
73.09
94.59
0.38
0.73
7.46
9.l6
12.44
15.40
18.27
21.11
32.85
37.78
42.68
60.19
67.72
75.23
97.38
0.71
1.17
5.60
11.69
24.57
8.88
18.56
29.59
14.36
29.84
47.73
21.09
43.55
69.45
158.12
Chapter 5
SEEMINGLY UNRELATED REGRESSION
EQUATIONS MODELS
INTRODUCTION
i= 1, ... ,G,
(5.1)
where the Yi E ~r are the response vectors, the Xi E ~rXki are the exogenous matrices with full column ranks, the ~i E ~ki are the coefficients, and
the Ui E ~r are the disturbance terms. The basic assumptions underlying the
SURE model (5.1) are E(Ui) = 0, E(Ui,U)) = CJiJr and limr-too(XrXj/T) exists (i, j = 1, ... , G). In compact form the SURE model can be written as
or
G
vec(Y) =
(5.2)
118
where Y = (YI ... YG) and U = (UI ... uG). The direct sum of matrices EEl~IXj
defines the GT x K block-diagonal matrix
~X'=XlEllX'EIl"'EIlXG=
(Xl X, ".
x,,)'
(5.3)
where K = L~l kj [125]. The matrices used in the direct sum are not necessarily of the same dimension. It should be noted however, that some properties
of the direct sum given in the literature are limited to square matrices [134,
pages 260-261]. The set of vectors ~}, ~2' ... ' ~G is denoted by {~j}G. The
vec() operator stacks the columns of its matrix or set of vectors argument in a
column vector, that is,
vec(Y)
= '(YI)
Y~
and
Hereafter, the subscript G in the set operator {.} is dropped and the direct sum
of matrices EEl~1 will be abbreviated to EElj for notational convenience.
The disturbance vector vec(U) in (5.2) has zero mean and variance-covariance
matrix !:.Ir, where!:. = [OJ,j] is a G x G positive definite matrix and denotes
the usual Kronecker product [5]. That is,
E(vec(U)vec(U)T) = !:.Ir
OI,IIr
_ ( 02,IIr
OG,IIr
Notice that
(EEljXj)vec( {~j}) = vec( {Xj~j})
and
OI,GIr)
02,GIr
oG,GIr
(5.4)
SURE models
119
Various least squares estimators have been proposed for estimating the parameters of the SURE model [142]. The most popular are the Ordinary Least
Squares (OLS) and Generalized Least Squares (GLS) approach. The OLS approach implicitly assumes that the regression equations in (5.1) are independent of each other and estimates the parameters of the model equation by equation. The GLS approach utilizes the additional prior information about correlation between the contemporaneous disturbances that is fundamental to the
SURE specification and estimates the parameters of all the equations jointly.
The OLS estimator of vec( {~i} ) in (5.2) and its variance-covariance matrix
are given respectively, by
(5.5)
and
Similarly, the GLS estimator of vec( {~i}) and its variance-covariance matrix
are given, respectively, by
and
Both vec( {bP}) and vec( {by}) are unbiased estimators of vec( {~i}). For L
positive definite and
it follows that
Var(vec( {bP})) = Var(vec( {by}))
+ G(L- 1h )GT
which implies that vec( {by}) is the BLUE of vec( {~i}) in the SURE model
[142]. In some cases the OLS and GLS estimators are identical- for example,
when Vi, j: Xi = Xj, or when Vi i= j: (Ji,j = O. More details can be found in
[141, 142].
In general L is unknown and thus the estimator of vec( {~i}) is unobservable. Zellner has proposed an estimator of vec( {~i} ) based on (5.7) where L is
replaced by an estimator matrix S E ~GxG [150]. Thus
(5.9)
120
is a feasible GLS (FGLS) estimator of vec( {~i})' The main methods for constructing S are based on residuals obtained by the application of the OLS.
The first method constructs 1: by ignoring the restrictions on the coefficients
of the SURE model which distinguishes it from the multivariate model. Let
X E 9tTxk be the observation matrix corresponding to the distinct regressors
in the SURE model (5.2), where k is the total number of these regressors.
Regressing Yi (i = 1, ... ,G) on the columns of X gives the unrestricted residual
vector
iii = PXYi,
wherePx
Si,j
= iiT iij/(T -
k)
= yT PxYj/(T -
k),
i,j
= 1, ... , G.
(5.10)
The second method constructs S by taking into account the restriction on the
coefficients of the SURE model. By applying the OLS to each equation in
(5.1), the restricted residuals
11; =PiYi
are derived, where Pi =
obtained from
Si,j
h - Xi (x{ Xi) -I
= aT aj/T = y{ PiPjYj/T,
i,j
= 1, ... , G.
(5.11)
and a~p-I) is the residual of the ith (i = 1, ... ,G) equation, i.e.
A(p-I) _
ui
,-X,t\(p-I)
'Pi
.
-Y,
SURE models
121
v, {Pi}
where II IIF denotes the Frobenius norm, i.e.
T
(5.13)
(5.14a)
and
q
W12)K
(5.14b)
W22 q
GT-K-q
Here, Q E 9tGTxGT and P E 9tgTxgT are orthogonal, Ri E 9tk;xk; and W22 are
upper triangular matrices, (K + q) is the column rank of the augmented GT x
(K + gT) matrix (EBXi (eh)) and K = L~1 ki. The orthogonal matrix QT
is defined as
T _
Q -
(h Q~0) (QI)
_(Q~Q~
QI )
Q~
0
K
GT - K '
(5.15)
122
T
QiXi=
ki
(Ri)
0 '
with
Qi= (QA,i
T-ki
QB,i ),
(5.16)
Q~(Q~(Ch))P=
(g
~2).
(5.17)
If
vec( {Yi})) K
y
q
Y
GT-K-q
QT vec(Y) = (
and
pT vec(V) =
(~)!T
q ,
vec( {Yi}))
~
(tBiRi)
g vec(
{~i}) +
(Wllg W12)
~2 (-)
~.
(5.18)
RiPi=h
i,
i= l, ... ,G,
(5.19)
SURE models
123
where
vec( {hi}) = vec( {Yi}) - W12W221y,
with
hi E 9tki
(5.20)
QTXe=
(~),
(5.22)
G-i
Rl,i ) k+i
R2',1 G-i
and
T
Q (Yi Vi) =
(y-,
yAII'
Vi) k+ i
Vi T-k-i'
where Vi == V,i is the ith column of V (i = 1, ... , G). For notational convenience a zero dimension denotes a null matrix or vector. Premultiplying the
constraints in (5.13) by (Ie QT) gives
124
(CIr) =
CI,I/HI
C2, I h+ I
o
o
CI,IIr-k-1
CG,lh+1
C2,2h+2
0
0
C2,lh
0
0 C2,IIr-k-2
0 C2,2Ir-k-2
0
0
CG,2h+2
0
CG,IIr-k-G 0
CG,IIG-I
0
0
CO,2/G-2
0
CG,oIk+G
CG,2Ir-k-G 0
CG,GIr-k-G
argmin
L (Iivi112 + IIvill2)
subject to
{Yi
Yi = gi + Ci,iVi
where
Ii =
e';j) j
v +!;
(i
~ I, ... ,G),
(5.23)
(0 0)
i-I
Ci,j I. . .0 Vj
j=1
I-J
and
i-I
gi =
L C;,j (0
j=1
Ir-k-i) Vj.
( Ii)
gi =
(0)
i-I
Itci,j Vj
where Ii E 9ti - 1 is the bottom non-zero subvector of f;. Employing the column
sweep algorithm Ii and gi can be determined in G-l steps [67]. The following
SURE models
125
(5.24)
5: end for
For Ci,i i- 0 the second constraint in (5.23) gives
Vi = (Yi - gi)/q,i.
In order to minimize the objective function the arbitrary vj (j = 1, ... , G) and,
for q,i = 0, the arbitrary Vj are set to zero. Thus, if Vi: Cj,i = 0, Yj = gj, then
the tSURE model is consistent and its solution given by
Pi=Ril(Yi-.fi),
(5.25)
For 3i E 9lk +a defined as 3T = (y? - .fiT 0), the solution of the upper triangular system R1i = 3i gives Y[ = (P? 0). This implies the solution for r of Rr =
(gj Yi) = (
(2)
gj
A(2)
Yj
'\.
""I
and, for simplicity, assume that yP) = gP) and y~2) - g~2) = hi i- O. That is, for
Cj,j = 0, the tSURE model is inconsistent because of the non-zero Ai element
vector hj. If Q2,i denotes the last Aj columns of Q, then the premultiplication
of the ith inconsistent regression equation
i-I
Yj = Xi Pi + Lq,jVj
j=1
Yj=XjPj+ LCj,jvj,
j=1
QTYj = (;:)
126
and
QI,i
i-I
i-I
j=1
j=1
L Ci,jVj = L q,j (0
hi) Vj == g~2).
If modifications to the data are acceptable, then the solution of the modified
consistent tSURE model proceeds as in the original case with the difference
that y~2) is replaced by g~2), which is equivalent to replacing Yi with Yi. The
iterative Algorithm 5.1 summarizes the steps for computing the feasible estimator of vec( {~i}).
= (~)
= (~:).
2: repeat
3:
Estimate 1: by t
4:
Compute the Cholesky decomposition t = CCT
5: Vi: let (f{ gf) T = 0, where E 9li - 1 and gi E 9l T- k - i
6:
for i = 1,2, ... , G - 1 do
7:
if q,i f:. 0 then
8:
Vi:= (Yi - gi)/q,i
9:
for all j = i + 1, ... , G do-in-parallel
Ii
(Ij)
.= (Ij)
+ C.. (0)
gj.
gj
Vi
10:
11:
12:
13:
14:
15:
16:
17:
],I
end for
18:
Solve the upper triangular system Rr = d
19: until Convergence
To compute the covariance matrix for ~i and ~ j (j :::; i)
QL (Yi
SURE models
and
QTl,iV),_
-
(IHi)
-, + (0!;-i 0)0
0
v)
127
v).
A,
A
I-'i
+ R-i I ~C
~ i,i (h+i)0 vi,
)=1
~)) Kt,
(5.26)
which is simplier than with the expressions given in [91, 127, 142].
3.1
IMPLEMENTATION ASPECTS
<PI ('t,p, i)
where T - k -
G-I
<P2('t,p) =
L <PI ('t,p, i)
i=1
and using backward stepwise regression, the following execution time model
is derived
<P3('t,p) = p(S.997 + 15.23't + 13.95'tp- 4.622~).
128
Table 5.1.
't
/1
Execution Time
(1)2('t,/1)
<P3('t,/1)
3
4
5
6
7
8
9
10
11
12
2
2
3
3
4
4
5
5
6
6
239.6
325.7
758.3
929.1
1730.1
2012.4
3292.1
3716.9
5584.7
6180.1
240.5
326.6
760.9
931.6
1732.6
2015.6
3294.0
3717.0
5583.5
6174.1
239.8
326.1
758.4
929.6
1729.0
2013.1
3291.3
3716.2
5585.0
6178.6
Rr
r',OJ.. '=L\
. . /R"
.
',Oi.
1,1
and
In this case, the time execution model corresponding to <!>4 (v,,u) is given by
SURE models
129
~.
Jl
Execution
Time-J
<P4(lC,Jl)
Execution
Time-2
<P5(lC,Jl)
3
3
3
4
4
4
5
5
5
6
6
6
7
7
1
4
8
2
4
7
3
5
7
2
5
7
2
7
446
3581
11696
1474
4560
13989
3499
8477
16461
2437
10116
19149
3006
22032
450
3583
11700
1476
4559
13978
3498
8472
16459
2436
10115
19144
3004
22033
447
2466
6589
1298
3273
8281
2828
5854
10151
2204
7207
12238
2744
14518
450
2468
6594
1300
3272
8271
2826
5850
10151
2203
7207
12235
2742
14522
1.00
1.45
1.86
1.14
1.39
1.69
1.24
1.45
1.62
1.11
1.40
1.56
1.10
1.52
COVARIANCE RESTRICTIONS
lC
130
equations models, and it can be found in modified form in applications involving the estimation of vector autoregressive models [146]. The proposed
estimation procedure applies the generalized least squares (GLS) method to a
transformed model in which the information matrix has a block bi-diagonal
recursive structure and the disturbances have a diagonal variance-covariance
matrix. The GLS estimators come from the solution of normal equations involving Kronecker products.
A computationally efficient and numerically reliable solution to the normal
equations exploits the banded structure of the information matrix, enacts nonliteral computation of the Kronecker products, and avoids computing matrix
inverses [41,42]. Otherwise, estimating SURE-CC models will be computationally expensive even for modestly sized problems, and the solution will be
meaningless for models with ill-conditioned matrices.
Within the context of the SURE-CC model, the elements of 1: are constrained to satisfy
al,1
(S.27a)
and
i, j = 1, ... , G and i # j,
(S.27b)
where Pi,j = ai,j/ Jai,iaj,j. These conditions are obtained when the disturbance terms are defined as Ui = L~=I Ej, where
E(EiE}) = 0 for i # j,
E(Ei) = 0,
(S.28)
and t1i is an unknown scalar (i,j = 1, ... , G). Then ai,j = L~=I t1p for i ::; j,
which imposes the constraints (5.27). Notice that Pi,j > Pi,k for k > j > i and
Pi,j > Pk,j for j > i > k. That is, for i < j the correlation Pi,j decreases for
fixed i and increasing j, or for fixed j and decreasing i. Figure 5.1 plots the
correlations Pi,j, where t1i = i and t1i = l/i.
Writing Ui+! as
(S.29)
(5.30)
131
SURE models
lJi = i
Correlation
45
40
35
30
25
20
15
\0
Correlation
45
Figure 5.1.
where Z = (Zl
40
35
30
25
20
15
\0
Z2
W=
Xl
-Xl
0
0
E2
0
X2
0
0
-X2 X3
0
...
Eo) and
0
0
0
(5.31)
-XO-I Xo
The error tenn vec ('E) has zero mean and variance-covariance matrix e h =
tBilJh.
The GLS estimator of the SURE-CC model (5.30) comes from the application of the GLS method to the model (5.30). In general, e is unknown and
is replaced by a consistent estimator, say E>. Thus, the FGLS estimator of
132
(5.32)
Notice, however, that due to the diagonal structure of 0, the FGLS estimator of
vec( {~i}) could come from the solution of the weighted least squares (WLS)
problem
argminll(0-! /r)(vec(Z) - Wvec({~i}))112
{M
(5.33)
{~i}
W = (8- 2 /r)W
~IXl
( -~2XI
0
~2X2
v
0...
0
...
0
0
0 )
0
(5.34)
(0)
GT-K
(5.35)
Q W= L K
and defining
QT vec(Z)
==
(~) ~T -
K ,
the FWLS estimator of vec ( {~i}) is the solution of the triangular system
Lvec( {~i}) =
J.
(5.36)
ki
ki
k2
L= k3
ko
k2
k3
0
0
LII,
L2,I ~,2 0
0
L32, L33,
kO-1
ko
0
0
0
0
0
0
................................
... Loo
0
0
0
, - I Loo
,
(5.37)
SURE models
133
LI,I~I
dl
(5.38a)
and
Li,i~i = tL - Li,i-ItL-I,
where tIr
singular.
i = 2, ... , G,
(5.38b)
4.1
QoW -
(A(O)) GT-K
DO) K
'
(5.39)
where DO) E SRKxK is non-singular and has the same structure as L in (5.37).
In the second stage an orthogonal matrix, say Q;, is applied on the left of
(5.39) to annihilate A(O) while preserving the block bi-diagonal structure of
DO). That is, the second stage computes
T
Q*
(A(O)) _
DO) -
(0)L
GT - K
K
.
(5.40)
i= 1, ... ,G
and
T
Qi,O( -t}iXi-J) =
-(0)) T -k
(A.A~O)
ki
'
(5.41)
(5.42)
Partitioning Qi,O as (Qi,O Qi,O) , let Qo = EBiQi,O and Qo = EBiQi,O, where Qi,O E
SRTx(T-ki) and Qi,O E SRTxki. The orthogonal matrix Qo and the matrices A(O)
and DO) in (5.39) are given by
(5.43)
QG,O
134
and
T-kl
(.{(O))
(O)
kl
k2
k3
kG-l
kG
0
0
0
0
0
0
-(0)
0
0
-(0)
-(0)
T-k2
A2
T- k3
A3
T-k4
A4
T-kG
-(0)
L(O)
0)
"(0)
L(O)
"(0)
L(O)
kl
0
AG
................................
1
"(0)
k2
A2
k3
A3
k4
A4
kG
3
"(0)
AG
(5.44)
The second stage, the annihilation of the block-diagonal matrix A(O), is divided in G -1 sub-stages. The ith (i = 1, ... , G -1) sub-stage computes the
G - i orthogonal factorizations
kj-l
QL
kj
kj-l
A-(i_l)) (A-(i)i+j
-
Hj
AY-l) LY-l)
(5.45)
Ay)
Ly)
kj
( dQ~Y) Qt?))T-k +
i j ,
2.'1)
I,}
d2.'2)
I,}
kj
let
G-i
ffiQ~k.'P)
QI~k,p) = W
I,}'
j=l
k ,p=,.
12
(5.46)
SURE models
135
Qi=
hi
Q~1,1)
Q?,2)
Q~2,1)
Q~2,2)
hi
0
di,l1,1)
IJli
0
0
Q~,~~i
di,12,1)
0
0
0
0
(5.47)
di,11,2)
Q;,~~i
Q~,~~i
di,12,2)
Q;~~i
,
0
0
0
IJli
where Ai = iT - L~=l kj and Pi = L~=l kG-HI. If A(i) and Vi) are defined as
(-(i))
1{i)
= QT .. . Qf
(-(0))
1(0) ,
(5.48)
then Vi) has the same structure as VO), and A{i) has non-zero elements only
in the (i + l)th block sub-diagonal (i < G -1). The application of Q41 (i =
1, ... , G - 2) on the left of (5.48) will annihilate the non-zero blocks of A{i),
but will fill in the (i + 2)th block sub-diagonal of A{i) if i < G - 2. It follows
.
(G-i).
~(G-i)
136
0
0
-(0)
A3
0
0
-(0)
A2
-(0)
A4
L(O)
................
1
A~O)
0
0
0
0
-(1)
A3
0
0
0
0
0
0
0
0
-(1)
A4
o
o
0
............... .
L (1) 0
0 0
1
A(I) L(I)
0 0
2
2
A(1)
L(I)
0
0
3
3
A(O) L(O)
0
0
4
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
L(3)
1
0
0
0
0
QT
-4
-(2)
A4
0)
A(O) L(O) 0
3
3
0 A(O) L(O)
4
0
0
0
0
0
0
0
0
0
................
L(2)
1
QT
-4
QT
-2t
A(2) L(2)
0
0
2
2
A(I)
L
(1)
0
0
3
3
A(O)
L(O)
0 0
4
4
................
A~2)
0
0
Figure 5.2.
2)
A(I) L(I)
0
3
3
A(O)
L(O)
0
4
4
Factorization process for computing the QLD (5.35) using Algorithm 5.3.
= j
+ 1 + (i-1)(i- 2)/2,
SURE models
137
where i = 2, .. . , G and j = 0, ... , i - 2. However, the annihialtion of a blockrow can commence before the complete annihilation of the preceding blockrows. The annihilation of A~O) can begin at step i, with subsequent steps annihilating the fill-ins. Figure 5.3 illustrates the sequential and parallel versions
of the two annihilation schemes. A number denotes the submatrices that are
annihilated by the corresponding step. The first scheme called the Parallel
Diagonal Scheme (PDS) uses the method described in Algorithm 5.3; its sequential version is called SDS . The Sequential and Parallel Row Schemes are,
respectively, denoted by SRS and PRS.
1
6 2
10 7 J
13 11 8 4
1514 1 9 5
\..
'\..
"
i\..
1
2
J
4
5
~
I'\.
f\..
~
SOS
I\,
'" "-
"'r'\.
"
,\ .
~
......
SRS
J 2
5 4 J
7 6 5 4
5 4 2
7 6 4 2
8 7 6 5 J
10 9 8 7 6
"
rr....
I:~; i'..
Figure 5.3.
654
10 987
15 14 t:l 1 11
J 1
1--
"- ~
PRS
'"
J 2
POS
J 2
5 4 J
7 6 5 4
9 8 7 65
~
1
2 1
J 2 1
4 J 2 1
-,
I\.
'\.
'"
Jjmited POS
"
'\.
r\.
Jjmited PRS
Given the assumption that the independent factorizations (5.45) can be computed simultaneously for all j = 1, ... , G - i (full parallelism), PDS and PRS
apply G - 1 and 2G - 3 CDOMs ,respectively. The superiority of PDS decreases since the number of factorizations (5.45) that can be computed simultaneously is reduced (limited parallelism). In the extreme case in which only
one factorization can be computed at a time, the parallel schemes reduce to
their corresponding sequential schemes, requiring the application of the same
138
4.2
PARALLEL STRATEGIES
The orthogonal factorization (5.45) can be computed using Householder reflections or Givens rotations. Sequences of CDGRs for computing factorizations similar to that of (5.45) have been considered in the context of updating
linear models [75,69]. Two of these sequences are illustrated in Fig. 5.4, where
the zero matrix, and the matrices Ay-Il, A~2j and Ayl in (5.45) are not shown.
4 3 2 1
8
7
6
5
4
5 4 3 2
. 6 543
7 6 5 4
8 7 6 5
7
6
5
4
3
6
5
4
3
2
5
4
3
2
1
4
5
6
7
8
UGS-l
2
2
3
4
5
1
1
1
2
3
Greedy-l
UGS-3
Figure 5.4.
3
4
5
6
7
8 7 5 3
7 6 4 2
6 5 3 1
5 4 2 1
4 3 2 1
Greedy-l
= log2(T -
SURE models
139
Figure 5.5 shows the number of CDGRs required to compute the orthogonal
factorization (5.40), where G = 4, (k I ,k2,k3,k4) = (80,70,10,30) and T =
1000. The entries of the matrices show Si,j - the number of CDGRs required
to annihilate the blocks of A(O). The bold numbers denote tI, ... ,tG-I, that is,
the total number of CDGRs applied in each sub-stage that annihilates a subdiagonal. Clearly, the Greedy sequences perform better than the UGSs in terms
of the total number of CDGRs applied.
80
70
10
0
0
9W ( 0
930 1009
0
0
990 1069 1059 0
970 1049 1039 979
80
30
~)
9W
930
990
970
10
0
0
0
40
30
~)
70
0
( 270
"
0
272 239
272 239
Number of CDGRs for computing the orthogonal factorization (5.40) using the
(1
~)
and L-(0) =
(LL LG0)
-
-(0)
140
4.3
The regression equations in a SURE model frequently have common exogenous variables. The computational burden of solving SURE models can be
reduced significantly if the algorithms exploit this possibility. Let X d denote
the matrix consisting of the Kd distinct regressors in the SURE-CC model,
where Kd ::; K = I,g.1 kj. The exogenous matrix Xj (i = 1. ... , G) can be defined as XdSj, where Sj E 5RKdxki is a selection matrix that comprises relevant
columns of the Kd x Kd identity matrix.
Assuming for simplicity that T > K d , let the QLD of X d be given by
QTXd =
d
(0)
T - Kd
Ld Kd
Zj,l = Q~,IZj,
with
Zj,2 = Q~,2Zj,
QT =
d
(Q~
Tdf I) Kd
Kd
d,2
j,l = Q~,lj,
(5.49)
j,2 = Q~,2j,
and define the orthogonal matrix QD as QD = (IaQd,1 IaQd,2). Applying the orthogonal matrix Qb from the left of the SURE-CC model (5.30)
gives
{Zj,l})) = ( 0 ) vec( {~-}) + (vec( {j,I})) .
( vec(
vec({Zj,2})
LD
I
vec({j,2})
Thus, the SURE-CC model estimators vec( {~j}) arise from the solution of the
reduced sized model
(5.50)
The variance-covariance matrix of the disturbance vector vec( {j,2}) is given
by eIKd, and the matrix LD has the structure
LdSI
-LdSI
LdS2
0
-LdS2 LdS3
o
o
o
(5.51)
The equivalent of the QLD (5.41) and the orthogonal factorization (5.42)
are now given, respectively, by
T (V d)
Qj,O 'f}jL Sj =
(0)
L~O)
Kd -kj
kj
,
i= 1, ... ,G
(5.52)
SURE models
and
T
-(O))
(AA~O)
.
K -ki
ki
'
141
(5.53)
The QLD (5.52) may be considered as the re-triangularization of a lowertriangular matrix after deleting columns. Sequential and parallel Givens sequences to compute this column-downdating problem have previously been
considered [71, 74,87]. A Givens sequence employed to compute (5.52) may
result in matrices A~O) and A~O) in (5.53) that are non-full. Generally, the order
of fill-in of the matrices A~O) and A~O) will not be the same if different Givens
sequences are used in (5.52). Thus, in addition to the number of steps, the
order of the fill-in for the matrices A~O) and A~O) will also be important when
selecting the Givens sequence to triangularize (5.52). In the second stage, the
and AV-l)
have fewer zero entries than A(j)
and AV)
respecmatrices A(j-l)
1
1
1+1
I'
tively.
Consider the special case of proper subset regressors of a SURE-CC model
[78,86]. The exogenous matrices are defined as
ki -ki+l
Xi= (Xi
(5.54)
Ld =
(0 hi). Furthermore, if Ld in
kl -k2
k2- k3
Ll,l
kG
kl -k2
k2- k3
~,l
~,2
kG
LG,l
LG,2
LG,G
kl-ki
ki-ki+l
LdSi = ki+l -ki+2
kG
ki-ki+l
ki+l - ki+2
kG
D
1,1
Li+l,i
Li+l,i+l
LG,i
LG,i+l
LG,G
142
Thus, after computing the QLD of Xl, the next main step is the triangularization of the augmented matrix in (5.44), which now has a sparse structure with
repeated blocks. This simplifies the computation of factorization (5.40). For
G = 4, the matrix (5.44) is given by
(A(O)
DO)
_L(2)
1,1
_L(3)
2,2
_L(4)
3,3
L(I)
1,1
L(I)
2,1
L (1)
2,2
()
L(I)
3,1
L(I)
3,2
L (1)
3,3
L(I)
4,1
L(I)
4,2
L(I)
4,3
L(I)
4,4
42,2)
L(2)
3,2
L(2)
3,3
L(2)
4,3
L(2)
4,4
L(3)
3,3
L(3)
4,4
......................................................... ,
_L(2) _L(2)
2,1
2,2
(5.55)
.........................................................
0
_L(3) _L(3)
3,2
3,3
.........................................................
0
= _1_ (
'tj,v+I
f}i
-~i+I,V+11
~i+J,v+1I)
t}1
J
'
(5.56)
SURE models
143
-IJP,v
_dj-t:I,V+l)
(j)
T(j),
"'v,J
0 .,'
.. ,_IJJ.~V+I)
...
_IJJ.:I,V+I)) _
,J
pj,V+lLv,v
'tj,v+ILv,j
.,'
L'v,v
0)
'tj,v+ILv,v
'
(5.57)
where
v=i+j-l,
(m,n)
L"I,}
and
= mnLiJ'
"
For J.l < A CJ.l > 0 and A = 1, ... , G) the values of ~1l'A. and
the recurrence
't1l,A.
are defined by
(5.58a)
and
if J.l = A,
otherwise.
QL
(5.58b)
Note that
is an extension of a 2 x 2 Givens rotation, in which the sine and
cosine elements are multiplied by an identity matrix. Furthermore, for i > 1 the
factorization (5.57) annihilates multiple blocks. These blocks are the fill-ins
which have been generated by the application of QT-I,j+l. The factorization
(5.57) affects the (i + j)th and ((j - 1)(2G + 2 - j) /2 + i)th block ofrows of
..4:(0) and DO), respectively, where DO) is partitioned as in (5.55). A block of
rows of DO) is modified at most once.
The triangularization pattern of (5.55) is illustrated in Fig. 5.6. An arc shows
the blocks of rows that are affected by the application of a CDOM. The letters
A and F, a shaded box and a blank box denote, respectively, an annihilated,
a filled-in, a non-zero and a zero block. A filled-in block that also becomes
zero during the annihilation of the block A is indicated by Z.
Let Li:,j: denote the block submatrix
(
L"'J"
"' . =
and define
L'',J,
:.
..
LG,j .. :
o.. ) ,
L~,G
144
I FF
I,
f'-.
IA
FF
l.
rt .....
"
\..
'"
~
i\.
~
~
~
I'\..
I
\
~
~
ZZ IA
i'-
i',
i\..
r;..
~
"- ~
-,
~
"
r"-
Sla ge J
Sla ge 2
Figure 5.6.
I\..
Sla ge I
"- -\..,
:-
'\..
rio..
ZIA
F F F
~ "-
'I".
~ ~
Sla geO
ZA
IA
"
I\.
1-
"-
and
where kG+ 1 = 0 and 'ti,G+ 1 = 1 for i = 1, . .. ,G. Straightforward algebraic manipulations show that, in (5.38), Li,i and Li,i-I are given by T;Li:,i: and PiLi: ,i-b
respectively. That is,
o
'ti,i+2 L i+l,i+1
(5.59)
SURE models
and
Pi,i+lLi,i-1
Li,i-I
= ( Pi,i+2~i+I'i-1
Pi,i+ILi,i
145
0)
~
(5.60)
d; =
d1,1.. )
d")
= (d~'i
],1
d~'i
and letting
dj:,i
'
(5.61)
it follows that
(5.62a)
and
(5.62b)
for i = 2, ... ,G. The estimators ~I' ... ' ~G can be derived simultaneously
=
by solving for B the triangular system of equations LIB = lJ, where
B~
~l,l
~2,1
~2,2
~G,l ~G,2
and
D~
dl,1
d2,1
d2,2
dG ,I
dG2
,
~D
JD
Thus, the main stages in solving the SURE-CC model with proper subset
regressors are (1) compute the QLD of Xl (2) evaluate the simple recurrence
for 't~ n in (5.58) for m = 1, ... , G and n = 2, ... , G + 1, and (3) solve a lowertrian~lar system, where the unknown and right-hand-side matrices have lower
block-triangular structures.
Chapter 6
The estimation of simultaneous equations models (SEMs) is of great importance in econometrics [34,35,62,64,96, 124, 130, 132, 149]. The most commonly used estimation procedures are the Three Stage Least-Squares (3SLS)
procedure and the computationally expensive maximum likelihood procedure
[33,60, 97, 106, 107, 119, 120, 143, 153]. Here the methods used for solving
SURE models will be extended to 3SLS estimation of SEMs. The ith structural
equation of the SEM can be written as
Yi=Xi~i+}j'Yi+Ui'
i= 1, ... ,G,
(6.1)
where, for the ith structural equation, Yi E S)tT is the dependent vector, Xi is the
T x ki matrix of full column rank of exogenous variables, 1'1 is the T x gi matrix
of other included endogenous variables, ~i and 'Yi are the structural parameters
to be estimated, and Ui E S)tT are the disturbance terms. For ~ == (Xi 1'1), aT ==
(~T if) and U = (Ul ... uG) the stacked system of the structural equations
can be written as
or as
G
vec(Y) =
(6.3)
i=1
148
= ~ (i = 1, ... ,G), and vec(U) == (uf ... ub) T. The disturbance vector
vec(U) has zero mean and variance-covariance matrix r. IT, where r. is a
G x G non-negative definite matrix. It is assumed that ei = ki + gi ~ K, that
is, all structural equations are identifiable. The notation used here is consistent
with that employed in the previous chapter and, similarly, the direct sum EB~I
and set operator {'}G will be abbreviated by EBi and {. }, respectively.
The 2SLS and Generalized LS (GLS) estimators of (6.3) are defined, respectively, from the application of Ordinary LS (OLS) and GLS to the transformed
SEM (hereafter TSEM)
WSi
(6.4)
(T
Si WTXX TWSi )-1 SiTW TXX TYi,
l=
1, ... ,G
and
vec( {3 i }) = ( (Ef7iSf) (r.- I WTPW) (EBiSi) ) -I (EBiSf)vec(WT pyr.- I ),
where P = X(XTX)-IXT. The computational burden in deriving these estimators can be reduced if the TSEM (6.4) is premultiplied by IG (R(I) - T, where
R(l) is the upper triangular factor in the QRD of X. That is, the TSEM can be
written as
or
(6.5)
where 0 = QfU, and the Qs and Rs come from the incomplete QRD of the
augmented matrix W == (X Y) given by
K
T
Q W-
(R(l)
G
R(2) K
(3)
,
R
T-K
.
T _
WIth Q -
(QT)K
Qr T-K
(6.6)
(6.7)
Note that vec(O) has zero mean and variance-covariance matrix r.IK.
149
The 3SLS estimator, denoted by vec( {B}), is the GLS estimator with L replaced by its consistent estimator t based on the 2SLS residuals [9, 33, 60].
Computing the Cholesky decomposition
(6.8)
the vec( {B}) estimator is the solution of the normal equations
(6.9)
where
and
(6.1Ob)
It is not always the case that the disturbance covariance matrix of a simultaneous equations model (SEM) is non-singular. In allocation models, for example, or models with precise observations that imply linear constraints on the
parameters, or models in which the number of structural equations exceeds
the number of observations, the disturbance covariance matrix is singular
[31, 62, 149]. In such models the estimation procedure above fails, since i:
l does not exist. The Generalized Linear Least Squares apis singular, i.e.
proach can be used to compute the 3SLS estimator of SEMs when t is singular
or badly ill-conditioned.
e-
The methods described in the previous chapter for solving SURE models
can be extended to 3SLS estimation of SEMs [68, 78, 86, 91, 110, 111]. Let
the TSEM(6.5) be rewritten in the equivalent form
(6.11)
where the rank of t = eeT is g ~ G, E 9\Gxg has full column rank, and V
is a random K x g matrix, defined as veT = Qfu. That is, vec(V) has zero
mean and variance-covariance matrix IgK. With this formulation, the 3SLS
estimator of vec( {Oi}) comes from the solution to the generalized linear least
squares problem (GLLSP)
argmin IIVII~ subject to vec(R(2)) = (EBiRSi)vec({oi})+vec(VeT), (6.12)
{Oi},V
150
y(1))E
y(2)
y(3)
GK-E-q
(6.13)
and
q
L12)E
q
~2
(6.14)
GK-E-q
Here E = L.~1 ej, R == Ef)jR(i), and the R(i) E 9\ejxej and ~2 are upper triangular non-singular matrices; Q and P are GK x GK and gK x gK orthogonal
matrices, respectively; and E + q is the column rank of (( Ef)jRSj) (C (l) h) )
[4,51,100,113]. The orthogonal matrix Qis defined as
-T _
Q -
(Ie0
Q~
Q
E
) (QQ~-T) - (-T)
Q~Q~ GK - E '
A
(6.15)
where the QRD of RSj (i = 1, ... ,G) and the complete QRD of Q~(C(l)h) are
given, respectively, by
(6.16)
and
(6.17)
(6.18)
(6.19)
Conformally partitioning
SEM is consistent iff
VT
= vec(V) TP as
y(3)
151
= 0,
(6.20)
y(2)
(6.21)
and the arbitrary vector VI is chosen to be zero. The 3SLS estimator is the
solution of the block-triangular system
Rvec({o;})
=y(l)
-L12V2,
i = 1, ... , G,
(6.22)
L12V2 =
(~I) ~l
(6.23)
hG kG
8; =
0;+ (R(i)r1A;VI,
implying that E(8;) = 0; and that the covariance matrix between 8; and
given by
i,j=l, ... ,G,
8j
is
(6.24)
1.1
t, is computed as
where 0 = (al' .. aG) denotes the residuals of the structural equations. Initially, 0 is formed from the residuals of the 2SLS estimators
~(;)
u2SLS
(R-(i)-I_(I).
1
G
Y; ,
= , ... , ,
l
152
that is,
A
Ui
Since OTO
= Yi -
WS iV2SLS'
~(i)
= 1, ... , G .
(6.25)
Q~O= (~J~-G
(6.27)
from which it follows that OTO = L~Lu and C = LUT. Note from (6.27)
that, if the number of structural equations exceeds the number of observations
in each variable - the so-called undersized sample problem - then 0, and
consequently t, will be singular with rank g ::; T < G. If 0 is not of full
column rank then C may be derived from the complete QLD of 0 [87].
1.2
REDUNDANCIES
Under the assumption that the consistency condition (6.20) is satisfied, factorizations (6.13) and (6.14) show that GK - E - q rows of the TSEM (6.11)
become redundant due to linear dependence [57]. Let Q(; comprise the last
GK - E - q columns of Qc and
N = Q~Q~
where N(i) is a (GK - E - q) x K matrix (i = 1, ... ,G). The elements of the pth
row of N, denoted by Np ,:, can reveal a linear dependency among the equations
of the TSEM (p = 1, ... ,GK - E - q). Premultiplication of the TSEM by N p ,:
gives
or
G K
G K
'"
..i '"
..i N(i)
p,t R(~)
t,1 = '"
..i '"
..i N(i)
p,t R(i~
t,.
i=lt=l
i=lt=l
o + '"
..i '"
..i N(i)\i.
p,t t,1. =
I
i=lt=l
0,
(6.28)
153
Vt ,i = L Ci,j vt ,j,
= 1, ... , G.
j=1
Assume that the ,uth equation of the Ath transformed structural equation
(6.29)
occurs in the linear dependency (6.28) - that is, N~~
=I O. Writing (6.28) as
G K
G K
~ ~ N(i)R(~)+NCA.)R(2) = ~ ~ N(i)(R(i)B+V, ')+N(A)(R(~)B +v. )
..i ..i p,t t,1
P,11 11,1..
..i ..i p,t t,. I
t,1
P,11 11, A
11,1..,
i=1 t=1
i#/..tf-11
i=1 t=1
if-At=ll1
_ (A)
N pl1
, if-A tf-11
Observe that, if
L LN~:~R~~) =
i=lt=1
_ (A)
N pl1
L LN~:~(R~:lBi + V"i).
i=lt=1
then N(i)
= Q~) Q~,i'
Furthermore, if
fj~,i and iJi,p denote the pth row and column of Q~) and Q~,i' respectively, then
N(i)/N(A)
AT - / AT p,t
P,11 = q p,iqi,t q p,A qA,/1'
1.3
INCONSISTENCIES
QT ( vec(QfY) _ NT y(3)) =
(0)
(vec({y\I)}))
y(2)
_ _0
==
y(2)
.
(vec({:y~1)}))
y(3)
y(3)
0
If vec(QfY) denotes the modified vector vec(QfY) in the TSEM such that
vec(QfY) = vec(QfY) - NT y(3),
154
then
vec(f) = Dvec(Y) + vec(Q2r),
where ris a random (T -K) x Gmatrix andD = (IGQl)D(IGQf). Thus
premultiplication of (6.3) by D gives the consistent modified SEM
or, equivalently,
where
Thus, the above modified model is incompatible with the specification of the
original SEM, since the replacement of Y by QIQ[f contradicts the replacement of Y by Y - Q2R(3) in W. Further research is needed to determine the
modification in the endogenous matrix Y that yields a consistent and correctly
specified model.
R(2))K
R(3)
t'
(6.33)
{ai},V
155
where H(J) is upper triangular, H= (H(1) H(2)), and t = CCT is a new estimator of L. The only computational advantage in not solving the updated
SEM afresh is the use of the already computed matrices R(l) and R(2) to construct the updated TSEM. The solution of (6.12) cannot be used to reduce the
computational burden of solving (6.34).
Similarly, the downdating problem can be described as solving the SEM
(6.3) after the sample information denoted by (6.32) has been deleted. If the
original matrix W is available, then the downdated SEM can be solved afresh
or the matrix that corresponds to R in (6.7) can be derived from downdating the
incomplete QRD of W [39, 50, 51, 75, 108, 109]. However, as in the updating
problem, the solution of the downdated TSEM will have to be recomputed
from scratch.
Assume that the additional variables denoted by WSi E SRTxei have been
introduced to the ith structural equation. After computing the QRD
with
the matrix computations corresponding to (6.16) and (6.18) are given respectively by
and
e.+e.
.
(Yv(l))
A*
K
I
Yi
A'
-ei-ei
where QB,i = QB,iQB,i and QA,i = (QA,i QB,iQA,i)' Computing the complete
QRD of Q~ (C h) as in (6.17) and the equivalent of (6.19), the 3SLS solution
of the modified SEM can be found using (6.22), where Q~ = EBiQ~ i and, as in
'
the updating problem, CCT is a new estimator of L.
Deleting the W Si data matrix from the ith structural equation is equivalent to
re-triangularizing k{i) by orthogonal transformations after deleting the columns
k{i) Si (i = 1, ... , G). Thus, if the new selector matrix of the ith equation is
denoted by Si, and the QRD of k(i) Si is given by
ei-ei
with
Qi = (
QAA,i
(6.35)
156
then (6. 17)-{6. 19) need to be recomputed with QA,; and QB,; replaced by QA,;QA,;
and (QA,jQB,; QB,i) , respectively.
Now consider the case where new predetermined variables, denoted by the
T x k matrix g, are added to the SEM. The modified SEM can be written as
vec(Y) = (EEljws;)vec( {~j}) + vec(U),
g Y), Sj is a (K + G + k) x (ej + kj )
where W == (X
as
and
(Tg
Q2
(3)) _
(!?(l)
0
QfR(3)) k
QI R(3) T - K -
k '
with
(6.36)
where now V and vec( {~j}) are a (K + k) x g matrix and an (E + L~l kj)element vector, respectively, and
V(l) _ (R(l)
g)
Qf
!?(l) ,
The solution of (6.36) can be obtained as in the original case. However, the
computation of the QRDs of RSj (i = 1, ... , G) can be reduced significantly if
both sides of (6.36) are premultiplied by the orthogonal matrix (QA QB)T,
where
QA == EEljQA,j, QB == EEl;QB,j,
Qv A,I =
-
0)
(QA,;
0 /.
ki
QB,j == (QB,j 0) for i = 1, ... ,G. In this case the upper triangular factor
in the QRD of RSP) is given by the already computed RSP).
and
157
where Hi E 9td;xe; has full row rank, ~i E 9td;, d == L~1 di, and di < ei (i =
1, ... , G). The constrained 3SLS estimator can be found from the solution of
argmin IIVII} subject to {
{Oi}'V
which, under the assumption that the consistency rule (6.20) is satisfied, can
be written as
L12)
)
~2 (-~~.
(6.39)
i= 1, ... ,G,
with
di
Q,.,(i))e.
12
I
Q"'(i)
22
let
and
d.'
I
(6.40)
158
Q~ (~l
t) P
and
AT
Qc
3.1
gK-q
( *)
Y2
y(2)
~2)q
o d+q-q
Y
(A(2))
:9(3) .
The basis of the null space (BNS) method and the direct elimination (DE)
method are alternative means for solving the constrained 3SLS problem. Both
methods reparameterize the constraints and solve a reduced unconstrained SEM
of E - d parameters [8, 14, 77, 93, 130, 131, 145]. Consider the case of separable constraints (6.37). In the BNS method the coefficient vector ai is expressed
as
(6.41)
where the QRD of
Hr is given by
di
with
Qi= (QA,i
(6.42)
or
QT gives
159
-(1)
... - -(i) .
-R-(i) QA,i1'}i
and J{i=R QB,i (z=1, ... ,G).
Once the estimator of ~i' say ~i' is derived from the solution of the GLLSP
argmin IWI!l2 subject to vec( {Yi - hi}) = (EBiRi)vec( gi}) + Lll VI, (6.46)
gi},VI
then the constrained 3SLS estimator of Oi can be found from (6.41) with ~i
replaced by ~i'
In the direct elimination method, the QRD
(6.47)
is computed, where ITi is a permutation matrix and Li E 9td;xd; is a nonsingular lower-triangular matrix. If
IT!!:' , u,-
(~i)
ei-di
Oi di
(6.48)
'
(2))_(
A
-)( veC({~i}))
AT)
- EBiRSi EBiRSi vec({8i-Li~i}) +vec(VC
or
vec(R(2)) - vec( {RSi8i}) = (EBiR(Si - SiLi)vec( {~i}) + vec(VCT ).
As in the BNS method, the premultiplication of the latter by
{Yi}))
( vec(
y(2)
QT gives
'
Lz2
V2'
(6.50)
or
vec({Yi-hi}) = (EBiRi)vec({~i})+LllVI'
(6.51)
160
{Bi }, VI
IIVi 112 subject to vec( {Yi - hi}) = (EBiRi)vec( {Bi}) + L11 Vi, (6.52)
will give an estimator for Bi, which is then used in (6.49) to compute ~i (i =
1, ... ,G). Finally, the constrained 3SLS estimator of Oi is computed by
IIi
( Bi)
~i
The above methods may be trivially extended for the case of cross-section
constraints.
COMPUTATIONAL STRATEGIES
The QRD and its modifications are the main components of the approach to
computing the 3SLS estimator given here. Strategies for computing the factorizations (6.16) and (6.17), when :t is non-singular, are investigated. The
QRDs of RSi (i = 1, ... , G) in (6.16) are mutually independent and can be
computed simultaneously. In the first chapter various strategies have been
discussed for the parallel computation of the QRDs. However, the particular structure of RSi can be exploited to reduce the computational burden of
the QRD (6.16). The QRDs (6.16) can be regarded as being equivalent to retriangularizing an upper trapezoidal matrix after deleting columns, where the
resulting matrix has at least as many rows as columns. Let Si == (eA..I, I ... eA..I,e,.)
and define the ei--element integer vector cri = (Ail ... Ai k. K ... K), where e1"'i..,}
(j = 1, ... , ei) is the Ajth column of iK+G and Ai, I < ... < Ai,e;' Figure 6.1
shows a Givens annihilation scheme for computing the QRD (6.16), where
cri = (3,6,7,9,12,15,15,15) and gi = 3. Generally, the total number of rotations applied to compute the QRD (6.16) for i = 1, ... , G is given by
,
G g;+k;
TI(cri,ki,gi, G) =
L L (cri,j- j)
i=I j=I
~~
((t
( J i r ki(k;
Ki -
1)/2).
(6.53)
Note that (6.53) gives the maximum number of rotations for computing the
QRDs (6.16). This can possibly be reduced by exploiting the structure of the
matrices RSi (i = 1, ... ,G), which depends on the specific characteristics of
the SEM. To illustrate this, consider the case where RSi =
1 6
5 10
4 915
3 814 22
713 21
12 20 39
111
~1
~O
18 ~8 37 ~5
17 ~7 36 ~
16
~5 34 ~2
r2.:I ~1 kil
~3 32 ~O
Figure 6.1.
RS j =
illil = (ill'l
where
~~D '
161
162
Thus, the number of rotations to compute the QRD of RSj is determined by the
RVi .
-)
PT ( (CAT h )QB
Let P be defined as P = (QA
= (0
LT )
22
E
.
GK-E
(6.54)
T AT
-T Q
AT
P (C h)QB = P (-T)
Qt (C h)QB
-(I)
K-el
K-e2
K- e3
K-eG
el
e2
-(I)
A 2,1
e3
A 3,1
-(I)
eG
AGI,
AG2
,
AG3
,
-(I)
K-el
lY)
I
K-e2
A2,1
K- e3
A 3,1
A32
,
lY)
3
K-eG
..1(1)
G,I
AG2
,
-T -
-(I)
-(I)
A(I)
A(I)
A(1)
A32
,
-(I)
1)
A(l)
-T
AG3
,
-
(6.55)
lY)
G
A(I)
A(I)
..
9(1)
References
164
[10] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying Influential Observations and Sources ofCollinearity. John Wiley
and Sons, 1980.
[11] C. Bendtsen, C. Hansen, K. Madsen, H. B. Nielsen, and M. Pinar. Implementation of QR up- and downdating on a massively parallel computer.
Parallel Computing, 21:49-61, 1995.
[12] M. W. Berry, J. J. Dongarra, and Y. Kim. A parallel algorithm for the
reduction of a nonsymmetric matrix to block upper-Hessenberg form.
Parallel Computing, 21:1189-1211, 1995.
[13] C. Bischof and C. F. Van Loan. The WY representation for products of
Householder matrices. Siam Journal on Scientific and Statistical Computing, 8(1):2-13, 1987.
[14]
A.
[15]
References
165
166
References
167
[48] W. M. Gentleman. Least squares computations by Givens transformations without square roots. Journal of IMA, 12:329-336, 1973.
[49] W. M. Gentleman. Some complexity results for matrix computations on
parallel processors. Journal of the ACM, 25(1):112-115, 1978.
[50] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders. Methods for modifying matrix factorizations. Mathematics of Computation,
28(126):505-535, 1974.
[51] G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins
University Press, Baltimore, Maryland, 3ed edition, 1996.
[52] J. H. Goodnight. A tutorial on the SWEEP operator. The American
Statistician, 33(3):116-135, 1979.
[53] A. R. Gourlay. Generalisation of elementary Hermitian matrices. The
Computer Journal, 13(4):411-412, 1970.
[54] M. Gulliksson. Iterative refinement for constrained and weighted linear
least squares. BIT, 34:239-253, 1994.
[55] M. Gulliksson and p.-A. Wedin. Modifying the QR decomposition to
constrained and weighted linear least squares. SIAM Journal on Matrix
Analysis and Applications, 13:4:1298-1313, 1992.
[56] S. J. Hammarling. The numerical solution of the general GaussMarkov linear model. In T. Durrani, J. Abbiss, J. Hudson, R. Mordam,
J. McWhirter, and T. Moore, editors, Mathematics of Signal Processing,
pages 441--456. Oxford University Press, 1987.
[57] S. J. Hammarling, E. M. R. Long, and P. W. Martin. A generalized linear
least squares algorithm for correlated observations, with special reference to degenerate data. DITC 33/83, National Physical Laboratory,
1983.
[58] J. A. Hausman, W. K. Newey, and W. E. Taylor. Efficient estimation
and identification of simultaneous equation models with covariance restrictions. Econometrica, 55:849-874, 1987.
[59] C. S. Henkel and R. J. Plemmons. Recursive least squares on a hypercube multiprocessor using the covariance factorization. Siam Journal
on Scientific and Statistical Computing, 12(1):95-106, 1991.
[60] L. S. Jennings. Simultaneous equations estimation (computational aspects). Journal of Econometrics, 12:23-39, 1980.
168
References
169
[73] E. J. Kontoghiorghes and M. R. B. Clarke. Computing the complete orthogonal decomposition using a SIMD array processor. In Lecture Notes
in Computer Science, volume 604, pages 660-663. Springer-Verlag,
1993.
[74] E. J. Kontoghiorghes and M. R. B. Clarke. Parallel reorthogonalization
of the QR decomposition after deleting columns. Parallel Computing,
19(6):703-707, 1993.
[75] E. J. Kontoghiorghes and M. R. B. Clarke. Solving the updated and
downdated ordinary linear model on massively parallel SIMO systems.
Parallel Algorithms and Applications, 1(2):243-252, 1993.
[76] E. J. Kontoghiorghes and M. R. B. Clarke. Stable parallel algorithms
for computing and updating the QR decomposition. In Proceedings
of the IEEE TENCON'93, pages 656--659, Beijing, 1993. International
Academic Publishers.
[77] E. J. Kontoghiorghes and M. R. B. Clarke. A parallel algorithm for repeated processing estimation of linear models with equality constraints.
In G. R. Joubert, D. Trystram, and F. J. Peters, editors, Parallel Computing: Trends and Applications, pages 525-528. Elsevier Science B.V.,
1994.
[78] E. J. Kontoghiorghes and M. R. B. Clarke. An alternative approach
for the numerical solution of seemingly unrelated regression equations
models. Computational Statistics & Data Analysis, 19(4):369-377,
1995.
[79] E. J. Kontoghiorghes and M. R. B. Clarke. Solving the general linear
model on a SIMD array processor. Computers and Artificial Intelligence, 14(4):353-370, 1995.
[80] E. J. Kontoghiorghes, M. R. B. Clarke, and A. Balou. Improving the
performance of optimum parallel algorithms on SIMD array processors: programming techniques and methods. In Proceedings ofthe IEEE
TENCON'93, pages 1203-1206, Beijing, 1993. International Academic
Publishers.
[81] E. J. Kontoghiorghes, M. Clint, and E. Dinenis. Parallel strategies for
estimating the parameters of a modified regression model on a SIMO
array processor. In A. Prat, editor, COMPSTAT, Proceedings in Computational Statistics, pages 319-324. Physical Verlag, 1996.
[82] E. J. Kontoghiorghes, M. Clint, and H.-H. Nageli. Recursive leastsquares using Householder transformations on massively parallel SIMO
systems. Parallel Computing, 25(8), 1999. (Forthcoming).
170
[83] E. J. Kontoghiorghes and E. Dinenis. Data parallel algorithms for solving least-squares problems by QR decomposition. In F. Faulbaum, editor, SoftStat'95: 8th conference on the scientific use of statistical software, volume 5 of Advances in Statistical Software, pages 561-568.
Stuttgart: Lucius & Lucius, 1996.
[84] E. J. Kontoghiorghes and E. Dinenis. Data parallel QR decompositions
of a set of equal size matrices used in SURE model estimation. Journal
ofMathematical Modelling and Scientific Computing, 6:421-427, 1996.
[85] E. J. Kontoghiorghes and E. Dinenis. Solving the sequential accumulation least squares with linear equality constraints problem on a SIMD
array processor. Zeitschrift for Angewandte Mathematik und Mechanik
(ZAMM), 76(SI):447-448, 1996.
[86] E. J. Kontoghiorghes and E. Dinenis. Solving triangular seemingly unrelated regression equations models on massively parallel systems. In
M. Gilli, editor, Computational Economic Systems: Models, Methods
& Econometrics, volume 5 of Advances in Computational Economics,
pages 191-201. Kluwer Academic Publishers, 1996.
[87] E. J. Kontoghiorghes and E. Dinenis. Computing 3SLS solutions
of simultaneous equation models with a possible singular variancecovariance matrix. Computational Economics, 10:231-250, 1997.
[88] E. J. Kontoghiorghes and E. Dinenis. Towards the parallel implementation of the SURE model estimation algorithm. Journal of Mathematical
Modelling and Scientific Computing, 8:335-341, 1997.
[89] E. J. Kontoghiorghes and D. Parkinson. Parallel Strategies for rank.-k
updating of the QR decomposition. Technical Report TR-728, Department of Computer Science, Queen Mary and Westfield College, University of London, 1996.
[90] E. J. Kontoghiorghes, D. Parkinson, and H.-H. Nageli. QR decomposition of dense matrices on massively parallel SIMD systems. In Proceedings of the 15th 1MACS Congress on Scientific Computation, Modelling
and Applied Mathematics. Wissenschaft und Technik verlag, 1997.
[91] S. Kourouklis and C. C. Paige. A constrained least squares approach
to the general Gauss-Markov linear model. Journal of the American
Statistical Association, 76(375):620-625, 1981.
[92] K. Lahiri and P. Schmidt. On the estimation of triangular structural
systems. Econometrica, 1978.
References
[93]
171
172
[107] W. Oberhofer and J. Kmenta. A general procedure for obtaining maximum likelihood estimates in generalized regression models. Journal
Econometrica, 42(3):579-590, 1974.
[108] S. J. Olszanskyj, J. M. Lebak, and A. W. Bojanczyk. Rank-k modification methods for recursive least squares problems. Numerical Algorithms, 7:325-354, 1994.
[109] C. C. Paige. Numerically stable computations for general univariate
linear models. Communications on Statistical and Simulation Computation, 7(5):437-453, 1978.
[110] C. C. Paige. Computer solution and perturbation analysis of generalized linear least squares problems. Mathematics of Computation,
33(145):171-183, 1979.
[111] C. C. Paige. Fast numerically stable computations for generalized
linear least squares problems. SIAM Journal on Numerical Analysis,
16(1):165-171, 1979.
[112]
c. C. Paige. The general linear model and the generalized singular value
decomposition. Linear Algebra and its Applications, 70:269-284, 1985.
References
173
174
References
175
Author Index
178
Plemmons, RJ., 57
Pollock, D.S.G., 7, 147
Pothen, A., 35
Rader, C.M., 94
Raghavan, P., 35
Rao, C.R., 2, 11, 147
Regalia, P.A., 118
Reid, J.K., 10-11, 82
Revankar, N.S., 127
Robert, Y.,22,67, 72-73,101,103
Rothenberg, TJ., 129
Ruud, P.A., 129
Sameh, A.H., 22, 30, 35, 71, 101, 108, 114
Sargan, D., 147, 158
Saunders, M.A., 57, 155
Schittkowski, K., 158
Schmidt, P., 121, 123, 147
Schott, J.R, 118
Schreiber, RS., 10, 17
Searle, S.R., 8
Seber, G.A.P., 8, 42, 62
Shroff, G.M., 12, 114
Smith, D.M., 57
Sorensen, D.C., 35, 38, 114, 150
Srivastava, V.K., 119, 121, 123, 127, 129-130, 147
Stanley, K., 35, 40
Steele, RS., 10, 17
Steinhardt, A.O., 94
Stewart, A., 18
Stewart, G.w., 57
Stoer, 1., 158
Swanson, N.R, 130
Soderkvist, I.S., 8
Takada, H., 121
Taylor, W.E., 129
Tesler, L.G., 121
Theil, H., 147, 149
Tiwari, R., 147
Toutenburg, H., 2,147
Ullah, A., 121
Van Loan, C.P., 4,8, 10-11, 13, 17,40-41,57,123,
150, 155
Van Dooren, P., 8, 57
Walker, D.W., 35, 40, 98
Wedin, p.-A., 8
Welsch, RE., 57
Weston, J.S., 18,24,27,55,57,66
Whaley, RC., 35, 40
Zellner, A., 119-121, 129-130, 147
Zosel, M.E., 10, 17
Subject Index
Algorithm, 8
3-D
Givens, 30, 74
Gram-Schmidt, 30
Householder, 30, 74
performance, 33, 74
bitonic, 69-70, 74-75, 82
complexity, 71
example, 71
orthogonal matrix, 69
column sweep, 124
data parallel, 17
downdating, 97
hybrid,55
SIMD,45
MIMD,35
QLD,135
QRD, 10
Givens, 14-15,22
Gram-Schmidt, 16-17,21
Householder, 12,19
hybrid,60,62,66
performance, 23, 25
set of equal size matrices, 29
skinny, 28
updating, 61
reconstructing the orthogonal matrix, 50, 52
SIMD, 23, 66
block parallel, 26
Givens, 40
Householder, 25, 40
performance, 28
QLD,41
QRD updating, 59
QRD,23
triangular (tSURE) model, 126
triangular systems, 128
Array processing, 18
Array processor, 41
180
reduced, 107
restrictions, 8
variance-covariance matrix, 3
recursive, 7, 87
BNS, 88
constraints, 87
DE,90
performance, 90
restricted, 6--7
unrestricted, 6--7
weighted,8
Likelihood function, 8
Limited parallelism, 137
Linear model, 1-3
constrained, 6
variance-covariance, 8
general (GLM), 2, 7-9, 105
ordinary (OLM), 2, 7-8, 39, 57
adding variables, 90
deleting variables, 99
downdated, 57, 92
modified,57
non full column rank, 54
updated, 57-58
variance-covariance, 2
singular, 7
weighted (WLM), 7
Lower trapezoid, 39, 58
Lower triangular, 41, 44
LU decomposition, 4
Manifold,3
Massively parallel, 17,60
Maximum likelihood, 3, 147
estimator, 4
MIMD, 35, 66, 98
efficiency, 38
IBM SP2, 38
inter-processor communication, 35
load balancing, 35
locality, 35
scattering, 36
speedup, 38
SPMD,35-36
task-farming, 35, 38
Moore-Penrose, 8
MPI,38
Multivariate model, 120
Non-negative definite, 2,7
Non-singular, 5, 7, 9, 41
Normal equations, 3, 149
Normally distributed, 2
Numerically unstable, 8
Orthogonal, 9, II, 13-14,22,40,44,46,49,58,
67-68,83,94,100,108,121,139-140,162
Compound Disjoint Orthogonal Matrices
(CDOMs), 135, 137-138
application, 142
Subject Index
reconstruction, 50
special structure, 51
Outer product, 90
Permutation matrix, 41, 68, 100
Preconditioning, 4
QLD,40, 106, 108, 110, 135, 140-141, 145, 152
complete, 40
generalized (GQLD), 105
column pivoting, 40
PDS, 137-139
PRS, 137-138
SDS, 137
SRS, 137
updating with lower triangular matrix, 75
QRD, 4-5, 10, 123, 148, 155, 162
adding columns, 90
complete, 122, 155
deleting columns from trapezoid, 100
Greedy-based method, 103
illustration, 102
SK-based method, 102
deleting columns from triangular matrix, 101
deleting columns, 100
downdating, 93
Householder transformations, 95
parallel strategies, 94
performance, 98
generalized (GQRD), 8-9, 121
Givens, 14
examples, 15
Gram-Schmidt, 16-17
Householder, 11-12
block format, 12
incomplete, 148,156
downdated, 155
updated, 154
retriangularizing a trapezoidal matrix, 160
Givens sequence, 160
set of matrices, 29, 34
structured banded matrices, 82
bitonic method, 82
illustration, 82, 85
updating, 58,67,72,84
weighted,8
Rank,3,5-6,9-10,87
criterion, 41
full column, 40, 57, 121, 147, 149
not full, 39
Recurrence, 145
Recursive doubling, 67
Regression, 1, 117
model,41
stepwise, 42, 48, 61, 127
Residuals, 1, 3,5,62, 120
Set of vectors notation, 118
SIMD, 17-18, 73, 108
arrays, 17
181
communication, 17
CPP, 24
Fortran, 25
GAMMA,24
lalib,24
languages, 24
qr factor, 25
DAP, 24, 40-41, 48,52,59,90,94-95,110,113,
127-128
GAMMA, 60, 64
layers, 18-19,24,42,60,65,108-109
mapping, 17-18,24,54,60
column, 60-62, 64
cyclic, 18,27,60-62,64
row, 60-61, 66
strategies, 55
MasPar, 17-18,24,29,32,60,62,64,74-75
DPU, 18-19
Fortran, 18
front-end, 18
overheads,20,27,113
communication, 61
function, 64
implementation, 64
performance, 17, 65
remapping, 62, 64
overheads, 65
synchronization, 17
Simultaneous equations models (SEMs), 129, 147
2SLS, 148
residuals, 149, 151
3SLS, 147, 149
computational strategies, 160
constrained estimator, 159
constrained, 157
convergence, 152
estimator, 149, 151
estimator's covariance matrix, 151
QRD,160
consistent modified, 154
cross-section constraints, 160
disturbance variance-covariance matrix, 148
consistent estimator, 151
non singular, 150
singular, 149, 152
GLLSP, 149
consistency condition, 152
QRD,150
GLS, 148
inconsistent, 153
modified, 154
adding new data, 154
adding new predetermined variables, 156
adding new variables, 155
deleting data, 155
deleting variables, 155
OLS, 148
182
redundancies, 152
separable constraints, 157
BNS method, 158
DE method, 158
specification incompatibilities, 153-154
structural equation, 147, 149, 152
identifiable, 148
transformed, 153
transformed (TSEM), 148-149, 152-153
downdated, ISS
linear dependency, 152-153
modified, 156
undersized sample problem, 152
Skinny matrices, 27
Sparse, 4
Subscript notation, 10
SURE,74, 117, 119-121,147,149,162
common exogenous variables, 140
covariance restrictions, 129
distinct regressors, 120
disturbances, 119
singular covariance matrix, 121
FGLS, 120
variance-covariance, 123
GLLSP, 121-122
objective function, 122
GLS, 119
variance-covariance, 119
inconsistent, 122, 153
iterative FGLS, 120
OLS, 119-120
variance-covariance, 119
restricted residual, 120
subset regressors, 123
SURE-CC, 129-130
definition, 130
disturbance variance-covariance matrix, 140
FWLS, 132
proper subset regressors, 141, 145
reduced size, 140
SURR,120
SUUR,120
triangular (tSURE), 123
algorithm, 126
consistent regression equation, 125
consistent, 125
estimator, 125
estimator's covariance matrix, 126
FGLS estimator, 126
implementation, 127
inconsistent regression equation, 125
inconsistent, 125
modified consistent, 126
performance, 128
singular covariance matrix, 123
unrestricted residual, 120
variance inequalities and correlation constraints,
129
SVD, 4, 9
Timing model, 17, 19,23,40,42,46,65
constrained least squares, 90
downdating, 94
Givens, 95
Householder transformations, 97
Givens, 23, 33
GLM,1I3
updating, 61
Householder, 20, 32
MIMD,37
Modified Gram-Schmidt, 21, 32
PGS, 48
QLD,54
QRD updating, 60
column, 62
cyclic, 61
reconstructing the orthogonal matrix, 50, 53
remapping, 62
triangular (tSURE), 127
Trace, 120
Triangular factors, 70
Triplet subscript, 10
Unbiased estimator, 8
Upper trapezoid, 100
Upper triangular,S, 9, 13,69
Vector operator, 118
Virtual processor, 24