(Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.) - Parallel Algorithms For Linear Models - Numerical Methods and Estimation Problems-Springer US (2000)

PARALLEL ALGORITHMS FOR LINEAR MODELS
Advances in Computational Economics

VOLUME 15
SERIES EDITORS
Hans Amman, University ofAmsterdam, Amsterdam, The Netherlands
Anna Nagurney, University of Massachusetts at Amherst, USA
EDITORIAL BOARD
Anantha K. Duraiappah, European University Institute
John Geweke, University of Minnesota
Manfred Gilli, University of Geneva
Kenneth L. Judd, Stanford University
David Kendrick, University of Texas at Austin
Daniel McFadden, University of California at Berkeley
Ellen McGrattan, Duke University
Reinhard Neck, University of Klagenfurt
Adrian R. Pagan, Australian National University
John Rust, University of Wisconsin
Berc Rustem, University of London
Hal R. Varian, University ofMichigan
The titles published in this series are listed at the end of this volume.
Parallel Algorithms for

Linear Models
Numerical Methods and
Estimation Problems
by
Erricos John Kontoghiorghes

Universite de Neuchtel, Switzerland
....
"
Springer Science+Business Media, LLC
Library of Congress Cataloging-in-Publication Data

Kontoghiorghes, Erricos John.
Parallel algorithms for linear models : numerical methods and estimation problems / by
Erricos John Kontoghiorghes.
p. cm. -- (Advances in computational economics; v. 15)
lncludes bibliographical references and indexes.
ISBN 978-1-4613-7064-2
ISBN 978-1-4615-4571-2 (eBook)
DOI 10.1007/978-1-4615-4571-2
1. Linear models (Statistics)--Data processing. 2. Parallel algorithms. 1. Title. II.
Series.
QA276 .K645 2000
519.5'35--dc21
99-056040
Copyright 2000 by Springer Science+Business Media New York

Originally published by Kluwer Academic Publishers, New York in 1992
Softcover reprint ofthe hardcover lst edition 1992
AII rights reserved. No part of this publication may be reproduced, stored in a retrieval
system or transmitted in any form or by any means, mechanical, photo-copying, recording,
or otherwise, without the prior written permission of the publisher, Springer Science +
Business Media, LLC
Printed on acid-free paper.
To Laurence and Louisa
Contents
List of Figures
List of Tables
List of Algorithms
Preface
ix
xi
xiii
xv
1. LINEAR MODELS AND QR DECOMPOSmON

1
Introduction
2
Linear model specification
2.1
The ordinary linear model
2.2
The general linear model
3
Forming the QR decomposition
3.1
The Householder method
3.2
The Givens rotation method
3.3
The Gram-Schmidt orthogonalization method
4
Data parallel algorithms for computing the QR decomposition
4.1
Data: parallelism and the MasPar SIMD system
4.2
4.3
The Gram-Schmidt method
4.4
The Givens rotation method
4.5
Computational results
5
QRD of large and skinny matrices
5.1
The CPP GAMMA SIMD system
5.2
The Householder QRD algorithm
5.3
QRD of skinny matrices
6
QRD of a set of matrices
6.1
Equal size matrices
6.2
Mattices with different number of columns
1
1
1
2
7
10
11
13
16
17
17
19
21
22
23
23
24
25
27
29
29
34
2. OLM Nor OF FULL RANK

1
Introduction
2
The QLD of the coefficient matrix
2.1
SIMD implementation
3
Triangularizing the lower trapezoid
39
39
40
41
43
viii
PARALLEL ALGORITHMS FOR UNEAR MODELS
4
5
3.1
3.2
The Givens method
Computing the orthogonal matrices
Discussion
3. UPDATING AND DOWNDATING THE OLM

1
Introduction
2
Adding observations
2.1
The hybrid Householder algorithm
2.2
The Bitonic and Greedy Givens sequences
2.3
Updating with a block lower-triangular matrix
2.4
QRD of structured banded matrices
2.5
Recursive and linearly constrained least-squares
3
Adding exogenous variables
4
Deleting observations
4.1
Parallel strategies
5
Deleting exogenous variables
43
46
49
54
57
57
58
60
67
75
82
87
90
92
94
99
4. THE GENERAL LINEAR MODEL

1
Introduction
2
Parallel algorithms
3
Implementation and performance analysis
105
105
108
111
5. SUREMODELS
1
Introduction
2
The generalized linear least squares method
3
Triangular SURE models
3.1
Implementation aspects
4
Covariance restrictions
4;1
The QLD of the block bi-diagonal matrix
4.2
Parallel strategies
4.3
Common exogenous variables
117
117
121
123
127
129
133
138
140
6. SIMULTANEOUS EQUATIONS MODELS

1
Generalized linear least squares
1.1
Estimating the disturbance covariance matrix
1.2
Redundancies
1.3
Inconsistencies
2
Modifying the SEM
3
Linear Equality Constraints
3.1
Basis of the null space and direct elimination methods
4
Computational Strategies
147
149
151
152
153
154
157
158
160
References
Author Index
Subject Index
163
177
179
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
2.1
2.2
2.3
2.4
3.1
3.2
3.3
3.4
3.5
Geometric interpretation of least-squares for the OLM

problem.
Illustration of Algorithm 1.2, where m = 4 and n = 3.
The column and diagonally based Givens sequences for
computing the QRD.
Cyclic mapping of a matrix and a vector on the MasPar
MP-1208.
Examples of Givens rotations schemes for computing
theQRD.
Execution time ratio between 2-D and 3-D algorithms
for computing the QRDs, where G = 16.
Stages of computing the QRDs (1.47).
Annihilation pattern of (2.4) using Householder reflections.
Givens sequences for computing the orthogonal factorization (2.4).
Illustration of the implementation phases ofPGS, where
es=4.
Thefill-in of the submatrix PI:n,I:n at each phase of Algorithm 2.4.
Updating Givens sequences for computing the orthogonal factorizations (3.6), where k = 8 and n = 4.
Ratio of the execution times produced by the models of
the cyclic-layout and column-layout implementations.
Computing (3.21) using Givens rotations.
The bitonic algorithm, where n = 6, k = 18 and PI =
P2 = P3 = 6.
The Greedy sequence for computing (3.6a), where n =
6andk= 18.
4
15
15
18
22
34
36
44
47
49
53
59
63
71
72
73
PARAILELALGORITHMS FOR liNEAR MODELS
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
4.1
4.2
4.3
4.4
4.5
5.1
5.2
5.3
5.4
5.5
5.6
6.1
Computing the factorization (3.23) using the diagonallybased method, where G = 5.

Parallel strategies for computing the factorization (3.24)
Computing factorization (3.23).
The column-based method using the UGS-2 scheme.
The column-based method using the Greedy scheme.
Illustration of the annihilation patterns of method-I.
Computing (3.31) for b = 8, l'}* = 3 and j = 2. Only
the affected matrices are shown.
Illustration of method-3, where p = 4 and g = 1.
Givens parallel strategies for downdating the QRD.
Illustration of the SK-based scheme for computing the
QRDofRS.
Greedy-based schemes for computing the QRD of RS.
Sequential Givens sequences for computing the QLD (4.3a).
The SK sequence.
G(16)B with e{16, 18,8) = 8.
The application of the SK sequence to compute (4.3)
on a 2-D SIMD computer.
Examples of the MSK(p) sequence for computing the QLD.
The correlations Pj,j in the SURE--CC model for l'}j = i
and l'}j = Iii.
Factorization process for computing the QLD (5.35)
using Algorithm 5.3.
Annihilation sequences of computing the factorization (5.40).
Givens sequences of computing the factorization (5.45).
Number of CDGRs for computing the orthogonal factorization (5.40) using the PDS.
Annihilation sequence of triangularizing (5.55).
Givens sequence for computing the QRD of RSj.
76
77
78
80
81
83
85
86
96
102
104
107
109
109
109
110
131
136
137
138
139
144
161
List of Tables
1.1
1.2
1.3
1.4
1.5
2.1
2.2
2.3
2.4
3.1
3.2
3.3
3.4
3.5
3.6
Times (in seconds) of computing the QRD of a 128M x

64N matrix.
Execution times (in seconds) of the CPP LALIB QR_FACTOR
subroutine and the BPHA.
Execution times (in seconds) ofthe CPP LA LIB QR-FACTOR
subroutine and Algorithm 1.9.
Times (in seconds) of simultaneously computing the
QRDs (1.47).
The task-farming and scattering methods for computing the QRDs (1.47).
Computing the QLD (2.3) (in seconds), where m = Mes
and n = Nes.
The CDGRs of the PGS for computing the factorization (2.4).
Computing (2.4) (in seconds), where k = Kes and nk = Tles.
Times (in seconds) of reconstructing the orthogonal matrices QT and P on the DAP.
Execution times (msec) of the Householder and Givens
methods for updating the QRD on the DAP.
Execution times (in seconds) for k = 11264.
Execution times (in seconds) of the RLS Householder
algorithm on the MasPar.
Execution times (in seconds) of the RLS Householder
algorithm on the GAMMA.
Number of CDGRs required to compute the factorization (3.6a).
Times (in seconds) for computing the orthogonal factorization (3.6a).
23
27
28
33
38
44
47
50
54
60
63
65
66
74
74
xu
3.7
3.8
3.9
4.1
4.2
5.1
5.2
Computing the QRD of a structured banded matrix using method-3.

Estimated time (msec) required to compute x(i) (i =
2,3, ... ), where mi = 96, n = 32N and k = 32K.
Execution time (in seconds) for downdating the OLM.
Execution times (in seconds) of the MSK(Aes/2).
Computing (4.3) (in seconds) without explicitly constructing QT and P.
Computing (5.24), where T - k - 1 = -res, G - 1 = Jles
and es = 32.
Execution times of Algorithm 5.2 for solving Rr = A.
87
91
98
114
115
128
129
List of Algorithms
Computing the QRD of A E 9\mxn using Householder transformations.

12
1.2 The column-based Givens sequence for computing the
QRD of A E S)\mxn.
14
1.3 The diagonally-based Givens sequence for computing the
QRD of A E S)\mxn.
16
1.4 The Classical Gram-Schmidt method for computing the
QRD of A E S)\mxn.
16
1.5 The Modified Gram-Schmidt method for computing the
QRD.
17
1.6 QR factorization by Householder transformations on SIMD
systems.
20
1.7 The MGS method for computing the QRD on SIMD systems.
21
1.8 The CPP LALIB method for computing the QR Decomposition. 26
1.9 Householder with parallelism in the first dimension.
28
1.10 The Householder algorithm.
30
1.11 The Modified Gram-Schmidt algorithm.
31
1.12 The task-farming approach for computing the QRDs (1.47)
on p (p G) processors using a SPMD paradigm.
37
2.1 The QL decomposition of A.
43
2.2 Triangularizing the lower trapezoid using Householder reflections.
45
2.3 The reconstruction of the orthogonal matrix Q in (2.3).
51
2.4 The reconstruction of the orthogonal matrix P in (2.4).
53
3.1 The data-parallel Householder algorithm.
61
3.2 The bitonic algorithm for updating the QRD, where R == Ri~ I.
70
3.3 The computation of (3.63) using Householder transformations.
97
5.1 An iterative algorithm for solving tSURE models.
126
1.1
XIV
5.2
5.3
The parallel solution of the triangular system Rr = L\.

Computing the QLD (5.35).
129
135
Preface
The monograph provides a complete and detailed account of the design,

analysis and implementation of parallel algorithms for solving large-scale linear models. It investigates and presents efficient, numerically stable algorithms
for computing the least-squares estimators and other quantities of interest on
massively parallel systems.
The least-squares computations are based on orthogonal transformations,
in particular the QR and QL decompositions. Parallel algorithms employing Givens rotations and Householder transformations have been designed for
various linear model estimation problems. Some of the algorithms presented
are parallel versions of serial methods while others are original designs. The
implementation of the major parallel algorithms is described. The necessary
techniques and insights needed for implementing efficient parallel algorithms
on multiprocessor systems are illustrated in detail. Although most of the algorithms have been implemented on SIMD systems the data parallel computations of these algorithms should, in general, be applicable to any massively
parallel computer.
The monograph is in two parts. The first part consists of four chapters
and deals with the computational aspects for solving linear models that have
applicability in diverse areas. The remaining two chapters form the second
part which concentrates on numerical and computational methods for solving various problems associated with seemingly unrelated regression equations
(SURE) and simultaneous equations models.
Chapter 1 provides a brief introduction to linear models and considers various forms for solving the QR decomposition on serial and parallel systems.
Emphasis is given to the design and efficient implementation of the parallel
algorithms. The second chapter investigates the performance and practical issues for solving the ordinary linear model (OLM), with the exogenous matrix
being ill-conditioned or having deficient rank, on a SIMD system.
xvi
Chapter 3 is devoted to methods for up- and down-dating the OLM. It provides the necessary computational tools and techniques that are often required
in econometrics and optimization. The efficient parallel strategies for modifying the OLM can be used as primitives for designing fast econometric algorithms. For example, the Givens and Householder algorithms used to compute the QR decomposition after rows have been added or columns have been
deleted from the original matrix have been efficiently employed to the solution
of the SURE and simultaneous equations models. The updating methods are
also employed to solve the recursive ordinary linear model with linear equality constraints. The numerical methods based on the basis of the null space
and direct elimination methods are in turn adopted for the solution of linearly
constrained simultaneous equations models.
The fourth chapter investigates parallel algorithms for solving the general
linear model - the parent model of econometrics - when it is considered as
a generalized linear least-squares problem. This approach has subsequently
been efficiently used to compute solutions of SURE and simultaneous equations models without having as prerequisite the non-singularity of the variancecovariance matrix of the disturbances. Chapter 5 presents a parallel algorithm
for solving triangular SURE models. The problem of computing estimates of
parameters in SURE models with variance inequalities and positivity of correlations constraints is also considered. Finally, chapter 6 presents algorithms for
computing the three-stage least squares estimator of simultaneous equations
models (SEMs). Numerical and computational methods for solving SEMs with
separable linear equalities constraints and when the SEM has been modified
by deleting or adding new observations or variables are discussed. Expressions revealing linear combinations between the observations which become
redundant are also presented.
These novel computational methods for solving SURE and simultaneous
equations models provide new insights that can be useful to econometric modelling. Furthermore, the computational and numerical efficient treatment of
these models, which are regarded as the core of econometric theory, can be
considered as the basis for future research. The algorithms can be extended or
modified to deal with models that occur in particular econometric applications
and have specific characteristics that need to be taken into account.
The practical issues of the parallel algorithms and the theoretical aspects
of the numerical methods will be of interest to a broad range of researchers
working in the areas of numerical and computational methods in statistics and
econometrics, parallel numerical algorithms, parallel computing and numerical linear algebra. The aim of this monograph is to promote research in the
interface of econometrics, computational statistics, numerical linear algebra
and parallelism.
Preface
xvii
The research described in this monograph is based on the work that I have
pursued in the last ten years. During this period I was privileged to have the
opportunity to discuss various issues related to my work with Maurice Clint.
His numerous suggestions and constructive comments have been both inspiring
and invaluable. I am grateful to Dennis Parkinson for his valuable information
that he has provided on many occasions on various aspects related to SIMD
systems, David A. Belsley for his constructive comments and advice on the
solution of SURE and simultaneous equations models, Hans-Heinrich Nageli
for his comments and constructive criticism on performance issues of parallel
algorithms and the late Mike R.B. Clarke for his suggestions on Givens sequences and matrix computations. I am indebted to Paolo Foschi and Manfred
Gilli for their comments on this monograph and to Sharon Silverne for proof
reading the manuscript. The author accepts full responsibility for any errors
that may be found in this work.
Some of the results of this monograph were originally published in various
papers [69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 84, 85, 86, 87, 88] and
reproduced by kind permission of Elsevier Science Publishers B.Y. 1993,
1994, 1995, 1999; Gordon and Breach Publishers 1993, 1995; John Wiley
& Sons Limited 1996, 1999; IEEE 1993; Kluwer Academic Publishers
1997, 1999; Principia Scientia 1996, 1997; SAP-Slovak Academic Press
Ltd. 1995; and Springer-Verlag 1993, 1996, 1999.
Chapter 1
LINEAR MODELS AND QR DECOMPOSITION
INTRODUCTION
A common problem in statistics is that of estimating parameters of some

assumed relationship between one or more variables. One such relationship is
(1.1)
where y is the dependent (endogenous, explained) variable and al,.. ,an are
the independent (exogenous, explanatory) variables. Regression analysis estimates the form of the relationship (1.1) by using the observed values of the
variables. This attempt at describing how these variables are related to each
other is known as model building.
Exact functional relationships such as (1.1) are inadequate descriptions of
statistical behavior. Thus, the specification of the relationship (1.1) is explained
as
(1.2)
where is the disturbance term or error, whose specific value in any single
observation cannot be predicted. The purpose of is to characterize the discrepancies that emerge between the actual observed value of y and the values
that would be assigned by an exact functional relationship. The difference
between the observed and predicted value of y is called the residual.
LINEAR MODEL SPECIFICATION
A linear model is one in which y, or some transformation of y, can be expressed as a linear function of ai, or some transformation of ai (i = 1, ... ,n).
Here only linear models where endogenous and exogenous variables do not
require any transformations will be considered. In this case, the relationship
(1.2) can be written as

(1.3)
where Xi (i = 1, ... , n) are unknown constants.
If there are m (m > n) sample observations, the linear model (1.3) gives rise
to the following set of m equations
YI = al1x I +a12x2 + ... +alnXn +1
Y2 = a21XI + a22X2 + ... + a2nXn + 2
or
(~) (::::~ :~) (~:)

=
Ym
ami
am2
amn
+ (::) .
Xn
(1.4)
In compact form the latter can be written as
y=Ax+,
(1.5)
where y, E SRm, A E SRmxn and x E SRn.

To complete the description of the linear model (1.5), characteristics of the
error term and the matrix A must be specified. The first assumption is that
the expected value of is zero, that is, E( ) = O. The second assumption is
that the various values of are normally distributed. The final assumption is
that A is a non-stochastic matrix, which implies E(A T ) = O. In summary,
the complete mathematical specification of the (general) linear model which is
being considered is
(1.6)
The notation ,...., N(O,a 2n) indicates that the error vector is assumed to
come from a normal distribution with mean zero and variance-covariance (or
dispersion) matrix a 2 n, where n is a symmetric non-negative definite matrix
and a is an unknown scalar [124].
2.1
THE ORDINARY LINEAR MODEL
Consider the Ordinary Linear Model (OLM):

y=Ax+, ,....,N(O,a2/m).
(1.7)
Linear models and QR decomposition
The OLM assumptions are that each Ei has the same variance and all disturbances are pairwise uncorrelated. That is, Var(Ei) = cr2 and \:Ii =I- j: E(ET Ej) =
o. The first assumption is known as homoscedasticity (homogeneous variances).
The most frequently used estimating technique for the OLM (1.7) is least
squares. Least-squares (LS) estimation involves minimizing the sum of squares
of residuals: that is, finding an n element vector x which minimizes
eTe = (y-Axf(y-Ax).
(1.8)
For the minimization of (1.8) eT e is differentiated with respect to x which is

treated as a variable vector, and the differentials are equated to zero. Thus,
a(eTe) = _2yT A+2xTATA
ax
and, setting a( e T e) lax = 0, gives the least-squares normal equations
AT Ax = AT y.
(1.9)
Assuming that A is of full column rank, that is, (AT A) -I exists, the leastsquares estimator can be computed as
(1.10)
and the variance-covariance matrix of the estimator x is given by
Var(x) = cr2 (AT A)-I.
(1.11)
The terminology of normal equations is expressed in terms of the following

geometric interpretation of least-squares. The columns of A span a subspace
in SRm which is referred to as a manifold of A and it is denoted by M (A). The
dimension of M (A) cannot exceed n and can only be equal to n if A is of
full column rank. The vector Ax resides in :M (A) but the vector y lies outside
M(A) where it is assumed that E =I- 0 in (1.7). For each different vector x
there is a corresponding vector of residuals e, so that y is the sum of the two
vectors Ax and e. The length of e needs to be minimized and this is achieved
by making the residual vector e perpendicular to M(A) (see Fig. 1.1). This
implies that e = y - Ax must be orthogonal to any linear combination of the
columns of A. If Ac is any such linear combination, where c is non-zero,
then the orthogonality condition gives c TAT (y-Ax) = 0 from which the leastsquares normal equations (1.9) are derived.
Among econometricians, Maximum Likelihood (ML) is another popular
technique for deriving estimators of linear models. The likelihood function of
the observed sample is the probability density function of y, namely,
L(x,cr2 ) = (21tcr 2 )-(m/2)e-(y-Ax f(Y-Ax)/2<i,
Figure 1.1.
Geometric interpretation of least-squares for the OLM problem.
where e denotes the exponential function. Maximizing L(x, cr2 ) is equivalent

to maximizing InL(x,cr 2 ) where,
InL(x, cr ) =
m
m
2
1
T
In (21t) - 2 1n(cr ) - 2cr2 (y - Ax) (y - Ax).
-2
and In Ljacr2 = 0, the ML estimators are obtained as
Setting In Ljax =
XML=(ATA)-IATy
(1.12)
and
(1.13)
The ML estimator XML, is identical to the least-squares estimator x which is
the best linear unbiased estimator (BLUE) of x in (1.7). If x is the BLUE of
x, it follows that E(x) = x and \fq E 9\n, Var(qT x) ::; Var(qT x), where x is any
linear unbiased estimator of x.
However, the ML estimator 0-~1L differs from the unbiased estimator of cr 2
which is given by
0-2 = (y-Ax)T(y-Ax)j(m-n).
(1.14)
Numerous methods exist for solving the least-squares problem. Some of the
best known methods are Gaussian elimination, Gauss-Jordan elimination, LU
decomposition, Cholesky factorization, Singular Value Decomposition (SVD)
and QR decomposition (QRD). When the coefficient matrix is large and sparse,
these methods, which are called direct methods, suffer from fill-in and they
can be impracticable. In such cases, iterative methods are more efficient, even
though there are intelligent adaptations of the direct methods which minimize
the fill-in. Iterative methods (e.g. Conjugate Gradient) have the advantage
that minimal storage space is required for implementation since no fill-in of
the zero positions of the coefficient matrix occurs during computation. Furthermore, if a good initial guess is known and preconditioning can be used, the iterative methods can converge in an acceptable number of steps. Full details of the
direct and iterative methods are given in textbooks such as [7,40,51,93,98].
Here the numerically reliable direct method of QR decomposition (QRD) is

used under the assumption that the coefficient matrix is dense. The QRD of
the explanatory data matrix A is given by
(1.15)
where Q E SRmxm is orthogonal, i.e. it satisfies QT Q = QQT = 1m, and R E SRnxn
is upper triangular. Substituting (1.15) in the normal equations (1.9), gives
RTRx=RTYI,
where
QTy= (YI) n
.
Y2 m-n
Under the assumption that A is of full column rank, which implies that R is
non-singular, the least-squares estimator of the OLM (1.7) is computed by
solving the upper triangular system of equations
RX=YI.
(1.16)
Another approach to deriving (1.16) is to use the property of orthogonal

transformation matrices which leave the Euclidean length of a vector invariant.
The Euclidean length or 2-norm of z E SRm is given by Ilzll = (ZT z) 1/2 and so,
IIQzl1 2 = ZT QT Qz = ZT Z = Ilz112. Hereafter, the Euclidean norm will be denoted
by II . II. In the context of minimizing lie 11 2, it follows that
x = argmin II ell 2
x
= argminllQTel1 2
x
= argminllQ T Y_ QTAxll2
x
=
arg~inll (Yl ~RX) 112
= arg~in (IIYI -
Rxll2 + IIY211 2)
= argminllYl -Rx112
x
= R-1YI.
The quantity
lIy - AxII2 =
IIY211 2 is termed the residual sum of squares (RSS).
2.1.1
OLM WITH LINEAR EQUALITY RESTRICTIONS
Many regression models incorporate additional information in the form of

restrictions (constraints) on the parameters of the model. In these cases the
models are called restricted or constrained models. If the vector x of the OLM
(1.7) is subject to k consistent restrictions expressed as
Cx=d,
(1.17)
where C is a k x n matrix and d is of order k, then the restricted least squares

(RLS) solution is an n element vector x* satisfying
x* = argmin Ily - Ax112.
(1.18)
Cx=d
The assumptions for the matrix C are rank (C) = k and k < n. The first assumption implies that there are no linear dependencies among the restrictions and
the second that the RLS cannot be derived by solving the system (1.17).
Differentiating the Lagrangian function
L = (y-Ax)T (y-Ax) +2"T (Cx- d)
with respect to x and ').. and equating the results to zero, gives
aL
ax =
-2yTA+2xTATA+2')..TC=O
and
respectively. These equations may be expressed in matrix form as
and, under the assumption that the matrix is not singular, a unique solution for
x* can be obtained. Since ATA is assumed to be non-singular it follows that
x*
= (AT A)-IA Ty- (AT A)-ICT')..

=x- (AT A)-ICT')..,
(1.19)
where x is the unrestricted least-squares estimator. Premultiplying (1.19) by C

and rearranging, gives
').. = (C(AT A)-ICT)-l(Cx-Cx*)
= (C(AT A)-ICT)-l(Cx-d).
Substituting for Ain (1.19), it then gives

x* = x- (AT A)-lCT (C(AT A)-lCT)-1 (Cx-d).
(1.20)
Observe that the RLS estimator x*, differs from the unrestricted least-squares
estimator x by a function involving the extent to which x fails to satisfy the
restrictions [119]. For consistent restrictions, the variance-covariance matrix
of the RLS estimator x* may be shown to be
An unbiased estimator of (J2 is given bye; e* / (m + k - n), where e* = y - Ax*

[61, 119]. The efficient computation of RLS will be discussed in detail within
the context of recursive least-squares.
2.2
THE GENERAL LINEAR MODEL
The difference between the General Linear Model (GLM) and the OLM is
that there is a correlation between the disturbances Ei (i = 1, ... ,m). The GLM
is given by (1.6), i.e.
(1.21)
where n is a known non-negative definite matrix. The particular GLM, where
n is a diagonal matrix, but not the identity matrix, is called the Weighted Linear
Model (WLM). If n is positive definite, then there exists an m x m non-singular
matrix B such that
(1.22)
Premultiplying the GLM by B- 1 gives the transformed OLM
z = Cx+~, ~ '" N(0,(J2/m)'
where z = B-ly, C = B-IA and ~ = B-IE. Applying the least-squares method

to the latter gives
x = (C TC)-ICTZ
= (ATn-IA)-IATn-ly.
(1.23)
The least-squares estimator x in (1.23) is the BLUE of x in the GLM (1.21)

and its variance-covariance matrix is given by
Var(x) = (J2(CT C)-I

= (J2(ATn-IA)-1.
(1.24)
As in the case of the OLM, an unbiased estimator of cr 2 is

(j2
= (z-Cx)T(z-Cx)/(m-n)
= (y-Ax)Tn- l (y-Ax)/(m-n)
= (yTQ-l y _ xTATn- ly)/(m-n).
(1.25)
Using the assumption rv N(0,cr 2n), it can be shown that the GLS estimator
x is also a ML estimator, where the likelihood function is defined as

L(x,cr2)
= (21tcr2)-m/2IQI-l/2e-(y-Ax)Tn-l(y-AX)/2cr2.
For more details see [61, 136, 137].
2.2.1
A GENERALIZED LINEAR LEAST SQUARES APPROACH
The derivation of x using (1.23) is computationally expensive and numerically unstable when n is ill-conditioned. Further, if n is singular then the numerical solution of the GLM will fail completely and the replacement of B- 1
by the Moore-Penrose generalized inverse B+ will not always give the BLUE
of x [91]. Numerically efficient methods have been designed to overcome the
ill-conditioning problems of the method [91, 110, 140]. Soderkvist has proposed an algorithm based on constrained and weighted least squares which
uses as a main tool the weighted QRD [54, 55]. Earlier, Paige considered the
GLM as a Generalized Linear Least Squares Problem (GLLSP) and employed
the generalized QRD (GQRD) to solve it [15, 51, 100, 113]. A summary of
the results in [56, 57, 91, 109, 110, 111, 112] is given, in which the GLM is
solved by treating it as a GLLSP problem. With this approach the difficulties
caused by singularity and those caused by ill-conditioning are avoided.
The estimator x in (1.23) is derived by solving
x=
argmin IIB- I 1I 2
x
which can equivalently be expressed as
x = argmin lIull 2 subject to B- l y = B-IAx+ u,

x,u
where u rv N(0,cr 2/m ) is a random m element vector such that u = B- l . Premultiplying the restrictions by B gives the GLLSP
x=argminllull 2 subjectto y=Ax+Bu.

X,u
(1.26)
Although the above formulation allows for singular n reconsider, without loss
of generality, the case when n is non-singular, that is B E 9\mxm has full col-
umn rank. The GQRD of A == (A
y) and B is given by:

n
_)
QT A = ( ~
==
n
1
m-n-l
(R0 11Y)
0
(1.27a)
and
m-n-l
(QTB)P=L T ==
nC'
1
m-n-l
0
0
j
<>
0
Lf2
dT
LI2
).
(1.27b)
where R E 9\nxn and LT E 9\mxm are upper triangular and non-singular, and
Q,P E 9\mxm are orthogonal. The GLLSP (1.26) can be equivalently written
as
argmin IlpT uII2
u,x
subject to QT y = QT Ax+ QTBPpT U
or
argmin (IIUI 112 +y2 + IIu2112)
Ul,y,U2,X
Y=RX+LflUI
subject to { 11 = <>y + dT U2
+ jy+Lf2u2
(1.28)
0= LI2u2
where uTP is conformably partitioned as (uf y uI). The third and second constraints in (1.28) give, respectively, U2 = 0 and y = 11/<>. In the first
constraint the arbitrary subvector UI is set to zero in order to minimize the objective function. Thus, the estimator of x derives from the solution of the upper
triangular system
(1.29)
Notice that, if 0 = 0 i- 11, then the GLM is inconsistent. Hammarling et aI. in
[57] solve the GLM using the SVD and present a modified procedure when the
GLM is found to be inconsistent [56].
An expression for the variance-covariance matrix of x is
(1.30)
10
where the estimator of (32 is computed by

(1.31)
The expressions (1.29)-(1.31) can be derived directly using (1.23), (1.24), (1.25)
and (1.27).
FORMING THE QR DECOMPOSITION
Different methods have been proposed for forming the QRD (1.15), which
is rewritten as
(1.32a)
or
(1.32b)
where A E 5)tmxn, Q = (QI Q2) E 5)tmxm, R E 5)tnxn and m > n. It is assumed
that A has full column rank. Emphasis is given to the two transformation methods known as Householder and Givens rotation methods. The Classical and
Modified Gram-Schmidt orthogonalization methods will also be briefly considered.
A general notation based on a simplification of the triplet subscript expression
will be used to specify sections of matrices and vectors [65, 101]. This notation
has been called Colon Notation in [51]. The kth column and row of A E 5)tmxn
are denoted by A:,k and Ak,: respectively. The submatrix Ai:k,j:s has dimension
(k - i + 1) x (s - j + 1) and its first element is given by ai,j. Similarly, Vi:k is a
(k - i + 1)--element subvector of v E 5)tm starting with element Vi. That is,
ai,j
Ai:k,j:s
ai,s )
ai,j+ I
= ( ~~~l.,~ .. ~i.~I.'~~l. ......... ~~~l.,~

ak,j
ak,j+ I
...
ak,s
and
Vi:k= (
L.
Vi+l
Vi)
If the lower or upper index in the subscript notation is omitted, then the default
values are one and the upper bound of this subscript for the matrix or vector,
respectively. A zero dimension denotes a null matrix or vector and all vectors
are considered to be column vectors unless transposed. For example, Ak,: is
a column vector and AI.: == (ak,l ... ak,n) is a row vector, Ai:,:s is equivalent
to Ai:m,l:s and Ai:k,j:s is a null matrix if k < i or s < j. Notice that A[k,j:s is
equivalent to (Ai:k,j:s) T and not (AT) i:k,j:s which denotes the (k - i + 1) x (sj + 1) submatrix of AT.
3.1
11
THE HOUSEHOLDER METHOD
An m x m Householder transformation (or Householder matrix or Householder reflector) has the form
where h E
snm satisfies IIhll 2 =I- o. Often H is expressed as

H=l _ hhT
m
b'
(1.33)
where b = IlhI1 2 /2. Householder matrices are symmetric and orthogonal, i.e.
H = HT and H2 = 1m. They are useful because they can be used to annihilate
specified elements of a vector or a matrix [18, 53, 126].
Let x E m be non-zero and H be a Householder matrix such that y = H x
has zero elements in positions k to n. If H is defined as in (1.33) and Xj is any
element of x other than those to be annihilated, then
sn
if i = k, ... ,n,
ifi=jandj=l-k, ... ,n,
otherwise,
Xi
hi =
{ XjS
i=x]+L~
p=k
and
such that
Yi = {
ifi=k, ... ,n,

=fS if i = j,
Xi
otherwise.
To avoid a large relative error, the sign of S is chosen to be the same as the
sign of Xj. Notice that, except for the annihilated elements of x, the only other
element affected by the transformation is Xj.
Consider now the computation of the QRD (l.32a) using Householder transformations [15,51, 123]. The orthogonal matrix QT is defined as the product
of the n Householder transformations
12
PARAUELALGORITHMS FOR liNEAR MODELS
The m x m Householder transformation Hi is of the form
where iIi = Im-i+1 - hhT jb, b = hThj2 and a zero dimension denotes a null
matrix. It can be verified that the symmetric Hi is orthogonal, that is Hr = Hi
and
= 1m. If A (0) == A and
H1
n-i
R{i))
12
A{i) m-i'
(1
~i
< n),
(1.34)
where R~il is upper triangular, then Hi+ I is applied from the left of A (i) to annihilate the last m - i-I elements of the first column of A(i). The transformation
Hi+IA(i) affects only A(i) and it follows that
A(n)
(~).
(1.35)
A summary of the Householder method for computing the QRD is given by Algorithm 1.1. The square root function is denoted by sqrt and h E 9tm- i + l . Notice that no account is taken of whether the matrix is singular or ill-conditioned,
i.e. when the division by b can be performed. Steps 7 and 8 may be combined
into a single step, but for clarity a working vector z E 9tn - i +1 has been used.
Algorithm 1.1 Computing the QRD of A E 9tmxn using Householder transformations.

1: for i = 1,2, ... ,n do
2:
h:=Aio i
3:
s:= sqrt(hTh)
4:
if hi < 0 then s := -s
5:
hi :=hl +s
6:
b:= hiS
7:
z:= (h TAi:,djb
8:
Ai:,i: := Ai:,i: - hzT
9: end for
For efficient execution on most conventional computers a block version of
the Householder algorithm is used [13, 135]. The product of Householder
reflections Hk ... HI (l ~ k ~ n) can be written in the block format
-T
Q
=Hk ... H1 =Im-YWY T ,
13
where Y E 9{mxk and W E 9{kxk is upper triangular. The first k Householder

reflections compute the QRD
-T
. =
Q A .,l.k
(RI:k,I:k)
0
and then the updating

-T
Q A:,k+l:
= A:,k+l: -
T
YWY A:,k+l:.
For i = 2, ... ,k the matrices Y and W can be derived recursively as

Wl:i-l,i = -Wl:i-l,l:i-l (Y5-1
h)/Vb
and
Y;,i =
h/Vb,
where Hi = 1m - hhT /b and, initially, Y;,l = h/v'b and W = h [15, 51]. The
same procedure is repeatedly applied to the submatrix Ak+l:m,k+l:n until the
QRD of A is computed.
3.2
THE GIVENS ROTATION METHOD
An m x m Givens rotation has the structural form

i
1
i--+
Gi,j=
j--+
-s
(1.36)
c
1
where c = cos(~) and s = sin(~) for some~. Apart from s and -s all offdiagonal elements are zero. Givens rotations are orthogonal, thus GT.iG;,j =
Gi,jGT.i = 1m. The rotation Gi,j when applied from the left of a matrix, annihilates a specific element in the jth row of the matrix and only the ith and jth
rows of the matrix are affected [51]. While Householder transformations are
useful for introducing zero elements on the grand scale, Givens rotations are
important because they annihilate elements of a matrix more selectively.
Let G~1 have the structural form (1.36) such that the rotation
(1.37)
14
results in iij,k being zero, whereA,A E 9l mxn and 1 :::; k:::; n. The rotation (1.37)
which affects only the ith and jth rows of A can be written as
Ap ,: =
if p = i,
CAi': +sAj,:
{ cA j,: - SAi,:
A p ,:
if P = j,
ifp=l, ... ,mandpi=i,j.
(1.38)
If ii j,k is zero, then ca j,k - sai,k = o. If ai,k i= 0 and a j,k i= 0, then using the
trigonometric relation c2 + s2 = 1 it follows that
and
c=ak/t
I,
,
If ai,k and aj,k are both zero, then
(1.38) - is reduced to
(1.39)
G~1 = 1m. Hence, (1.37) - or its equivalent
(ai,kAi,: +aj,kAj,:}/t
{
Ap ,:= (ai,kAj,:-aj,kA;,:)/t
A p ,:
if p = i,
~fp:j,
..
Ifp-1, ... ,mandpi=l,j.
(1.40)
A sequence of Givens rotations can be applied to compute the QRD (1.32a).

One such sequence, referred to as column-based, is given by Algorithm 1.2,
where A is overwritten by (~), the orthogonal matrix Q is not formed and
annihilated elements are preserved throughout the annihilation process. The
elements of A are annihilated from bottom to top starting with the first column.
The Givens rotations are performed between adjacent planes. The rotation
in the 3rd step can also be written as A := GtJA. In this case the element
aj,i is annihilated by a rotation between the ith and jth planes (1 :::; i :::; n and
i < j:::; m).
Algorithm 1.2 The column-based Givens sequence for computing the QRD of
A E 9lmxn .
1: for i = 1,2, ... , n do
2:
for j = m, m - 1, ... , i + 1 do
3:
A := G)~I,jA
4:
end for
5: end for
The total number of Givens rotations applied by Algorithm 1.2 is given by:
n
I)m-i) = n(2m-n-1)/2.
i=1
(1.41)
For m = 4 and n = 3, Algorithm 1.2 is equivalent

to A :=
.
G~ljG~llA, see Fig. 1.2. A.,
15
d3,43)d2,32 )d3,42 )d1,21) x
and a blank space denotes a possible non-zero

ele'men't, an element annihilated by the Givens rotation and a zero element,
respectively.
0
G(1)
3,4
Figure 1.2.
d2,32)
... -
G(1)
2,3
G(1)
\,2
I a
G(3)
o.
3,4
.. -
d3,42)
_
_
.0
Illustration of Algorithm 1.2, where m = 4 and n = 3.
Two examples of Givens rotation sequences are also shown in Figure 1.3,
where m = 10 and n = 6. A number i (1 ::; i ::; 39) at position (j, k) indicates where zeros are created by the ith Givens rotation (1 ::; k ::; n and
k < j ::; m). The first sequence is based on Algorithm 1.2, while the second,
called diagonally-based, annihilates successively su~iagonals of the matrix
starting with the lower sub--diagonal.
8 17
7 16 24
6 15 23 30
5 14 22 29 35
413 21 28 34 39
3 12 20 27 33 38
211 19 26 32 37
1 10 18 25 31 36
(a) Column-based
Figure 1.3.
28 ~5
22 29 36
16 23 30 37
11 17 24 31 38
7 12 18 25 32 39
34
4 8 13 19 26 33
2 5 9 14 20 27
1 3 6 10 15 21
(b) Diagonally-based
The column and diagonally based Givens sequences for computing the QRD.
Algorithm 1.3 gives the order of the rotations applied when the diagonallybased sequence is used. Notice that the rotations are between adjacent planes
and both column-based and diagonally-based algorithms apply the same number of Givens rotations, given by (1.41).
Gentleman proposed a square root free algorithm for computing the QRD
(1.32a) [48]. His method removes the need for the calculation of any square
roots in computing the Givens rotations. This resulted in improving the efficiency of computing the QRD (1.32a) on a serial computer.
16
Algorithm 1.3 The diagonally-based Givens sequence for computing the QRD
of A E 9t mxn .
1: for i = 1,2, ... , m - 1 do
2:
for j = 1, ... , min(i,n) do
3:
A:= G~21,pA,
4:
end for
5: end for
where
p = m- i+ j
3.3
THE GRAM-SCHMIDT ORTHOGONALIZATION

METHOD
Consider the QRD (1.32b) and let Ql == Q such that
A.0' r' = Q-R.0, r' = RrrQ-
,0, r' +Q-'I'r'-IRl'r'-1
"
. ,r'
From this it follows that

(1.42)
where b = A:,i - Q:,I:i-lRl:i-l,i. Premultiplying (1.42) by CP:i gives Ri,i = Q;'ib
and, by substituting for Q:,i from (1.42) it follows that Ri,i = Ilbll. The remaining non-zero elements of R:,i can be computed from Rl:i-l,i = Q;'I:i-lA:,i.
The derivation of the QRD using this method, which at each stage computes a
column of Qand R, is known as Classical Gram-Schmidt (CGS) method. Algorithm 1.4 gives the steps of the CGS orthogonalization method for forming
the QRD (1.32b), where A is overwritten by Ql == Q.
Algorithm 1.4 The Classical Gram-Schmidt method for computing the QRD
of A E 9tmxn .
1: Rl,l := Ii A :,l11
2: A:,l := A:,I/Rl,l
3: fori=2, ... ,ndo
4:
Rl:i-l,i := A;'I:i_lA:,i
5:
b := A:,i - A:,I:i-lRl:i-l,i
6:
Ri,i:= Ilbll
7:
A:,i:= b/Ri,i
8: end for
The CGS method has poor numerical properties which can result in loss of
orthogonality among the computed columns of Q. The numerical stability of
the CGS method can be improved if a modified version, called the Modified
Gram-Schmidt (MGS) method, is used. The MGS method rearranges the computations of the CGS algorithm, such that at each stage a column of Q and a
17
row of R are determined [15, 51]. The MGS method for computing the QRD
(1.32b) is given by Algorithm 1.5.
Algorithm 1.5 The Modified Gram-Schmidt method for computing the QRD.
1: for i = 1, ... , n do
2:
Ri:i := ifA:,dl
3:
A,i := A:,i/Ri,i
4:
for j=i+l, ... ,ndo
5:
R,')', :=A!'A),
.,1
"
A:,j := A:,j - Ri,jA:,i

7:
end for
8: end for
6:
DATA PARALLEL ALGORITHMS FOR

COMPUTING THE QR DECOMPOSITION
Data parallel algorithms for computing the QRD are described. The algorithms are based on the Householder, Givens rotations and Gram-Schmidt
methods. Using regression, accurate timing models are constructed for measuring the performance of the algorithms on a massively parallel SIMD (Single
Instruction, Multiple Data) system. The massively parallel computer used is
the MasPar MP-1208 with 8192 processing elements. Although the algorithms
were implemented on the MasPar, the implementation principles should be in
general applicable to any massively parallel SIMD computer [83]. The timing
models will be of the same order when similar SIMD architectures are used,
but coefficient parameters will be different due to the differences in software
and hardware designs that exist among different parallel computers.
4.1
DATA PARALLELISM AND THE MASPAR SIMD

SYSTEM
In the data parallel programming paradigm, the program instructions are

executed serially, but instructions operate (optionally) on many elements of a
large data structure simultaneously. The data parallel paradigm is not restricted
to a particular parallel architecture and provides a natural way of programming
parallel computers. The programmer does not explicitly manage processes,
communication or synchronization. However, it is possible to describe how
data structures such as arrays are partitioned and distributed among processors,
since mapping of the data can affect performance significantly [43,44,45,46,
47, 49, 117]. Examples of languages that support data parallelism are Fortran
90 and High Performance Fortran (HPF) [65].
A Single Instruction Stream - Multiple Instruction Stream (SIMD) system
involves multiple processors simultaneously executing an operation on an array
18
PARALLEL ALGORITHMS FOR liNEAR MODELS
in a data parallel mode. These systems are found to be useful for specific
applications such as database searching, image reconstruction, computational
fluid dynamics, signal processing and econometrics. The effectiveness of a
SIMD array processor depends on the interconnection network, the memory
allocation schemes, the parallelization of programs, the languages features and
the compiling techniques [26, 27, 67, 117].
The MasPar SIMD system is composed of afront-end (a DEC station 5000)
and a Data Parallel Unit (DPU). The parallel computations are executed by
the Processing Element (PE) array in the DPU, while serial operations are
performed on the front-end. The 8192 PEs of the MP-1208 are arranged
in a eSl x eS2 array, where eSl = 128 and eS2 = 64. The default mapping
distribution in the MasPar is cyclic. In a cyclic distribution, an n element vector
and an m x n element matrix are mapped onto n / eSl eS21 and m / eSll n / eS21
layers of memory respectively. Figure 1.4 shows the mapping of a 160 x 100
matrix A and a 16384--element vector v on the MP-1208. Other processor
mappings are available for efficiently mapping arrays on the PE array, when
the default cyclic distribution is not the best choice [99].
VI
V65
~1r
__-trii==l-AI29,100
ii:;;~~---lt- A 160,100
Layer
#2
Layer
#3
A 129,65
Figure 104.
V8129
V8193
V8257
VI6321
---~...,,--- V64
V8192
=~;:::::=== V8256
V16384
Cyclic mapping of a matrix and a vector on the MasPar MP-1208.
The main languages for programming the MasPar are the MasPar Fortran
(hereafter MF) and MasPar Programming Language. The language chosen
for implementing the algorithms was MF, which is based on Fortran 77 supplemented with array processing extensions from standard Fortran 90. These
array processing extensions map naturally on the DPU of the MasPar. MF also
supports the forall statement of HPF, which resembles a parallel do loop. For
example given the vectors h E SRm and z E SRn, the product A = hzT can be
19
computed in parallel by
forall (i = 1 : m, j
= 1 : n) Ai,} = hi * z}
(1.43a)
or
A = spread(h,2,n) *spread(z, I,m).
(1.43b)
The computations on the rhs of (1.43a) are executed simultaneously forall i

and j. In (1.43b) the spread commands construct two m x n matrices. Each
column of the first matrix is a copy of h and each row of the second matrix
is a copy of z. The matrix A is computed by multiplying the two matrices
element by element. In both cases, the assignment on A's elements can be
made conditional using a conformable logical matrix which masks with true
values the elements participating in the assignment [80].
The time to execute a single arithmetic operation such as *, + or sqrt on
an m x n matrix (m, n ~ eSI eS2), depends on the number of memory layers
required to map the matrix on the DPU, that is m/esl 1 n/es21. If, however,
a replication, reduction or a permutation function such as spread or sum is
applied to the m x n matrix, then the execution time also depends on rm/esll
and rn/es21 [67]. This implies that the execution time model of a sequence
of arithmetic operations and (standard) array transformation functions on an
m x n matrix, is given by
(1.44)
where Ci (i = 0, ... ,3) are constants which can be found by experiment. The
above model can describe adequately the execution time of (1.43a) and (1.43b).
If m or n is greater than eSleS2, then the timing model (1.44) should also
include combinations of the factors rm/esl eS21 and rn/esl eS21, which correspond to the number of layers required to map a column and a row of the matrix
on the DPU. In order to simplify the performance analysis of the parallel algorithms, it is assumed that m, n ~ eSI eS2 and the dimensions of the data matrix
are multiples of eSI and eS2, respectively.
4.2
The data parallel version of the serial Householder QRD method is given
by Algorithm 1.6. The application of HjA(i-l) in (1.34) is activated by line 3
and the time required to compute this transformation is given by <1>1 (m - i +
1, n - i + 1). Thus, the total time spent on computing all of the Householder
transformations is
n
<1>2(m,n) = L<1>I(m-i+ l,n-i+ 1).

i=1
20
It can be observed that the application of the ith and jth transformation have
the same execution time if
r(m-i+l)/ell = r(m-j+l)/ell
and
Algorithm 1.6 QR factorization by Householder transformations on SIMD systems.

1: defHouseh_QRD(A,m,n) =
2:
for i = 1, ... , n do
3:
apply transform(Ai:,i:,m- i+ l,n- i + 1)
4:
end for
5: end def
6: deftransform(A,m,n) =
7:
h:=A:,1
8:
s:= sqrt(sum(h * h))
9:
If (hi < 0) then s := -s
10:
hi := hi +s
11:
b:=hl*S
12:
z:= sum(spread(h,2,n) *A, 1)/b
13:
forall(i = 1 : m, j = 1 : n) Ai,j := Ai,j - hi * Zj
14: end def
Algorithm 1.6 has been implemented on the MP-1208 and a sample of approximately 400 execution times has been generated for various values of M
and N, where m = Mesl and n = Nes2. Evaluating <!>2(Mesl,Nes2) and us~
ing regression analysis, the estimated execution time (seconds x 102 ) of Algorithm 1.6 is found to be
TI(M,N) = N(14.15+3.09N -0.62N2 +5.71M +3.67MN).

The above timing model includes the overheads which arise mainly from the
reference to the submatrix Ai:,i: in line 3. This matrix reference results in the
assignment of an array section of A to a temporary array and then, when the
procedure transform in line 6 has been completed, the reassignment of the temporary array to A. The overheads can be reduced by referencing a submatrix of
A only if it uses fewer memory layers than a previous extracted submatrix (see
for example the Modified Gram-Schmidt algorithm). This slight modification
improves significantly the execution time of the algorithm which now becomes
T2(M,N) = N(14.99+2.09N -0.20N2 +3.19M + 1. 17MN).

The accuracy of the timing models is illustrated in Table 1.1.
4.3
21
THE GRAM-SCHMIDT METHOD
As in the case of the Householder algorithm, the performance of the straightforward implementation of the MGS method will be significantly impaired by
the overheads. Therefore, the n = Nes2 steps of the MGS method are used in
N stages. At the ith stage, eS2 steps are used to orthogonalize the (i - 1)es2 + 1
to ies2 columns of A and also to construct the corresponding rows of R. Each
step of the ith (i = 1, ... , N) stage has the same execution time, namely
Thus, the execution time to apply all Nes2 steps of the MGS method is given
by
N
<l>3(Mesl,Nes2) = eS2
L <1>1 (Mesl, (N -
i=l
i + l)es2).
The data parallel MGS orthogonalization method is given in Algorithm 1.7,

where A is overwritten by Ql == Q - the orthogonal basis of A. The total execution time of Algorithm 1.7 is given by
T3(M,N) =N(9.15+3.12N -O.OIN2 +4.95M + 1.31MN).

Algorithm 1.7 The MGS method for computing the QRD on SIMD systems.
1: defMGS_QRD(A,Mesl,Nes2) =
2:
for i = 1, ... , Nes2 with steps eS2 do
3:
apply orthogonal(A:,i: ,Ri:,i:,Mesl, (N - i + l)es2)
4:
end for
5: enddef
6: def orthogonal(A,R,m,n) =
7:
for i = 1, ... , eS2 do
8:
Ri,i := sqrt(sum(A:,i *A:,i))
9:
A:,i :=A:,i/Ri,i
10:
forall(j = i + 1 : n) W,j := A:,i *A:,j
11:
Ri,i+l: := sum(W,i+l:, 1)
12:
forall(j = i + 1: n) A:,j :=A:,j - Ri;j *A:,i
13:
end for
14: end def
It can be seen from Table 1.1 that the (improved) Householder method performs better than the MGS method. The difference in the performance of the
two methods arises mainly because, at the ith step, the MGS and Householder
methods work with m x (n - i + 1) and (m - i + 1) x (n - i + 1) matrices, respectively. An analysis of T2(M,N) and T3(M,N) reveals that for M > N, the
22
MGS algorithm is expected to perform better than the Householder algorithm

only when N = 1 and M = 2.
4.4
THE GIVENS ROTATION METHOD
A Givens rotation, when applied from the left of a matrix, affects only two of
its rows: thus a number of them can be applied simultaneously. This particular
feature underpins the development of parallel Givens algorithms for solving a
range of matrix factorization problems [29, 30, 69, 94, 95, 102, 103, 129].
The orthogonal matrix QT in (1.32a) is the product of a sequence of Compound Disjoint Givens Rotations (CDGRs), with each compound rotation reducing to zero elements of A below the main diagonal while preserving previously annihilated elements. Figure 1.5 shows two sequences of CDGRs for
computing the QRD of a 12 x 6 matrix, where a numerical entry denotes an element annihilated by the corresponding CDGR. The first Givens sequence was
developed by Sameh and Kuck [129]. This sequence - the SK sequence - applies a total ofm+n-2 CDGRs to triangularize an m x n matrix (m > n), compared to n(2m - n -1) Givens rotations needed when the serial Algorithm 1.2
is used. The elements are annihilated by rotating adjacent rows. The second
Givens sequence - the Greedy sequence - applies fewer CDGRs than the SK
sequence but, when it comes to implementation, the advantage of the Greedy
sequence is offset by the communication overheads arising from the construction and application of the compound rotations [30,67, 102, 103]. For m n,
the Greedy sequence applies approximately log m + (n - 1) log log m CDGRs .
11
10 12
9 11 13
8 10 12 14
7 911 13 15
6 8 10 12 14 16
5 7 911 13 15
4 6 8 10 12 14
3 5 7 911 13
2 4 6 8 10 12
1 3 5 7 9 11
(a) SK sequence
Figure 1.5.
3 6
5 8
2 4 7 10
2 4 6 9 12
1 3 6 811 14
2
1
1
1
1
1
3 5 7 10 13
3 5 7 9 12
2 4 6 8 11
2 4 6 8 10
2 3 5 7 9
(b) Greedy sequence
Examples of Givens rotations schemes for computing the QRD.
The adaptation, implementation and performance evaluation of the SK sequence to compute various forms of orthogonal factorizations on SIMD systems will be discussed in the subsequent chapters. On the MP-1208, the ex-
23
ecution time of computing the QRD of an Mesl x Nes2 matrix using the SK
sequence, is found to be
T4 (M,N) = N(25.64+5.51N -7.94N2 + 11.1M + 15.99MN) +41.96M.
4.5
COMPUTATIONAL RESULTS
The Householder factorization method is found to be the most efficient in

terms of speed, followed by the MGS algorithm which is only slightly slower
than the data parallel Householder algorithm. Use of the SK sequence produces
by far the worst performance.
The comparison of the performances of the data parallel implementations
was made using accurate timing models. These models provide an effective
tool for measuring the computational speed of algorithms and they can also
be used to reveal inefficiencies of parallel implementations [80]. Comparisons
with performance models of various algorithms implemented on other similar
SIMD systems, demonstrate the scalability of the execution time models [67,
75]. If the dimensions of the data matrix A do not satisfy the assumption that
they are multiples of the size of the physical array processor, then the timing
models can be used to give a range of the expected execution times of the
algorithms.
Table 1.1.
Times (in seconds) of computing the QRD of a 128M x 64N matrix.
Algor. 1.6
Improved Algor. 1.6
Exec. Tl(M,N) Exec. T2(M,N)
X 10- 2
X 10- 2
Time
Time
Algor. 1.7
Exec. T.l(M,N)
x 10- 2
Time
Algor. SK
Exec. T4(M,N)
X 10- 2
Time
10
10
10
14
14
14
18
18
18
22
22
22
3
7
9
5
9
13
5
9
17
7
15
19
5.48
22.15
33.80
17.48
47.86
90.30
22.34
61.90
189.16
48.61
188.79
287.23
3.21
12.07
18.49
9.30
24.52
46.64
11.60
30.68
94.41
23.98
89.86
138.59
21.16
67.73
92.65
62.27
150.05
242.63
82.03
207.47
503.44
175.85
585.56
805.45
5.55
22.34
34.09
17.54
48.03
90.56
22.35
61.98
189.03
48.71
188.49
286.33
2.58
9.28
13.80
7.36
18.75
34.38
9.14
23.77
69.42
19.10
68.88
103.31
2.59
9.34
13.92
7.35
18.86
34.54
9.15
23.79
69.31
18.90
68.57
102.81
3.22
12.08
18.48
9.30
24.50
46.64
11.59
30.52
94.26
23.93
89.82
138.27
21.04
67.59
92.62
62.36
150.11
242.68
82.25
207.61
503.51
175.96
585.64
805.71
QRD OF LARGE AND SKINNY MATRICES
The development of SIMD algorithms to compute the QRD when matrices

do not have dimensions which are multiples of the physical array processor
size are considered [90]. Implementation aspects of the QRD algorithm from
24
the Cambridge Parallel Processing (CPP) linear algebra library (LALIB) are
investigated [19]. The LALIB QRD algorithm is a data-parallel version of
the serial Householder algorithm proposed by Bowgen and Modi [17]. The
performances of Algorithm 1.6 and the QRD LALIB routine are compared. A
second Householder algorithm which is efficient for skinny matrices is also
proposed.
5.1
THE CPP GAMMA SIMn SYSTEM
The Cambridge Parallel Processing (CPP) GAMMA series has a Master

Control Unit (MCU) and 1024 or 4096 Processing Elements (PEs) arranged
in a 2-D square array. It has an interconnection network for PE-to-PE communication and for broadcast between the MCU and the PEs. The GAMMA
SIMD systems are based on fine grain massively parallel computer systems
known as the AMT DAP (Distributed Array of Processors) [116, 118].
A macro assembler called APAL (Array of Processors Assembly Language)
is available to support low-level programming the GAMMA-I. Two high level
language systems are also available for the GAMMA-I. These are extended
versions of Fortran (called Fortran-Plus enhanced or for short, F-PLUS) and
C++. These languages interact with the language that the user selects to run
on the host machine, typically Fortran or C [1, 2] Both high level languages
allow the programmer to assume the availability of a virtual processor array of
arbitrary size. As in the MasPar, using the default cyclic distribution, an m x n
matrix is mapped on the PEs using rm / es1 rn / es1 layers of memory, while
an m-element vector is mapped on the PEs using rm/es 21layers of memory,
where es x es (es = 32 or es = 64) is the dimension of the SIMD array processor. An m x n matrix can also be considered as an array of n m-element
column vectors (parallelism in the first dimension) or m n-element row vectors (parallelism in the second dimension), requiring respectively, m/es 21
and mrn/es 21layers of memory to map the matrices onto the PEs [25].
In most non-trivial cases, the complexity of performing a computation on
an array is not reduced if some of the PEs are disabled, since the disabled
PEs will become idle only during the assignment process. In such cases the
programmer is responsible for avoiding computations on unaffected submatrices. To illustrate this, let h = (hI, ... ,hm) and u = (UI, ... , un) be real vectors,
L == (11,--"--"-. ,In) a logical vector and A an m x n real matrix. The F-PLUS statement
nr
u(L) = sumr(matc(h,n) *A)

is equivalent to the HPF statement
forall(i = 1 : n,li = true) Ui = sum(h*A:,i)
(1.45)
25
which computes the inner-product Ui = hT A,i for all i, where Ii has value true.
In Fortran-90 the F-PLUS functions sumr(A), matc(h,n) and matr(u,m) can
be expressed as sum(A, 1), spread(h,2,n) and spread(u, I,m), respectively.
The main difference, however, between F-PLUS and HPF, is that the F-PLUS
statement computes all the inner-products hT A and then assigns simultaneously the results to the elements of u, where the corresponding elements of L
have value true. This difference may cause degradation of the performance
with respect to execution speed, if the logical vector L has a significant number of false values. Consider, for example, the three cases, where (i) all elements of L have a true value, (ii) the first n/2 elements of L have a true value
and (iii) only the first element of L has a true value. For m = 1000 and n =
500, the execution time in msec for computing (l.45) on the 1024-processor
GAMMA-I (hereafter abbreviated to GAMMA-I) for all three cases is 249.7,
while, without masking the time required to compute all inner-products is
given by 247.79. Explicitly performing operations only on the affected elements of u, the execution times (including overheads) in cases (ii) and (iii) are
found to be 147.84 and 13.21, respectively. This example shows the degradation in performance that might occur when implementing an algorithm without
taking into consideration the systems software of the particular parallel computer.
5.2
THE HOUSEHOLDER QRD ALGORITHM
The CPP LAUB implementation of the QRD Householder algorithm could

be considered a straightforward one. Initially the algorithm was implemented
on the AMT DAP using an earlier version of F-PLUS which required the data
matrix to be partitioned into submatrices having the same dimensions as the
array processor [17]. The re-implementation of the algorithm using the new
F-PLUS has removed this constraint. Algorithm 1.8 shows broadly how the
Householder method has been implemented in this library routine for computing the QRD. The information needed for generating the orthogonal matrix Q
is stored in the annihilated parts of A and in two n-element vectors. For simplicity Algorithm 1.8 ignores this, neither does it emphasize other details of
the LAUB QRD subroutine QR_FACTOR, such as those dealing with overflow,
that do not play an important role in the performance of the algorithm [19].
Clearly the performance of Algorithm 1.8 is dominated by the computations in the 10th and 11 th lines, while computations on logical arrays and
scalars are less significant. The computation of the Euclidean norm of the
m-element vector h in line 5 is a function of rm/es 2 l and is therefore important only for large matrices, where m es 2 Notice that the first i - I
elements of h are zero and the corresponding rows and columns of A remain
unchanged. Thus the computations in lines 5, 10 and 11 can be written as
follows:
26
Ui:n := sumr(mate(hi:m,n - i + 1) *Ai:m,i:n)/Pi

Ai:m,i:n := Ai:m,i:n - mate (hi:m, n - i + 1) * matr( Ui:n, m - i + 1)
Algorithm 1.8 The CPP LA LIB method for computing the QR Decomposition.
1: L:= true; M:= true
2: for i = 1,2, ... , n do
3:
h:=O
4:
h(L) := A:,i
5:
cr:= sqrt(sum(h*h))
6:
if hi < 0 then cr:= -cr
7:
hi := hi + cr
8:
Pi:=cr*hi
9:
if Pi i- 0 then
10:
U:= sumr(mate(h,n) *A)/Pi
11:
A(M) := A - mate(h,n) *matr(u,m)
12:
end if
13:
Li := false; M:,i := false
14: end for
The Fortran-90 sub array expressions are not supported by F-PLUS. However functions and subroutines are available for extracting and replacing subarrays. Hence, working arrays need to be used in place of Ui:n and Ai:m,i:n.
The computational cost of extracting the affected subarrays and re-assigning
them to the original arrays can be higher than the savings in time that might be
achieved by working with subarrays of smaller dimensions. This has been considered previously in detail within the context of improving the performance
of Algorithm 1.6.
The Block-Parallel version of Algorithm 1.8 (hereafter called BPHA) is divided into blocks, where each block comprises transformations that have the
same time complexity. The first block comprises the kJ (1 ::; kJ ::; min(n, es))
Householder transformations HI, ... ,Hkl' where kJ is the maximum value satisfying rm/esHn/es1 = r(m+ l-kI}/esH(n+ l-kd/esl The transformations are then applied using Algorithm 1.8. The same procedure is applied
recursively to the smaller (m - kl) x (n - kJ) submatrix Akl+J:m,kl+J:n, until A
is triangularized. Generally, let mo = m, no = n, mi = mi-J - ki (i> 0) and let
the function f(A,m,n) be defined as
f(A,m,n)
rm/es Hn/es1- r(m+ 1- A)/esH(n+ 1 - A)/es1,
(1.46)
where A, m and n are integers and 1 ::; A ::; n. The ith block consists of
ki transformations which are applied using Algorithm 1.8 to the submatrix
27
where ki is the maximum value of Asatisfying f(A,mi-l ,ni-t} =

o and k(i) = L~~ll kj. The numerical stability of BPHA is the same as that of
the CPP LALIB subroutine QR_FACTOR.
Table 1.2 shows the execution time of the CPP LALIB subroutine QR_FACTOR
and BPHA for various values of m and n. Due to fewer organizational overheads the LALIB subroutine performs better when the number of columns of
the matrices do not exceed the edge size (es) of the array processor - that is,
when BPHA consists of at most two blocks. The difference, however, is very
small compared with the improvement in speed offered by BPHA for large
matrices. Notice that the improvement in speed is much higher (factor of two)
for square matrices. The main disadvantage of the BPHA is the use of working
arrays which results in increased memory requirements.
Ak(i)+l:m,k(i)+l:n'
Table 1.2.
BPHA.
m
Execution times (in seconds) of the CPP LALIB QR_FACTOR subroutine and the
QR_FACTOR
BPHA
QR_FACTOR I BPHA
300
600
900
1200
1500
1800
25
25
25
25
25
25
1.05
1.60
2.20
2.79
3.33
3.94
1.09
1.71
2.30
3.01
3.50
4.20
0.96
0.93
0.96
0.93
0.95
0.94
2000
2000
2000
2000
2000
2000
2000
2000
25
100
175
250
325
400
475
550
4.30
32.48
74.65
132.08
221.31
313.10
420.16
570.44
4.68
25.56
58.76
99.57
152.68
205.66
285.45
368.93
0.92
1.27
1.27
1.33
1.45
1.52
1.47
1.55
200
400
600
800
200
400
600
800
18.76
88.07
239.61
501.57
10.41
45.14
115.77
229.52
1.80
1.95
2.07
2.19
5.3
QRD OF SKINNY MATRICES
The cyclic mapping distribution might not be efficient when the number of
columns of A is small. Parallelism in the first dimension is more efficient for
skinny matrices, that is, when m/es Hnjes1
mjes 21 [25]. Parallelism
in the second dimension is inefficient since n ~ m. Parallel computations are
performed only on single columns or rows of A when parallelism in the first
or second dimension, respectively, is used. Algorithm 1.9 is the equivalent of
nr
28
Algorithm 1.8 with parallelism only in the first dimension (columns) of A.

Notice that the sequential loop in line 10 is equivalent to the lines 10 and 11 of
Algorithm 1.8 and that the logical vector L is used in place of M.
Algorithm 1.9 Householder with parallelism in the first dimension.

1: L:= true
2: for i = 1,2, ... ,n do
3:
h:=O
4:
h{L) := A:,i
5:
cr:= sqrt{sum{h*h))
6:
if hi < 0 then cr:= -cr
7:
hi := hj + cr
8:
Pi:= cr*hi
9:
if Pi i= 0 then
10:
for j = i,i+ 1, ... ,n do
11:
Uj := sum(h*A:,j)
12:
A:,j{L) := A:,j - {h *Uj)/Pi
13:
end for
14:
end if
15:
Li := false
16: end for
Table 1.3 shows the execution time of the LALIB subroutine QR_FACTOR
and Algorithm 1.9 for large m and small n. The performance of Algorithm 1.9
improves as m increases with constant n. Notice that for large skinny matrices
the LALlB subroutine uses more memory compared to Algorithm 1.9.
Table 1.3.
rithm 1.9.
m
500
500
500
2000
2000
2000
4000
4000
4000
10000
10000
10000
Execution times (in seconds) of the CPP LALIB QR_FACTOR subroutine and Algo-
n
8
16
32
8
16
32
8
16
32
8
16
32
QR_FACTOR
Algorithm 1.9
0.45
0.90
1.77
1.38
2.76
5.35
2.61
5.21
10.09
6.35
12.67
24.51
0.36
1.19
4.27
0.42
1.39
4.97
0.53
1.77
6.38
0.89
2.98
10.74
QR_FACTORI Algorithm 1.9
1.26
0.76
0.41
3.31
1.99
1.07
4.91
2.94
1.58
7.14
4.26
2.28
29
A block-version of Algorithm 1.9 can also be used but, under the assumption of small n (n es 2 ), the number of blocks will be at most two. Thus, any
savings in computational time in the block-version algorithm will be offset by
the overheads. If m n, then at some stage i of the Householder algorithm the
affected submatrix Ai:m,i:n of A could be considered skinny. This suggests that
in theory an efficient algorithm could exist that initially employs BPHA and
which switches to Algorithm 1.9 in the final stages [82].
QRD OF A SET OF MATRICES
The computation of the estimators in a set of regression equations requires

the QRDs
i= 1,2, ... ,G,
(1.47)
where Ai E SRmxnj (m> ni) is the exogenous full column rank matrix in the
ith regression equation, Qi is an m x m orthogonal matrix and Ri is an upper
triangular matrix of order ni [78]. The fast simultaneous computation of the
QRDs (1.47) is considered.
6.1
EQUAL SIZE MATRICES
Consider, initially, the case where the matrices AI, ... ,AG have the same
dimension, that is, n} = ... = nG = n. The equal-size matrices suggests that a
3-D array could be employed. The m x n data matrices A I , ... ,AG and the upper
triangular factors R I, ... ,RG can be arranged in an m x n x G array A and the
n x n x G array R, respectively. Using a 2-D mapping, computations performed
on scalars, I-D and 2-D arrays correspond to computations on I-D, 2-D and
3-D arrays when a 3-D mapping is used. Thus, in theory, the advantage over a
2-D mapping is that a 3-D arrangement will increase the level of parallelism.
The algorithms have been implemented on the 8192-processor MasPar MP1208, using the high level language MasPar-Fortran. On the MasPar, the 3-D
arrangement of the equal-size matrices is mapped on the 2-D array of PEs plus
memory, with computations over the third dimension being performed serially.
This indicates that under a 3-D arrangement the increase in parallelism will not
be as large as is theoretically expected.
The indexing expressions of 2-D matrices and the replication and reduction functions can be used in a 3-D framework. That is, the function spread
which replicates an array by adding a dimension and the function sum which
adds all of the elements of an array along a specified direction can be used.
For example, if B and C are m x n and m x n x G arrays respectively, then
C:= spread(B,3,G) implies that forall k, C:,:,k = B:,: and B:= sum(C,3) is
equivalent to B(i,j) = Lf=1 Ci,j,k whereas sum(C) has a scalar value equal to
the sum of all of the elements of C.
30
The first method for computing the QR factorization of a matrix employs a

sequence of Householder reflections H = 1- hhT jb, where b = h T hj2. The
application of H to the data matrix Ai involves the vector-matrix computation
ZT = hT Ad b and a rank-one update Ai - hz T . Both of these operations can be
efficiently computed on an SIMD array processor using the replication function spread and the reduction function sum. The SIMD implementation of
the Householder algorithm for computing the QRDs (1.47) simultaneously is
illustrated in Algorithm 1.10. A total of n Compound Householder Transformations (CHTs) are applied. The ith CHT produces the ith rows of RI, ... ,RG
without effecting the first i-I columns and rows of A I, ... ,AG. The simultaneous data parallel vector-matrix computations and rank-one updates are shown
respectively in lines 12-14 of Algorithm 1.10.
Algorithm 1.10 The Householder algorithm.

1: defHouseh_QRD(A,m,n,G) =
2:
for i = 1, ... , n do
3:
apply transform(Ai:,i:,: , m - i + 1, n - i + 1, G)
4:
end for
5: end def
6: def transform(A, m, n, G) =
7:
H := A,I,:
8:
S:= sqrt(sum(H *H, 1))
9:
where (HI,: < 0) then S := -S
10:
HI,: := HI,: +S
11:
B:= HI,: *S
12:
W := spread(H,2,n)
13:
Z:= sum(W *A, l)jspread(B, l,n)
14:
A:=A-W*spread(Z,I,m)
15: end def
As in the Householder factorization algorithm, the Modified Gram-Schmidt
(MGS) method generates the upper triangular factor Ri row by row, with the
difference that it explicitly constructs the orthogonal matrices Qi, where Ai =
QiRi (i = 1, ... , G). Algorithm 1.11 shows the data parallel implementation of
the MGS method for computing simultaneously the QRDs in (1.47), where Ai
is overwritten by Qi. Recall that the computations over the third dimension of
a 3-D array are performed serially. Thus, in order to increase the performance
of the MGS algorithm, at the ith step (i = 1, ... , n) the subarrays Ri,i:,: and A,i,:
are stored in the 2-D (n - i + 1) x G array Rand m x G array A, respectively.
The implementation ofthe SK sequence to compute the QRDs (1.47) will be
briefly considered [129]. The ith CDGR applied to the matrixA j (j = 1, ... , G),
31
Algorithm 1.11 The Modified Gram-Schmidt algorithm.

1: defMGS_QRD(A,R,m,n,G) =
2:
for i = 1, 2, ... , n do
3:
apply orthogonal(A:,i:,: , Ri:,i:,:, m, n - i + 1, G))
4:
end for
5: end def
6: def orthogonal(A,R,m,n, G) =
7:
R := RI,:,: and A := A:,I,:
8:
RI,: := sqrt(sum(A *A, 1))
9:
A :=A/spread(RI,:, I,m)
10:
A.I" :=A
11:
W' :'~ spread(A,2,n-I)
12:
R2:,: := sum(W *A:,2:,:, 1)
13:
RI,:,: := R
14:
A:,2:,: := A:,2:,: - spread(R2:,:, I,m) * W
15: end def
has the block-diagonal structural form
i = 1, ... , m + n - 2,
G(i,}) =
Jr,
where
( " ")
Gkt,]
(c(i,})
k
(i,})
s(i,}))
k
(i,})'
-sk
2p + ~ + = m, and the values of p, ~ and

2p x G matrices such that
T
C :,]
({i,})
ci
ck
c(i,})
1
...
1
=,
... ,p,
Sdepend on i.
c(i,})
P
Let C and S be two
cp(i,}))
and
ST" = ((i,})
:,]
sl
_ (i,})
Sl
.. .
(i,})
sp
where j = 1, ... , G. Also, letA =AS+I:S+2p,:,: - that is, A:,:,i corresponds to the
2p x n submatrix of Ai starting at row ~ + 1 (i = 1, ... , G). The simultaneous
application of the ith CDGRs in a data parallel mode may be realized by:
A*I"2p"2"
. . '0'"" :=A2:2p:2::
" ,
(1.48a)
32
A *2..
'2p'2",.,' := AI:2p:2::
"
(1.48b)
A := spread(C,2,n) *A + spread(S,2,n) *A*,
(1.48c)
and
where (1.48a) and (1.48b) construct the 3-D array A * by pairwise interchanging
the rows in the first dimension of A.
The algorithms have been implemented on the MasPar MP-1208 using the
default cyclic distribution for mapping arrays on the Data Parallel Unit (DPU).
In order to simplify the complexity of the timing models the dimension of the
data matrix Ai (i = 1, ... , G) is assumed to be a multiple of the size of the array
processor and G::; eS2. That is, m = Mesl and n = Nes2, where M ~ N ~ G,
eSI = 128 and eS2 = 64. Furthermore, the algorithms have been slightly modified in order to reduce the overheads arising in their straightforward implementation. These overheads mainly comprised the remapping of the affected
subarrays into the DPU and were overcome by referencing a subarray only if
it was using fewer memory layers than a previous extracted subarray.
The dimensions of the data matrices suggest that the time required to execute
the procedure transform in line 6 of Algorithm 1.10 is given by
<1>1 (m,n, G,esl ,es2) = Co +Clm+ G(C2 + C3ii+ qm+ csmii),
(l.49)
where m = m/esll, ii = n/es21 and Co,.. , Cs are constants. That is, the total
time spent in applying all of the CHTs, is given by
Nes2
<l>2(M,N, G,esl ,es2)
=L
<1>1 (Mesl - i + 1,Nes2 - i + 1, G,esl ,es2)
i=1
=N(co+cIN +C2M
+ G( C3 + C4N + csN2 + C6M + c7MN)) + '(M, N)
=TH(M,N,G) + ,(M,N),
(1.50)
where Co, ... , C7 are combinations of the constants in (1.49) and ,(M,N) is a
negligible function of M and N. A sample of more than 500 execution times
(in msec) were generated for various values of M and N. The least-squares
estimators of the coefficient parameters in the model TH(M,N, G) are found
to be Co = 13.23, CI = -0.49, C2 = 3.30, C3 = 1.89, C4 = 1.60, Cs = -0.12,
C6 = 1.56 and C7 = 0.72.
The timing model TH(M,N, G) can be used as a basis for constructing a
timing model of the MGS algorithm in Algorithm 1.11. Using backwards stepwise regression with the initial model given by TH(M,N, G), the execution time
model of the 3-D MGS algorithm is found to be
TMGS(M,N,G) =N(1O.75+2.69M
+ G(2.21 + 2.34N + 2.70M + 0.75MN)).
(1.51)
33
It can be observed that, unlike TH{M,N, G), this model does not include the N 2
and GN3 factors. This is because at the ith step of the 3-D MGS algorithm the
affected subarray of A has dimension m x (n - i + 1) x G which implies that
the timing model is given by L~~2 <1>1 (MeS1,NeS2 - i + 1, G,es1,es2).
Similarly, the timing model of the SK sequence is found to be
TG{M,N, G) =N{74.08 - 26.05N + 2. 18N2 + 37.16M - 1.68MN
+ G{9.11N - 5.03N2 + 6.50M + 12.46MN))
(1.52)
+4.03GM.
From Table 1.4 and analysis of the timing models it may be observed that
the SK sequence algorithm has the worst performance. Furthermore, the MGS
algorithm is outperformed by the Householder algorithm when M > N. For
G = 1 the timing models of the 3-D QRDs algorithms have the same order of
complexity as their corresponding performance models for the single-matrix
2-D QRDs algorithms [83]. The analysis of the timing models shows that,
in general, the 3-D algorithms perform better than their corresponding 2-D
algorithms with the improvement getting larger with G. The only exception is
the 3-D Givens algorithm which performs worse than the 2-D Givens algorithm
when the number of CDGRs is large. Figures 1.4 shows the ratio between the
2-D and 3-D algorithms for computing the QRDs, where G = 16.
Table 1.4.
M
4
4
4
4
8
8
8
10
10
10
10
10
10
1
1
3
3
5
5
5
1
1
1
5
5
5
5
10
3
8
3
6
10
5
8
10
5
8
10
Times (in seconds) of simultaneously computing the QRDs (1.47).
Householder
Exec.
TH(M,N,x)
x 10- 3
Time
0.87
1.48
2.60
5.65
9.07
16.26
25.71
1.74
2.51
3.02
16.76
25.55
31.22
0.88
1.51
2.59
5.65
9.06
16.26
25.86
1.76
2.55
3.07
16.76
25.51
31.34
Modified Gram-Schmidt
Exec.
TMGS(M,N,x)
X 10- 3
Time
1.12
2.06
3.28
7.66
11.46
21.35
34.29
2.34
3.54
4.31
21.51
33.44
41.08
1.13
2.05
3.26
7.63
11.47
21.33
34.48
2.33
3.51
4.29
21.55
33.35
41.22
Givens Rotations
Exec.
TG(M,N,G)
x 10- 3
Time
6.80
ll.4l
18.96
43.41
82.92
154.34
249.77
16.08
22.69
27.05
168.12
260.70
322.29
6.72
11.52
18.98
43.42
82.89
154.37
249.68
15.75
22.76
27.44
168.21
260.57
322.14
34

Householder Transfonnations
Modified Gram-Schmidt
311
Givens Rotations
Figure 1.6. Execution time ratio between 2-D and 3-D algorithms for computing the QRDs,
where G = 16.
6.2
MATRICES WITH DIFFERENT NUMBER OF

COLUMNS
Consider the simultaneous computation of the QRDs (1.47) using Householder transformations, where the matrices are not restricted to having the same
number of columns. However, it is assumed that the data matrices are arranged
so that their dimensions are in increasing order, that is, nl ~ n2 ~ ... ~ nG. The
QRDs can be computed by applying a total of nG CHTs in G stages. At the
end of the kth stage (i = 1, ... , G) the QRDs of AI, ... , Ak are computed and the
first nk rows of Rk+ 1, ... , RG are constructed. In general, if Vi: A (i,O) = Ai and
no = 0, then the kth stage computes simultaneously the G - k + 1 factorizations
QTi,kA(i,k-l) -_
R(i,k)
1
i=k, ... ,G,
(1.53)
by applying nk - nk-l CHTs, where R~i,k) is upper triangular and a CHT comprises G - k + 1 single Householder transformations. Thus,
0) (I 0
. ...
Q1,2
ni _ 1
35
and
nl
n2 -nl
n3 -n2
ni-ni-I
R(i,l)
k(i,l)
k(i,I)
ft.(i,I)
R(i,2)
I
3
ft.(i,2)
2
ft.(i,2)
i-I
R(i,3)
Ri=
ft.(i,3)
i-2
i= 1, ... ,G,
R(i,i)
I
where
np+1 -np
R(i,P) 2
ft.(i,p)
2
ni-nj-I
ft.(i,p)
i-p+1
'
p= 1, ... ,i-l.
Figure 1.7 shows the stages of this method when G = 4. The orthogonal
matrix Q;k in (1.53) is the product of nk - nk-I Householder transformations,
(i,k)
f h k
say H (i,k)'
... HI
. At the pth (p = 1, ... ,nk - nk-I) step 0 t e th stage
nk
nH
the Householder transformations H~k,k), ... ,H~G,k) are applied simultaneously

to the augmented matrix (A~~:~~I) ., .A~~;:-I)) using complicated replication,
reduction and permutation functions that are not fully supported by the MasPar
software. As a result this theoretically efficient algorithm is outperformed by
an algorithm that triangularizes the data matrices AI, .. . ,AG one at a time with
each triangularization being computed in parallel using Householder transformations [88].
On a MIMD system, a straightforward method is to use a library routine to
compute one factorization at a time in parallel [16,28, 37, 121]. However, this
method will result in unnecessary communication between processors since the
QRDs (1.47) can be computed simultaneously with each processor computing
locally one QRD at a time. The task-farming approach can be used to achieve
the required locality with very low inter-processor communication. Let Po
denote the master processor from a given set of p processors Po, PI, ... , Pp_ l ,
where p G. Initially Po sends Ai to processor Pi (i = 1, ... , p - 1) for factorization. When processor Pi has completed the factorization of Ai it sends it back
to the master processor, which then sends to Pi another matrix for factorization
unless all of the matrices have been factorized. A SPMD (Single-Program,
Multiple-Data) pseud<H:.:ode for the task-farming approach is given in Algorithm 1.12. The algorithm (program) is executed by all processors on their
local data. For simplicity, only the send and receive messages relating to matrices are included and for load balancing the matrices A I, ... ,AG are ordered
such that nl :2 n2 :2 ... :2 nG
36

' - " " "'1 """'---+-1I 1 ~
----.. "1""'-
11).'- - - - - - - -..., ... ..,- - - -
--.... ftl40--
"t ' "
-r~
______
- - - - - I
QT) ,1A(.ol
Stage I
,
n - nl
m-tll
1
m-f)L-_~L-____~
or
Stage 2
Figure 1.7.
...... - ny+
11.(0.11
Stage 3
Stages of computing the QRDs (1.47).
An alternative approach (hereafter called scattering ) that does not necessitate inter-processor communication is to distribute the matrices evenly over
the processors. Each processor then factorizes its allocated matrices one at a
time using the SPMD paradigm. The main difficulty in this approach is determining how to distribute the matrices over the processors in order to achieve
load balancing. In the case where all the data matrices have the same dimen-
37
Algorithm 1.12 The task-farming approach for computing the QRDs (1.47) on
p (p
G) processors using a SPMD paradigm.
1: if (Processor Po) then
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
send the matrix Ai to processor Pi (i = 1, ... , p - 1)

else
receive a matrix from processor Po
send the QRD of the matrix to processor Po
end if
Processor Pj (j = 0, 1, ... , p - 1) has active status
while (Processor is active) do
if (Processor Po) then
receive the QRD from some processor Pj (1 ::; j < p)
if (Number of QRDs received = G) then
The status of processors Pj and Po become idle
else if (Number of matrices sent = G) then
The status of processor Pj becomes idle
else
send next matrix for factorization to processor Pj
end if
else
receive a matrix from processor Po
send the QRD of the matrix to processor Po
end if
end while
sion then each processor will hold G / p matrices, under the assumption that
G is a mUltiple of p; otherwise, the matrices Ai] , ... ,Ai,," are allocated to the
processor Pi (i = 1, ... , p), so that
I
(1.54)
where f(m,ni) is the execution time required to compute the QRD of Ai in a

single processor and 1 ::; Ai ::; G. The Load in (1.54) can be derived from the
task-farming method using the timing models f(m,ni) or their corresponding
complexity functions. Initially, the matrix Ai is allocated to processor Pi (i =
1, ... ,p) which computes its complexity. The next matrix in the list is allocated
to the processor with the minimum accumulated complexity.
38
The task-farming and scattering methods have been implemented on an 8processor distributed memory system - the IBM SP2 - using MPI (Message
Passing Interface), Fortran77 and the LAPACK QRD subroutines [3]. It was
assumed that the matrices were local to each processor, i.e. the task-farming
method is reduced to sending the index of a matrix to be factorized rather
than the matrix itself to the processors. The experiments were performed with
G = 50, m = 5000 and 50 ~ ni ~ 300 (i = 1, ... , G). Table 1.5 shows the
time in seconds, speedup and efficiency achieved by each method. The main
difference between the task-farming and scattering methods is that the latter
uses all processors for computation while the former uses one processor only
for communication.
Table 1.5.
The task-farming and scattering methods for computing the QRDs (1.47).
Method
Time
Idle time
Speedup
Efficiency
Serial (One processor)

Task-Farming
Scattering
296.5
43.4
39.3
0%
12%
0%
1.00
6.83
7.54
1.00
0.85
0.94
Chapter 2
OLM NOT OF FULL RANK
INTRODUCTION
Consider the Ordinary Linear Model (OLM)
(2.1)
y=AX+E,
where A E Sltmxn (m > n) is the exogenous data matrix, y E Sltm is the response
vector and E E Sltm is the noise vector with zero mean and dispersion matrix
(J2/m. The least squares estimator of the parameter vector x E Sltn
argminET E = argmin IIAx- y112,
x
x
(2.2)
has an infinite number of solutions when A does not have full rank. However,
a unique minimum 2-norm estimator of x say, X, can be computed.
Let the rank of A be given by k (k :::; n). The solution of (2.2) is computed
in two stages. In the first stage the coefficient matrix A is reduced to a lower
trapezoidal form; in the second stage the lower trapezoid is triangularized. The
orthogonal decompositions of the first and second stages are given, respectively, by:
k
o)m-k
Lz
(2.3)
and
n-k k
(LI L2)P=
(0
L)'
(2.4)
40
where Q E 9tmxm and P E 9tnxn are orthogonal, Land L2 are lower-triangular

and non-singular, and II E 9tnxn is a permutation matrix. That is,
n-k
QTAnp~ (~
o)m-k
L k
(2.5)
The orthogonal decompositions (2.3) and (2.5) are called QL decomposition

(QLD) and complete QLD , respectively.
The minimum 2-norm best linear unbiased estimator of x is given by
x= PL-1QT
2
2Y,
A
where
and
Numerous methods have been proposed for computing the orthogonal factorizations (2.3) and (2.4), on both serial computers and MIMD parallel systems
[16, 18,51,93].
Algorithms are designed, implemented and analyzed for computing the complete QLD on the CPP DAP 510 massively parallel SIMD computer (abbreviated to DAP) [70]. The algorithms employ Householder reflections and Givens
plane rotations. Algorithms are also proposed for reconstructing the orthogonal matrices involved in the decompositions when the data which define the
orthogonal transformations are stored in the annihilated parts of the coefficient
matrix A. The implementation and execution time models of all algorithms
on the DAP are considered in detail. All of the algorithms were implemented
on the 1024-processor DAP using double precision arithmetic. The timing
models are expressed in msec.
THE QLD OF THE COEFFICIENT MATRIX
The computation of the QLD (2.3) using Householder reflections with column pivoting is considered. This method is also used when A is of full column
rank but ill-conditioned. Let the elementary permutation matrix I~i,Jl) denote
the identity n x n matrix In with columns n - i + 1 and Jl interchanged and let
Qf =Im-
h(i)h(i)T
hi
(2.6)
denote an m x m Householder matrix which annihilates the first m - i elements

of A:,n-i+l (pivot column), when it is multiplied by A on the right. The matrices
OLM not offull rank
41
Q and n in (2.3) are defined by

k
QT =
TIQLi+l = QrQLI ... Qf

i=1
and
n=
TI /(i,lIi) = /(I,JJJ) /(2,/-12) ... /(k,lIk).

k
i=1
To describe briefly the process of computing the QLD (2.3) let, at the ith (0 :$
i:$ k) step,
A (i) =
QT. QfA/~I'IIJ) ... I~i'lIj)

n-i
Vii
)m-i
i
'
where L~i is non-singular and lower-triangular with its diagonal elements

in increasing order of magnitude. The value of Jli+ I is the index of the column of
with maximum Euclidean norm. The criterion used to decide that
LW
. II (k) " 112 < 't, where A (

k).
(k)
rank ()
A = k IS A I'm-k
I'm-k
"
IS the JlH I column of Lit
,
'rk+l
'
,rk+l
and 't is an absolute tolerance parameter whose value depends on the scaling
of A [51,63,93]. The value of't is assumed to be given.
The permutation matrix n can be stored and computed using one of the two
n element integer vectors ~ and ~, where
A permutation I~i'lIi) is equivalent to swapping first the elements ~n-i+1 and ~lIi
of the ~ vector and then swapping the elements n - i + 1 and Jli of the ~ vector
where, initially, ~i = ~i = i (i = 1, ... ,n).
2.1
SIMD IMPLEMENTATION
The QLD (2.3) has been computed on the DAP under the assumption that
m = Mes and 1 < m :$ es 2 . That is, the dimension of the matrix
A is an exact multiple of the edge size of the array processor and m/es 21=
M / es1 = 1. A sample of execution times has been generated from the application of a single Householder reflection on matrices with various dimensions.
A regression model has been fitted to this sample with the execution time denoting the response variable. The predetermined factors of the timing model
= Nes,
42
are derived from the number of es x es layers involved in the arithmetic computations and the number of times layers are replicated or reduced. The time
required to construct and apply the ith Householder reflection is found to be:
TI (M,N)
= ao + al rm/es21+ a2 rm/es1+ a3 rm/es1rn/es1

(2.7)
where d J = ao + aJ = 2.43, a2 = 0.014 and a3 = 1.683.

Notice that
A affects only the leading (m - i + 1) x (n - i + 1) submatrix
of A. In order to reduce the execution time, the Householder reflections are
divided into K sets S(I), ... ,S(K), where K = rk/es1. The set s(i) (i = 1, ... ,K)
. 0 fth e refl ectlOns
.
S(J)
h S(i)(J = 1, ... , tj ) .
. Ient to
1 .. , S(K)
ti ,were j
IS eqUlva
conSIsts
QT
Q~-J)es+j'
tK =
k- (K -1)es and
tl = ... = tK-I
= es. If mj = (M - i+ l)es
= (N - i + l)es, then the Householder reflection
sjil is applied only to

the submatrix A(i) =AJ:m;,l:nj' The Householder reflections si j ), . ,S~il use the
and
nj
same number of es x es layers of memory and so they have the same execution
time when applied on the left of A(i).
Algorithm 2.1 effects the computation of the QLD (2.3). The procedure
column_swap performs the appropriate column and element interchanges on
its matrix and vector arguments if nj - j + 1 f= Ilj. Initially, the squared Euclidean norms of the columns of A(i) (denoted by A) are stored in VI: ni For
greater accuracy VI :ni is recomputed prior to the applications of the Householder reflections in S(i). The Householder vectors are stored in the annihilated
positions of A and in the last k elements of V. Within the context of OLM estimation, Algorithm 2.1 can be applied to the augmented matrix (Y A) except
that the permutations are performed only on the columns of A.
The time spent in applying all the Householder reflections is I,f:, J tj TJ (mj, nj).
If A has full column rank - that is, k = n and tj = es (i = 1, ... ,N), then the
total time is:
In this case, the estimated execution time of Algorithm 2.1 is found to be:
TQdM,N) = N(105.92 + 31.79M +5.86N + 27.81MN -9.18N2).
This estimate is derived using the backward stepwise regression method, where
the explained variable is the execution time and the initial explanatory variables
are determined after evaluating T2(M,N) [83, 137]. Table 2.1 shows the high
accuracy of the timing models T2(M,N) and TQdM,N) when used to predict
the execution time of Algorithm 2.1. Notice that the time spent in applying the
Householder reflections is approximately 90% of the total execution time of
Algorithm 2.1, that is, T2(M,N)/TQdM,N) ~ 0.90.
OLM notoffull rank
43
Algorithm 2.1 The QL decomposition of A.
1: let ~i = ~i = i (i = 1, ... , n)
2: fori:=1,2, ... ,Ndo
3:
let mi == (M - i + l)es, ni
4:
Vl: n; := sumr(A *A)
5:
forj:=1,2, ... ,esdo
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
== (N - i + l)es and A == Al:m;,l:n;
Ilj :=max(Vl, ... ,vn;-j+J)
if VPj < 't then

STOP
else
column_swap(A, V, ~, ~, ni - j + 1,11 j)
h:=O
hl:m;-j+l := Ai:m;-j+l,n;-j+l
s:= sqrt(Vn;_j+J)
if hm;-j+l < 0 then s:= -s
hm;-j+l := hm;-j+l +s
Vn;-j+l := hm;_j+l
b:=s*hm;_j+l
Am;-j+l,n;-j+l := -s
W = matc(h,ni - j)
X = sumr(W *Al:m;-j+l,l:n;-j)/b
l,l:n;-j := l,l:n;-j - W * matr(X, mi)
.
2
Vl:n;-j'= Vl:n;-j- (Am;-j+l,l:n;-j)
21:
22:
23:
end if
24:
end for
25: end for
TRIANGULARIZING THE LOWER TRAPEZOID
Two methods for computing the factorization (2.4) are presented. The first
uses Householder reflections and annihilates one row at a time of the Ll matrix;
the second uses Givens rotations for simultaneously annihilating elements in
certain rows of Ll.
3.1
A Householder algorithm is considered for computing the factorization (2.4).

For notational convenience let fi == n - k, Ll == L and L2 == t. The ith Householder reflection has the form
(2.8)
44

Computing the QLD (2.3) (in seconds), where m = Mes and n = Nes.
Table 2.1.
Matrix dimension
m
n
256
256
256
320
320
320
480
480
480
480
480
Predictions x 1()'
T2(M,N)
TQdM,N)
Execution time
of Algorithm 2.1
32
64
224
32
96
288
32
96
288
352
416
0.58
1.56
10.56
0.70
3.58
20.12
0.99
5.31
32.82
45.38
58.90
where Yi = Li,i S, s = I\L;,: \I

cation of Pi on the right of (L
2
(L- LA) Pi=
0.56
1.56
10.56
0.70
3.58
20.12
1.00
5.31
32.82
45.38
58.90
Ratio
T2(M,N)/TQdM ,N)
0.51
1.40
9.61
0.62
3.26
18.50
0.89
4.89
30.64
42.38
54.98
0.91
0.90
0.91
0.89
0.91
0.92
0.89
0.92
0.93
0.93
0.93
92
.
+ Li,i'
Ci = y;s and Ik = (el ... ek). The apph-
L) gives
(-L-xL-T,: L-Yixei
A
T) , where
i
-9
x=(LLi,:+YiL:,i)/Ci.
(2.9)
The reflection (2.9) annihilates the ith row of L by modifying all of L but only
the ith column of L. After the reflections PI, ... , Pi (1 ~ i ~ k) have been
applied on the right of (L L), the first i rows of L are zero and L remains
lower-triangular. The orthogonal matrix P in (2.4) is defined as P = PI P2 Pk.
Figure 2.1 shows the annihilation pattern for L when using Householder reflections.
-n----k!
After Stage 1
After Stage 2
After Stage 3
o 0 om
mmmm e
mmmm e e
mmmm e e e
o 0 o e m
m m m em e
m m m em e e
e
e e
o 0 o e em
m m m e em e
@!l Modified
o Unchanged
@] Annihilated
Figure 2.1.
After Stage 4
e
e e
e e e
e e e m
D Zero Element.
Annihilation pattern of (2.4) using Householder reflections.
The Householder reflections are divided into sets S( I) .. S(K), where K =

rk/es1- The set S(i) (i = 1, ... ,K) comprises the reflections S~i), ... ,S~/), where
tl
=0 =k-
(K - 1)es and t2 = ... = tK
. eqUlv
. alent to Pj+t(i) , were
h
t (i) =
1S
~i-I
= es. The reflection S)i) (j = 1, ... ,ti)
"-q=1 tq .
OLM not offull rank
The reflection
45
SY) is applied to the k; xii submatrix Vi) == Lt(i)+I:k,l:n and
k;-element subvector Vi) == .Lt(i)+I:k' where k; = k - t(i). Algorithm 2.2 effects

the computation of the orthogonal factorization (2.4) using Householder reflections, where the function sign returns the sign of its argument, L* E 9t kixn ,
.L* E 9t k;xt; and yE 9tt;. The vectors Li,: and scalars Yi of the Householder reflections PI, ... , Pk are stored in the annihilated positions of L and in the k-element
array y.
Algorithm 2.2 Triangularizing the lower trapezoid using Householder reflections.

1: let ii
= n - k, K = rk/es1, tl = k - (K - 1)es and t2 = ... = tK = es
2: for i:= 1,2, ... ,K do

let t(i) -= ~;-I
t k = k-t{;) ,L-*= L-t(')+I:k,l:n'
.
3 '.
~q=1 q' I -
.L*
== .Lt(i)+I:k,t(i)+l:t(i+I) and Y== Y1+t(i):t(i+I)
for j:= 1,2, ... ,ti do
4:
6:
s:= sign(.Lj)sqrt(sum(Lj,: *Lj) + (.Lj)2)

L *j,j+s
Yj:=
7:
8:
x:= (sumc(L* *matr(Lj,;,kj )) +Yj *.L~j)/c
9:
Lj+l:k,l:n:= Lj+l:k,l:n -matc(xj+l:k,ii) *matr(Lj,;,k; - j)
5:
10:
11:
12:
:=Yj*S
.Lj+l:k,j := .Lj+l:k,j - Yj *Xj+I:k

.Lj,j := -s
end for
13: end for

The time required to apply S)i) is found to be:
T(k;,ll) =ao+alll+a2rk;jesl +a311rk;jesl,

where II = rii/esl, ao = 1.57, al = 0.06, a2 = 0.22 and a3 = 0.54. Thus, the
total time spent in applying the Householder reflections in Algorithm 2.2 is
given by:
K
TCQd K , ll) = Lt;T(k;, ll)

i=1
= O(ao +alll +a2 K +a311 K )

(K -1)
+es 2 (2ao+2alll+a2K+a311K).
46
The timing model TCQdK,11) can be used to predict the total execution time
of Algorithm 2.2 which has negligible organizational overheads.
Notice that, if tl = ... = tK-I = es, tK = 0, K> 1 and 0 =I es, then the
execution time of Algorithm 2.2 will increase by (es - o)(K -1)(a2 +a311)
msec. Observe also that if D = (L* L~j) and fT = ((Lj)T Yj), then lines
8-10 of Algorithm 2.2 may be expressed as:
sumc(D *matr(f,ki))/C and

:= Dj+I:k,I:ii+l -matc(xj+l:k,n+ 1) *matr(f,ki
X:=
Dj+I:k,I:ii+1
j).
This formulation uses fewer es x es layers of memory than Algorithm 2.2 when
r(n+ 1)/esl However, on the DAP, the slight improvement in speed of
using fewer memory layers is lost to the overheads incurred in placing L~j in
D and then storing it in L.
11 =
3.2
THE GIVENS METHOD
A parallel Givens sequence (hereafter called PGS) requiring n - 1 compound disjoint Givens rotations (CDGRs) to compute the orthogonal decomposition (2.4) is proposed. The PGS does not create any non-zero elements above
the main diagonal of t and previously created zeroes of L are preserved. Let
Gi,j E 9tnxn denote a Gives rotation that annihilates Li,j when applied from the
right of (L t), and which affects only the jth and ith columns of L and t,
respectively. At most min (n,k) disjoint Givens rotations can be applied simultaneously: here it is assumed that n :S k. The elements annihilated after the
application of the ith CDGR, denoted by G(i), lie on a diagonal and their total
number is given by
ei= {
if 1 :S i
< n,
if n :S i:S k,
n+k-i ifk<i<n+k.
Table 2.2 shows the Givens rotations comprising G(i) of the PGS, the diagonals and the total number of elements annihilated after the application of G(i) ,
where di,j denotes the diagonal of L starting at element Li,j. In Fig. 2.2(a), i
(1 :S i < n) corresponds to the elements of L annihilated from the application of
G(i), where k = 16 and n = 22 (n = 6). Figure 2.2(b) illustrates an alternative
annihilation scheme equivalent to the PGS.
The application of G(i) affects ei consecutive columns of each L and L. Let
the submatrices comprising these columns be denoted by Vi) and t(i), respectively. Furthermore, let x = (XI, . .. ,xej ) denote the elements of Vi) annihilated
after the application of G(i) and the corresponding elements of t(i) used in
constructing the rotations be given by Y = (Yl, .. . ,Yej). The application of G{i)
OLM not o/full rank

Table 2.2.
47
The CDGRs of the PGS for computing the factorization (2.4).

G(i)=
Diagonals annihilated
ei
Gl,n
Gl,n-1 G2,n
Gl,n-2 G2,n-l G3,n
dl,n
dl,n-l
dl,ii-2
I
2
3
Gl,1 G2,2 ... Gii,ii

G2,IG3,2 ... Gii+l,n
dl,1
d2,1
ii
ii
k+I
Gk-ii+I,1 Gk-ii+2,2'" Gk,ii

Gk-ii+2, I Gk-n+3,2 ... Gk,ii-l
dk-ii+I,1
dk-ii+2,1
ii-I
k+ii-2
k+ii-I
Gk-l,I Gk,2
Gk,1
dk-1,1
dk,l
2
I
2
3
ii
ii+I
6
7
8
9
5
6
7
8
10 9
4 3 2 1
5 4 3 2
6 5 4 3
7 6 5 4
8 7 6 5
11 10 9 8 7 6
12 11 10 9 8 7
13 12 11 10 9 8
14 13 12 11 10 9
15 14 13 12 11 10
16 15 14 13 12 11
17 16 15 14 13 12
18 17 16 15 14 13
19 18 17 16 15 14
20 19 18 17 16 15
21 20 19 18 17 16
1
2
3
4
5
6
7
8
9
(i
= 1, ... ,n -
3
4
5
6
7
8
9
4
5
6
7
8
9
5
6
7
8
9
6
7
8
9
10
10 11
10 11 12
10 11 12 13
10 11 12 13 14
10 11 12 13 14 15
11 12 13 14 15 16
12 13 14 15 16 17
13 14 15 16 17 18
14 15 16 17 18 19
15 16 17 18 19 20
16 17 18 19 20 21
(b) Alternative PGS.
(a) PGS.
Figure 2.2.
2
3
4
5
6
7
8
9
ii
Givens sequences for computing the orthogonal factorization (2.4).
1) on the right of
(L L)
can be written as
48
PARAUELALGORITHMS FOR LINEAR MODELS
where
G (i)
CI
( -SI
'.
.. .
SI
Ceo
I
-Se;
CI
Se
.. .
CI
C2
C2
CI
C2
CI
C = matr(c,k) =
I,
Ce;
C")
Ce;
...
C ei
and
S = matr(s,k) =
SI
SI
SI
S2
SI
S2
::
...
...
sel)
S
~'.
Se;
In the implementation of the PGS on the DAP, let n = 11es and k = Kes.
The application of the CDGRs is divided into the three phases <1>1, <1>2 and <1>3.
Phase <l>i applies the CDGRs in S(i) ( i = 1, 2, 3 ), where
\
S)i} =
G(j)
{
G(ii+~-I)
G(k+J-I)
if i = 1 and j = 1, ... ,n - 1
if i = 2 and j = 1, ... ,k - n,
if i = 3 and j = 1, ... ,no
Phase <1>2 is divided into the K -11 sub-phases <PI,, <PK-1l' where, at subphase <Pi, the diagonals dl,l, d2 ,1, ... ,des,1 of the (K + 1 - i)es x 11es submatrix
L(i-I)es+l:k,l:ii are annihilated. In phases <1>2 and <1>3 previously annihilated
es x es submatrices of L and L are excluded from the computations. Figure 2.3
shows the phases and the sub-phases of PGS, where es = 4. At the beginning
of sub-phase <P3, the top 8 x 8 submatrix is zero and is excluded from the
computation (2.10).
A timing model for the PGS needs to be derived in order to compare the performance of the Givens and Householder algorithms on the DAP. This model
can be obtained by adding the (parameterized) estimated execution times of the
different phases of the algorithm. Then the factors of the total execution time
model, say TROT(K, 11), are used in the stepwise regression to construct a timing
model of the Givens algorithm, say TpGs(K, 11), which also takes into account
the various organizational aspects of the implementations that do not occur in
the different phases. Furthermore, the analysis and comparison of TROT(K, 11)
and TpGs(K,11) will indicate any inefficiencies in the implementation of the
algorithm.
From experiments, the estimated execution time for constructing and applying the CDGR G(i) (i = 1, ... , n - 1) in (2.10) is found to be:
t(K,ei)
= 1.5 + K re;jesl(O. 887 +0.004K -0.006re;jesl
OLM not offull rank
Phase CPt
Phase CP2
49
Phase CP3
7 1615
1716
7
.I~
o Annihilated element. 0 Zero element.

Figure 2.3.
1
2[1
32 1
[!] Non-zero element.
Illustration of the implementation phases of PGS, where es = 4.
Thus, the time required to apply all the CDGRs of the PGS is:
11
TROT(K,11) =es Il(K, i) - T(K, 11)
(from Phase CPI)
i=1
K-11
+es L T(K + 1-i,11)
(from Phase <1>2)
i=1
11
+ es LT(11 + 1 - i, 11 + 1 - i)
i=1
After expansion and using backward stepwise regression, the execution time
model of the PGS (including the overheads) on the DAP is given by:
TpGs(K,11) = K(ao +al11 + a2K + a311K + a4112) +11(as - a6112) ,

where ao = 35.70, al = 38.77, a2 = 1.10, a3 = 17.82, a4 = 15.07, as = 65.93
and a6 = 2.92. To two significant digits, TpGs(K,11) and the execution time of
PGS on the DAP are the same. From Table 2.3 it is obvious that the Householder reflections method is superior to the Givens rotation method.
COMPUTING THE ORTHOGONAL MATRICES
In some cases, such as in the deletion of data from the OLM and in recursive constrained OLM, the orthogonal matrices must be formed [75, 85].
The algorithms used in computing the orthogonal factorizations (2.3) and (2.4)
50

Table 2.3.
Computing (2.4) (in seconds), where k
Matrix dimension
n-k
k
32
192
192
192
352
352
352
352
512
512
512
512
512
Execution time
of Algorithm 2.2
TCQdK,TJ)
0.08
0.83
1.58
2.34
2.20
4.55
11.59
13.94
4.21
9.06
23.44
33.09
37.92
0.08
0.82
1.58
2.33
2.18
4.52
11.55
13.89
4.14
8.94
23.34
32.94
37.74
32
32
96
160
32
96
288
352
32
96
288
416
480
= Kes and n - k = TJes.

Predictions x 103
TpGs(K,TJ)
0.17
1.28
3.81
6.85
3.34
9.89
35.67
45.84
6.34
18.69
65.50
103.43
124.00
TROT(K,TJ)
0.15
1.11
3.19
5.67
2.83
8.38
30.65
39.07
5.33
15.99
58.23
92.44
110.49
hold the information necessary for regenerating these matrices. Algorithms

for the explicit formation of the orthogonal matrices Q and P are given for the
case where the factorizations (2.3) and (2.4) are computed using Householder
reflections.
The reconstruction of the orthogonal matrix QT in (2.3) can be considered to be similar to that of Algorithm 2.1 when the Householder vectors h{i)
(i = 1, ... ,k) in (2.6) are known. After the execution of Algorithm 2.1 the
value of b i (i = 1, ... ,k) in (2.6) can be computed as bi = -Vn-i+lAm-i+l,n-i+l.
However, executing Algorithm 2.2 after Algorithm 2.1 results in the entries of
Am-i+l,n-i+l being overwritten by Am-i+l,n-i+l -'Yk+l-i. Thus, bi is computed
as -(Am-i+l,n-i+l + 'Yk+l-i)Vk-i+l, where 'Yi (i = 1, ... ,k) is defined in (2.9).
For simplicity let H = (h:,l h:,2 ... h:,k) and ~ = (~l ~2 ... ~k)' where h:,i and
~i correspond to the Householder vector h(k+l-i) and bk+l-i in (2.6). Algorithm 2.3 defines the steps in reconstructing the orthogonal matrix QT.
The execution time of Algorithm 2.3 is given by:
K
TQ(M,k)
= Ltr(ao+alM +a2M(M + 1- i))

i=l
= es
(K -1)
2
(2aO+ 2alM +a2M (2M +2-K))
+ O(ao +alM +a2M(M + 1-K)),

where m = Mes, K = fk/esl, ao = 0.45, al = 0.19 and a2 = 0.55.
The data defining the ith Householder matrix Pi (i = 1, ... ,k) in (2.8) occupy
the ith row of L == L}, the ith position of the k-element array 'Y and the ith
OLMnotoffullrank
51
Algorithm 2.3 The reconstruction of the orthogonal matrix Q in (2.3).

1: letK= fk/esl and<>=k-(K-1)es.
2: QT :=Im
3: let tl = ... ,tK-I = es and tK =
4: fori= 1, ... ,Kdo
5:
let mj == (M - i + 1)es, nj == (N - i + 1)es, kj == k + 1 - L~= I tj,
o.
-T
6:
== QI:m;,I:m' H == HI:m;,k;:k;+t;
for j:= ti,ti - 1, ... ,1 do

W:= matr(h:,j,m)
and ~
== ~k;:k;+t;.
7:
8:
QT := QT - matc(sumc(QT * W) /~ j, mj) * W
9:
end for
10: end for
diagonal element of L == Am-k+l:m,n-k+l:n. The value of Cj in (2.8) can be
computed as -1iAm-k+i-I,n-k+i-l. It can be proved that the matrix p('J...) =
PI ... P'J... has a special structure which facilitates the development of an efficient
algorithm for computing P = p(k).
THEOREM
2.1 The structure of the matrix p('J...) = PI ... P'J... is given by
ii
pi') =
k-A
(D~')
o
I
)ii+A
(2.11)
k-A'
where the bottom A x A submatrix ofW(A.) is upper-triangular.

PROOF
2.1 Base step: For A = 1 the matrix p('J...) is given by
11 ef)
11 -
--LI"
C}
1-
ri
CI
'"
o
o
h-I
which satisfies (2.11) with
and
W(1)
== (
--Ll"
11 - )
C}
ri'" .
1-CI
52
Inductive step: Assume that p(A) has the structure defined in (2.11). It must
be shown that p(A+ 1) also has this structure.
Now,
P (A+I) -_ p(A)p1..+1
D(A)
-T
_D(A)LA+ 1,:LA+ 1,:
W(A)
CA+l
_ YI..+I IT
CA+I 1..+1,:
0
0
0
_ YA+l D(A)I
1..+1,:
CA+I
1- i+1
CA+I
0
0
0
h-A-l
and
W(A)
W(A+l) = (
YA+ 1D(A) I 1..+1,..)

+l
1- i+1
.
cA+I
CA
Notice that, given W(A), W(A+I) can be derived by computing only its ith
column W~~: ~). Thus, given p(A), the calculations of D(A+ I) and W~~: ~) can
be summarized as follows:
let x = -1- (D(A)I1..+1,: ) in
cl..+I
YA+l
and
where efiH+l is the last column of the identity matrix I fiH +1. Algorithm 2.4
shows, in data-parallel mode, the reconstruction of the orthogonal matrix P.
On the DAP the implementation of Algorithm 2.4 comprises /l phases <1>1 , ... ,
<1>1/' where /l = rn/es1-11 + 1 and 11 = r(ii + I) /es 1- During <1>1, .. , <1>1/-1 the
number of rows of the affected submatrix of PI :n, l:fi is multiple of es. Phase <1>i
(i = I, ... , /l) has Si steps with the jth step being equivalent to step (j + L~~\ Sq)
of Algorithm 2.4, where
<>= 11es-ii
{
Si= es
k - <> - (/l- 2)es,
ifi= I,
ifi=2, ... ,/l-I,
if i = /l and /l > 2.
OLM not offull rank
53
Algorithm 2.4 The reconstruction of the orthogonal matrix P in (2.4).

1: P:= In
2: for i:= 1,2, ... ,k do
3:
Ci := -iAm-k+i-l,n-k+i-l
4:
5:
6:
7:
8:
Xl:iHi-l := sumc(Pl:iHi-l,l:ii *matr(Li,:,Ii+i -1))

XiHi := i
x :=X/Ci
Pl:iHi,l:ii := Pl:iHi,l:ii Pl:ii+i,ii+i := Pl:ii+i,iHi -
9: end for
matc(x,li) * matr(Li,:, Ii + i)
i *x
In Fig. 2.4 a grid and a shaded box denote the submatrices Pl:n,l:ii and Pl: ii+ Si ,l:ii
(i = 1, ... ,p), where k = 17, Ii = 6 and es = 4.
SI
Figure 2.4.
=2
S2
=4
S3
=4
S4
= 4
S5
= 3
Thefill-in ofthe submatrix PI:n,l:ii at each phase of Algorithm 2.4.
The execution time of the ith step of Algorithm 2.4 is found to be
where a = 0.78, al = 0.21 and a3 = 0.55. Thus, for ni = Ii + L~=l Sj, the estimated time of executing Algorithm 2.4 on the DAP, excluding the overheads,
is given by:
II
Tp(k,li) = ~>i(ao+(al +a2 fli/esl Hni/esl)

i=1
II
= ISi(ao+ (al +a2fli/esl)(n + 1- i)).

i=l
54
= fies and k = Kes, which implies that 11 = K and

then Tp(k,ii) may be written as
If ii
Tp(Kes,fies) = es I,(ao + (aJ
Si
= es (i = 1, ... ,11),
+ a2fi )(fi + i))
i=J
= ~ K(2ao + (aJ + a2fi)(2fi + K + 1)).

Table 2.4 shows (in seconds) TQ(x/es,y) and Tp(x,y) for some values x and
y. The corresponding execution times of Algorithm 2.3 and Algorithm 2.4 are
also given. As in the previous cases the timing models are highly accurate.
Table 2.4.
Times (in seconds) of reconstructing the orthogonal matrices QT and P on the DAP.
Algorithm 2.3
T{2(M,k) x 103
Mes = x and k = y
160
160
288
288
288
416
416
416
416
544
544
544
544
32
160
32
160
288
32
160
288
416
32
160
288
416
DISCUSSION
0.48
1.53
1.49
5.86
7.70
3.05
12.96
19.22
21.81
5.15
22.75
35.83
43.90
0.48
1.54
1.48
5.85
7.70
3.05
12.96
19.25
21.90
5.17
22.87
35.82
44.02
Algorithm 2.4
Tp(k,n - k) x 103
k=xandfi=n-k=y
0.60
3.90
1.51
8.71
20.96
2.82
14.99
34.46
61.25
4.51
22.76
50.58
87.81
0.61
3.88
1.53
8.68
20.87
2.83
14.98
34.40
61.08
4.52
22.79
50.56
87.83
Aspects of the implementation of methods for solving the non-full column

rank OLM on a massively parallel SIMD computer have been considered.
From the experimental results and analysis of the timing models it has been
established that the Householder method is superior to the Givens method for
computing the orthogonal factorization (2.4). The performance of the algorithms has not been considered when the rank of the A matrix is close to n
(number of exogenous variables). In this case, a large number of processing
elements will remain idle during the computation if the default mapping is applied to distribute L and t over the processing elements. This will result in a
degradation of the performance of the algorithms. The implementation of the
algorithms using different mapping layouts for distributing the matrices over
OLM not o/full rank
55
the processing elements in order to achieve maximum performance remains to

be investigated [25].
In the extreme case where n - k = 1 in the factorization (2.4), the Householder algorithm is equivalent to PGS which is, in tum, equivalent to a simple sequential Givens algorithm. It may be that PGS performs better than
the Householder algorithm when k is very close to n. The design of a hybrid algorithm similar to the one in [82] also merits investigation. The Givens
and Householder algorithms, based on different mapping strategies, should be
combined to achieve the best performance.
Chapter 3
UPDATING AND DOWNDATING THE OLM
INTRODUCTION
In many applications, it is desirable to re-estimate the coefficient parameters of the OLM after it has been updated or downdated by observations or
variables. For example, in real time applications updated solutions of a model
should be obtained where observations are repeatedly added or deleted. In
computationally intensive applications such as model selection, regression diagnostics and cross-validation, efficient and numerically stable algorithms are
needed to solve models that have been modified by adding or deleting variables
or observations [10, 22, 24, 52, 138, 139].
Consider the OLM
Y =AX+E,
(3.1)
where y E 9\m is the response variable, A is the full column rank exogenous
m x (n - 1) matrix (m 2: n), x is the unknown vector of n - 1 parameters and
E E 9\m is the error vector with zero mean and covariance matrix (J2/m. Given
the QRD of the augmented matrix A = (A y)
QTA =
(R) m-n'
n
with
(3.2)
the least-squares estimator of x is determined from the solution of Rx = u,

where R is an upper triangular matrix of order n - 1. Several methods have
been proposed to solve the up- and down-dating least-squares problem [11,
21,27,32,39,50,51,59,93,98, 105,108, 114, 115, 144]. Parallel strategies
for solving the up- and down-dating OLM problem will be considered. The
strategies will have as a basic component the recalculation of the QRD which
will be based on Householder transformations and Givens rotations.
58
ADDING OBSERVATIONS
The updated OLM problem is the estimation of the BLUE of x in
where the BLUE of (3.1) has already been derived. Here the information added
to the original OLM (3.1) is denoted by
z = Dx+~, ~ '" N(O, cr2h) ,

where (D
z) == b
(3.4)
E ~kxn. Computing the QRD
Q~ (~) = (~),
with
- (Rn un) n-1

R=
Sn
'
(3.5)
the least-squares solution of the updated OLM is given by Rnxn = Un. where
Qn E ~(m+k)x(m+k) is orthogonal and Rn is an upper triangular matrix of order
n - 1. Thus. the (observations) updating problem can also be regarded as the
computation of the QRD (3.5) after (3.2) has been computed. This is equivalent
to computing the orthogonal factorization
(3.6a)
or
(3.6b)
where Qis an (n+k) x (n+k) orthogonal matrix. Notice that when (3.6a) and
(3.6b) are computed the orthogonal matrix Q~ in (3.5) is defined. respectively.
by
and
It will be assumed that the orthogonal matrix Qn is not stored.

The updating QRD (3.6b) is the transpose of the orthogonal factorization
(2.4) which triangularizes a lower trapezoid from the left. Therefore. the
Householder and Givens methods used in the computation of the second stage
Updating and downdating the OLM
59
of the complete QLD can also be employed to compute (3.6b). Specifically,

the orthogonal matrix QT in (3.6b) can be defined as the product of the Householder transformations QT = Hn' .. H2H1, where now
(3.7)
A2 i' Ci = SYi and ei IS
" the lth column of the n x n
Ri,i S, s2 = II D:,i 112 + J(i
unit matrix In. The applicati~n of Hi, annihilates the ith column of D, and
affects the ith row of R and the last (n - i + 1) columns of D.
Three sequences of CDGRs (compound disjoint Givens rotations), called
Updating Givens Sequences (abbreviated as UGSs ) are shown in Fig. 3.1.
The UGS-l and UGS-2 which compute the factorizations (3.6a) and (3.6b),
respectively, are equivalent to the PGSs employed to compute (2.4). Each
of the three UGSs applies a total of k + n - 1 CDGRs. UGS-l annihilates
the non-zero elements of the columns from bottom to top using successive
Givens rotations that affect adjacent rows. The ith (i = 1, ... ,n) CDGR starts
to annihilate the ith column of (b T RTV in this way. UGS-2 annihilates
the elements of the columns of b from the top to the bottom using successive
Givens rotations. An element of b in position (p, q) is annihilated by a Givens
rotation that affects the pth (p = 1, ... , k) and qth (q = 1, ... ,n) rows of b and
R, respectively. Columns of D start to be annihilated by successive CDGRs.
UGS-3 is equivalent to UGS-2, with the difference that the elements of the
columns of b are annihilated from bottom to top.
Yi =
7 9
6 8 10
5 7 911
4
3
2
1
6 8 10
5 7 9
4 6 8
3 5 7
2 4 6
3 5
4
(a) UGS-l.
10 11
9 10
8 9
7 8
6 7
3 4 5 6
2 3 4 5
1 2 3 4
1 2 3 4
8
7
6
5
(b) UGS-2.
(c) UGS-3.
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9
8
7
6
4 5
Figure 3.1. Updating Givens sequences for computing the orthogonal factorizations (3.6),
where k = 8 and n = 4.
The Householder method and UGS-l have been implemented on the DAP
(CPP DAP 510), using single precision arithmetic. Without loss of generality it
has been assumed that n = Nes and k = Kes, and K = 1 in the case of UGS-1.
60
Highly accurate timing models of the Householder and Givens methods are
given, respectively, by:
THup(N,K) = N(48.04 + 3.89N +9.70K +0.12N2+ 8.72KN)

and
Tcup(K) = 47.45 + 73.59K + 1.32K2.

From the timing models and Table 3.1 which shows THup (1,K) and TGup(K)
for some values of K, it can be observed that the Householder method performs
better than the Givens method. The same behaviour is expected when n > es.
In the case of k es or n es, the efficiency of both methods can be increased
if different mapping strategies for distributing b and R over the processing
elements of the parallel computer are used [82].
Table 3.1. Execution times (msec) of the Householder and Givens methods for updating the
QRD on the DAP.
THup(N,K)
TGup(K)
2.1
126
363
8
199
720
16
347
1563
20
420
2047
32
40
64
641
789
3754
5103
1231
10164
THE HYBRID HOUSEHOLDER ALGORITHM
The implementation of the Householder algorithm to compute (3.6b) on

the parallel SIMD machines 8192-processor MP-1208 MasPar (abbreviated
to MasPar) and 1024-processor CPP GAMMA-I (abbreviated to GAMMA)
is considered. The computational details of the implementations will not be
shown. Similar implementations have previously been considered for computing the QRD under the assumption that the dimensions of the matrices are exact multiples of the corresponding dimensions of the array processor, eSI x eS2,
and that none of their dimensions exceeds eSI eS2 [17, 79]. Here the only constraint imposed on the dimensions of the matrices is that k > n. It will also be
assumed that the orthogonal matrix Qin (3.6b) is not explicitly constructed.
The Processing Elements (PEs) of the MasPar and Gamma SIMD systems
are arranged in a 2-D array of size eSI x eS2, where eSI = 128 and eS2 = 64
in the case of the MasPar, and esl = eS2 = 32 in the case of the GAMMA.
The main mapping layouts for distributing the k x n matrix b over the PEs
are the (default) cyclic, column and row layouts, which use rk/esll rn/es21,
nrk/esles21 and n/esl eS21 layers of memory, respectively [11].
The memory layers have dimension eSI x eS2. A mapping layout is chosen
so that the maximum number of PEs remain active during computation with-
kr
61
out, however, increasing the communication overheads between the PEs. Since
k > n, the row-layout will be inefficient compared with the column-layout.

Consequently, the performances of the Householder algorithm using cyclic and
column layouts are considered. The data-parallel Householder algorithm for
computing (3.6b) is shown in Algorithm 3.1, where
R is overwritten by It
Algorithm 3.1 The data-parallel Householder algorithm.

1: for i = 1, 2, ... ,n do
2:
s:= sqrt(RT,i + IID:,dI 2)
3:
if Ri i < 0 then s := -s
4:
y:='Ri,i+S
S:
c:= s*y
6:
Z.= y*Ri,i: +D:,P:,i: /c
7:
Rii :=Rii'-y*Z
T
8:
D:,i: .- D:,i: - D:,iZ
9: end for
",)'
"_
AT
","
Using a cyclic-layout to map the matrices fj E 9\kxn and R E 9\nxn on to the

PE array, the time (msec) required to apply the ith Householder transformation
is found to be
<1>1 (k, n, i) =1O.6Sf(n + 1 - i)/es21 + 1.66f(n + 1 - i)/es2 Hk/esll

+ 4.72fk/esll + O.03(n + 1 - iHk/esll
(3.8)
The total time spent in applying the Householder reflections is thus given by:
n
<I>2(k,n) = L<I>I(k,n,i)
i=1
N
=es2 L(1O.6Si + 1. 66Ki +4.72K) + 0.03n(n+ 1)K/2

i=1
- (Nes2 - n)(1O.6SN + 1. 66NK + 4.72K),
(3.9)
where N = fn/es21 and K = fk/esll However, the overheads of the implementation which arise mainly from the passing of arguments (subarrays) to
various routines are not included in this model. Evaluating <1>2(k,n), and using backward stepwise regression on a sample of more than SOOO execution
times, gives the following highly accurate timing model for the cyclic-layout
62
Householder implementation:
TI (k,n) =N(1391.0+ 66.8N2 + 388.2K + l1S.SNK)

- (Nes2 - n)(17.42+4.S3K + 2.79N2 + 3.66NK)
=N(276.12+98.28K -118.74NK -111.76N 2)
+n(17.42+4.S3K + 2.79N 2 + 3.66NK)
(for eS2 = 64). (3.10)
Calculations show that the residuals are normally distributed. Thus, the hypothesis tests made during the selection ofthis model are justified [137]. The
adequacy of the latter model, measured by the coefficient of determination, is
found to be 99.99%.
Using cyclic and column layouts to map respectively the matrices R and D
on to the PEs, a model for estimating the execution time of the ith Householder
reflection is:
<1>3 (k, n, i) =co + ci (n + 1 - i)

+ c2(n+ 1- i)ik/esles21
+ C31n/es21,
(3.11)
where co, . .. ,C3 are constants. Evaluating I~I <1>3(k, n, i) and using regression
analysis, the execution time of the column-layout implementation is found to
be:
T2(k,n) = n( 13.51
+ 2.69n + 1.33(n + l)ik/esles21 + 2.981n/es21).
(3.12)
From Fig. 3.2 it will be observed that neither of the implementations is superior in all cases. The efficiency of the cyclic-layout implementation improves
in relation to that of the column-layout implementation, for fixed k and increasing n. Table 3.2 shows that the column-layout is superior for very large k
and relatively small n.
The results above suggest that the application of the n required Householder
reflections be divided into two parts. In the first part, nl reflections are applied
to annihilate the first n 1 columns of A using cyclic-layout; in the second stage
the remaining n2 = n - nl reflections reduce to zero the submatrix D:,nl+l:
using column-layout, where D:,nl+l: comprises the last n2 columns of D. Let
tl (k, n, n t) be the time required to complete the first stage. Then, the total
execution time of the hybrid implementation is given by:
(3.13)
where, on the MasPar:
'TJ(k,n-nt) =lk/esll(6S.4SI(n-nd/es21 +0.62(n-nt))

is the time (msec x 10- 3 ) required to remap the submatrix D:,nl+l: from cycliclayout to column-layout.
Updating and downdating the aLM
63
13
10.5
8
5.5
0.5
~~~~~m<'SOQS2">C
8192
12288
16384
20480
24576
k
28672
32768
36864
40960
Figure 3.2. Ratio of the execution times' produced by the models of the cyclic-layout and
column-layout implementations.
Table 3.2.
Execution times (in seconds) for k = 11264.
Cyclic
TI (k,n)
Column
T2{k,n)
22
30
38
46
54
94
62
70
78
86
14.06
20.39
27.43
33.52
38.91
75.94
43.60
50.62
58.13
66.56
14.68
20.39
26.53
32.45
38.38
75.88
44.30
50.17
58.74
67.31
3.04
5.63
8.68
12.18
16.65
49.46
21.79
27.90
34.46
41.71
3.02
5.40
8.47
12.23
16.67
49.46
21.80
27.82
34.35
41.56
All timing model results have been multiplied by 103 .
The value ofnl, which may not be unique, is chosen to minimize T4{k,n,nl).
That is, nl is the solution of
argminT4{k,n,nt}
n1
subject to
{o ::; nl
::; n,
n 1 is integer
(3.14)
which may easily be determined by (simultaneously) computing T4{k,n,nt} for

nl = O, ... ,n and selecting the value(s) which minimizes T4{k,n,nt}. Note that
T4{k,n,nJ) is a statistical timing model and includes an unpredicted random
error component. Consequently, the solution of (3.14) may not yield the true
value of n 1 which minimizes the execution time of the hybrid algorithm.
64
An alternative method for estimating n I is to compare the total number of

memory layers used for different values of n I. In this case n I is the solution of
n)
argmin ( r fk/esil f(n + 1 - i) / es21

ni
i=I
+ ~(n -n.)(n -ni + l)fk/eS)es21),
(3.15)
where 0 ~ n I ~ n. This estimation method does not, however, take into account the cost of remapping D:,nl+):' Better estimates of ni might possibly
be obtained by constructing more accurate timing models than that given by
(3.14), or by introducing a weighting function, say W(nI,k,n,es),es2), into
(3.15) which takes account of the implementation overheads.
Table 3.3 shows the execution times for the three implementations on the
MasPar, where the negligible time required to compute the estimates of n} is
not included and T1(k,n,n.) has been evaluated as IZ;) <P)(k,n,i). The estimates for ni obtained using (3.14) and (3.15) are denoted by 'h and nj, respectively. In most cases, the two estimation methods yield different values for
n I. The execution time for the hybrid implementation, using nj, is found to be
more accurate than those using iiI in approximately half of the cases. In some
instances, for very small n and relatively large K, the hybrid implementation
reduces to the column-layout implementation: that is, the estimated value of
n) is zero (see, for example, the case n = 32 and k = 12800). With the exception of a few cases, the hybrid implementation on the MasPar is more efficient
than both the cyclic-layout and column-layout implementations.
On the GAMMA a hybrid algorithm has also been investigated with ii} denoting an estimator of n} which is the solution of
nl
argmin(r fk/esil f(n + 1 - i)/ es 21

n}
i=}
+ ~(n -
n})(n - n}
+ l)fk/es}es21 + W(n) ,n,k,es) ,es2)),
(3.16)
where W(n},n,k,es},es2) is the overheads function. The function W has been

used to take account of the computational cost of remapping D:,nl +1: from
cyclic-layout to column-layout. On this machine, a less computationally expensive method for remapping the matrix was found to be a loop that copies
n - nl times a column from a fk/esll x fn/es21 grid of layers to a (n - nl) x
fk/esl eS21 grid. However, the remapping is not achieved as efficiently as in the
case of the MasPar and accounts for a large proportion of the total execution
time of the hybrid algorithm. An ad-hoc definition of W is given by
(3.17)
65
Updating and down dating the OLM

Table 3.3.
Execution times (in seconds) of the RLS Householder algorithm on the MasPar.
k/128
nl
Hybrid
with nl
n*I
Hybrid
with nj
Cyclic
Column
80
80
80
80
80
80
90
90
90
90
100
100
100
100
100
100
110
110
110
110
120
120
120
120
120
120
32
64
96
128
160
192
32
64
96
128
32
64
96
128
160
192
32
64
96
128
32
64
96
128
160
192
0
5
20
52
84
116
0
0
8
40
0
0
0
26
58
90
0
0
0
11
0
0
0
0
0
31
10.31
30.00
59.77
92.10
132.42
170.62
11.01
32.57
63.29
99.14
11.48
33.52
66.79
102.65
153.04
200.62
11.95
34.46
68.20
108.52
12.66
35.85
70.07
115.08
170.85
228.29
0
25
57
89
121
153
0
20
7
39
0
15
0
29
61
93
0
10
0
19
0
5
0
9
0
13
10.31
28.83
59.30
88.60
129.61
167.35
10.78
30.46
63.51
98.68
11.48
31.87
66.80
103.12
153.52
201.33
12.19
33.51
68.68
107.35
12.42
34.93
69.85
112.03
170.62
230.39
20.15
40.78
71.01
100.54
141.57
179.30
22.73
45.71
79.69
112.73
25.08
50.15
88.36
124.92
175.78
222.65
27.65
55.55
96.79
136.88
30.00
60.46
105.47
149.Q7
210.00
265.54
10.54
31.17
63.29
105.93
159.85
223.83
10.78
32.34
65.15
108.05
11.48
33.04
66.56
109.93
165.01
230.39
11.71
34.45
67.97
112.26
12.18
35.40
69.61
114.14
170.15
236.26
where w is the weight constant. The value of w which produces a more efficient
hybrid algorithm under iiI was found by experiment to be 0.2.
Table 3.4 shows the performances of the various algorithms on the GAMMA.
In most cases the hybrid algorithms perform better than the column-layout and
cyclic-layout algorithms. However, the hybrid algorithm based on the minimization of the estimated execution time, that is, on nj, is found to have the
least deviation from the best execution time.
For other SIMD systems, the value of n I may best be derived using the
straightforward minimization of the total number of memory layers used, rather
than minimizing the estimated time given by the performance model in (3.13)
which requires the time consuming re-determination of the coefficients of the
various timing models. However, if the remapping overheads are significant (as
in the case of GAMMA), then the minimization of (3.16) - the third method-
66

Execution times (in seconds) of the RLS Householder algorithm on the GAMMA.
Table 3.4.
k/32
iii
Hybrid
with iii
n*I
Hybrid
with nj
nl
Hybrid
withnl
Cyclic
Column
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
128
128
128
128
128
128
128
128
128
128
128
128
128
128
128
128
8
16
24
32
40
48
56
64
72
80
88
96
104
112
120
128
8
16
24
32
40
48
56
64
72
80
88
96
104
112
120
128
0
0
0
7
0
1
11
48
58
68
78
89
99
109
119
127
0
0
0
0
0
0
0
3
0
0
0
84
95
112
120
128
0.62
2.09
4.43
8.79
11.73
19.18
23.41
26.66
18.95
25.70
31.98
38.10
47.52
56.31
63.84
69.69
0.62
2.09
4.40
7.58
11.66
16.55
22.28
32.71
36.42
44.74
53.88
45.63
57.93
72.84
81.25
88.65
0
0
0
8
0
0
7
15
23
31
39
47
55
63
71
79
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0.62
2.10
4.43
8.75
11.72
16.64
24.00
28.29
34.83
40.98
46.61
51.68
39.22
46.89
53.81
60.11
0.62
2.09
4.41
9.80
11.66
16.55
22.28
33.16
36.44
44.73
53.89
70.24
74.90
86.62
99.19
121.04
0
0
5
13
26
34
42
50
63
71
79
87
100
108
116
124
0
0
0
7
0
0
6
14
60
68
76
84
98
106
114
122
0.62
2.10
5.88
8.79
14.04
18.68
23.01
26.84
20.53
26.69
32.32
37.37
47.93
55.89
62.54
68.38
0.62
2.08
4.41
9.82
11.66
16.55
25.74
31.18
24.35
32.13
39.25
45.63
59.43
69.49
77.91
85.30
3.15
6.23
9.31
11.97
16.85
21.49
25.82
29.65
24.21
30.38
36.00
41.06
49.69
57.65
64.29
70.14
3.97
7.84
11.72
15.07
21.23
27.09
32.55
37.37
30.57
38.36
45.47
51.86
62.78
72.84
81.25
88.65
0.62
2.10
4.43
7.62
11.72
16.64
22.41
29.03
36.63
44.96
54.17
64.24
75.31
87.09
99.72
113.22
0.62
2.08
4.40
7.58
11.66
16.55
22.28
28.87
36.43
44.74
53.88
63.89
74.90
86.61
99.19
112.61
is the most appropriate. Further investigation of the overheads function W in

(3.16) is needed.
Similar hybrid algorithms may be used to compute the orthogonal factorization (3.6b), based on Givens rotations and Householder reflections, when k < n.
In this case, the efficiency of the row-layout mapping strategy should be investigated as a (possibly) more efficient alternative. The hybrid approach could
also be employed to improve the efficiency of the SIMD algorithms proposed
in [27, 69, 75, 79] and might also be considered for MIMD systems.
2.2
67
THE BITONIC AND GREEDY GIVENS

SEQUENCES
Parallel Givens strategies that compute the factorization (3.6) in fewer steps
by using UGSs are considered. The strategies are modifications of the annihilation schemes reported in [30, 94, 103]. The first block-parallel strategy
uses the recursive doubling approach while the second strategy is based on the
Greedy annihilation scheme.
Let the matrix b be partitioned as
and the QRD of Vi E 5R Pixn (i = 1, ... ,A) be given by

T
V' _ (R~O)) n
Qi,O
I -
Pi - n '
with
QiO =
(Oro)
oro nPi - n '
(3.18)
where Lt=l Pi = k and Vi: Pi ~ n. Constructing the orthogonal matrix

01,0
Qo=
it follows that
Thus, the computation of (3.6a) is equivalent to triangularizing

R(O)
1
R(O)
R(o)
1..+1
by orthogonal transformations, where R== R~~l' Using recursive doubling this

triangularization is achieved in rlog2 (A + 1)1stages. At the ith stage (1 ~ i ~
68
PARAUELALGORITHMS FOR UNEAR MODELS
rlog2 (A+ 1)1) the Updating QRDs (UQRDs)
QJ., (~~~\)
(Rn:,
j = 1, ... ,g"
(3.19)
are computed simultaneously, where

i-I
gi= l(A+l-
~gj)/2J,
J=I
i-I
if (A+ 1- ~ gj) is an odd number,
R(i-I) - R(i)
2g;+1 g;+1
J=I
R)i)
and, in (3.6a), R = R~rIOg2(A+I)l). Note that

(i = 0, ... , rlog2 (A+ 1)1 -1)
might be singular.
Without loss of generality it can be assumed that 10g2 (A + 1) is an integer.
Let the orthogonal and permutation matrices QT and nT be defined, respectively, as
QT =
("
QfA+l)2-J
and
nT =
In
0
0
0
0
In
0
0
0
0
0
0
0
0
0
0
In
0
0
0
0
0
0
In
In
0
0
0
0
0
In
such that
R(i)
R(i)
n! Q! R(i-I) = n!
1
(i)
R(A,+1)2- i
(i)
R(A,+1)2- i
= (R(i)) n(A+ 1)2-~

-
n(A+ 1)2-1

.
-T
-T
69
Thus, the orthogonal matnx Qn = Qlog2 (1..+1) ... Q I Qo' where

-T = (II!Q!
Q.
I
I
I
0
i = 1, ... ,log2 (A. + 1).
Ik+n-n(A+I)/2i-l'
Notice that atthe ith (i = 1, ... , log2 (A. + 1) stage the gi orthogonal matrices
can be applied to annihilate the last gi upper-triangular factors of R(i-I). In
order to simplify the description of this method let k = (2g - 1)n and V be
partitioned into sub-blocks VI, ... V2K -I, where Vi E ~nxn (i = 1, ... , 2g 1). This algorithm, hereafter called the bitonic algorithm, initially computes
simultaneously, for j = 1, ... , 2g - 1, the QRDs
T
(0)
(3.20)
QoDj=R
..
j,
j
h
At stage i (i = 1, ... ,g) the bitonic algorithm computes in parallel the UQRDs
R(i-l))
T
j
Qj,i ( R(i-I).
j+2{g-,)
((i))6
R
n'
. - 1, ... , 2(g-i) ,
J-
where R == R~~) and Ry) is upper triangular. After the gth stage
computed. The orthogonal matrix Qn (3.5) is defined as
where
Qf.o
T _ (
Q0
-
d l ,2)
I,i
dl,l)
I,i
dI,I)
Qi=
d 2,1)
l,i
2g -',i
d 2,I)
2g -',i
dl,i2,2)
d I ,2)
2g -',i
,2)
d22-',i
g
(3.21)
R == R~g)
is
70
and
Algorithm 3.2 effects the computation of the factorization (3.6a). At stage

i (i = 1, ... , flog2 (A + 1)1) R~i-l) is not used in the UQRDs if the number of
triangular factors (blocks) is odd and R~i)
R is given by R R~rlOg2 (A,+1)1).
= R~i-l). The upper triangular matrix
Algorithm 3.2 The bitonic algorithm for updating the QRD, where R
1: let b T =
=R~~l'
(Vr ... bD, where Vi E 9tPjxn and Pi ~ n (i = 1, ... ,A).
: fo:~m:tel:~ .~~:j;;P'j:a(I~;O))
,
n
Pi- n
4: end for all

5: blocks:= A+ 1
6: for i = 1,2, ... , flog2 (A+ 1)1 do
7:
if blocks is odd then
8:
p:= 1
9:
R~i) := R~i-l)
10:
else
11:
p:= 0
12:
end if
13:
J1:= Lblocks/2J
:::
fO:~m:~tel:~~:~o~;;p(l~\)
(R~p):
RJ+p+p
=
16:
end for all
17:
blocks := J1 + P
18: end for
Figure 3.3 shows the process of computing (3.21) using CDGRs, for the
case where n = 6. An integer I (I = 1, ... ,n), a blank and a denote, respectively, the elements annihilated by the CDGR G(l,j), a zero element and a
non-zero element. An arc indicates the rotations required to annihilate individual elements. An element of R~~;(~_j) at position (r,q) is annihilated by a
Givens rotation that affects the rth row of R~~-;~l-j) and the qth row of RY-I)
71
(r,q = 1, ... ,n and r::; q). In general n CDGRs are applied using this Givens
sequence which, it will be observed, is an intermediate stage of UGS-l.

c(i,j)
I~

I~
I~
1'1
.!.
I~
Il- l.
1l
1'1
1
C(4,j)

,.
,.
4~
~I
Figure 3.3.
C(2,j)

~
I~
I~
21. I.
2l- I-
1'2 l-
1'2
1'2
c(3,j)

,.
I~
3
Ie(
3.
I~
C(5,j)
.. .
'5(
1'3
c(6,j)
..
'6
Computing (3.21) using Givens rotations.
The total number of CDGRs applied to compute (3.6a) using the bitonic
algorithm is given by
TI (n,k,'A.,p)
= max (To(pi,n))
+nflog2 ('A. + 1)1,
= 1, ... ,'A.,
(3.22)
where To(pi,n) is the number of CDGRs required to compute (3.18) and P =

(PI, P2, .. ,Pt..). Employing the SK annihilation scheme in [129] (see Fig. 1.5(a))
to compute (3.18)-thatis To(pi,n) = Pi+n-2if Pi> n and To(n,n) = 2n-3(3.22) is found empirically to be minimized for Pi having values closer to n.
For simplicity let 'A. be an integer such that k = An and fIog2 'A.1 = flog2 ('A. + 1) 1.
This implies that (3.22) is minimized if\fi: Pi = nand TI (n, k, 'A., p) is simplified
to
T2(n,'A.) = 2n - 3 +nflog2 ('A. + 1)l.

Figure 3.4 illustrates the computation of (3.6a) using the bitonic algorithm,
where n = 6, k = 18 and Pi = 6 (i = 1,2,3). The bold frames show the partition
72
(DT RT{ = (br Dr Dr RTV. Initially the QRDs of Dj , D2 and D3

are computed simultaneously and, at stages i = 1,2, the updating is completed
by computing (3.21), where g = 2.
Compute (3.20) Compute (3.21) Compute (3.21)
for i = 1
for i = 2

5
4 6
3 5 7
2 4 6 8
113 5 7 19
5
4 6
3 5 7
2 4 6 8
1 3 5 7 19

5
4 6
3 5 7
2 4 6 8
1 3 5 7 9

Figure 3.4.
lU 11 12 13 14 15
10 11 12 13 14
10 11 12 13
10 11 12
10 11
10
10 11 12 13 14 15
1U 11 12 13 14
10 11 12 13
10 11 12
10 11
10
16 17 18 19 2U 21
16 17 18 19 20
16 17 18 19
16 17 18
16 17
16
The bitonic algorithm, where n = 6, k = 18 and PI
= P2 = P3 = 6.
The number of CDGRs applied to update the QRD using the UGSs is given
by
Ignoring additive constants, it follows that
This indicates the efficiency of the bitonic algorithm for computing (3.6a) for
A > 2, compared with that when using the UGSs.
The second parallel strategy for solving the updating problem is a slight
modification of the Greedy annihilation scheme in [30, 103]. Taking as before n = 6 and k = 18, Fig. 3.5 indicates the order in which the elements are
annihilated. Observing that the elements in the diagonal of R are annihilated
73
by successive rotations, it follows that at most k + n - 1 CDGRs are required

to compute (3.6a). An approximation to the number of CDGRs required to
compute (3.6a) when n is fixed and k approaches to infinity, is given by
T4(n,k) = log2k+ (n-l)log210g2k.

The derivation of this approximation has been given in the context of computing the QRD and, it is also found empirically to be valid for computing (3.6a)
[103]. In general, for k n, the Greedy sequence requires fewer CDGRs than
the bitonic method, while for small k (compared with n) the UGSs and Greedy
sequence require the same number of CDGRs. Table 3.5 shows the number
of CDGRs required to compute (3.6a) using the UGSs, bitonic method and
Greedy sequence for some n and A(k = An and k n) .
4 7
3 6 9
3 6 811
2 5 8 10 13
2 5 7 10 12 15
2
2
2
1
1
1
1
1
1
1
1
1
Figure 3.5.
4
4
4
3
3
3
3
3
2
2
2
2
2
7
6
6
6
5
5
5
4
4
4
4
3
3
3
9
9
8
8
7
7
7
6
6
6
5
5
5
4
4
12 14
11 14
11 13
10 13
10 12
912
911
811
8 10
810
7 9
7 9
7 9
6 8
6 8
5 7
6
The Greedy sequence for computing (3.6a), where n = 6 and k = 18.
As regards implementation, the efficiency of the Greedy method is expected

to be reduced significantly by the organizational overheads so that the bitonic
method is to be preferred [30, 103]. The simultaneous computations performed
at each stage of the bitonic method make it suitable for distributing memory
architectures [36]. Each processing unit will perform the same matrix computations without requiring any inter-processor communications. The simultaneous QRD of the matrices VI, ... D28 _Ion a SIMD system has been con-
74

Table 3.5.
Number of CDGRs required to compute the factorization (3.6a).
k=M
UGSs
bitonic
Greedy
15
15
15
15
30
30
30
30
60
60
60
60
5
10
20
40
5
10
20
40
5
10
20
40
75
150
300
600
150
300
600
1200
300
600
1200
2400
89
164
314
614
179
329
629
1229
359
659
1259
2459
72
87
102
117
147
177
207
237
297
357
417
477
43
47
50
54
89
96
102
107
187
198
208
217
sidered within the context of the SURE model estimation [84]. In this case
the performance of the Householder algorithm was found to be superior to
that of the Givens algorithm (see Chapter 1). The simultaneous factorizations
(3.21) have been implemented on the MasPar within a 3-D framework, using Givens rotations and Householder reflections. The Householder algorithm
applies the reflections H(I,j), ... ,H(n,j), where H(l,j) annihilates the non-zero
elements of the lth column of R;~;(~_i) using the lth row of RY-I) as a pivot
row (I = 1, ... ,n).
Table 3.6.
Times (in seconds) for computing the orthogonal factorization (3.6a).
bitonic Householder
Householder
bitonic Givens
UGS-J
64
64
64
2
3
5
2
3
5
2
3
5
0.84
1.55
4.83
4.15
7.78
27.12
10.15
19.69
72.12
0.23
0.35
0.82
1.78
2.98
10.27
5.51
10.05
37.48
1.29
2.34
7.97
8.98
18.21
69.96
27.96
58.78
236.13
1.45
2.46
9.04*
9.77
19.45
76.96*
32.41
67.51*
278.20*
192
192
192
320
320
320
* Estimated times.
Table 3.6 shows the execution times for the various algorithms for computing (3.6a) on the 8I92-processor MasPar using single precision arithmetic.
Clearly the bitonic algorithm based on Householder transformations performs
better than the bitonic algorithm based on CDGRs. However, the straightfor-
Updating and downdating the aLM
75
ward data-parallel implementation of the Householder algorithm is found to be

the fastest of all. The degradation in the performance of the bitonic algorithm
is due mainly to the large number of simultaneous matrix computations which
are performed serially in the 2-D array MasPar processor [84]. The bitonic
algorithm based on CDGRs performs better than the direct implementation of
UGS-l because of the initial triangularization of the submatrices DI,' .. ,D2g-1
using Householder transformations.
2.3
UPDATING WITH A MATRIX HAVING A BLOCK

LOWER-TRIANGULAR STRUCTURE
Computational and numerical methods for deriving the estimators of structural equations models require the updating of a lower-triangular matrix with
a matrix having a block lower-triangular structure. Within this context the
updating problem can be expressed 'as the computation of the orthogonal factorization
-T
(0)t E-
(A(I))
;\(1) =
el
(G-l)K-E+eG'
(3.23)
where
K -el
A(1) =
K -e2
K -eG-I
-(I)
-(I)
e2
A2,1
e3
A31,
eG
AG,I
-(I)
A 3,2
-(I)
K-el
K-el
A(1) =
K-e2
K -eG-I
A
A(I).
-(I)
-(I)
A GG
, -1
A G,2
A(I)
LI
A( I)
A21,
A(I)
K-e2
K -eG-I
A(I)
L2
A(I)
LA(I)
G_ 1
A G- 12
,
AG_1,1
.
L and Li (I = 1, ... G - 1) are lower tnangular and E = Li= 1 ei

The factorization (3.23) can be computed in G - 1 stages, where each stage
annihilates a block-subdiagonal with the first stage annihilating the main block-
76
diagonal. At the ith (i
= 1, ... , G -
1) stage the orthogonal factorizations
Y+
are computed simultaneously for j = 1, ... , G - i, where the t I) matrix is

lower triangular and Pi,) is a (K - ej + ei+j) x (K - ej + ei+j) orthogonal matrix. It follows that the triangular matrix t in (3.23) is given by
t~G)
A(G-I)
2,1
t(G-I)
2
0
0
Therefore, if TDI (e, K, i, j) denotes the number of CDGRs required to compute

the factorization (3.24) using this method (hereafter called diagonally-based
method), then the total number of CDGRs needed to compute (3.23) is given
by
G-I
TD(e,K,G) =
L max ( TDI (e,K,i,j)),
j= 1, ... ,G-i,
(3.25)
1=1
where e = (el, ... ,eG). Figure 3.6 shows the annihilation process for computing the factorizations (3.23), where G = 5 and iii denotes a submatrix eliminated at stage i (i = 1, .. . , G - 1).
Stage 1
Stage 2
Stage 3
Stage 4
",
I\.
Figure 3.6.
Computing the factorization (3.23) using the diagonally-based method, where
G=5.
Figure 3.7 illustrates various annihilation schemes for computing the factorization (3.24) by showing only the zeroed matrix A~2 j,j and the lowertriangular
tY) matrix, where ei+j =
12, K - ej = 4. The annihilation schemes
77
are equivalent to those of block-updating the QRD the only difference being that an upper-triangular matrix is replaced by a lower-triangular matrix
[69, 75, 76, 81]. These annihilation schemes can be employed to annihilate
different submatrices of A(1), that is, at step i (i = 1, ... , G - 1) of the factoriza. (3 23) the sub matrices
.
A-(i)
. the
hon.
i+l,I"'" A-(i)
G,G-i can b e zeroedth
WI out usmg
same annihilation scheme. Assuming that only UGS-2 or UGS-3 schemes are
employed to annihilate each submatrix, then the number of CDGRs given by
(3.25) is
TDI (e,K,i,j) = K - ej +ei+j -1.

Hence, the total total number of CDGRs applied to compute the factorization
(3.23) is given by
G-l
T~2v(e,K,G) = ~ max(K-ej+ei+j-l)
1=1
G-l
=(G-l)(K-l)+ L,max(ei+j-ej),
j=I, ... ,G-i. (3.26)
i=1
3 2 1
4 3 2
5 4 3
6 5 4
7 6 5
8 7 6
H 987
U 1() 9 8
112 1 1~ 9
4
5
6
7
8
9
Il~ I
114 1 1
15 l,n.
UGS-2
Figure 3.7.
15 14 13 12
14 13 12 11
13 12 11 1~
1" 11 111 9
11 1() 9 8
1~
9
8
7
6
5
4
9
8
7
6
5
4
3
8 7
7 6
6 5
5 4
4 3
32
2 1
UGS-3
6
7
8
9
6
7
8
9
3 1
42
6 3
7 6
3 1
4 2
6 3
7 6
3 1
1~ 42
1111(1 3
5
6
7
8
5
6
7
8
5
1~1" III 1()
Bitonic
4
5
5
6
6
7
7
8
8
9
9
3
3
4
4
5
5
5
6
6
7
7
10 8
2 1
2 1
2 1
3 1
3 1
3 1
4 2
42
42
5 3
5 3
6 4
Greedy
4
4
4
4
4
4
4
4
4
4
4
4
3
3
3
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
HOUSEHOLDER
Parallel strategies for computing the factorization (3.24)
The factorization (3.23) is illustrated in Fig. 3.8 without showing the lower
triangular matrix J(1), where each submatrix of A:(I) is annihilated using only
the UGS-2 or Greedy schemes, K = 10, G = 4 and e = (2,3,6,8). This particular example shows that both the schemes require the application of the same
78
number of CDGRs to compute the factorization. However, for problems where

the number of rows far exceeds the number of columns in each submatrix, the
Greedy method will require fewer steps than the other schemes.
8 7 6 S 4 3 2 1
9 8 7 6 S 4 3 2
10 9 8 7 6 5 4 3
20 19 18 17 16 15 14 13 7 6 5 4 3 2 1
21 20 19 18 17 16 15 14 8 7 6 5 4 3 2
22 21 20 19 18 17 16 15 9 8 7 6 5 4 3
23 22 21 20 19 18 17 16 10 9 8 7 6 5 4
24 23 22 21 20 19 18 17 11 10 9 8 7 6 5
25 24 23 22 21 20 19 18 12 11 10 9 8 7 6
34 33 32 31 30 29 28 27 19 18 17 16 IS 14 13 4 3
35 34 33 32 31 30 29 28 ~O 19 18 17 16 15 14 5 4
36 35 34 33 32 31 30 29 21 20 19 18 17 16 15 6 5
37 36 35 34 33 32 31 30 ~2 21 20 19 18 17 16 7 6
38 37 36 35 34 33 32 31 23 22 21 20 19 18 17 8 7
39 38 37 36 35 34 33 32 24 23 22 21 20 19 18 9 8
40 39 38 37 36 35 34 33 25 24 23 22 21 20 19 10 9
41 40 39 38 37 36 35 34 26 2S 24 23 22 21 20 11 10
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
2
2
3
3
4
4
5
6
1
1
1
1
2
2
3
4
U ing only the UGS2 scheme
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 1
10 987 6 5 4 2
20 19 18 17 16 15 14 13 7 6 5 4 3 2 1
21 20 19 18 17 16 14 13 8 7 6 5 4 2 1
22 21 20 19 18 16 15 13 9 8 7 6 4 3 1
23 22 21 20 19 17 15 14 10 9 8 7 5 3 2
24 23 22 21 20 18 16 14 11 10 9 8 6 4 2
25 24 23 22 21 19 17 15 12 11 10 9 7 5 3
34 33 32 31 30 29 28 27 19 18 17 16 15 14 13 4
35 34 33 32 31 30 28 27 20 19 18 17 16 14 13 S
36 35 34 33 32 30 29 27 21 20 19 18 17 15 13 6
37 36 35 34 32 31 29 27 22 21 20 19 17 16 13 6
38 37 36 35 33 31 30 128 23 22 21 19 18 16 14 7
39 38 37 36 34 32 30 28 24 23 22 20 18 17 14 8
40 39 38 37 35 33 31 29 25 24 23 21 19 17 15 9
41 40 39 38 36 34 32 30 26 25 24 22 20 18 16 10
3
4
4
5
5
6
7
8
Using only the Greedy cherne
Figure 3.B.
Computing factorization (3.23).
The intrinsically independent annihilation of the submatrices in a blocksubdiagonal of A(1) makes this factorization strategy well suited for distributed
memory systems since it does not involve any inter-processor communication.
79
However, the diagonally-based method has the drawback that the computational complexity at stage i (i = 1, ... , G - 1) is dominated by the maximum
. ed to annihl
th
b
. A-(i)
A-(i)
number 0 fCDGRs requlf
1 ate e su matnces Hl,l'' HG-i,G-i.
An alternative approach (called column-based method) which removes this

drawback is to start annihilating simultaneously the submatrices AI, ... ,AG-l'
where
j=I, ... ,G-1.
(3.27)
Consider the case of using the UGS-2 scheme. Initially UGS-2 is applied to
annihilate the matrix ..1(1) under the assumption that it is dense. As a result the
steps within the zero submatrices are eliminated and the remaining steps are
adjusted so that the sequence starts from step 1. Figure 3.9 shows the derivation
of this sequence using the same problem dimensions as in Fig. 3.8. Generally,
for PI = 1, Pj = Pj-l + 2ej - K (l < j < G) and 11 = min(Pl, ... ,PG-d, the
annihilation of the submatrix Ai starts at step
si=Pi-Il+1,
i=I, ... ,G-1.
The number of CDGRs needed to compute the factorization (3.23) is given by
T~~v(e,K,G'Il) =E+K -2el-ll.
(3.28)
Comparison ofT~~v(e,K,G) and T~~v(e,K,G'Il) shows that, when the UGSs

are used, the diagonally-based method never performs better than the columnbased method. Both methods need the same number of steps in the exceptional
case where G = 2.
The column-based method employing the Greedy scheme is illustrated in
Fig. 3.10. The first sequence is the result of directly applying the Greedy
-(i)
-(i)
scheme on the Ai+l,l' ... ,Ai+G-i,G-i submatrices. Let the columns of each
submatrix be numbered from right to left, that is, in reverse order. The number
of elements annihilated by the qth (q > 0) CDGR in the jth (j = 0, ... , K - ei)
column of the ith submatrix Ai is given by
ry,q) = l(ay,q) + 1)/2J,

where a(i,q) is defined as
J
o
ei+1
(i,q-l) + (i,q-l) _ (i,q-l) + (i-l,q-l)
aj
rj _l
rj
aj
r j_ l
- rj
(i,q-l) + (i,q-l)
(i,q-l)
rK-ei_1
if j > q and j
if q = j = 1,
> k-
if j = 1 and q
> 1,
otherwise.
ei,
80
19 18 17 16 IS 14 13 1" 11 10 9 8 7 6 5 4 3 2 1
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
22 21 20 19 18 17 16 IS 14 13 12 11 10 9 8 7 6 5 4
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
24 23 22 21 20 19 18 17 16 15 14 13 12 11 Ie 9 8 7 6
25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7
26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8
27 26 25 24 23 22 21 20 19 18 17 16 IS 14 13 12 11 10 9
28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
29 28 27 26 25 24 23 2..! 21 20 19 18 17 16 15 14 13 12 11
30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 IS 14
33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15
34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
UGS-2
12 11 10 987 6 5
13 12 11 10 9 8 7 6
14 13 12 11 10 9 8 7
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
19 18 17 16 IS 14 13 12 11 10 9 8 7 6 5
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6
21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
22 21 20 19 18 17 16 IS 14 13 12 11 10 9 8 7 6 5 4
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6
25 24 23 22 21 20 19 18 17 16 IS 14 13 12 11 10 9 8 7
26 25 24 23 22 21 20 19 18 17 16 IS 14 13 1" 11 10 9 8
27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 II 10 9
28 27 26 25 24 23 22 21 ~O 19 18 17 16 15 14 13 12 11 10
Modified UGS-2
Figure 3.9.
The column-based method using the UGS-2 scheme.
The sequence terminates at step q if Vi, j: ry,q) = O.

The second sequence in Fig. 3.10, called Modified Greedy, is generated from
the application of the Greedy algorithm in [69] by employing the same technique for deriving the column-based sequence using the UGS-2 scheme. Notice however that the second Greedy sequence does not correspond to and is not
as efficient as the former sequence which applies the Greedy method directly
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 1
10 9 8 7 6 5 4 2
IS 14 13 12 11 10 9 8 7 6 5 4 3
16 15 14 13 12 11 10 9 8 7 6 5 4
17 16 IS 14 13 12 11 10 9 8 7 6 4
18 17 16 IS 14 13 12 11 10 9 8 7 5
19 18 17 16 IS 14 13 12 11 10 9 8 6
20 19 18 17 16 15 14 13 12 11 10 9 7
21 20 19 18 17 16 15 14 13 12 11 10 8
22 21 20 19 18 17 16 IS 14 13 12 11 9
23 22 21 20 19 18 17 16 15 14 13 12 10
24 23 22 21 20 19 18 17 16 15 14 13 11
2S 24 23 22 21 20 19 18 17 16 15 14 12
26 25 24 23 22 21 20 19 18 17 16 15 13
27 26 25 24 23 22 21 20 19 18 17 16 14
28 27 26 2S 24 23 22 21 20 19 18 17 IS
2
2
3
3
4
5
6
7
8
9
10
81
1
1
1
2
2
3
5 4
6 5
7 6
8 6
9 7
11 10 8
12 11 9
13 12 10
3
4
4
5
5
6
7
8
2
2
3
3
4
4
5
6
1
1
1
1
2
2
3
4
18 17 16 15 14 13 12 11
19 18 17 16 IS 14 13 12
20 19 18 17 16 IS 14 13
21 20 19 18 17 16 15 14 13 12 11 10 9 7 6
22 21 20 19 18 17 16 15 14 13 12 11 9 8 6
23 22 21 20 19 18 17 16 15 14 13 11 10 8 7
24 23 22 21 20 19 18 17 16 15 13 12 10 9 7
25 24 23 22 21 20 19 18 17 15 14 12 11 9 7
26 25 24 23 22 21 20 19 17 16 14 13 11 10 8
27 26 25 24 23 22 21 19 18 16 15 13 12 10 8 6
28 27 26 2S 24 23 22 20 18 17 15 14 12 11 9 7
29 28 27 26 2S 24 23 21 19 17 16 14 13 11 9 7
30 29 28 27 26 25 24 22 20 18 16 15 13 12 10 8
31 30 29 28 27 26 25 23 21 19 17 15 14 12 10 8
32 31 30 29 28 27 26 24 22 20 18 16 14 13 11 9
33 32 31 30 29 28 27 2S 23 21 19 17 15 13 11 9
34 33 32 31 30 29 28 26 24 22 20 18 16 14 12 10
5
5
5
6
6
7
7
8
3
3
3
4
4
5
5
6
1
1
Greedy
1
1
2
2
3
4
Modified Greedy
Figure 3.10.
The column-based method using the Greedy scheme.
to the submatrices of Ai (i = 1, ... , G - 1). The annihilation pattern of the (efficient) Greedy scheme is gradually reduced to that of the UGS-2 schemes.
Generally, the direct employment of the Greedy scheme is expected to perform
at least as efficiently as the UGS-2 scheme. In the example in Fig. 3.10 the
column-based method is the same using both the UGS-2 and Greedy schemes.
82
2.4
QRD OF STRUCTURED BANDED MATRICES
The adaptation of the bitonic method to compute the QRD of a structured

banded matrix may be considered as follows. Suppose that A is a band matrix
with bandwidth b - that is, if ai,j =1= 0 and ai,k =1= 0 then Ij - kl ~ b and if ai,j
and ak I are the first non-zero elements of the ith and kth rows of A with i < k,
then ~ I [38, 126]. Let Ai E 9{m;xb (i = 1, ... ,x) be the block consisting of
rows in which the first non-zero element is in the 1'}i column and Ai E 9{m;xb
denotes the submatrix of A starting from element a~;;~i' where
mi =
L mj -1'}i + l.
j=I
The first method (method-I) to compute the QRD

(3.29)
is based on the parallel algorithm reported in [74]. The SK annihilation scheme
is applied to the submatrices Ai at stage Si (i = 1, ... ,x), so that the diagonal
elements of A are annihilated by successive CDGRs. For SI = 1, this implies
that
i
si=21'}i-1-Lmj,
i=l, ... ,x
j=2
and the total number of CDGRs required to compute the QRD (3.29) is given
by
TI (x,b,m,e,s) = mi +b+1'}x - 2 - min(sl' ... ,sx)'
(3.30)
where m, e and S are x element vectors with ith elements mi, 1'}i and Si respectively. Two examples are given in Fig. 3.11 for m = 17, n = b + 1'}x - 1 = 9,
b = 4 and x = 4. The framed boxes denote the submatrices AI, ... ,Ax, respectively.
For large structured banded matrices, the partition and updating techniques
employed in the bitonic algorithm can be used to reduce the number of CDGRs
applied. To simplify the description of the second parallel method (method-2),
let
The initial stage of method-2, stage 0, computes simultaneously the QRDs
i = 1, ... ,2P .
I.
5
4
3
2
I
6
5 7
4 6 S
3 5 7
.1. ..
I. I
I
i.
9.
4 0
3 5
24
1 3
o2
I}
7
6
5
4
10
9
8
7
6
i: 1 2 3 4
mj: 2 4 6 5
t'jj:
1 2 4 6
,nj: 2 5 9 12
s;: 1 -1 -3 4
I -1 3 0
Total number of CDGRs : 15
Total number o f CDGRs: 14
o Initial non-zero
F ill-in
Figure 3.11.
~.
-3- 1 1 3
4- 202
1 2 5 6
o Zero
6.
1 3 5 7
o2 4 6
I 1 3 5
2 0 2 4
-3- 1 1 3
u 1-
S.
7 9
6 8
S7
4 6
1 1 3 S
2 0 2 4
titj: 6 9 812
5j:
I3 5
214
i: 1 2 3 4
mj: 6 4 2 5
t'jj:
4 . . 1
1 3
o2
1. I.
S lO
1 3 5 7 ~11
o 2 4 6 Sl( 11
-1 1 3 5 7 911
14 6 8 IG
3 5 7 9 1
k 4
83
Illustration of the annihilation patterns of method-I .
eF))
At the jth stage (j = 1, ... , p), the orthogonal decompositions

QT .
I,j
R(j-I)
2;-1
O(j-I)
O(j-I)
R(j-I)
2;
are computed simultaneously for i = 1, .. . , 2P-
h+ (2i -l)~'
b-~*
j,
where
(3.31)
QL is orthogonal,
is a (b+ (2 j - 1 -1)~*) x 2j-I~* zero matrix, R~j) is a structured banded

upper triangular matrix of order (b+ (2 j -1)~*) and R in (3.29) is given by
R~P). Note that if the first (2 j - 1)~. rows of RF) are divided into blocks of ~*
rows, then the last (2j - k)~* columns of the kth block are zero (1 ::; k < 2 j ).
The orthogonal decomposition (3.31) can be computed by applying b +
O(j-I)
(2 j -
1-
1)~* CDGRs. Let (R~1::}
(j_I)
( R ;-1
2
(j_I)) =
O(j-I))
in (3.31) be partitioned as
2j-I~*
b-~*
R(j-l)
1,2/-1
R(j-l)
2,2;-1
R(j-l)
3,2;-1
84
As for the UQRD problem, CDGRs are employed to annihilate

U-. 1)
R3,21-1
= (RU-I)
3,2i-1
OU-I))
2
'
where O~j-l) becomes non-zero during the annihilation process. Starting from
the main diagonal, the Ith CDGRs annihilates the Ith superdiagonal of R~j;~1
I = 1, ... ,b + (2 j - 1 - 1) t}*. In general, an element of R~j;~l at position q)
(1,
becomes zero after the application of the (q - I + 1)th CDGRs which rotates
and R~-l), respectively (q ~ I). In addition the
the Ith and qth row of R~j;~1
,
banded upper triangular structure of R~-l) is preserved and
( RU-1)
1,2i-l
RU-I)
2,2i-l
OU-l))
1
remains unaffected by the application of the CDGRs. If

the CDGRs, then
in (3.31) is defined as
QL
QL is the product of
where
Figure 3.12 shows the process of applying the CDGRs for b = 8, t}* = 3 and
j = 2. The matrices are partitioned as
(
RU-.l)
IOU-I)
)
3,21-1
2
After the application of the first CDGRs, the 4th and 5th rows of O~j-l) become
oV-
1) are filled-in by successive CDGRs.

non-zero. The remaining rows of
An alternative method (method-3) for computing the QRD of A is to combine method-l and method-2. Let A be divided into blocks of 2g m* rows
(0 ~ g ~ p). The ith block can be partitioned as (0 A (i) 0), where A(i) is
a 2g m* x (b + (2g - 1)t}*) structured banded matrix (l ~ i ~ 2P- g ). The initial stage of method-3 employs simultaneously method-Ion A (1), ... ,A (2P- g )
to derive the upper triangular factors R~g) (i = 1, ... ,2P- g ). Then method-2 is
used starting from stage g + 1. It can be observed that for g = 0 and g = p,
method-3 is equivalent to method-2 and method-I, respectively, provided that
b-~'
12 3 4
1 2 3
1 2
1
5
4
3
2
6
5
4
3
7
6
5
4
8
7
6
5
1 213 ,4
85
91C 11
8 9H
7 8 9
6 7 8
5 6 7
b+(2j-1 - l ) ~'
D Initial non- zero

D Fill-in
Figure 3.12.
shown.
Computing (3.31) for b = 8,
~.
= 3 and j = 2.
Only the affected matrices are
the SK annihilation scheme is applied to compute the initial stage of method-2.

That is, method-3 may be regarded as a generalization of the two previous parallel methods. Figure 3.13 illustrates the process of computing the QRD of A
for p = 4 and g = 1.
The number of CDGRs applied using method-3 is given by
T4(m* ,b,'~* ,p,g) =

{
m* + b - 2 + pb + (2 P - P - l)t'}* - g(b - t'}*)

b- 2+ pb+ (2 P - p+ 1)t'}* + (2g(m* - 2t'}*) - g(b- t'}*))
if 2t'}*
>
- m* ,
otherwise.
(3.32)
With g denoting the value of g which minimizes (3.32), it follows that g = p

(method-i) if 2t'}* 2: m* and g = (method-2) if 2t'}* < m* and m* 2: b + t'}*.
In the case where 2t'}* < m* < b + t'}* , g is chosen to be the solution of
argmin (2g(m* -2t'}*) -g(b-t'}*))
05g 5 p,
subject to { 2t'}* < m* < b + t'}* ,
m*, b, t'}* ,p, g are non-negative integers.
86
Matrix A
After Stage 4
Figure 3.13.
Illustration of method-3, where p
= 4 and g = 1.
87
Table 3.7 gives examples of g and T4(m* ,b, 1'}* ,p,g) for some m*, b,
Table 3.7.
m*
10
15
30
40
50
100
1'}*
and p.
Computing the QRD of a structured banded matrix using method-3.

b
'i}*
14(m* ,b, 'i}* ,p,g)
8
12
10
20
30
80
6
10
5
5
20
3
4
5
6
7
8
3
4
0
0
oor 1
lor 2
58
175
218
463
2688
10678
40
RECURSIVE LEAST~SQUARES WITH LINEAR

EQUALITY CONSTRAINTS
2.5
Recursive least squares with linear equality constraints (LSE) problem can
be described as the solution of
argmin IIA (i)x - y(i) 112
subjectto
ex =
d,
for
i = 1,2, ... ,
(3.33)
where
e E ~kxn, X E ~n, d E ~k and (A(O) y(O)) is a null matrix. It is assumed that there are no linear dependencies among the restrictions
and also that the LSE problem has a unique solution which cannot be derived
by solving the constraints system of equations. That is, the rank: of e is k < n
and (A (i)T eT ) T is of full column rank:.
The two main methods which have been used to solve (3.33) are Direct
Elimination (DE) and Basis of the Null Space (BNS) [14, 93]. The performances of these two methods for solving the recursive LSE problem on an
array processor are discussed.
Yi E ~mi, Ai E ~mixn,
2.5.1
THE BNS METHOD
The BNS method uses an orthogonal basis for the null space of the constraints matrix to generate an unconstrained least-squares problem of n - k
parameters [93]. Let the QRD of eT be given by
k
with
Q= (QI
(3.34)
88
Lh = d and h = Qlh. From the constraints system of equations it follows that

x is given by
where is a ('! - k) random element vector. Substituting this expression for x

into A (I) X - y(l) gives the unconstrained least-squares problem
argmin IIZ(i)y- q(i)1I 2,
(3.35)
Y
where Z(i)
= A (i) Q2 and q(i) = y(i) -
A (i) h. After computing the QRD

Ri
pT (Z(i) q(i)) = ( ~
Pi)
(3.36)
~'
the solution of (3.33) can be expressed as

x(i) =
h+ Ql'li) ,
where'1- i ) = Rjl Pi. To compute x(i+1), the updated solution of (3.33) at stage
i+ 1, let
ZHI =Ai+lQ2,
(3.37)
qi+1 = Yi+ 1 - Ai+1 h
(3.38)
and
T
PHI
(Ri
0
Zi+1
Pi )
l1i
qHl
(Ri+l
0
0
PHI) n -. k
l1i+1
0
(3.39)
mi+l
where Ri+ 1 is upper triangular and Pi+ 1 is orthogonal. Then, solving the triangular system of equations
(3.40)
allows x(i+1) to be computed as
x(i+l) =
h+ Q2y(i+l)
(3.41)
== x(i) + Q2(y(i+1) _ y(i)).

Observing that (Z(I) q(1)) == (ZI ql), the computations (3.37)-(3.41) can
be used to find the solution of (3.33) for i = 2,3, ... without requiring the whole
data matrices to be stored.
89
2.5.2
THE DE METHOD
Let Q E 9lkxk be orthogonal, IT E 9lnxn be a permutation matrix and L be
lower triangular and non-singular, such that
(3.42)
and x = nT x, where
X=
(~~) ~-k
The constraints in (3.33) can be written as
QT cnnTx = CXI + Li2

==QTd
which gives
X2 = L -1 (QT d - CXI)
== J-CXI'
(3.43)
where
If
and
then
A(i)X-y(i) =A(i)nnT x-y(i)
= (..1(i)
(A (i) -
W(i)) ( _ XI__ ) _ y(i)

d-CXI
W(i)C)XI - (y(i) - W(i)dj
= Z(i)XI - q(i).
After using (3.36) to solve the unconstrained least-squares problem

argmin IIZ(i) Xl - q(i) II,
Xl
(3.44)
90
X2 is derived from (3.43) and finally x(i) = ill is computed.
For
n-k
Ai+l II =
(Ai+l
"'i+l)
and
(3.45)
the factorization (3.39), and (3.43) can be used to compute x(i+l) where, now,
Ri+lXl
= Pi+l
2.5.3
IMPLEMENTATION AND PERFORMANCE
The main difference between the matrix computations required by the BNS
and DE methods is that the mi+l x n matrix Ai+l and the n x (n - k) submatrix
Q2 used in the former correspond to the mi+ 1 x k and k x (n - k) "'i+ 1 and t
matrices, respectively, used in the latter. Since k < n, the DE method is always
faster than the BNS method.
The BNS algorithm has been implemented on the SIMD array processor
AMT DAP 510 [67]. Householder transformations have been employed to
compute the orthogonal factorizations (3.34) and (3.39). For the matrix-matrix
multiplications and solution of triangular systems the outer product and column sweep methods have been used, respectively [67]. The analysis of the
relative performances of the BNS and DE methods has been made by using accurate timing models. The assumption that the data matrices have dimensions
which are multiples of the edge size of the array processor has been made. As
expected, the analysis of the timing models shows that for all cases the DE
method outperforms the BNS method, with the difference in performance between the two methods decreasing as k increases for fixed n. This is illustrated
in Table (2.5.3).
ADDING EXOGENOUS VARIABLES

The OLM with added exogenous variables can be written as
(3.46)
whereB E ~mxk (m ~ n+k) is a data matrix corresponding to k new exogenous

variables. Computing the QRD of An = (A By)
QnAn =
TA
(R)k+n
0 m-k-n'
with
R_
-
(Rn un) k+n-l

0
Sn
'
(3.47)

Table 3.B. Estimated time (msec) required to compute xli) (i = 2,3, ... ), where mi
32N and k = 32K.
91
= 96, n =
BNS
DE
10
10
10
10
10
18
18
18
18
18
18
18
18
1
3
5
7
9
3
5
7
9
11
13
15
17
6.13
4.31
2.76
1.48
0.46
16.80
13.66
10.82
8.26
5.98
3.97
2.22
0.74
3.76
2.87
2.02
1.21
0.43
10.23
8.71
7.26
5.86
4.52
3.21
1.95
0.70
the least-squares solution of the modified OLM is given by Rnxn = Un, where
Qn E 9tmxm is orthogonal and Rn is an upper triangular matrix of order k+n-l.
If (3.2) has already been computed, then (3.47) can be derived by computing
the smaller QRD
Q-T(B2
set ) = (R)k+1
0 m-n-k'
(3.48)
with
where et denotes the first column of the m - n + 1 unit matrix and
QTB = (Bt) n - 1 .
B2 m-n+ 1
(3.49)
The orthogonal matrix Qn and upper triangular factor R in the QRD (3.47) are
given, respectively, by
(3.50)
and
R=Note that A~ An
(~
t :n) ~ 0
(3.51)
= RT R from which it follows that
~~~) =- (~~~
BfB~?k~Rn
(~~yT ~ ~~:
T
yT B yTY
u R uT + Rn
A
Sn
Bt
u~
uR:A~un
Bf
). (3.52)
T
u U+ u~ Un + ~
92
Thus, if the orthogonal matrix Q in (3.2) is unavailable, the matrices Bl and R

can be computed, respectively, by solving the triangular system
RTBI =ATB
and then computing the Cholesky decomposition
whereB= (B
y)
u) [67].
andBl = (Bl
DELETING OBSERVATIONS
The solution of the OLM is considered when m - k observations are deleted.

This is the reverse operation to that of updating the OLM. If the first m - k observations are deleted from the already solved OLM (3.1), then the downdating
OLM problem is the re-calculation of the BLUE of x in
(3.53)
where
n-l
A== (A y) ==
(~l)
== (AI
A2
A2
Yl)m-k
Y2 k
and
(1) m-k
k
.
2
The downdating problem can also be expressed as the computation of the QRD
with
R ==
(Rno un) n1 Sn
(3.54)
when (3.2) is known. The restriction n ::; k < m is imposed to guarantee that at
least one observation is deleted and that the resulting downdated OLM is overdetermined. Furthermore, it is assumed thatA2 has full column rank. Note that
if the deleted observations do not occupy the first m - k rows of the OLM, then
A and Q in (3.2) need to be replaced by OTA and OT Q, respectively, where OT
is a permutation matrix, whose effect is to bring the deleted rows of A to the
top [15,67].
If
m-k
Qf2)n
Qf2 m-k
T
k-n
Q32
93
in (3.2) is known, then the downdating problem can be solved in two stages
[67]. In the first stage the orthogonal factorizations
HT
(Q~I
Q~2)
Q
Q
31
32
and
GT (Qfl
q}2
Z
Q
22
(Z
0
A)
= (D
0
0
q~2) m - k
Q 32 k-n
A~
Q 22
E)
m-k
B n
(3.55a)
(3.55b)
are computed, where H E sx(m-n) x (m-n) and G E sx(m-k+n) x (m-k+n) are orthogonal, Z is upper triangular and IDI = Im-k. In the second stage the QRD
(3.56)
is computed, where R correspond,s to the upper triangular factor of the QRD
(3.54). Observe that
0) (1
(G
o h-n <>
T
from which it follows that Al = DE and the orthogonal matrix Q~ is given by

(3.57)
Furthermore, from
it follows that
(3.58)
and since (Qll
ZT
0) has orthogonal rows,

ZTZ
= Im-k- QllQfl
(3.59)
Thus, if Q in (3.2) is not available, then (3.58) and (3.59) can be used to derive
Qll and Z [39, 67, 75, 81].
94
Another approach to the solution of the downdating problem is to solve the

updatedOLM
where l denotes the imaginary unit [15, 20, 67, 75, 93,122]. Thus, the problem
can be seen as the updating of the QRD (3.2) by tAl:
QT (tAl)
R
(i?)0 m-k
n
(3.61a)
or
(3.61b)
The QRD updating (3.61) can be obtained using hyperbolic Givens rotations
or Householder transformations. Hyperbolic Householder reflections have the
form
i= 1, ... ,n,
(3.62)
A(l) IS
" the lth column of AI,
A Yi = J{i,i
A + S, S = SIgn
. (A
)(1'.2
II A(I)112) '
where A:,i
J{i,i J(i,i - A:,i
Ci = sri and ei is the ith column of In. Application of hyperbolic Householder
transformations does not involve complex arithmetic. However, it should be
noted that this downdating method is not numerically stable.
4.1
PARALLEL STRATEGIES
On the DAP the factorizations (3.55) and (3.56) have been computed under
the assumptions that n = N es, m = Mes, m - k = es (es observations have been
deleted) and that, except for R, none of the other matrices is required. That is,
the orthogonal matrix Qn is not computed.
In the first stage (3.55a) is computed using Householder transformations
with total execution time (in msec)
1's, (X) = 94.97 + 56.89(X)
for X
=M -
N.
UGS-l can be used, as shown in the updated OLM, to compute (3.55b). That
is, the orthogonal matrix GT is the product of the CDGRs of UGS-l which
computes the updating
95
UGS-3 can also be employed to compute the equivalent form of the orthogonal
factorization (3.55b)
GT (Ql1
q12
QI2
A)0 = (0D QI2

B) n
O E m-k'
(3.63)
These Givens sequences have the advantage of deriving B in upper triangular

form and thus, the second stage, i.e. the QRO (3.56), is not required. However,
this approach has the disadvantage of computing explicitly the observations to
be removed, i.e. the matrix E is regenerated. Figure 3.14 shows how the two
methods operate when m - k = 6 and n = 10. The bold frames partition the
initial matrix as
Ql1
Z
IA ) n
10 m-k
withanumberi(i= 1, ... ,18)inthematrices (Ql1 ZT) T and (AT OT) T denoting, respectively, the elements annihilated and filled in when the ith COGR
is applied. On the OAP the execution time model for computing (3.55b) using
UGS-l is found to be
1's2g(N) = 10.11 + 33.76N + 56.32N2 +6.58N3
(3.64)
Consider now the computation of (3.63) using Householder transformations.

The orthogonal matrix G is defined as G = Hm-k . .. HI, where Hi denotes the
ith Householder transformation
Hi=lm-k+n-l+~Zi'il ((Zi,i:O"i)e) (qT
(Zi,i+O"i)ef)
(3.65)
which annihilates the ith column of (Ql1 Q12) , denoted by q, and which also
reduces Zi,i+l: to zero and makes IZi,d = 1 (i = 1, ... ,m - k). Here, ei is the ith
column of Im-k and O"i = 1 if Zi,i > 0, and O"i = -1 otherwise. Let
ni ( D(i)
0
m-k-i
m-i
Q(i)
~(~))
EO(I)
(Q(i)
R(i))
== Z(i) E(i)
Z{i)
Q(i-l)
=Hi ( Z{i-l)
R(i-l)) n
E(i-l) m-k'
(3.66)
where Q(O) == (Ql1 Q12)' Z(O) == (Z QI2)' R(O) == R, E(O) == 0, ID(i) I = Ii,
and in (3.63), B == R(m-k) and E == E(m-k). From (3.66) it follows that
Q(i) = Q(i-l) -q(ei+O"iZi~~-I){ /(1 + IZi~~-I)I),
(3.67a)
96

.
00000
U.OOOOl~.
11~.000~~1
~~.oo~lnl
lIZ:: IZ3I2U~ 151:

1
1
121
161411
11 l
l~ 12.
121l~ 11 15~3 11
10 11~
1~
H
9 I1J l5 1
l~ 1
1 11 9
81
1~U~412H8
7911113151
1151311197
68~012l411~
161411086
5 7 9 1 l
1513119 7 5
4 6 8 10 I ,
141' U 8 6 4
1:
~ll.0~119rI5
141:
13 11'
.~2~I:U~4
3 5 7 9 lit
1 11 9 7 5 3
246 1 8l0~
13 1 57~1
11 1 ~7 1 531
1211086 142.
2468[(
3579
468
108642
9753
864
5 7
7 5
(a) The UGS-I for computing (3.55b).
lSI!
l'
~O
161
l~
1~
,8
1'l
1~
9
8
7
6
5
4
3
11
13
9 10 I
8 9 III .1
7 8 9 10
6 789
5 6 7 8 9
4 5 678
23 4 5 6 7
112 3 4 15 6
.0 0 0 0 0 1
000 0 1~
.00 0 1~
.00 11211
.0 !2r21
!3122 tn
1 1 11 11] 19 8 7 16 5 14
1'1 I 11 9 8 7 6 5
1 10 9 8 7 6
1:
1
1 11111] 9 8 7
1 l'
1 12 III III 9 8
In3 11 1110 9
1 1:
o Initial non-zero element
o Zero element
321
43 2
543
6 5 4
7 6 5
8 7 6
[I] Fill-in
[!] non-zero element
(b) The UGS-3 for computing (3.63).
Figure 3.14.
Givens parallel strategies for downdating the QRD.
97
(3.67b)
(3.67c)
and
(3.67d)
Algorithm 3.3 computes (3.63) using Householder transformations. For convenience, the superscripts of the matrices Q(i), R(i), Z(i) and E(i) have been
dropped. Notice that the observations that have been deleted from the OLM
are generated by the 10th step of the algorithm. This step together with the
steps that modify Z, i.e. steps 11 and 12, can be deleted.
Algorithm 3.3 The computation of (3.63) using Householder transformations.

1: for i = 1, ... , m - k do
2:
cr:= 1
3:
H (Zi,i < 0) then cr := -cr
4:
5:
6:
7:
8:
9:
10:
11:
12:
y:=Zi,i+cr
~ := 1 + IZi,il
W := sum(spread(Q:,i,2,n) *R, 1)
forall(j = 1, ... ,n) Rj,: := Rj,: - Qj,i *W /~
W:= cr*Zi,i: +el
forall(j = 1, ... ,n) Qj,i: := Qj,i: - Qj,i *W /~
E i ,: := -cr* W
Zi,i+l: := 0
Zi,i := -cr
13: end for
The total execution time required to implement Algorithm 3.3 on the DAP,
without modifying Z and without regenerating the deleted data E, is given by
I's2h (N) = 44.14 + 37.40N + 36.06N2

The time required to compute the QRD of Bin (3.63) using Householder transformations, is
I's2L(N) = N(10.06+ 36.50N + 18.52N2 ).

Table 3.9 shows for some N, the values of I'sl (N), I's2g (N), I's2h (N), I'su (N)
and T3(N), where
98
is the time needed to compute (3.61b) using hyperbolic Householder transformations. The method which uses Algorithm 3.3 - called the Householder
method - is inefficient for large N due to the triangularization of Bin (3.63).
However, it performs better than UGS-l for N = 1,2,3. The disadvantages of
UGS-l, which necessarily computes the transformation of Z to D and regenerates E, indicates that the efficiency of the Householder method might increase
with respect to UGS-l as the number of observations to be deleted increases
- that is, for larger m - k. Use of hyperbolic Householder transformations
to solve the downdating problem produces the fastest method. However, the
execution efficiency of this method is offset by its poor numerical properties.
Table 3.9.
Execution time (in seconds) for downdating the OLM.
T.vl (N)
1's2g(N)
1's2h (N)
T.v2L(N)
T.V2h (N) + 1's2L (N)
T3(N)
1
2
3
4
8
16
32
0.15
0.21
0.27
0.32
0.55
1.01
1.92
0.50
1.01
1.64
2.40
6.75
22.62
90.l5
0.l2
0.26
0.48
0.77
2.65
9.87
38.17
0.l6
0.50
1.13
2.l7
12.62
86.81
647.46
0.27
0.76
1.61
2.94
15.27
96.69
685.63
0.19
0.45
0.79
1.19
3.55
12.l4
47.32
Another possible alternative method to compute (3.63), which may also

be efficient on MIMD systems, is a block generalization of the serial Givens
algorithm in which m - k = 1 [12, 23, 89]. Let Qfl and A be partitioned, respectively, as
T _
Qll-
C'WIT) .,
:
n2
:
WJ
ng
and
nl
n2
ng
All
,
Al ,2
Al,g
A22
,
A2,g n2
A=
Ag,g ng
and assume that the computations are not performed on (Q12

Z
= Zo and E~O) = 0, at the ith (i = 1, ... ,g) step the factorization

T
Gi
(lti
Zi-l
.. .
Ai,i ...
nl
Ai,g) _
(i-I) Eg
(0
Zi
Ri,i ... Ri,g)

(i)
(i)
Ei
.. . Eg
'b) T.
For
(3.68)
is computed, where Gj is orthogonal and Zj is an upper triangular matrix of

order (m - k). That is, IZgl = 1m -to E
== (E~g) ., . E~g))
and Bin (3.63) has
99
the block-triangular form
B=
ni
n2
RII,
RI2,
R2,2
ng
RI,g nl
R2,g n2
Rg,g ng
The updating (3.68) can be computed using the UGS-2, UGS-3, or Householder methods. For the QRD of B the orthogonal factorizations
ni
Qf(Ri,i ... Ri,g)= (Ri,i
ng
...
Ri,g),
i=1, ... ,g,
are computed simultaneously, where Ri,i is upper triangular. The matrix QT

and R in (3.56) are given, respectively,
by
\
QT =
(Of .~)
Rl,1 ~1,2
and
R2,2
R= (
~I,g)
R2,g
Rg,g
DELETING EXOGENOUS VARIABLES
Downdating of the regression model (3.1) after removing a block of regressors is a straightforward operation. Partitioning the data matrix A as
nl
A= (AI
n2
n3
A2 A3)'
the regressors-downdating problem can be described as computing the QRD

of the matrixA* = (AI A3 Y) E 9tmxn*
Q-TA* = (R) n*
o m-n*'
with
R=
(Rno un) n*1 - 1 '

Sn
(3.69)
where the deleted regressors are denoted by the submatrix A2, n* = ni + n3 +

1 == n - n2 and the new least-squares solution is given by x = R;; I Un. Partitioning R in (3.2) as
100
the QRD of A * can be seen as being equivalent to the updating problem
o? (i?2.3)
=
i?3.3
(R02.3) n3n2 + 1
(3.70)
The matrix Rin (3.69) is given by
R == (i?l.l ~1.3).
R2.3
Hence, the algorithms for solving the updating (3.6) can be employed to compute (3.70).
The re-triangularization procedure above can be extended to deal with the
case where more than one block is deleted. Here the more general problem of
deleting arbitrary regressors from the OLM will be considered. Furthermore,
in order for the proposed strategies to be used in other related problems and
to have a consistent notation, it will be assumed that the matrix i? in (3.2) is
a K x (K + G) upper trapezoid, i.e. m = K and n = K + G. Thus, within this
context, the regressors-downdating problem is redefined as the computation of
the orthogonal factorization
K
R = (R(l)
with
G
R(2))K,
(3.71)
where R(l) and R are upper triangular, e = k + g ~ K and S E 9t(K+G)xe is a

matrix which when multiplied on the left by R selects the k and g columns of
R(l) and R(2), respectively. The QRD (3.71) is equivalent to triangularizing the
upper trapezoid R by orthogonal transformations after deleting columns.
Partitioning the permutation matrix Si as
k
S=
( so SO)KG'
let
(3.72)
where the jth column of the permutation matrices S and S are given, respectively, by the ).jth column of lK and the ).jth colu~ of lG, ). = ().l ... ).k) and
101
).1 < ... < ).k. The QRD of the RS matrix can be derived by computing the
QRDs
with
(3.73)
and
QT(~n=mk-k-g'
(3.74)
such that the orthogonal matrix (),T and the upper triangular factor
QRD (3.71) are given respectively by
R in
the
(JT = (~ JT) (~ IK~~J

and
_- (ROQ-TR
R(2)
R-
(3.75)
The computation of the QRD (3.73) is equivalent to re-triangularizing an

upper triangular factor after deleting columns. Let the leading k x k submatrix
of RS be already in upper triangular form, that is, ).q = q for q = 1, ... , k. Using
the parallel algorithm in [74] the total number of CDGRs required to compute
the QRD (3.73) is given by
Tu().,k)=(k-k)-,u+l,
(3.76)
where,u = min (PI , ... , Pk-k) and

Pj
= 2j+k-).j+k + 1 for
j = 1, ... ,k-k.
The CDGRs comprise rotations between adjacent planes and the non-zero elements in the (j + k)th column of RP) are annihilated from bottom to top by
the O"jth CDGR, where O"j = Pj -,u+ 1 (j = 1, ... ,k-k).
The computation of the QRD (3.74) can be obtained using any known parallel Givens annihilation scheme such as those reported in [30, 75, 103, 129].
However, the factorization (3.74) can start simultaneously or before the completion of the QRD of R~I) in (3.73). Thus, if the SK (Sameh and Kuck) annihilation scheme in [129] is used for the factorization (3.74), then the total
number of rotations applied to compute the QRD of RS is given by
(1)
TRS ('A.,k,g,K) =
{Tu().,k)
K -k+g-2+max(0,2-K +).k)
ifg=O,
if g> 0.
(3.77)
102
--g- -- --g- -- --g-
-- k
lele Ie ele Ie e
Ie Ie ele Ie e
30 Ie ele Ie e
' .OJ
-Ie Ie e
lele e
~J!ol,) Ie e
ele
ele
ele
ele
ele
ele
Ie
-Ie
Ie
r:I~
e
e
e
e
e
e
e
e
e
Ie e ele e
~9 e ele e
2D~ ele e
2H~ 311e e
I!, 21l tll '11
ele Ie
ele Ie
ele Ie
ele Ie
Ie Ie
-Ie Ie
Ie Ie
Ie
~ ~:l
tlli
'11
19~1
~1
~,
I":l
'1, '1'7
..,
!olD
J.,
13:
121 2.1
25~1
"'''
131151'1
ele
ele
eie
ele
ele
ele
ele
ele
Ie
Ie
?Q 'l~
t\2
InB
:12
?"
?~
B
,10 l'
i911
6 18 I~
517 9
416 8
9
31$
214
8
1 i3
7
I"
."
Figure 3.15.
,?'
u
21 21
2~ 22
...
U 20
I ' III
SK scheme
"'II
.,,,
llil 12., "../1

~Il 121 21
~'1 i19 121
~~
~o
~5 11 119
Ehmmate numbers
D Initial zero element
ee ee ee ee ee
15 e ee ee ee ee
1416 ee ee ee ee
1315 17 e ee ee ee
416 18 e ee ee e
315 1719 ee ee e
1214 1618 ~O ee ee
113 1517 1921 ee e
012 1416 1820 22 ee
911 1315 1719 21 123 e

11!! 1214 1618 20 ~2 ~
113 1517 19~1 ~
1416 1820 122
113 1517 19121
UJ 12 1416 18~0
113 1517 19
012 1416 18
911 1315 17
8 10 1214 16
7 9 1113 15
6 8 1012 14
7 9 1113
8 10 12
7 9 11
6 8 10
579
468
357
246
135
Subtract m = 14
Illustration of the SK-based scheme for computing the QRD of RS.
This method for computing the QRD of RS is called the SK-based scheme
and its derivation from the SK Givens sequence is illustrated in Fig. 3.15,
where K = 30, k = 7, g = 3 and ~ = (4,10,11 , 12,15, 21,22) or, equivalently,
k = 8, g = 2 and ~ = (4,10,11 , 12, 15,21,22, 30). Initially the SK scheme is
employed to triangularize a K x e full dense matrix, where a number i denotes
the elements annihilated by the ith (i = 1, . .. , K + e - 2) CDGR and a denotes a non-zero element. Then the numbers at positions ~ j + 1 to K of the jth
(j = 1, . . . , k) column of the matrix - that is, the numbers in the zero part of
the matrix - are eliminated. The minimum number, e.g. m, from the remaining
positions in the matrix is computed. Finally the SK-based scheme is derived
by subtracting m -1 from each number. Observe that for g = 0 the SK-based
scheme is identical to the parallel algorithm reported in [74], while, for k = 0,
the SK-based scheme is identical to the SK Givens sequence for computing
the QRD of a K x g matrix. However, for g = 0 the complexity analysis of
103
the SK-based parallel strategy will hold for all cases, in contrast to that in [74]
which fails if 3j : 1..j = j.
The number of CDGRs applied to compute the QRD of RS in (3.71) can be
reduced if the maximum number of elements is annihilated by a single CDGR
[30, 69, 103]. The rotations are not restricted to being between adjacent planes
and previously annihilated elements are preserved. Let r)q) = La)q) /2J denote
the maximum number of elements in column j (j = 0, 1, ... ,k) of RS that can
be annihilated by the qth (q> 0) CDGR, where a dummy column at position
zero has been inserted with 5.0 = 0 and Vq: a~q) = o. At step q an element in
position (I, j) of RS is annihilated by a Givens rotation in planes I and (I where I
r)q) ) ,
2': j and r)q) > O. If

for j = I, ... ,k,
for j=k+ 1,
for j = k+2, ... ,e,
then a)q) is defined as

(3.78)
where q > 1. This annihilation process is called Greedy-based and terminates
at step q (q 2': 0) if Vj: r)q) = O.
Figure 3.16 shows on the left the annihilation pattern of the Greedy-based
algorithm for the problem considered for the SK-based scheme in Fig. 3.15.
On the right an alternative sequence corresponding to the triangularization of
a 30 x 10 dense matrix (shown in the center) using the Greedy algorithms in
[30, 103] is shown. Notice that in this particular example this scheme uses one
CDGR more than the previously described Greedy-based annihilation scheme.
Generally, if 1..p = p (p = 0, ... ,k), k = K- k, e= e - k and k e, then the
total number of CDGRs applied to compute the QRD of RS using the Greedybased scheme is given approximately by
TA;) (1..,k,g,K)
IOgK + (g-I)loglogK
{
= logk+ (e-l) loglogk -log(k- 1..kj-l)
logk + (e-l) loglogk -Iog(k - Ak+l )
if k = 0,
if g = 0,
if k,g
> 0,
(3.79)
where 1..k+ 1 = 0 and k = 1..k - k. This result derives from the complexity of the
Greedy algorithm when applied to a dense matrix [103]. Clearly, the Greedybased scheme requires no more CDGRs than the SK-based scheme.
104
--g- - --g- - --gk

5
2
2
1 5
1 4 4 8 .
1 3 6 4 7 10
1 4 7
3 6 9 .
3 5 8 . 3 6 9 1
3 6 811
2 4 7 lU 3 6 911 14
2 5 8 10 13
2 4 6 9 12 3 5 811 1316
1 3 5 8 1114 3 5 8 IU 1315 18
2 5 7 10 1 15
2 5 7 10 1215 17121J
2 4 7 9 1 14 17
135 710 1316
124 6 9 1215 18.
2 4 6 8 1114 17~
3 5 7 10 13 Hi 19
4 6 9 1 1518
2 5 811 1417
1 4 7 10 1316
3 6 9 1 15
2 5 811 14
2 4 7lU 13
1 3 6 9 12
1 3 5 811
1 2 4 7 10
2 4 6 9
3 5 8
3 5 7
2 4 6
246
135
1 3 5
1 2 4
1 2 3
Greedy-based scheme
Figure 3.16.
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
5
4
4
4
4
4
4
3
3
3
3
3
3
3
2
2
2
2
2
2
2
7 9 1214 1719 22.

2 4 6 9 1114 1619.
7 9 1114 lU9 21 24
4 6 811 1316 18 21
6 9 1113 lU8 21 23
6 8 lU 13 15 18 20
6 8 113 1518 20 3
8U 12 15 1720
6 8 lU 13 1517 20 22
7 lU 1214 1719
6 8 012 151 19 22
11~ 9 12 14 1619
6 8 lU~: 1416 19 21
911 13 1618
5 7 10~ 1416 1821
911 13 15 18
5 7 9 III 14l(j 18 20
811 13 15 17
5 7 9111 1315 17 20
8 lU 12 1417
5 7 911 1315 1719
8 lU 1214 16
5 7 9ilJ 1315 17 19
8 lU 1 1416
5 6 8 10 214 1618
II. 1113 15
4 6 8 101 416 18
113 15
4 6 8 101 416 18
113 15
[012 14
4 6 8 9 1 315 17
4 6 7 91 315 17
01 14
[012 14
4 5 7 9 1 315 17
3 5 7 8 lU 2 1416
911 13
3 5 6 8 lU 2 14 16
911 13
3 4 6 7 9 1 1315
8 10 12
(Dense) Greedy
Alternative Greedy-based
o Initial zero element
Greedy-based schemes for computing the QRD of RS.
Chapter 4
THE GENERAL LINEAR MODEL
INTRODUCTION
Consider the General Linear Model (GLM)
(4.1)
where y E SRm is the response vector, A E SRmx(n-l) (m ~ n) is the exogenous

data matrix with full rank, x E SR(n-l) is the vector of parameters to be estimated and E E SRm is the noise vector normally distributed with zero mean and
variance-covariance matrix 0'20. Without loss of generality it will be assumed
that 0 E SRmxm is symmetric positive definite. The BLUE of x is obtained by
solving the generalized linear least squares problem (GLLSP)
argmin II u If
X,u
subject to y = Ax + Bu,
(4.2)
where 0 = BBT and u '" N(O,0'21m ). If B is the Cholesky lower triangular

factor of 0, then the solution of the GLLSP (4.2) employs as a main computational tool the Generalized QL Decomposition (GQLD):
QT (y
A) =
(~T ) == (~
Y2
n-l
)m-n
RT
n-l
(4.3a)
106
and
1 n-l
)m-n
0
pOl
Lz2
(4.3b)
n-l
where RT, Lll and L22 are lower triangular non-singular matrices. The BLUE
of x, say X, is obtained by solving the lower triangular system
The variance-covariance matrix of is given by
Var(x) = (J2(4.2R-l)T (42R-1),

where 112 j(p2(m- n+ 1)) is an unbiased estimator of (J2 [91, 109, 111].
Paige has suggested a serial algorithm based on Givens rotations for computing (4.3), which takes into account the initial lower triangular form of the B
matrix and that the first m - n columns of QT BP contribute nothing to the solution of the GLM [109]. Figure 4.1(a) shows the annihilation pattern for Paige's
algorithm when computing the QLD (4.3a), where m = 18 and n = 8. This
sequence is equivalent to the diagonally-based annihilation scheme shown in
Fig. 1.3(b). Let Gi,j denote the Givens rotation between the adjacent planes i
and i + 1 that annihilates Ai,j' where A (y A). The application of Gi,j from
the left of (A B) will annihilate Ai,j, but will also fill-in Bi,i+!. This fill in
can be eliminated by a Givens rotation, say Pi,j, which affects columns i and
i + 1 of Gi,jB when applied from its right. Thus, Gi,jBPi,j is lower triangular,
and QT and P in (4.3) are the products of the left Gi,j and right Pi,j Givens
rotations, respectively.
Now, let
and Pi denote, respectively, the products of the left and right
Givens rotations such that
QT
mt
1
Q'A~ (~)
~
Ai - m - mi'
I
Yi
n-l
~i
(4.4a)
and
mi
(QfB)P, ~ L, "
m, (~II
m-mi Lzl
m-mi
0
Li
),
(4.4b)
107
The General linear model
where Lj is lower triangular and non-singular and 1 ::; mj < n. Conformably

partitioning uT Pj as (aT uf), the GLLSP (4.2) can be written as
argmin(lIadl2 + lIujll2) subject to
Uj,aj,X
(0) (0) + (LllLz. 0) (a

Yj
Aj
Lj
j)
Uj
(4.5)
It follows that, since
GLLSP
L aj
= 0, aj =
and (4.5) is equivalent to the reduced
argminllujll2 subject to Yj=Ajx+Ljuj.
(4.6)
Uj,X
Thus, after the application of the (n(n + 1) /2)th left and right Givens rotation
Paige's sequence solves reduced GLLSPs.
Similarly, the column-based Givens sequence shown in Fig. 4.1 (b) can be
used to compute (4.3). However, this sequence annihilates complete rows of
A more slowly than Paige's sequence. As a result the column-based sequence
performs twice as many operations when m n [111].
36 28 21
44 35 27
52 43 34
60 51 42
68 59 50
76 67 58
84 75 66
92 83 74
lOCI 91 82
3
5
8
12
24 17
40 31 23
48 39 30
56 47 38
64 55 46
72 63 54
80 71 62
HIE 97 88 79 70
0596 87 78
~04 95 86
15
20
26
33
41
49
57
65
73
10~ C)C) 90 81
10'7 98 89
10
14
19
25
32
6
9
13
18
1
2
4
7
11
16
22
29
37
45
53
61
69
77
~o~ 94 85
~02 93
01
(a) Paige's sequence
Figure 4.1.
99 88
lOCI 89
101 90
0291
103 92
104 93
105 94
ll1E 95
10'7 96
10~
76 63 49 34 18
77 64 50 35 19
78 65 51 36 20
79 66 52 37 21
80 67 53 38 22
81 68 54 39 23
82 69 55 40 24
83 70 56 41 25
84 71 57 42 26
97 85 72 58 43 27
98 86 73 59 44 28
87 74 60 45 29
75 61 46 30
62 47 31
48 32
33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(b) Column-based
Sequential Givens sequences for computing the QLD (4.3a).
108
PARALLEL ALGORITHMS
The Paige's sequence shown in Fig. 4.1(a) annihilates, diagonal by diagonal, the elements of A and the elements of each diagonal are annihilated by
successive Givens rotations from the bottom to the top. The annihilation of
a diagonal can start before the previous diagonal has been completely zeroed.
This leads to the SK sequence, a parallel Givens sequence developed by Sameh
and Kuck [129]. The SK sequence computes the QLD of an m x n matrix by
applying m + n - 2 CDGRs for m > n and 2n - 3 CDGRs when m = n. Figure 4.2 shows the SK sequence for m = 18 and n = 8. Each CDGR preserves
the previously created zeroes and a single element is annihilated by rotating
two adjacent planes. The total number of elements annihilated by the application of the ith CDGR is given by e(i, m, n), where, for m 2: 2n
e(i,m,n)
= {min(l(i ~ 1)/2J +
m+n-l-l
1, n)
if 1::; i < m,
if m ::; i ::; m + n - 2,
while, for m < 2n

l(i-1)/2J +1
{
e(i,m,n) = lm/2J + 1- e(i - m+ 1)
m+n-i-l
if 1 ::; i < m,
if m ::; i
< 2n,
if 2n ::; i ::; m + n - 2.
Let QT in (4.3b) be defined as QT = G(m+n-2). .. G(2)G(1), where G(i) is

the ith CDGR of the SK sequence. The m + n - 2 (m 2: n) CDGRs of the SK
sequence are divided into two sets, with the first 2n - 1 CDGRs comprising
set-1 and the remaining m - n - 1 comprising set-2. When G(i) is applied to
the left of (y A B), e(i,m,n) zero elements in the leading superdiagonal of
B will become non-zero. In Fig. 4.3 the pattern results from applying G(16) to
B, where m = 18, n = 8 and . . denotes the fill in. A CDGR, say p(i), can
be applied to the right of G(i) B to restore the lower triangular form of B. The
compound rotation p(i) is the product of e(i,m,n) single disjoint Givens rotations. Each Givens rotation affects two adjacent columns when annihilating a
non-zero element in the leading superdiagonal of B. The orthogonal matrix P
in (4.3b) is then defined as the product of the right compound rotations, that is,
p = p( 1) p(2) ... p(m+n-2). If G(i) is in set-2 then, prior to the application of G(i)
and p( i) , the first i - 2n + 1 rows of A = (y A) are zero, which implies that the
first i - 2n + 1 columns of B are ignored for the application of G(i) [67, 111].
The direct application of the SK sequence to compute the orthogonal decomposition (4.3) on a SIMD array processor might appear to be inefficient
since at some stage CDGRs are applied redundantly on zero layers. This is
illustrated in Fig. 4.4, where es = 4 is the edge size of the array processor,
the layers have dimension es x es and are denoted by bold frames, m = 18,

1513 119 7 5 3 1
U 14 12UJ 8 6 4 2
1715 1 11 9 7 5 3
1816 1411 10 8 6 4
1917 1513 119 7 5
120 18U 1412 UJ 8 6
2119 1 15 1311 9 7
122 20 U 16 141 10 8
~ 21 1917 1513 119
~ 22 2() 18 1614 12 HJ
.23 2119 1715 1311
22~ 1816 141
.21 191 lSI
20 1816 14
19 1715
18U
1
Figure 4.2.
The SK sequence.
109

Figure 4.3.
G(l6) B with e(16, 18,8)
= 8.
n = 8 and G(I6) is applied to the left of (y A B). The zero (shaded) layers
of (y A) and B are not affected by the application of G(I6). Similarly, the
zero layers of G(16)B are not affected by the right application of p(16) which
annihilates the fill-in of G(I6) B.
0
0
0
0
0
(y A)
[Q) Annihilated [!] Fill in [!] Non zero
D and D Zero elements
Figure 4.4. The application of the SK sequence to compute (4.3) on a 2-D SIMD computer.
110
Modi and Bowgen modified the SK sequence, hereafter called the MSK
sequence, to facilitate its implementation on the DAP using systems software
in which operations can be performed only on es x es matrices and es element
vectors [104]. The MSK sequence requires a greater number of compound
rotations to compute the QLD (4.3a) than does the SK sequence, but it has the
advantage that the CDGRs are not applied to any previously annihilated es x es
layers. The MSK sequence can be generalized so that MSK(p) partitions A
into n/ p blocks (each consisting of p columns) and applies the SK sequence to
these blocks one at a time (where it is assumed that n is a multiple of p). Thus,
the total number of CDGRs needed to compute the QLD (4.3a) using MSK(p)
is given by:
nip
Z(m,n,p) = I(m+ (2-i)p- 2) == n(2m+ 3p- n-4)/2p.

i=l
Note that the SK sequence and the MSK sequence in [104] are equivalent to
MSK(n) and MSK(es), respectively. Figure (4.5) shows MSK(4) and MSK(8),
where m = 24 and n = 16. The shaded frames denote the submatrices to which
the SK sequence is applied to transform the matrix into lower triangular form.
73 ~1 69 67 5553 51 49 3331 29 27 7 5 3 1
74 ~2 70 68 5654 52 50 3432 30 28 8 6 4 2
75 73 71 69 5755 53 51 3533 31 29 9 7 5 3
76 74 72 70 5856 54 52 3634 32 30 ~o 864
77 ~5 73 71 5957 55 53 3735 33 31 ~1 975
78 76 74 72 64J 58 56 54 3836 34 32 ~2 10 8 6
79 ~7 75 73 6159 57 55 3937 35 33 ~3 1197
80 ~8 76 74 6260 58 56 4038 36 34 ~4 12108
r79 77 7S 6361 59 57 4139 37 35 ~5 1311 9
78 '76 64 62 60 58 4240 38 36 ~6 1412 10
.77 6563 61 59 4341 39 37 ~7 1513 11
6Cl64 62 60 4442 40 38 ~8 1614 12
.6S 63 61 4S43 41 39 ~9 171S 13
64 62 4644 42 40 ~ 1816 14
.63 4745 43 41 119 1715
484(; 44 42 220 18 16
.47 45 43 ~21 1917
46 44 2422 20 18
.45 ~5 23 21 19
lUi 24 22 20
.25 23 21
24 22
45 43 41 39 37 35 33 31 ~S 13 11 9 7 531
46 44 42 40 38 36 34 32 ~6 14 12 10 8 6 4 2
47 45 43 41 39 37 35 33 ~7 15 13 11 9 7 5 3
48 46 44 42 40 38 36 34 ~8 16 14 12 10 864
49 47 45 43 41 39 37 35 917 15 13 11 975
50 48 46 44 42 40 38 36 W18 16 14 12 10 8 6
51 49 47 45 43 41 39 37 119 17 15 1311 9 7
52 50 48 46 44 42 40 38 220 18 16 1412 108
.51 49 47 45 43 41 39 2..121 19 17 1513 119
50 48 46 44 42 40 ~22 20 18 1614 12 10
.49 47 45 43 41 ~23 21 19 17 15 13 11
48 46 44 42 Ui24 22 20 18 16 14 12
.47 45 43 725 23 21 1917 15 13
46 44 ,..,,'" 22 20 18 16 14
.45 927 2S 23 21 19 17 15
3028 26 24 22 20 18 16
.29 27 2S 23 21 19 17
28 26 24 22 20 18
27 25 23 21 19
26 24 22 20
25 23 21
24 22
(a) MSK(4)
(b) MSK(8)

.23

Figure 4.5.

.23

Examples of the MSK(p) sequence for computing the QLD.
111
IMPLEMENTATION AND PERFORMANCE

ANALYSIS
The implementation and perfonnance of MSK(A.es/2) for solving the GLM

on a SIMD array processor will be considered, under the assumption that m ~
2n, m = Mes and n = Nes, Let A = (y A) and let MSK(p) have reached the
stage of executing the ith SK sequence to transfonn the mj x nj submatrix
-(i) -
-A1:m-(i-l)p,n-jp+l:n-(j-l)p
into lower triangular fonn, where p = A.es/2, mj = m - (i -l)p and nj = p,

The CDGRs of the ith SK sequence are divided into the three sets SP), S?) and
SP) as follows:
for 1 ~ J' ~ a(lj)
== 2nj -1,
S~2)
= da\i)+i)
I,}
for 1 ~ J' ~ a 2(j)
== mj -
S~3) = da\i)+aY)+j)
for 1 < J' < a 3(j)
== n l"-1.
s~1)
=
I,}
G(j)
The CDGR S}j
(k =
2nj,
1,2,3) can be partitioned as

I~
(k)
e;,j
I~
where
(k)
L ap,mj,nj),
,k-l
ej,j = e(J+
p=l
and
~+~+2e~j =mj,
Confonnably partitioning the top mj x (nj + (2N - iA)es/2) submatrix of A,
say A*, as
-(j))
A* =
(1ti)
-(i)
A2
112
then A * :=
S\k!
A (i)
I,]
S;,1 A * is equivalent to A(i) := S;,1 A(i). In data-parallel mode A(i) :=
is computed as
A(i) :=
CI
CI
CI
CI
CI
CI
C2
C2
C2
C2
C2
C2
Ce Ce
Ce Ce
Ce
Ce
DI,:
D(2e-I),:
Du,:
-SI
-SI
-SI
SI
Sl
SI
-S2
-S2
-S2
S2
S2
S2
-Se
Se
-Se
Se
-Se
Se
D2,:
D3,:
D4,:
D2,:
DI,:
D4,:
D3,:
(4.7)
Du,:
D(U-I),:
where e = e;,1, D == A(i), U x (ni + (2N - iA)es/2) are the dimensions of all
matrices and * denotes element by element multiplication. Samples of execution times have been generated by constructing and applying a single CDGR
to matrices with various dimensions. Fitting a model to these samples, the
estimated execution times of constructing and applying (4.7) were found, respectively, to be
al
+ b l rU/es1 rN - (i -1)A/21
and
where al = 9.02e-04, b l = O.33e-04, a2 = 1.90e-04 and b2 = 4.94e-04 (in

seconds).
To evaluate the performance of the MSK(Aes/2) sequence, let the application of
affect the submatrices of A, B and QT with dimensions r~,1 x C~)k) ,
S;,1
r~,1 x C~)k) and r~,1 x ci,)k) , respectively. Similarly let the right compound rota-
si1
rt1
r},1
affect the E:,)k) x

and ci,)k) x
submatrices of
tion corresponding to
B and P. Due to the sparsity of B, QT, P and the special case where compound
113
rotations are not applied to some columns of B when i = 2N /1.., the values of
C~)k), ct)k) and ci,)k) are different. If PA, PB=PBL+PBR, PQ and Pp are all of the
es x es submatrices of A, B, Qr and P, respectively, which are involved in the
compound rotations when the MSK(Aes/2) sequence is applied and where PBL
and PBR correspond to the left and right compound rotations, then
el)
2NI').. 3
PA(M,N, A) = L LtJrt~/eslrc~)k)/esl,
i=l k=lj=l
2NI').. 3 aU)
PB(M,N,A) = L L
(rrt~ /esl
i=l k=lj=l
2NI').. 3
a(i)
k
(k)
rct)k) /esl + rc':,)k) /esl rrt~ /esl) ,

(k)
PQ(M,N, A) = L LLrri,j/eslrci,j /esl

i=l k=lj=l
and
Pp(M,N,A) = PQ(M,N,A).
More details of these are given in [67]. The total time spent on constructing
and applying the CDGRs of the MSK(Aes/2) sequence is
T(M,N,A) = 2alZ(M,N,Aes/2) +bl (PA(M,N,A) + PBdM,N, A))

+ 5a2Z(M,N, 'Aes/2) + b2Pr(M,N, A),
where
Pr(M,N,A) = PA(M,N,A) +PB(M,N,A) + 2PQ(M,N, A).

Another important factor in the execution time of the MSK(Aes/2) sequence,
are the overheads in extracting and replacing the affected submatrices of..4, B,
Qr and P [80]. Table 4.1 shows for some M, N and all possible A, the value of
Z == Z(M,N,Aes/2), Pr == Pr(M,N,A), the estimated time T == T(M,N,A), the
execution time Rotations which corresponds to T (M, N, A), the overheads time
Overh. and the total execution time when applying the MSK(Aes/2) sequence
on theDAP.
From Table 4.1 it will be observed that the best timings for the execution of
the MSK('Aes/2) sequence are for A= 2N (i.e. p = n), which is equivalent to
the SK sequence, when M N and A= N (i.e. p = n/2), otherwise. In most
cases Pr(M,N,A) increases as A increases, but the efficiency of applying the
compound rotations to a smaller number of es x es matrices is degraded due
to the large value of Z(M,N,'Aes/2), which results in greater overheads and
larger total time in constructing the compound rotations.
114
PARAUELALGORITHMS FOR LINEAR MODELS

Table 4.1.
Execution times (in seconds) of the MSK(A.es/2).
Pr
1
2
140
94
1059
1160
0.92
0.84
0.90
0.84
0.25
0.17
1.17
1.04
1
2
204
126
1962
2113
1.55
1.41
1.53
1.41
0.53
0.39
2.08
1.82
1
2
588
318
12084
11863
7.67
6.79
7.60
6.83
5.03
3.47
12.69
10.33
1
2
4
8
16
6496
3440
1912
1148
766
281808
289770
307472
338630
382821
159.63
155.22
159.98
173.70
195.14
158.01
154.24
159.92
174.68
197.57
148.75
95.59
68.42
52.19
35.68
307.41
250.22
228.59
227.08
233.53
1
2
1292
670
51585
47690
29.34
25.53
29.16
25.68
32.46
19.4
61.75
45.16
16
20
Rotations
Overh.
MSK(A.es/2)
20
1
2
3
6
3684
1914
1324
734
166841
167481
169887
174312
93.88
89.23
88.77
89.17
93.08
88.95
89.23
90.17
105.52
66.18
53.41
40.60
198.61
155.14
142.78
130.91
20
1
2
3
4
6
12
6792
3540
2456
1914
1372
830
333560
340065
349507
355739
370849
412475
186.31
180.57
182.37
183.96
190.07
209.57
184.60
179.40
182.57
184.05
190.89
212.10
208.13
131.08
105.95
92.65
79.38
64.07
392.79
310.71
288.54
276.72
270.28
276.18
The employment of Householder transformations to compute (4.3) is straightforward [13, 17, 18, 28, 37,40, 56, 93]. The orthogonal matrices QT and P
in (4.3) are the products of Householder transformations. Unlike the Givens
strategies, the Householder algorithm cannot exploit the triangular structure of
B. The timing model (in seconds) of the Householder method is found to be
Th(M,N) = n(c+2a) +m(c+a)

+b ~(6M2N + 2M3 _N 3 +6MN +3M2 +M +N),
where a = 0.68e-03 and b = 0.54e-03 and c = 0.9ge-03. Table 4.2 shows for
some M and N, the estimated time Th(M,N), the execution time of the Householder algorithm and the best execution time of the MSK(A.es/2) sequence
when QT and P are not computed. The performance of the MSK(Aes/2)
115
sequence is found to surpass that of the Householder algorithm only when

MN.
Table 4.2.
Computing (4.3) (in seconds) without explicitly constructing QT and P.
Th(M,N)
Householder
MSK(Aes/2)
2
3
9
9
9
12
12
12
16
16
16
20
20
20
20
1
1
1
2
4
1
2
3
1
2
3
1
2
3
6
0.43
0.77
7.25
8.88
12.03
14.92
17.70
20.44
31.86
36.63
41.37
58.47
65.80
73.09
94.59
0.38
0.73
7.46
9.l6
12.44
15.40
18.27
21.11
32.85
37.78
42.68
60.19
67.72
75.23
97.38
0.71
1.17
5.60
11.69
24.57
8.88
18.56
29.59
14.36
29.84
47.73
21.09
43.55
69.45
158.12
Chapter 5
SEEMINGLY UNRELATED REGRESSION
EQUATIONS MODELS
INTRODUCTION
The problem of estimation for a system of regression equations where the

random disturbances are correlated with each other is investigated. That is,
the regression equations are linked statistically, even though not structurally,
through the non-diagonality of the associated variance-covariance matrix. The
expression Seemingly Unrelated Regression Equations (SURE) is used to reflect the fact that the individual equations are in fact related to one another
even though, superficially, they may not seem to be. The SURE model that
comprises G regression equations can be written as
Yi=Xi~i+Ui'
i= 1, ... ,G,
(5.1)
where the Yi E ~r are the response vectors, the Xi E ~rXki are the exogenous matrices with full column ranks, the ~i E ~ki are the coefficients, and
the Ui E ~r are the disturbance terms. The basic assumptions underlying the
SURE model (5.1) are E(Ui) = 0, E(Ui,U)) = CJiJr and limr-too(XrXj/T) exists (i, j = 1, ... , G). In compact form the SURE model can be written as
or
G
vec(Y) =
(E!1Xi )vec( {~i}G) + vec(U),

i=l
(5.2)
118
where Y = (YI ... YG) and U = (UI ... uG). The direct sum of matrices EEl~IXj
defines the GT x K block-diagonal matrix
~X'=XlEllX'EIl"'EIlXG=
(Xl X, ".
x,,)'
(5.3)
where K = L~l kj [125]. The matrices used in the direct sum are not necessarily of the same dimension. It should be noted however, that some properties
of the direct sum given in the literature are limited to square matrices [134,
pages 260-261]. The set of vectors ~}, ~2' ... ' ~G is denoted by {~j}G. The
vec() operator stacks the columns of its matrix or set of vectors argument in a
column vector, that is,
vec(Y)
= '(YI)
Y~
and
Hereafter, the subscript G in the set operator {.} is dropped and the direct sum
of matrices EEl~1 will be abbreviated to EElj for notational convenience.
The disturbance vector vec(U) in (5.2) has zero mean and variance-covariance
matrix !:.Ir, where!:. = [OJ,j] is a G x G positive definite matrix and denotes
the usual Kronecker product [5]. That is,
E(vec(U)vec(U)T) = !:.Ir
OI,IIr
_ ( 02,IIr
OG,IIr
Notice that
(EEljXj)vec( {~j}) = vec( {Xj~j})
and
OI,GIr)
02,GIr
oG,GIr
(5.4)
SURE models
119
Various least squares estimators have been proposed for estimating the parameters of the SURE model [142]. The most popular are the Ordinary Least
Squares (OLS) and Generalized Least Squares (GLS) approach. The OLS approach implicitly assumes that the regression equations in (5.1) are independent of each other and estimates the parameters of the model equation by equation. The GLS approach utilizes the additional prior information about correlation between the contemporaneous disturbances that is fundamental to the
SURE specification and estimates the parameters of all the equations jointly.
The OLS estimator of vec( {~i} ) in (5.2) and its variance-covariance matrix
are given respectively, by
(5.5)
and
Similarly, the GLS estimator of vec( {~i}) and its variance-covariance matrix
are given, respectively, by
and
Both vec( {bP}) and vec( {by}) are unbiased estimators of vec( {~i}). For L
positive definite and
it follows that
Var(vec( {bP})) = Var(vec( {by}))
+ G(L- 1h )GT
which implies that vec( {by}) is the BLUE of vec( {~i}) in the SURE model
[142]. In some cases the OLS and GLS estimators are identical- for example,
when Vi, j: Xi = Xj, or when Vi i= j: (Ji,j = O. More details can be found in
[141, 142].
In general L is unknown and thus the estimator of vec( {~i}) is unobservable. Zellner has proposed an estimator of vec( {~i} ) based on (5.7) where L is
replaced by an estimator matrix S E ~GxG [150]. Thus
vec({~i}) = ((EBiX{)(S-l h)(EBiXi)r\EBiX{)(S-1 h)vec(Y)
(5.9)
120
is a feasible GLS (FGLS) estimator of vec( {~i})' The main methods for constructing S are based on residuals obtained by the application of the OLS.
The first method constructs 1: by ignoring the restrictions on the coefficients
of the SURE model which distinguishes it from the multivariate model. Let
X E 9tTxk be the observation matrix corresponding to the distinct regressors
in the SURE model (5.2), where k is the total number of these regressors.
Regressing Yi (i = 1, ... ,G) on the columns of X gives the unrestricted residual
vector
iii = PXYi,
wherePx
=h _X(XTX)-IXT. An unbiased estimatorofCJi,j is obtained from
Si,j
= iiT iij/(T -
k)
= yT PxYj/(T -
k),
i,j
= 1, ... , G.
(5.10)
The second method constructs S by taking into account the restriction on the
coefficients of the SURE model. By applying the OLS to each equation in
(5.1), the restricted residuals
11; =PiYi
are derived, where Pi =
obtained from
Si,j
h - Xi (x{ Xi) -I
Xr A consistent estimator of CJi,j is
= aT aj/T = y{ PiPjYj/T,
i,j
= 1, ... , G.
(5.11)
The unbiased estimator of CJi,j is equivalent to (5.11) when T is replaced by

trace(PiPj), the sum of the diagonal elements of PiPj. The feasible estimators
of the SURE model using S and Swere proposed by Zellner and called Seemingly Unrelated Unrestricted Residuals (SUUR) and Seemingly Unrelated Restricted Residuals (SURR), estimators respectively [150, 151].
An iterative procedure is used to obtain the FGLS estimator when 1: is unknown. Initially, S is substituted by Ia in (5.9) so that vec( {by}) == vec( {bP}).
Then, based on the residuals of vec( {bP}), 1: is estimated by S(I). The estimator S{l) of 1: can be used in place of Sin (5.9) to yield another estimate of
vec( {~i}) and 1:. Repeating this process leads to an Iterative FGLS (IFGLS)
estimator of vec( {~i}) [150]. In general the pth IFGLS estimator of vec( {~i})
is given by
vec( {Mp)}) = ((E9iX{)((S(p))-1 h )(E9iXi) r\E9iX{)((S(p))-1 h)vec(Y),
(5.12)
where S(O)
= la , S(p) = OTO/T is the estimator of 1:, 0 = (uA(P-I)

uA(P-I))
I'" a
and a~p-I) is the residual of the ith (i = 1, ... ,G) equation, i.e.
A(p-I) _
ui
,-X,t\(p-I)
'Pi
.
-Y,
SURE models
121
THE GENERALIZED LINEAR LEAST SQUARES

METHOD
A large number of methods have been proposed to solve SURE models

when the variance-covariance matrix is non-singular [132, 141, 142, 148,
150]. These methods require the expensive computation of a matrix inverse
which often leads to loss of accuracy. Here the generalized linear least squares
approach that has been used to solve the general linear model will be investigated. With this approach the coefficients in all equations of the SURE model
are estimated simultaneously by considering the SURE model as a generalized linear least squares problem (GLLSP) [109, 110, 111]. This approach
avoids the difficulty in directly computing the inverse of :t (or its estimate S).
It can also be used to give the BLUE of vec( {Pi}) in (5.2) when :t is singular
[91, 147].
The BLUE of vec( {Pi}) is derived from the solution of the GLLSP
argminllVlI} subject to vec(Y) = (EBiXi)vec( {Pi}) + vec(VeT ),
v, {Pi}
where II IIF denotes the Frobenius norm, i.e.
T
(5.13)
IIVII} == L L V;~j == trace{VTV) == IIvec(V) 11 2 ,

i=1 j=1
:t = eeT, e E 9tGxg has full column rank , rank(:t) = g and V E 9tTxg is

a random matrix defined as veT = U. That is, vec(V) has zero mean and
variance covariance matrix IgT [78].
For the solution of (5.13) consider the GQRD
(5.14a)
and
q
W12)K
(5.14b)
W22 q
GT-K-q
Here, Q E 9tGTxGT and P E 9tgTxgT are orthogonal, Ri E 9tk;xk; and W22 are
upper triangular matrices, (K + q) is the column rank of the augmented GT x
(K + gT) matrix (EBXi (eh)) and K = L~1 ki. The orthogonal matrix QT
is defined as
T _
Q -
(h Q~0) (QI)
_(Q~Q~
QI )
Q~
0
K
GT - K '
(5.15)
122
where the QRD of Xi (i = 1, ... , G) is given by
T
QiXi=
ki
(Ri)
0 '
with
Qi= (QA,i
T-ki
QB,i ),
(5.16)
and the complete QRD of Q~ (C h) is given by
Q~(Q~(Ch))P=
(g
~2).
(5.17)
If
vec( {Yi})) K
y
q
Y
GT-K-q
QT vec(Y) = (
and
pT vec(V) =
(~)!T
q ,
then the GLLSP can be written as

argmin IWII 2 + IWII 2
V,~, {~i}
subject to (
vec( {Yi}))
~
(tBiRi)
g vec(
{~i}) +
(Wllg W12)
~2 (-)
~.
(5.18)
From this it follows that, if y =I 0, then the SURE model is inconsistent, ~ =

W22 1y, the arbitrary V is set to zero in order to minimize the objective function
and thus,
RiPi=h
i,
i= l, ... ,G,
(5.19)
SURE models
123
where
vec( {hi}) = vec( {Yi}) - W12W221y,
with
hi E 9tki
The variance~ovariance matrix of vec( {Pi}) is given by

Var(vec({pi})) = (EBiRilWll)(EBiRilWll)T.
(5.20)
Algorithms for simultaneously computing the QRDs (5.16) for i = 1, ... , G

have been discussed in the first chapter. The computation of the complete
QRD (5.17) is investigated in the next chapter within the context of solving
simultaneous equations models.
TRIANGULAR SURE MODELS
The SURE model (5.2) is triangular if the exogenous matrix Xi (i = 1, ... , G)

is defined as
(5.21)
where Xo E 9tTxk and Xi E 9tT , that is, ~i E 9tk+i, Xi E 9tTx (k+i), K = G(2k+
G + 1) /2 and the bottom G x G submatrix Of(~l ~2 ... ~e) is upper triangular
[92]. Triangular SURE (tSURE) models can be considered as special cases of
SURE models with subset regressors [68, 78, 133, 142]. For the solution of
tSURE models let C by the lower-triangular Cholesky factor of 1:. Note that,
in the case of singular 1:, some columns of the lower-triangular matrix C will
be zero [51]. Furthermore, let the QRD of Xc be given by
QTXe=
(~),
(5.22)
the upper triangular non-singular (k + G) x (k + G) matrix R be partitioned as
G-i
Rl,i ) k+i
R2',1 G-i
and
T
Q (Yi Vi) =
(y-,
yAII'
Vi) k+ i
Vi T-k-i'
where Vi == V,i is the ith column of V (i = 1, ... , G). For notational convenience a zero dimension denotes a null matrix or vector. Premultiplying the
constraints in (5.13) by (Ie QT) gives
vec(QTY) = EBi(QT Xi) vec ( {~d) + vec(QTVCT )
124
and after partitioning the lower-triangular (C Ir) matrix as
(CIr) =
CI,I/HI
C2, I h+ I
o
o
CI,IIr-k-1
CG,lh+1
C2,2h+2
0
0
C2,lh
0
0 C2,IIr-k-2
0 C2,2Ir-k-2
0
0
CG,2h+2
0
CG,IIr-k-G 0
CG,IIG-I
0
0
CO,2/G-2
0
CG,oIk+G
CG,2Ir-k-G 0
CG,GIr-k-G
it follows that (5.13) is equivalent to

G
argmin
L (Iivi112 + IIvill2)
{~i}' {Vi}, {Vi} i=1
subject to
{Yi
~ Ri~' +Li~ C',j

I
Yi = gi + Ci,iVi
where
Ii =
e';j) j
v +!;
(i
~ I, ... ,G),
(5.23)
(0 0)
i-I
Ci,j I. . .0 Vj
j=1
I-J
and
i-I
gi =
L C;,j (0
j=1
Ir-k-i) Vj.
Clearly. the first k + 1 elements of Ii E 9tHi are zero and
( Ii)
gi =
(0)
i-I
Itci,j Vj
where Ii E 9ti - 1 is the bottom non-zero subvector of f;. Employing the column
sweep algorithm Ii and gi can be determined in G-l steps [67]. The following
SURE models
125
code illustrates this computation, where, initially,]; = 0 and gi = 0:

1: for i = 1, 2, ... , G - 1 do
2:
for all j = i + 1, ... , G do-in-parallel
3:
4:
({~) := ({~) +Cj,i (~)
(5.24)
end for all
5: end for
For Ci,i i- 0 the second constraint in (5.23) gives
Vi = (Yi - gi)/q,i.
In order to minimize the objective function the arbitrary vj (j = 1, ... , G) and,
for q,i = 0, the arbitrary Vj are set to zero. Thus, if Vi: Cj,i = 0, Yj = gj, then
the tSURE model is consistent and its solution given by
Pi=Ril(Yi-.fi),
i=I, ... ,G.
(5.25)
For 3i E 9lk +a defined as 3T = (y? - .fiT 0), the solution of the upper triangular system R1i = 3i gives Y[ = (P? 0). This implies the solution for r of Rr =
PI' ... '
d produces simultaneously the vectors

Pa, where r = (11 12 ... 1a)
andd= (3 1 ~ 3a ).
In the case of inconsistency, a consistent model can be generated by modifying some of the original observations [57]. Let (gj Yi) be partitioned as
(gj Yi) = (
gP) y~I)) T-k-i-Aj
(2)
gj
A(2)
Yj
'\.
""I
and, for simplicity, assume that yP) = gP) and y~2) - g~2) = hi i- O. That is, for
Cj,j = 0, the tSURE model is inconsistent because of the non-zero Ai element
vector hj. If Q2,i denotes the last Aj columns of Q, then the premultiplication
of the ith inconsistent regression equation
i-I
Yj = Xi Pi + Lq,jVj
j=1
by the idempotent matrix 1- Q2,jQI,j' yields the consistent equation

j-I
Yj=XjPj+ LCj,jvj,
j=1
where Yi = Yi - Q2,jhi. Notice that QI,iXi = 0,
QTYj = (;:)
126
and
QI,i
i-I
i-I
j=1
j=1
L Ci,jVj = L q,j (0
hi) Vj == g~2).
If modifications to the data are acceptable, then the solution of the modified
consistent tSURE model proceeds as in the original case with the difference
that y~2) is replaced by g~2), which is equivalent to replacing Yi with Yi. The
iterative Algorithm 5.1 summarizes the steps for computing the feasible estimator of vec( {~i}).
Algorithm 5.1 An iterative algorithm for solving tSURE models.

1: Compute the QRD QTXa
= (~)
and Vi: QTYi
= (~:).
2: repeat
3:
Estimate 1: by t
4:
Compute the Cholesky decomposition t = CCT
5: Vi: let (f{ gf) T = 0, where E 9li - 1 and gi E 9l T- k - i
6:
for i = 1,2, ... , G - 1 do
7:
if q,i f:. 0 then
8:
Vi:= (Yi - gi)/q,i
9:
for all j = i + 1, ... , G do-in-parallel
Ii
(Ij)
.= (Ij)
+ C.. (0)
gj.
gj
Vi
10:
11:
12:
13:
14:
15:
16:
17:
],I
end for all

else if Yi = gi then
Vi:= 0
else
The SURE model is inconsistent - STOP or
let Yi := gi and continue
end if
end for
18:
Solve the upper triangular system Rr = d
19: until Convergence
To compute the covariance matrix for ~i and ~ j (j :::; i)
let QT in (5.22) be partitioned as Q = (Ql,i

first k + i columns of Q. That is,
QL (Yi
Q2,i) , where Ql,i comprises the
Xi) = (Yi Ri)
SURE models
and
QTl,iV),_
-
(IHi)
-, + (0!;-i 0)0
0
v)
127
v).
A,
Observing from (5.13) that Yi = Xi~i + L~=I C;,iVi' ~i

written as
=Ril(Yi - Ii) can be
~i = Ri l ( QL(Xi~i + L~=I Ci,iVi) - L~-=" C;,i Ci~i ~) Vi)

=
A
I-'i
+ R-i I ~C
~ i,i (h+i)0 vi,
)=1
from which it follows that E(~i) = ~i. Thus,
Vi,i = Ri l (L~=I L~=I C;,qCi,p (h;q) E(vqv/) (h+ p

=
Ri l (L~=I Ci,pCi,p (h;p
~)) Kt,
(5.26)
which is simplier than with the expressions given in [91, 127, 142].
3.1
IMPLEMENTATION ASPECTS
To demonstrate the practicality of the tSURE algorithm, the performance

of (5.24) and the solution of the upper triangular system Rr = L\ on the 1024processor AMT DAP 510 are considered. To simplify the complexity analysis,
the dimensions of the matrices are assumed to be multiples of es and have at
most es2 rows, where es = 32 is the edge size of the SIMD array processor.
Similar computational details of the implementations have been described in
[67, 69, 73, 75, 77, 79]. All times are given in msec and double precision
arithmetic is used.
The execution time model for the ith step of (5.24) is given by
=0.059(p+ 1- ri/esl) +0.049('t+ 1- ri/esl)'

+0.S65('t+ 1- fi/esl) (p+ 1- ri/esl),
1 = 'tes and G - 1 = pes. Evaluating
<PI ('t,p, i)
where T - k -
G-I
<P2('t,p) =
L <PI ('t,p, i)
i=1
and using backward stepwise regression, the following execution time model
is derived
<P3('t,p) = p(S.997 + 15.23't + 13.95'tp- 4.622~).
128

Computing (5.24), where T - k - 1 = 'tes, G - 1 = /1es and es = 32.
Table 5.1.
't
/1
Execution Time
(1)2('t,/1)
<P3('t,/1)
3
4
5
6
7
8
9
10
11
12
2
2
3
3
4
4
5
5
6
6
239.6
325.7
758.3
929.1
1730.1
2012.4
3292.1
3716.9
5584.7
6180.1
240.5
326.6
760.9
931.6
1732.6
2015.6
3294.0
3717.0
5583.5
6174.1
239.8
326.1
758.4
929.6
1729.0
2013.1
3291.3
3716.2
5585.0
6178.6
The accuracy of <!>2('t,,u) and <!>3('t,,u) is demonstrated in Table 5.1.

Algorithm 5.2 was implemented on the DAP to solve the triangular system
= L\ [67]. An accurate model of the execution time of Algorithm 5.2 is
Rr
<!>4(K,,u) =,u(29.72 + 19.97,u+ 14.291)

+ K(37.79 + 18.81,u+ 28.491 + 14.54,uK),
where k = Kes and G = ,ues. However, the direct implementation of Algorithm 5.2 is inefficient since it fails to take into account that the bottom G x G
submatrix of L\ (and consequently of
is upper triangular. This inefficiency
can be easily avoided by having the computations over the second dimension
of the rand L\ matrices performed in the range (Jj, . , G, where (Jj = 1 if i ~ k,
and (Jj = i, otherwise. That is, steps 2 and 3 of Algorithm 5.2 can be written
respectively, as
r',OJ.. '=L\
. . /R"
.
',Oi.
1,1
and
In this case, the time execution model corresponding to <!>4 (v,,u) is given by
<!>S(K,,u) =,u(41.59 + 17.71,u+4.741)

+ K(37.9+ 33.16,u+ 14.181 + 14.53,uK).
In Table 5.2 Execution Time 1 and Execution Time 2 correspond to <!>4(K,,u)
and <!>s (K,,u), respectively. The improvement in performance, achieved by excluding from the computations zero es x es submatrices of r and L\ (see Chapters 1 and 2), is indicated by the ratio <!>4(K,,u)/<!>S(K,,u).
SURE models
Algorithm 5.2 The parallel solution of the triangular system Rr =

1: for i = k+G,k+G-l, ... , 1 do
2:
r i,: := ~i':/ Ri,i
3:
~l:i-l,: := ~l:i-l,: - Rl:i-I,irT,:
4: end for
Table 5.2.
129
~.
Execution times of Algorithm 5.2 for solving Rr = ~.
Jl
Execution
Time-J
<P4(lC,Jl)
Execution
Time-2
<P5(lC,Jl)
<P4 (lC, Jl) / <P5 (lC, Jl)
3
3
3
4
4
4
5
5
5
6
6
6
7
7
1
4
8
2
4
7
3
5
7
2
5
7
2
7
446
3581
11696
1474
4560
13989
3499
8477
16461
2437
10116
19149
3006
22032
450
3583
11700
1476
4559
13978
3498
8472
16459
2436
10115
19144
3004
22033
447
2466
6589
1298
3273
8281
2828
5854
10151
2204
7207
12238
2744
14518
450
2468
6594
1300
3272
8271
2826
5850
10151
2203
7207
12235
2742
14522
1.00
1.45
1.86
1.14
1.39
1.69
1.24
1.45
1.62
1.11
1.40
1.56
1.10
1.52
COVARIANCE RESTRICTIONS
lC
Efficient estimation methods of SURE and simultaneous equations models

with various fonns of covariance restrictions have been proposed [6, 58, 128].
These methods exploit prior infonnation about the disturbance variances that
reduce the number of parameters in the disturbance covariance matrix. Among
the problems considered is that of estimating a SURE model with variance
inequalities and correlation constraints. These include the case where the
variance--covariance matrix of the disturbances is constrained so that the variance of one disturbance tenn is smaller than that of its successor, and the correlation coefficients between the disturbance tenns lie in the range zero to one.
The SURE model with these covariance constraints will be abbreviated as the
SURE-CC model.
The estimation of SURE-CC was initially investigated by Zellner within
the context of an error-components procedure for introducing prior infonnation on the elements of covariance matrices and was later discussed by Srivastava and Giles [142, 152]. The estimation procedure of the SURE-CC model
can be extended to other multivariate regression models such as simultaneous
130
equations models, and it can be found in modified form in applications involving the estimation of vector autoregressive models [146]. The proposed
estimation procedure applies the generalized least squares (GLS) method to a
transformed model in which the information matrix has a block bi-diagonal
recursive structure and the disturbances have a diagonal variance-covariance
matrix. The GLS estimators come from the solution of normal equations involving Kronecker products.
A computationally efficient and numerically reliable solution to the normal
equations exploits the banded structure of the information matrix, enacts nonliteral computation of the Kronecker products, and avoids computing matrix
inverses [41,42]. Otherwise, estimating SURE-CC models will be computationally expensive even for modestly sized problems, and the solution will be
meaningless for models with ill-conditioned matrices.
Within the context of the SURE-CC model, the elements of 1: are constrained to satisfy
al,1
< a2,2 < ... < aa,G
(S.27a)
and
o< Pi,j < 1,
i, j = 1, ... , G and i # j,
(S.27b)
where Pi,j = ai,j/ Jai,iaj,j. These conditions are obtained when the disturbance terms are defined as Ui = L~=I Ej, where
E(EiE}) = 0 for i # j,
E(Ei) = 0,
(S.28)
and t1i is an unknown scalar (i,j = 1, ... , G). Then ai,j = L~=I t1p for i ::; j,
which imposes the constraints (5.27). Notice that Pi,j > Pi,k for k > j > i and
Pi,j > Pk,j for j > i > k. That is, for i < j the correlation Pi,j decreases for
fixed i and increasing j, or for fixed j and decreasing i. Figure 5.1 plots the
correlations Pi,j, where t1i = i and t1i = l/i.
Writing Ui+! as
Ui+ I = Ui + Ei+ I = Yi - Xi~i + Ei+ I,

the ith regression equation in (5.1) becomes
Zi=~'Yi+Ei'
i=I, ... ,G,
(S.29)
where ZI = YI, Zi = Yi - Yi-I, WI = X.. ~ = (-Xi-I Xi), 'YI = ~I and if =

(~T-I ~n for i = 2, ... ,G [142, 152]. Thus, the SURE-CC model can be
expressed as
vec(Z) = Wvec( {~i}) + vec(1:) ,
(5.30)
131
SURE models
lJi = i
Correlation
45
40
35
30
25
20
15
\0
Correlation
45
Figure 5.1.
where Z = (Zl
40
35
30
25
20
15
\0
The correlations Pi,j in the SURE-CC model for tJi
Z2
W=
zo), 'E = (El
Xl
-Xl
0
0
E2
0
X2
0
0
-X2 X3
0
...
= i and tJi = Iii.
Eo) and
0
0
0
(5.31)
-XO-I Xo
The error tenn vec ('E) has zero mean and variance-covariance matrix e h =
tBilJh.
The GLS estimator of the SURE-CC model (5.30) comes from the application of the GLS method to the model (5.30). In general, e is unknown and
is replaced by a consistent estimator, say E>. Thus, the FGLS estimator of
132
vec( {~i}) in the SURE-CC model is given by

vec( {~i}) = (WT (0- 1 /r)W) -1 W T(0- 1 /r )vec(Z).
(5.32)
Notice, however, that due to the diagonal structure of 0, the FGLS estimator of
vec( {~i}) could come from the solution of the weighted least squares (WLS)
problem
argminll(0-! /r)(vec(Z) - Wvec({~i}))112
{M
= argminllvec(Z) - Wvec( {~i}) 11 2,
(5.33)
{~i}
where Z = Z0-1 and, for ~i = l/Vfti (i = 1, ... ,G),

_
W = (8- 2 /r)W
~IXl
( -~2XI
0
~2X2
v
0...
0
...
- .. -6- .... ~.t}J~~" ~f~"
0
0
0 )
0
:.:.:. '~~~i;~;' 'b~i~' .
(5.34)
That is, the FWLS estimators vec( {~i}) = (W TW)-1 WT vec(Z).

Computing the QL decomposition (QLD) of W
T -
(0)
GT-K
(5.35)
Q W= L K
and defining
QT vec(Z)
==
(~) ~T -
K ,
the FWLS estimator of vec ( {~i}) is the solution of the triangular system
Lvec( {~i}) =
J.
(5.36)
The GT x GT matrix Q is orthogonal and the K x K matrix L is lower-triangular

and non-singular. Furthermore, the matrix L has the block bi-diagonal structural form
ki
ki
k2
L= k3
ko
k2
k3
0
0
LII,
L2,I ~,2 0
0
L32, L33,
kO-1
ko
0
0
0
0
0
0
................................
... Loo
0
0
0
, - I Loo
,
(5.37)
SURE models
133
From (5.36) and (5.37) it follows that
LI,I~I
dl
(5.38a)
and
Li,i~i = tL - Li,i-ItL-I,
where tIr
singular.
== (di tit; ... tiE)
i = 2, ... , G,
(5.38b)
and Li,i (i = 1, ... , G) is lower-triangular and non-
4.1
THE QLD OF THE BLOCK BI-DIAGONAL

MATRIX
The QLD of the block bi-diagonal matrix W can be divided in two stages.
In the first stage an orthogonal matrix, say Q6, is applied on the left of W so
that
T - _
QoW -
(A(O)) GT-K
DO) K
'
(5.39)
where DO) E SRKxK is non-singular and has the same structure as L in (5.37).
In the second stage an orthogonal matrix, say Q;, is applied on the left of
(5.39) to annihilate A(O) while preserving the block bi-diagonal structure of
DO). That is, the second stage computes
T
Q*
(A(O)) _
DO) -
(0)L
GT - K
K
.
(5.40)
Thus, in (5.35), the orthogonal matrix Q is defined as Q = QoQ*.

To compute the orthogonal factorization of the first stage, let the QLD of
ftiXi be given by
i= 1, ... ,G
and
T
Qi,O( -t}iXi-J) =
-(0)) T -k
(A.A~O)
ki
i=2, ... ,G.
'
(5.41)
(5.42)
Partitioning Qi,O as (Qi,O Qi,O) , let Qo = EBiQi,O and Qo = EBiQi,O, where Qi,O E
SRTx(T-ki) and Qi,O E SRTxki. The orthogonal matrix Qo and the matrices A(O)
and DO) in (5.39) are given by
(5.43)
QG,O
134
and
T-kl
(.{(O))
(O)
kl
k2
k3
kG-l
kG
0
0
0
0
0
0
-(0)
0
0
-(0)
-(0)
T-k2
A2
T- k3
A3
T-k4
A4
T-kG
-(0)
L(O)
0)
"(0)
L(O)
"(0)
L(O)
kl
0
AG
................................
1
"(0)
k2
A2
k3
A3
k4
A4
kG
3
"(0)
AG
(5.44)
The second stage, the annihilation of the block-diagonal matrix A(O), is divided in G -1 sub-stages. The ith (i = 1, ... , G -1) sub-stage computes the
G - i orthogonal factorizations
kj-l
QL
kj
kj-l
A-(i_l)) (A-(i)i+j
-
Hj
AY-l) LY-l)
(5.45)
Ay)
Ly)
where j = 1, ... , G - i, leo = 0 and

is lower-triangular and non-singular.
The order of computing the factorizations in the ith sub-stage is inconsequential as they are performed on different submatrices. Thus the factorizations
(5.45) can be computed simultaneously for j = 1, ... , G - i.
Partitioning the orthogonal matrix Qi,j as
T -ki+j
Qi,j=
kj
( dQ~Y) Qt?))T-k +
i j ,
2.'1)
I,}
d2.'2)
I,}
kj
let
G-i
ffiQ~k.'P)
QI~k,p) = W
I,}'
j=l
k ,p=,.
12
(5.46)
SURE models
135
The product of the Compound Disjoint Orthogonal Matrices (CDOMs), Qi, is

given by
Qi=
hi
Q~1,1)
Q?,2)
Q~2,1)
Q~2,2)
hi
0
di,l1,1)
IJli
0
0
Q~,~~i
di,12,1)
0
0
0
0
(5.47)
di,11,2)
Q;,~~i
Q~,~~i
di,12,2)
Q;~~i
,
0
0
0
IJli
where Ai = iT - L~=l kj and Pi = L~=l kG-HI. If A(i) and Vi) are defined as
(-(i))
1{i)
= QT .. . Qf
(-(0))
1(0) ,
i=1, ... ,G-1,
(5.48)
then Vi) has the same structure as VO), and A{i) has non-zero elements only
in the (i + l)th block sub-diagonal (i < G -1). The application of Q41 (i =
1, ... , G - 2) on the left of (5.48) will annihilate the non-zero blocks of A{i),
but will fill in the (i + 2)th block sub-diagonal of A{i) if i < G - 2. It follows
.
(G-i).
~(G-i)
that, m (5.38), Li,i = Li

(I = 1, ... , G) and Li,i-l = Ai
(I = 2, ... , G), and
in (5.40), Q; = Q~-1 ... QI Qf. Algorithm 5.3 effects the computation of the
QLD (5.35). Figure 5.2 shows the application ofthe CDOMs in Algorithm 5.3,
where G=4.
Algorithm 5.3 Computing the QLD (5.35).

1: Compute the QLD (5.41) and (5.42)
2: for i = 1, 2 ... , G - 1 do
3:
for all j = 1,2 ... , G - i do-in-parallel
4:
Compute the orthogonal factorization (5.45)
5:
end for all
6: end for
136
PARAUEL ALGORITHMS FOR liNEAR MODELS
0
0
-(0)
A3
0
0
-(0)
A2
-(0)
A4
L(O)
................
1
A~O)
0
0
0
0
-(1)
A3
0
0
0
0
0
0
0
0
-(1)
A4
o
o
0
............... .
L (1) 0
0 0
1
A(I) L(I)
0 0
2
2
A(1)
L(I)
0
0
3
3
A(O) L(O)
0
0
4
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
L(3)
1
0
0
0
0
QT
-4
-(2)
A4
0)
A(O) L(O) 0
3
3
0 A(O) L(O)
4
0
0
0
0
0
0
0
0
0
................
L(2)
1
QT
-4
QT
-2t
A(2) L(2)
0
0
2
2
A(I)
L
(1)
0
0
3
3
A(O)
L(O)
0 0
4
4
................
A~2)
0
0
Figure 5.2.
2)
A(I) L(I)
0
3
3
A(O)
L(O)
0
4
4
Factorization process for computing the QLD (5.35) using Algorithm 5.3.
Alternative annihilation schemes can be applied to compute the factorization

(5.40). One such scheme annihilates the fill-in blocks of ;itO) row by row. In
this case the submatrix;iV) at position (i, i - j -1) of ;itO) is annihilated at step
Si,j
= j
+ 1 + (i-1)(i- 2)/2,
SURE models
137
where i = 2, .. . , G and j = 0, ... , i - 2. However, the annihialtion of a blockrow can commence before the complete annihilation of the preceding blockrows. The annihilation of A~O) can begin at step i, with subsequent steps annihilating the fill-ins. Figure 5.3 illustrates the sequential and parallel versions
of the two annihilation schemes. A number denotes the submatrices that are
annihilated by the corresponding step. The first scheme called the Parallel
Diagonal Scheme (PDS) uses the method described in Algorithm 5.3; its sequential version is called SDS . The Sequential and Parallel Row Schemes are,
respectively, denoted by SRS and PRS.
1
6 2
10 7 J
13 11 8 4
1514 1 9 5
\..
'\..
"
i\..
1
2
J
4
5
~
I'\.
f\..
~
SOS
I\,
'" "-
"'r'\.
"
,\ .
~
......
SRS
J 2
5 4 J
7 6 5 4
5 4 2
7 6 4 2
8 7 6 5 J
10 9 8 7 6
"
rr....
I:~; i'..
Figure 5.3.
654
10 987
15 14 t:l 1 11
J 1
1--
"- ~
PRS
'"
J 2
POS
J 2
5 4 J
7 6 5 4
9 8 7 65
~
1
2 1
J 2 1
4 J 2 1
-,
I\.
'\.
'"
Jjmited POS
"
'\.
r\.
Jjmited PRS
Annihilation sequences of computing the factorization (5.40).
Given the assumption that the independent factorizations (5.45) can be computed simultaneously for all j = 1, ... , G - i (full parallelism), PDS and PRS
apply G - 1 and 2G - 3 CDOMs ,respectively. The superiority of PDS decreases since the number of factorizations (5.45) that can be computed simultaneously is reduced (limited parallelism). In the extreme case in which only
one factorization can be computed at a time, the parallel schemes reduce to
their corresponding sequential schemes, requiring the application of the same
138
number, G( G - 1) /2, of CDOMs. Figure 5.3 also shows an example of limited

parallelism in which at most two factorizations can be performed simultaneously. Note that, for the application of the minimum number of CDOMs, PDS
and PRS require, respectively, G - 1 and LG /2J factorizations to be computed
in parallel. PDS never applies more CDOMs than PRS.
4.2
PARALLEL STRATEGIES
The orthogonal factorization (5.45) can be computed using Householder reflections or Givens rotations. Sequences of CDGRs for computing factorizations similar to that of (5.45) have been considered in the context of updating
linear models [75,69]. Two of these sequences are illustrated in Fig. 5.4, where
the zero matrix, and the matrices Ay-Il, A~2j and Ayl in (5.45) are not shown.
4 3 2 1
8
7
6
5
4
5 4 3 2
. 6 543
7 6 5 4
8 7 6 5
7
6
5
4
3
6
5
4
3
2
5
4
3
2
1
4
5
6
7
8
UGS-l
2
2
3
4
5
1
1
1
2
3

Greedy-l
UGS-3
Figure 5.4.
3
4
5
6
7
8 7 5 3
7 6 4 2
6 5 3 1
5 4 2 1
4 3 2 1

Greedy-l
Givens sequences of computing the factorization (5.45).
The UGSs compute the factorization (5.45) using

Si,j = T -ki+j+kj-l
CDGRs, while the Greedy sequences apply approximately

Si,j
= log2(T -
ki+j ) + (kj - 1) log2log2(T - ki+j )
CDGRs when T kj + ki+j. Thus, the ith CDOM Qi (i = 1, ... , G -1) in

(5.47) is the product of ti = max(Si,I, ... ,Si,G-i) CDGRs and the total number
of CDGRs applied to compute the factorization (5.40) is given by
G-I
T(tI, ... ,tG-d= Lti'

i=I
In the case of UGS-2 and UGS-3 the complexity function T (tI, ... , tG-I) is
given by
G-I
T(tI, ... ,tG-I) = (G-l)(T-l)+ L max(kI-k1+i, ... ,kG-i- kG).

i=I
SURE models
139
Figure 5.5 shows the number of CDGRs required to compute the orthogonal
factorization (5.40), where G = 4, (k I ,k2,k3,k4) = (80,70,10,30) and T =
1000. The entries of the matrices show Si,j - the number of CDGRs required
to annihilate the blocks of A(O). The bold numbers denote tI, ... ,tG-I, that is,
the total number of CDGRs applied in each sub-stage that annihilates a subdiagonal. Clearly, the Greedy sequences perform better than the UGSs in terms
of the total number of CDGRs applied.
80
70
10
0
0
9W ( 0
930 1009
0
0
990 1069 1059 0
970 1049 1039 979
80
30
~)
9W
930
990
970
10
0
0
0
40
30
~)
Greedy sequences: 814 CDGRs
UGSs: 3177 CDGRs

Figure 5.5.
PDS.
70
0
( 270
"
0
272 239
272 239
Number of CDGRs for computing the orthogonal factorization (5.40) using the
Partitioning A(O) and DO) as

A(O)
(1
~)
and L-(0) =
(LL LG0)
-
-(0)
the annihilation of A(0) may be considered as updating the n x n lower-triangular

matrix L by the block-diagonal m x n matrix Ii using orthogonal factorizations, where m = (G - 1) T - rg,2 k i and n = r~l' k i . The employment of
the UGSs and Greedy sequences to perform this updating without taking into
consideration the block-diagonal structures of Ii requires the application of
(G - 1) T + kI - kG - 1 and log2 m + (n - 1) log210g2 m CDGRs, respectively.
That is, the direct application of the UGS and Greedy sequence to compute the
updating require fewer CDGRs than PDS.
Eliminating these entries of the UGSs which correspond to the upper triangular, zero, part of Ii and adjusting the remaining sequences to start from
step one produces the Modified UGS (MUGS) [71]. MUGS applies a total of
(G-l)T - K +2kI - /1 CDGRs, where /1 = min(PI,P2, ... ,PG-I), PI = 1 and
Pi = Pi-I + T - 2ki for i = 2, ... , G - 1. For the example in Fig. 5.5 MUGS
applies 2969 CDGRs. The annihilation of the submatrix A~O) starts at step
Pi-I - /1+ 1 (i = 2, ... , G) when MUGS-2 is used.
The difference between the MUGS and PDS based on CDGRs is that MUGS
starts to apply the CDGRs for annihilating a block subdiagonal before the complete annihilation of the previous subdiagonals. The same method can be used
to derive modified Greedy sequences that will perform at least as well as the
MUGSs. Notice that the theoretical measures of complexities of the parallel
140
strategies used to compute (5.40) hold even if .4(0) is block lower-triangular

[87].
4.3
COMMON EXOGENOUS VARIABLES
The regression equations in a SURE model frequently have common exogenous variables. The computational burden of solving SURE models can be
reduced significantly if the algorithms exploit this possibility. Let X d denote
the matrix consisting of the Kd distinct regressors in the SURE-CC model,
where Kd ::; K = I,g.1 kj. The exogenous matrix Xj (i = 1. ... , G) can be defined as XdSj, where Sj E 5RKdxki is a selection matrix that comprises relevant
columns of the Kd x Kd identity matrix.
Assuming for simplicity that T > K d , let the QLD of X d be given by
QTXd =
d
(0)
T - Kd
Ld Kd
Zj,l = Q~,IZj,
with
Zj,2 = Q~,2Zj,
QT =
d
(Q~
Tdf I) Kd
Kd
d,2
j,l = Q~,lj,
(5.49)
j,2 = Q~,2j,
and define the orthogonal matrix QD as QD = (IaQd,1 IaQd,2). Applying the orthogonal matrix Qb from the left of the SURE-CC model (5.30)
gives
{Zj,l})) = ( 0 ) vec( {~-}) + (vec( {j,I})) .
( vec(
vec({Zj,2})
LD
I
vec({j,2})
Thus, the SURE-CC model estimators vec( {~j}) arise from the solution of the
reduced sized model
(5.50)
The variance-covariance matrix of the disturbance vector vec( {j,2}) is given
by eIKd, and the matrix LD has the structure
LdSI
-LdSI
LdS2
0
-LdS2 LdS3
o
o
o
(5.51)
The equivalent of the QLD (5.41) and the orthogonal factorization (5.42)
are now given, respectively, by
T (V d)
Qj,O 'f}jL Sj =
(0)
L~O)
Kd -kj
kj
,
i= 1, ... ,G
(5.52)
SURE models
and
T
Qi,O (-t}i L Si-l) =
-(O))
(AA~O)
.
K -ki
ki
'
i=2, ... ,G.
141
(5.53)
The QLD (5.52) may be considered as the re-triangularization of a lowertriangular matrix after deleting columns. Sequential and parallel Givens sequences to compute this column-downdating problem have previously been
considered [71, 74,87]. A Givens sequence employed to compute (5.52) may
result in matrices A~O) and A~O) in (5.53) that are non-full. Generally, the order
of fill-in of the matrices A~O) and A~O) will not be the same if different Givens
sequences are used in (5.52). Thus, in addition to the number of steps, the
order of the fill-in for the matrices A~O) and A~O) will also be important when
selecting the Givens sequence to triangularize (5.52). In the second stage, the
and AV-l)
have fewer zero entries than A(j)
and AV)
respecmatrices A(j-l)
1
1
1+1
I'
tively.
Consider the special case of proper subset regressors of a SURE-CC model
[78,86]. The exogenous matrices are defined as
ki -ki+l
Xi= (Xi
(5.54)
which implies that X d = Xl, Kd = kl and Sf =

(5.49) is partitioned as
Ld =
(0 hi). Furthermore, if Ld in
kl -k2
k2- k3
Ll,l
kG
kl -k2
k2- k3
~,l
~,2
kG
LG,l
LG,2
LG,G
then LdSi (i = 1, ... , G) is already lower-triangular and is given by
kl-ki
ki-ki+l
LdSi = ki+l -ki+2
kG
ki-ki+l
ki+l - ki+2
kG
D
1,1
Li+l,i
Li+l,i+l
LG,i
LG,i+l
LG,G
142
Thus, after computing the QLD of Xl, the next main step is the triangularization of the augmented matrix in (5.44), which now has a sparse structure with
repeated blocks. This simplifies the computation of factorization (5.40). For
G = 4, the matrix (5.44) is given by
(A(O)
DO)
_L(2)
1,1
_L(3)
2,2
_L(4)
3,3
L(I)
1,1
L(I)
2,1
L (1)
2,2
()
L(I)
3,1
L(I)
3,2
L (1)
3,3
L(I)
4,1
L(I)
4,2
L(I)
4,3
L(I)
4,4
42,2)
L(2)
3,2
L(2)
3,3
_L(2) _L(2) _L(2) _L(2) L(2)

4, I
4,2
4,3
4,4
4,2
L(2)
4,3
L(2)
4,4
L(3)
3,3
L(3)
4,4
......................................................... ,
_L(2) _L(2)
2,1
2,2
_L(2) _L(2) _L(2)

3,1
3,2
3,3
(5.55)
.........................................................
0
_L(3) _L(3)
3,2
3,3
_L(3) _L(3) _L(3) L(3)

4,2
4,3
4,4
4,3
.........................................................
0
_L(4) _L(4) L(4)

4,3
4,4 4,4
where L~1 = f}kLi,j, k = 1, ... ,G, i = k, ... ,G and j = 1, ... ,i.

As in the case of Algorithm 5.3, G-l CDOMs are applied to annihilateA(O),
where the ith (i = 1, ... ,G - 1) CDOM Qi is the product of G - i orthogonal
matrices. The orthogonal matrix Q[j (i = 1, ... , G - 1 and j = 1, ... , G - i) in
(5.46) is defined as
Q!.
I,J
= _1_ (
'tj,v+I
f}i
-~i+I,V+11
~i+J,v+1I)
t}1
J
'
(5.56)
SURE models
143
and the factorization in (5.45) is simplified to
-IJP,v
_dj-t:I,V+l)
(j)
T(j),
"'v,J
0 .,'
.. ,_IJJ.~V+I)
...
_IJJ.:I,V+I)) _
,J
pj,V+lLv,v
'tj,v+ILv,j
.,'
L'v,v
0)
'tj,v+ILv,v
'
(5.57)
where
v=i+j-l,
(m,n)
L"I,}
and
= mnLiJ'
"
For J.l < A CJ.l > 0 and A = 1, ... , G) the values of ~1l'A. and
the recurrence
't1l,A.
are defined by
(5.58a)
and
if J.l = A,
otherwise.
QL
(5.58b)
Note that
is an extension of a 2 x 2 Givens rotation, in which the sine and
cosine elements are multiplied by an identity matrix. Furthermore, for i > 1 the
factorization (5.57) annihilates multiple blocks. These blocks are the fill-ins
which have been generated by the application of QT-I,j+l. The factorization
(5.57) affects the (i + j)th and ((j - 1)(2G + 2 - j) /2 + i)th block ofrows of
..4:(0) and DO), respectively, where DO) is partitioned as in (5.55). A block of
rows of DO) is modified at most once.
The triangularization pattern of (5.55) is illustrated in Fig. 5.6. An arc shows
the blocks of rows that are affected by the application of a CDOM. The letters
A and F, a shaded box and a blank box denote, respectively, an annihilated,
a filled-in, a non-zero and a zero block. A filled-in block that also becomes
zero during the annihilation of the block A is indicated by Z.
Let Li:,j: denote the block submatrix
(
L"'J"
"' . =
and define
L'',J,
:.
..
LG,j .. :
o.. ) ,
L~,G
144
I FF
I,
f'-.
IA
FF
l.
rt .....
"
\..
'"
~
i\.
~
~
~
I'\..
I
\
~
~
ZZ IA
i'-
i',
i\..
r;..
~
"- ~
-,
~
"
r"-
l"'i rt' I\.
Sla ge J
Sla ge 2
Figure 5.6.
I\..
Sla ge I
"- -\..,
:-
'\..
rio..
ZIA
F F F
~ "-
'I".
~ ~
Sla geO
ZA
IA
"
I\.
1-
"-
Annihilation sequence of triangularizing (5.55).
and
where kG+ 1 = 0 and 'ti,G+ 1 = 1 for i = 1, . .. ,G. Straightforward algebraic manipulations show that, in (5.38), Li,i and Li,i-I are given by T;Li:,i: and PiLi: ,i-b
respectively. That is,
o
'ti,i+2 L i+l,i+1
(5.59)
SURE models
and
Pi,i+lLi,i-1
Li,i-I
= ( Pi,i+2~i+I'i-1
Pi,i+ILi,i
Pi,i+2 L i+l,i Pi,i+2 L i+l,i+1 :::
145
0)
~
(5.60)
Pi,G+ILG,i-1 Pi,G+ILG,i Pi,G+ILG,i+1 ... Pi,G+ILG,G
Confonnally partitioning d; in (5.38) as
d; =
d1,1.. )
d")
= (d~'i
],1
d~'i
and letting
dj:,i
'
(5.61)
it follows that
(5.62a)
and
(5.62b)
for i = 2, ... ,G. The estimators ~I' ... ' ~G can be derived simultaneously
=
by solving for B the triangular system of equations LIB = lJ, where
(~i,i ... ~G,i)'
B~
~l,l
~2,1
~2,2
~G,l ~G,2
and
D~
dl,1
d2,1
d2,2
dG ,I
dG2
,
~D
JD
Thus, the main stages in solving the SURE-CC model with proper subset
regressors are (1) compute the QLD of Xl (2) evaluate the simple recurrence
for 't~ n in (5.58) for m = 1, ... , G and n = 2, ... , G + 1, and (3) solve a lowertrian~lar system, where the unknown and right-hand-side matrices have lower
block-triangular structures.
Chapter 6
SIMULTANEOUS EQUATIONS MODELS
The estimation of simultaneous equations models (SEMs) is of great importance in econometrics [34,35,62,64,96, 124, 130, 132, 149]. The most commonly used estimation procedures are the Three Stage Least-Squares (3SLS)
procedure and the computationally expensive maximum likelihood procedure
[33,60, 97, 106, 107, 119, 120, 143, 153]. Here the methods used for solving
SURE models will be extended to 3SLS estimation of SEMs. The ith structural
equation of the SEM can be written as
Yi=Xi~i+}j'Yi+Ui'
i= 1, ... ,G,
(6.1)
where, for the ith structural equation, Yi E S)tT is the dependent vector, Xi is the
T x ki matrix of full column rank of exogenous variables, 1'1 is the T x gi matrix
of other included endogenous variables, ~i and 'Yi are the structural parameters
to be estimated, and Ui E S)tT are the disturbance terms. For ~ == (Xi 1'1), aT ==
(~T if) and U = (Ul ... uG) the stacked system of the structural equations
can be written as
or as
G
vec(Y) =
(EB WSi) vec( {ai}G) +vec(U),
(6.3)
i=1
where W == (X Y) E S)tTx(K+G), X is a T x K matrix of all predetermined

variables, Y == (Yl ... YG), Si is a (K + G) x (ki + gi) selector matrix such that
148
= ~ (i = 1, ... ,G), and vec(U) == (uf ... ub) T. The disturbance vector
vec(U) has zero mean and variance-covariance matrix r. IT, where r. is a
G x G non-negative definite matrix. It is assumed that ei = ki + gi ~ K, that
is, all structural equations are identifiable. The notation used here is consistent
with that employed in the previous chapter and, similarly, the direct sum EB~I
and set operator {'}G will be abbreviated by EBi and {. }, respectively.
The 2SLS and Generalized LS (GLS) estimators of (6.3) are defined, respectively, from the application of Ordinary LS (OLS) and GLS to the transformed
SEM (hereafter TSEM)
WSi
vec(XTy) = (EBiXTWSi)vec( {Oi}) + vec(XTU),
(6.4)
where vec(XTU) has zero mean and variance-covariance matrix r.XTX.

That is, the 2SLS and GLS estimators are given, respectively, by
~(i)
u2SLS=
(T
Si WTXX TWSi )-1 SiTW TXX TYi,
l=
1, ... ,G
and
vec( {3 i }) = ( (Ef7iSf) (r.- I WTPW) (EBiSi) ) -I (EBiSf)vec(WT pyr.- I ),
where P = X(XTX)-IXT. The computational burden in deriving these estimators can be reduced if the TSEM (6.4) is premultiplied by IG (R(I) - T, where
R(l) is the upper triangular factor in the QRD of X. That is, the TSEM can be
written as
or
(6.5)
where 0 = QfU, and the Qs and Rs come from the incomplete QRD of the
augmented matrix W == (X Y) given by
K
T
Q W-
(R(l)
G
R(2) K
(3)
,
R
T-K
.
T _
WIth Q -
(QT)K
Qr T-K
(6.6)
where Q E ~TxT is orthogonal, R(l) is upper triangular and non-singular, and
(6.7)
Note that vec(O) has zero mean and variance-covariance matrix r.IK.
Simultaneous equations models
149
The 3SLS estimator, denoted by vec( {B}), is the GLS estimator with L replaced by its consistent estimator t based on the 2SLS residuals [9, 33, 60].
Computing the Cholesky decomposition
(6.8)
the vec( {B}) estimator is the solution of the normal equations
(6.9)
where
e is a G x G non-singular upper triangular matrix,

(6.1Oa)
and
(6.1Ob)
It is not always the case that the disturbance covariance matrix of a simultaneous equations model (SEM) is non-singular. In allocation models, for example, or models with precise observations that imply linear constraints on the
parameters, or models in which the number of structural equations exceeds
the number of observations, the disturbance covariance matrix is singular
[31, 62, 149]. In such models the estimation procedure above fails, since i:
l does not exist. The Generalized Linear Least Squares apis singular, i.e.
proach can be used to compute the 3SLS estimator of SEMs when t is singular
or badly ill-conditioned.
e-
GENERALIZED LINEAR LEAST SQUARES
The methods described in the previous chapter for solving SURE models
can be extended to 3SLS estimation of SEMs [68, 78, 86, 91, 110, 111]. Let
the TSEM(6.5) be rewritten in the equivalent form
(6.11)
where the rank of t = eeT is g ~ G, E 9\Gxg has full column rank, and V
is a random K x g matrix, defined as veT = Qfu. That is, vec(V) has zero
mean and variance-covariance matrix IgK. With this formulation, the 3SLS
estimator of vec( {Oi}) comes from the solution to the generalized linear least
squares problem (GLLSP)
argmin IIVII~ subject to vec(R(2)) = (EBiRSi)vec({oi})+vec(VeT), (6.12)
{Oi},V
150
which does not require that the variance~ovariance matrix be non-singular.

For the solution of (6.12) consider the following QRDs involving (Ef)jRSj)
and (C(l)h):
y(1))E
y(2)
y(3)
GK-E-q
(6.13)
and
q
L12)E
q
~2
(6.14)
GK-E-q
Here E = L.~1 ej, R == Ef)jR(i), and the R(i) E 9\ejxej and ~2 are upper triangular non-singular matrices; Q and P are GK x GK and gK x gK orthogonal
matrices, respectively; and E + q is the column rank of (( Ef)jRSj) (C (l) h) )
[4,51,100,113]. The orthogonal matrix Qis defined as
-T _
Q -
(Ie0
Q~
Q
E
) (QQ~-T) - (-T)
Q~Q~ GK - E '
A
(6.15)
where the QRD of RSj (i = 1, ... ,G) and the complete QRD of Q~(C(l)h) are
given, respectively, by
(6.16)
and
(6.17)
(6.18)
(6.19)
Conformally partitioning
SEM is consistent iff
VT
= vec(V) TP as
y(3)
151
(V[ VI), it follows that the
= 0,
(6.20)
V2 is the solution of the triangular system

~2V2 =
y(2)
(6.21)
and the arbitrary vector VI is chosen to be zero. The 3SLS estimator is the
solution of the block-triangular system
Rvec({o;})
=y(l)
-L12V2,
which can be equivalently written as
R(i) 8; = yP) - h;,
i = 1, ... , G,
(6.22)
where 8; E ~ei corresponds to the 3SLS estimator of 0;, and
L12V2 =
(~I) ~l
(6.23)
hG kG
Elementary algebraic manipulation produces
8; =
0;+ (R(i)r1A;VI,
implying that E(8;) = 0; and that the covariance matrix between 8; and
given by
i,j=l, ... ,G,
8j
is
(6.24)
where rp (p = i,j) is the solution of the triangular system R(p)rp = A p , L[I =

(A[. .. Ab) and A~ E ~epx(gK-q) [78].
1.1
ESTIMATING THE DISTURBANCE COVARIANCE

MATRIX
A consistent estimator of L, say
t, is computed as
where 0 = (al' .. aG) denotes the residuals of the structural equations. Initially, 0 is formed from the residuals of the 2SLS estimators
~(;)
u2SLS
(R-(i)-I_(I).
1
G
Y; ,
= , ... , ,
l
152
that is,
A
Ui
Since OTO
= Yi -
WS iV2SLS'
~(i)
= 1, ... , G .
(6.25)
= OT QQTO, premultiplication of both sides of (6.25) by QT gives

(6.26)
where, in (6.6), QTW = R == (R(1) R(2)). Then, residuals iteratively based

on 3SLS estimators are used to recompute 0, until convergence has been
achieved.
1ft is computed explicitly, then C in (6.12) could be obtained by removing
the G - g zero columns of the Cholesky factor C in (6.8) [93]. An alternative
numerically stable method is to compute the QLD of O. That is,
Q~O= (~J~-G
(6.27)
from which it follows that OTO = L~Lu and C = LUT. Note from (6.27)
that, if the number of structural equations exceeds the number of observations
in each variable - the so-called undersized sample problem - then 0, and
consequently t, will be singular with rank g ::; T < G. If 0 is not of full
column rank then C may be derived from the complete QLD of 0 [87].
1.2
REDUNDANCIES
Under the assumption that the consistency condition (6.20) is satisfied, factorizations (6.13) and (6.14) show that GK - E - q rows of the TSEM (6.11)
become redundant due to linear dependence [57]. Let Q(; comprise the last
GK - E - q columns of Qc and
N = Q~Q~
== (N(l) ... N(G)) ,
where N(i) is a (GK - E - q) x K matrix (i = 1, ... ,G). The elements of the pth
row of N, denoted by Np ,:, can reveal a linear dependency among the equations
of the TSEM (p = 1, ... ,GK - E - q). Premultiplication of the TSEM by N p ,:
gives
or
G K
G K
'"
..i '"
..i N(i)
p,t R(~)
t,1 = '"
..i '"
..i N(i)
p,t R(i~
t,.
i=lt=l
i=lt=l
o + '"
..i '"
..i N(i)\i.
p,t t,1. =
I
i=lt=l
0,
(6.28)

where V
153
= (V,I'" V,g), V,j E 9tK (j = 1, ... ,g) and

g
Vt ,i = L Ci,j vt ,j,
= 1, ... , G.
j=1
Assume that the ,uth equation of the Ath transformed structural equation
(6.29)
occurs in the linear dependency (6.28) - that is, N~~
=I O. Writing (6.28) as
G K
G K
~ ~ N(i)R(~)+NCA.)R(2) = ~ ~ N(i)(R(i)B+V, ')+N(A)(R(~)B +v. )
..i ..i p,t t,1
P,11 11,1..
..i ..i p,t t,. I
t,1
P,11 11, A
11,1..,
i=1 t=1
i#/..tf-11
i=1 t=1
if-At=ll1
it follows that (6.29) may be equivalently expressed as
_ (A)
N pl1
, if-A tf-11
Observe that, if
L LN~:~R~~) =
i=lt=1
_ (A)
N pl1
L LN~:~(R~:lBi + V"i).
i=lt=1
' if-A t=ll1
Q~ = (Qg) ... Q~G) ),
then N(i)
= Q~) Q~,i'
Furthermore, if
fj~,i and iJi,p denote the pth row and column of Q~) and Q~,i' respectively, then
N(i)/N(A)
AT - / AT p,t
P,11 = q p,iqi,t q p,A qA,/1'
1.3
INCONSISTENCIES
The SEM is inconsistent if y(3) in (6.13) is non-zero. Unlike the case of

inconsistent SURE models, setting y(3) to zero will result in incompatibilities in
the specification of the SEM [66]. To illustrate this, assume for simplicity that
y(3) = Nvec(R(2)) is uniformly non-zero, where N = Q~Q~. Premultiplying
the TSEM (6.11) by the idempotent matrix D = (IGK-E-q -NTN) gives
vec(QfY) - N Ty(3) = (ffiiRSi)vec( {Bi}) + vec(VCT )
from which it can be observed that
QT ( vec(QfY) _ NT y(3)) =
(0)
(vec({y\I)}))
y(2)
_ _0
==
y(2)
.
(vec({:y~1)}))
y(3)
y(3)
0
If vec(QfY) denotes the modified vector vec(QfY) in the TSEM such that
vec(QfY) = vec(QfY) - NT y(3),
154
then
vec(f) = Dvec(Y) + vec(Q2r),
where ris a random (T -K) x Gmatrix andD = (IGQl)D(IGQf). Thus
premultiplication of (6.3) by D gives the consistent modified SEM
or, equivalently,
where
Thus, the above modified model is incompatible with the specification of the
original SEM, since the replacement of Y by QIQ[f contradicts the replacement of Y by Y - Q2R(3) in W. Further research is needed to determine the
modification in the endogenous matrix Y that yields a consistent and correctly
specified model.
MODIFYING THE SEM
It is often desirable to modify the SEM by adding or deleting observations

or variables. This might be necessary if new data become available, old or
incorrect data are deleted from the SEM, or variables are added or deleted
from structural equations. First consider the case of updating the SEM with
new data. Let the additional sample information be given by
(6.32)
where W; = WSi == (Xi Yi) E 9t txei ; X E 9ttxK is the matrix of all predetermined variables in (6.32); E(vec(O)) = 0 and E(vec(O)vec(Of) = r.It.
Computing the updated incomplete QRD
G
R(2))K
R(3)
t'
(6.33)
the 3SLS estimator of the updated SEM is the solution of

argmin IIVII} subject to vec(R(2)) = (E9iRSi)vec( {ai }) + vec(VCT ), (6.34)
{ai},V
155
where H(J) is upper triangular, H= (H(1) H(2)), and t = CCT is a new estimator of L. The only computational advantage in not solving the updated
SEM afresh is the use of the already computed matrices R(l) and R(2) to construct the updated TSEM. The solution of (6.12) cannot be used to reduce the
computational burden of solving (6.34).
Similarly, the downdating problem can be described as solving the SEM
(6.3) after the sample information denoted by (6.32) has been deleted. If the
original matrix W is available, then the downdated SEM can be solved afresh
or the matrix that corresponds to R in (6.7) can be derived from downdating the
incomplete QRD of W [39, 50, 51, 75, 108, 109]. However, as in the updating
problem, the solution of the downdated TSEM will have to be recomputed
from scratch.
Assume that the additional variables denoted by WSi E SRTxei have been
introduced to the ith structural equation. After computing the QRD
with
the matrix computations corresponding to (6.16) and (6.18) are given respectively by
and
e.+e.
.
(Yv(l))
A*
K
I
Yi
A'
-ei-ei
where QB,i = QB,iQB,i and QA,i = (QA,i QB,iQA,i)' Computing the complete
QRD of Q~ (C h) as in (6.17) and the equivalent of (6.19), the 3SLS solution
of the modified SEM can be found using (6.22), where Q~ = EBiQ~ i and, as in
'
the updating problem, CCT is a new estimator of L.
Deleting the W Si data matrix from the ith structural equation is equivalent to
re-triangularizing k{i) by orthogonal transformations after deleting the columns
k{i) Si (i = 1, ... , G). Thus, if the new selector matrix of the ith equation is
denoted by Si, and the QRD of k(i) Si is given by
ei-ei
with
Qi = (
QAA,i
(6.35)
156
then (6. 17)-{6. 19) need to be recomputed with QA,; and QB,; replaced by QA,;QA,;
and (QA,jQB,; QB,i) , respectively.
Now consider the case where new predetermined variables, denoted by the
T x k matrix g, are added to the SEM. The modified SEM can be written as
vec(Y) = (EEljws;)vec( {~j}) + vec(U),
g Y), Sj is a (K + G + k) x (ej + kj )
where W == (X
as
selector matrix defined
and
Computing the incomplete QRD

AT
(Tg
Q2
(3)) _
(!?(l)
0
QfR(3)) k
QI R(3) T - K -
k '
with
it follows that the modified TSEM can be written in the form

vec(R(2)) = ( EElRSj)vec( {~j}) + vec(V(Y) ,
(6.36)
where now V and vec( {~j}) are a (K + k) x g matrix and an (E + L~l kj)element vector, respectively, and
V(l) _ (R(l)
g)
Qf
!?(l) ,
The solution of (6.36) can be obtained as in the original case. However, the
computation of the QRDs of RSj (i = 1, ... , G) can be reduced significantly if
both sides of (6.36) are premultiplied by the orthogonal matrix (QA QB)T,
where
QA == EEljQA,j, QB == EEl;QB,j,
Qv A,I =
-
0)
(QA,;
0 /.
ki
QB,j == (QB,j 0) for i = 1, ... ,G. In this case the upper triangular factor
in the QRD of RSP) is given by the already computed RSP).
and
157
LINEAR EQUALITY CONSTRAINTS

Consider the solution of the SEM (6.3) with the separable constraints
(6.37)
where Hi E 9td;xe; has full row rank, ~i E 9td;, d == L~1 di, and di < ei (i =
1, ... , G). The constrained 3SLS estimator can be found from the solution of
argmin IIVII} subject to {
{Oi}'V
vec(R(2)) = (EBiRSi)vec( {Oi}) + vec(VCY)

vec( gi}) = (EB#i)vec( {od)
(6.38)
which, under the assumption that the consistency rule (6.20) is satisfied, can
be written as
L12)
)
~2 (-~~.
(6.39)
Computing the QRD
i= 1, ... ,G,
with
di
Q,.,(i))e.
12
I
Q"'(i)
22
let
and
d.'
I
(6.40)
158
The constrained 3SLS solution can be derived analogously to the solution

of the original problem after computing the complete QRD
Q~ (~l
t) P
and
AT
Qc
3.1
gK-q
( *)
Y2
y(2)
~2)q
o d+q-q
Y
(A(2))
:9(3) .
BASIS OF THE NULL SPACE AND DIRECT

ELIMINATION METHODS
The basis of the null space (BNS) method and the direct elimination (DE)
method are alternative means for solving the constrained 3SLS problem. Both
methods reparameterize the constraints and solve a reduced unconstrained SEM
of E - d parameters [8, 14, 77, 93, 130, 131, 145]. Consider the case of separable constraints (6.37). In the BNS method the coefficient vector ai is expressed
as
(6.41)
where the QRD of
Hr is given by
di
with
Qi= (QA,i
(6.42)
Li E 9tdjxdj is anon-singular lower-triangular matrix, L(f}i = ~i (from (6.37,

and ~i is an unknown non-zero (ei - di)-element vector. Substituting (6.41)
into (6.11) gives the reduced TSEM
Premultiplying both sides of (6.43) by
or
QT gives
159
where it is assumed that (6.20) holds, V2 is defined as in (6.21), i.e. V2 =
L2ly(2) , hi is defined in (6.23),

PT vec(V) =
(VV21) ' Yi=Yi

v
-(1)
... - -(i) .
-R-(i) QA,i1'}i
and J{i=R QB,i (z=1, ... ,G).
Once the estimator of ~i' say ~i' is derived from the solution of the GLLSP
argmin IWI!l2 subject to vec( {Yi - hi}) = (EBiRi)vec( gi}) + Lll VI, (6.46)
gi},VI
then the constrained 3SLS estimator of Oi can be found from (6.41) with ~i
replaced by ~i'
In the direct elimination method, the QRD
(6.47)
is computed, where ITi is a permutation matrix and Li E 9td;xd; is a nonsingular lower-triangular matrix. If
IT!!:' , u,-
(~i)
ei-di
Oi di
(6.48)
'
Li1'}i = Qf~i and LiLi = Li, then Bi can be written as

(6.49)
Furthermore, if SiITi = (Si
vec(R
Si) then (6.11) can be written equivalently as
(2))_(
A
-)( veC({~i}))
AT)
- EBiRSi EBiRSi vec({8i-Li~i}) +vec(VC
or
vec(R(2)) - vec( {RSi8i}) = (EBiR(Si - SiLi)vec( {~i}) + vec(VCT ).
As in the BNS method, the premultiplication of the latter by
{Yi}))
( vec(
y(2)
QT gives
= (EBiRi) vec( {~.}) + (Lll L12) (~I)

0
'
Lz2
V2'
(6.50)
or
vec({Yi-hi}) = (EBiRi)vec({~i})+LllVI'
(6.51)
160
where now Yi = yP) - R),~i' R(i) (Si Si) =

for i = 1, ... ,G. The solution of the GLLSP
argmin
{Bi }, VI
(R~i) R)) and Ri = R~i) - R) Li
IIVi 112 subject to vec( {Yi - hi}) = (EBiRi)vec( {Bi}) + L11 Vi, (6.52)
will give an estimator for Bi, which is then used in (6.49) to compute ~i (i =
1, ... ,G). Finally, the constrained 3SLS estimator of Oi is computed by
IIi
( Bi)
~i
The above methods may be trivially extended for the case of cross-section
constraints.
COMPUTATIONAL STRATEGIES
The QRD and its modifications are the main components of the approach to
computing the 3SLS estimator given here. Strategies for computing the factorizations (6.16) and (6.17), when :t is non-singular, are investigated. The
QRDs of RSi (i = 1, ... , G) in (6.16) are mutually independent and can be
computed simultaneously. In the first chapter various strategies have been
discussed for the parallel computation of the QRDs. However, the particular structure of RSi can be exploited to reduce the computational burden of
the QRD (6.16). The QRDs (6.16) can be regarded as being equivalent to retriangularizing an upper trapezoidal matrix after deleting columns, where the
resulting matrix has at least as many rows as columns. Let Si == (eA..I, I ... eA..I,e,.)
and define the ei--element integer vector cri = (Ail ... Ai k. K ... K), where e1"'i..,}
(j = 1, ... , ei) is the Ajth column of iK+G and Ai, I < ... < Ai,e;' Figure 6.1
shows a Givens annihilation scheme for computing the QRD (6.16), where
cri = (3,6,7,9,12,15,15,15) and gi = 3. Generally, the total number of rotations applied to compute the QRD (6.16) for i = 1, ... , G is given by
,
G g;+k;
TI(cri,ki,gi, G) =
L L (cri,j- j)
i=I j=I
~~
((t
( J i r ki(k;
+ 1)/2) + Ki(2K -ZIG -
Ki -
1)/2).
(6.53)
Note that (6.53) gives the maximum number of rotations for computing the
QRDs (6.16). This can possibly be reduced by exploiting the structure of the
matrices RSi (i = 1, ... ,G), which depends on the specific characteristics of
the SEM. To illustrate this, consider the case where RSi =
(R~i) R)) and
1 6
5 10
4 915
3 814 22
713 21
12 20 39
111
~1
~O
18 ~8 37 ~5
17 ~7 36 ~
16
~5 34 ~2
r2.:I ~1 kil
~3 32 ~O
Figure 6.1.
RS j =
Givens sequence for computing the QRD of RSi.
(R~i) R~j)) for some i '=I j. Conformally partitioning R(i) as
illil = (ill'l
iln '" erl
it follows that R(J) can be derived from the QRD
where
~~D '
161
162
Thus, the number of rotations to compute the QRD of RSj is determined by the
RVi .
triangularization of the smaller submatrix

Parallel strategies for computing the QRD of RSi (i = 1, ... ,G) have been
described within the context of up- and down-dating the ordinary linear model
(Chapter 3). However, these strategies are efficient for a single matrix factorization. Their extension to compute simultaneously a set of factorizations by
taking into account the fact that the matrices might have common variables
needs to be investigated.
Consider the computation of the orthogonal factorization (6.17) when (; is
non-singular; that is, Qc = IGK-E, g = G, q = GK -E, and
-)
PT ( (CAT h )QB
Let P be defined as P = (QA
= (0
LT )
22
E
.
GK-E
(6.54)
QB) P such that
T AT
-T Q
AT
P (C h)QB = P (-T)
Qt (C h)QB
-(I)
K-el
K-e2
K- e3
K-eG
el
e2
-(I)
A 2,1
e3
A 3,1
-(I)
eG
AGI,
AG2
,
AG3
,
-(I)
K-el
lY)
I
K-e2
A2,1
K- e3
A 3,1
A32
,
lY)
3
K-eG
..1(1)
G,I
AG2
,
-T -
-(I)
-(I)
A(I)
A(I)
A(1)
A32
,
-(I)
1)
A(l)
-T
AG3
,
-
(6.55)
lY)
G
A(I)
A(I)
..
9(1)
whereAi,j = Cj,iQA,iQB,j and Ai,j = Cj,iQB,iQB,j for l > j, and Li = Ci,ih-ei

(i = 1, ... ,G). The matrix pT is the product of orthogonal matrices that reduce
(6.55) to lower-triangular form. Parallel strategies have been previously described within the context of updating a lower-triangular with a block-lower
triangular matrix (Chapter 3) and for solving SURE models with variance inequalities and positivity of correlations constraints (Chapter 5).
References
[1] Active Memory Technology (AMT) Ltd. Fortran-Plus enhanced, 1990.

[2] Active Memory Technology (AMT) Ltd. AMT General Support Library, 1992.
[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and
D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, 1992.
[4] E. Anderson, Z. Bai, and J. J. Dongarra. Generalized QR factorization
and its applications. Linear Algebra and its Applications, 162:243-271,
1992.
[5] H. C. Andrews and J. Kane. Kronecker matrices, computer implementation, and generalized spectra. Journal of the ACM, 17(2):260-268,
1970.
[6] M. Arellano. On the efficient estimation of simultaneous equations with
covariance restrictions. Journal of Econometrics, 42:247-265, 1989.
[7] O. Axelsson. Iterative Solution Methods. Cambridge University Press,
1996.
[8] J. L. Barlow and S. L. Handy. The direct solution of weighted and
equality constrained least-squares problems. Siam Journal on Scientific
and Statistical Computing, 9(4):704-716, 1988.
[9] D. A. Belsley. Paring 3SLS calculations down to manageable proportions. Computer Science in Economics and Management, 5:157-169,
1992.
164
[10] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying Influential Observations and Sources ofCollinearity. John Wiley
and Sons, 1980.
[11] C. Bendtsen, C. Hansen, K. Madsen, H. B. Nielsen, and M. Pinar. Implementation of QR up- and downdating on a massively parallel computer.
Parallel Computing, 21:49-61, 1995.
[12] M. W. Berry, J. J. Dongarra, and Y. Kim. A parallel algorithm for the
reduction of a nonsymmetric matrix to block upper-Hessenberg form.
Parallel Computing, 21:1189-1211, 1995.
[13] C. Bischof and C. F. Van Loan. The WY representation for products of
Householder matrices. Siam Journal on Scientific and Statistical Computing, 8(1):2-13, 1987.
[14]
A.
[15]
A. Bjorck. Numerical Methods for Least Squares Problems. SIAM,

Philadelphia, 1996.
Bjorck. A general updating algorithm for constrained linear least

squares problems. Siam Journal on Scientific and Statistical Computing,
5(2):394-402, 1984.
[16] L. S. Blackford. J. Choi, A. Cleary, E. D' Azevedo, J. Demmel,

I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. ScaLAPACK Users' Guide. SIAM,
Philadelphia, 1997.
[17] G. S. J. Bowgen and J. J. Modi. Implementation of QR factorization on
the DAP using Householder transformations. Computer Physics Communications, 37:167-170, 1985.
[18] P. Businger and G. H. Golub. Linear least squares solutions by Householder transformations. Numerische Mathematik, 7:269-276, 1965.
[19] Cambridge Parallel Processing. AP Linear Algebra Library (Manual
242),1996.
[20] J. M. Chambers. Regression updating. Journal of the American Statistical Association, 66:744-748, 1971.
[21] J. M. Chambers. Computational methods for data analysis. John Wiley
and Sons, Inc., 1977.
[22] J.-P. Chavas. Recursive estimation of simultaneous equation models.
Journal of Econometrics, 18:207-217, 1982.
References
165
[23] J. Choi, 1. J. Dongarra, and D. W. Walker. The design of a parallel dense

linear algebra software library: Reduction to Hessenberg, tridiagonal,
and bidiagonal fonn. Numerical Algorithms, 10:379-399, 1995.
[24] M. R. B. Clarke. Algorithm AS163. A Givens algorithm for moving
from one linear model to another without going back to the data. Applied
Statistics, 30(2):198-203, 1981.
[25] M. Clint, E. J. Kontoghiorghes, and J. S. Weston. Parallel GramSchmidt orthogonalisation and QR factorisation on an array processor. Zeitschrift for Angewandte Mathematik und Mechanik (ZAMM) ,
76(SI):377-378, 1996.
[26] M. Clint, R. Perrott, C. Holt, and A. Stewart. The influence of hardware and software considerations on the design of synchronous parallel
algorithms. Software Practice and Experience, 13:961-974, 1983.
[27] M. Clint, J. S. Weston, and J. B. Flannagan. Efficient Gram-Schmidt
orthogonalisation on an array processor. In B. Buchberger and J. Volkert, editors, Parallel Processing: CONPAR 94-VAPP VI, volume 854 of
LNCS, pages 218-228. Springer-Verlag, 1994.
[28] M. Cosnard and E. M. Daoudi. Householder factorization on distributed
architectures. In D. J. Evans and C. Sutti, editors, Parallel Computing:
Methods, Algorithms and Applications. Proceedings of the International
Meeting on Parallel Computing, pages 91-102. Adam Hilger, 1988.
[29] M. Cosnard and M. Daoudi. Optimal algorithms for parallel Givens factorization on a coarse-grained PRAM. Journal of the ACM, 41(2):399421, 1994.
[30] M. Cosnard, J.-M. Muller, and Y. Robert. Parallel QR decomposition of
a rectangular matrix. Numerische Mathematik, 48:239-249, 1986.
[31] R. H. Court. Three stage least squares and some extensions where the
structural disturbance covariance matrix may be singular. Econometrica, 42(3):547-558, 1974.
[32] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart. Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR
factorization. Mathematics of Computation, 30(136):772-795, 1976.
[33] W. Dent. Infonnation and computation in simultaneous equations estimation. Journal of Econometrics, 4:89-95, 1976.
[34] P. J. Dhrymes. Econometrics, Statistical Foundations and Applications.
Harper & Row, New York, 1970.
166
[35] P. J. Dhrymes. Topics in Advanced Econometrics, volume Vo1.2: Linear

and Nonlinear Simultaneous Equations. Springer-Verlag, New York,
1994.
[36] J. J. Dongarra and A. H. Sameh. On some parallel banded system
solvers. Parallel Computing, 1(3-4):223-235, 1984.
[37] J. J. Dongarra, A. H. Sameh, and D. C. Sorensen. Implementation of
some concurrent algorithms for matrix factorization. Parallel Computing, 3:25-34, 1986.
[38] I. S. Duff, M. Erisman, and J. K. Reid. Direct methods for sparse matrices. Oxford Science Publications, 1986.
[39] L. Elden and H. Park. Block downdating of least squares solutions.
SIAM Journal on Matrix Analysis and Applications, 15(3):1018-1034,
1994.
[40] R. W. Farebrother. Linear Least Squares Computations (Statistics: Textbooks and Monographs), volume 91. Marcel Dekker, Inc., 1988.
[41] D. W. Fausett and C. T. Fulton. Large least squares problems involving
Kronecker products. SIAM Journal on Matrix Analysis and Applications, 15:219-227, 1994.
[42] D. W. Fausett, C. T. Fulton, and H. Hashish. Improved parallel QR
method for large least squares problems involving Kronecker products.
Journal of Computational and Applied Mathematics, 78:63-78, 1997.
[43] P. M. Flanders. Musical bits - a generalized method for a class of data
movements on the DAP. Technical Report CM70, ICL RADC, 1980.
[44] P. M. Flanders. A unified approach to a class of data movements on an
array processor. IEEE Transactions on Computers, C-31(9):809-819,
1982.
[45] P. M. Flanders and D. Parkinson. Data mapping and routing for highly
parallel processor arrays. Future Computing Systems (Oxford University
Press), 2(2), 1987.
[46] I. Foster. Designing and Building Parallel Programs. Addison-Wesley,
1995.
[47] T. L. Freeman and C. Phillips. Parallel Numerical Algorithms. Series in
Computer Science (Editor C. A. R. Hoare). Prentice Hall International,
1992.
References
167
[48] W. M. Gentleman. Least squares computations by Givens transformations without square roots. Journal of IMA, 12:329-336, 1973.
[49] W. M. Gentleman. Some complexity results for matrix computations on
parallel processors. Journal of the ACM, 25(1):112-115, 1978.
[50] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders. Methods for modifying matrix factorizations. Mathematics of Computation,
28(126):505-535, 1974.
[51] G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins
University Press, Baltimore, Maryland, 3ed edition, 1996.
[52] J. H. Goodnight. A tutorial on the SWEEP operator. The American
Statistician, 33(3):116-135, 1979.
[53] A. R. Gourlay. Generalisation of elementary Hermitian matrices. The
Computer Journal, 13(4):411-412, 1970.
[54] M. Gulliksson. Iterative refinement for constrained and weighted linear
least squares. BIT, 34:239-253, 1994.
[55] M. Gulliksson and p.-A. Wedin. Modifying the QR decomposition to
constrained and weighted linear least squares. SIAM Journal on Matrix
Analysis and Applications, 13:4:1298-1313, 1992.
[56] S. J. Hammarling. The numerical solution of the general GaussMarkov linear model. In T. Durrani, J. Abbiss, J. Hudson, R. Mordam,
J. McWhirter, and T. Moore, editors, Mathematics of Signal Processing,
pages 441--456. Oxford University Press, 1987.
[57] S. J. Hammarling, E. M. R. Long, and P. W. Martin. A generalized linear
least squares algorithm for correlated observations, with special reference to degenerate data. DITC 33/83, National Physical Laboratory,
1983.
[58] J. A. Hausman, W. K. Newey, and W. E. Taylor. Efficient estimation
and identification of simultaneous equation models with covariance restrictions. Econometrica, 55:849-874, 1987.
[59] C. S. Henkel and R. J. Plemmons. Recursive least squares on a hypercube multiprocessor using the covariance factorization. Siam Journal
on Scientific and Statistical Computing, 12(1):95-106, 1991.
[60] L. S. Jennings. Simultaneous equations estimation (computational aspects). Journal of Econometrics, 12:23-39, 1980.
168
[61] J. Johnston. Econometric Methods. McGraw-Hill International, third

edition, 1987.
[62] G. G. Judge, W. E. Griffiths, R. C. Hill, H. Liitkepohl, and T. C. Lee. The
Theory and Practice of Econometrics. Wiley series in Probability and
Mathematical Statistics. John Wiley and Sons, second edition, 1985.
[63] I. Karasalo. A criterion for truncation of the QR decomposition algorithm for the singular linear least squares problem. BIT, 14:156-166,
1974.
[64] J. Kmenta and R. F. Gilbert. Small sample properties of alternative
estimators of seemingly unrelated regressions. Journal of the American
Statistical Association, 63:1180-1200, 1968.
[65] C. H. Koelbel, D. B. Lovemac, R. S. Schreiber, R. S. Steele, and M. E.
Zosel. The High Performance Fortran Handbook. The MIT Press, 1994.
[66] E.1. Kontoghiorghes. Inconsistencies and redundancies in SURE models: computational aspects. Computational Economics. (Forthcoming).
[67] E. J. Kontoghiorghes. Algorithms for linear model estimation on massively parallel systems. PhD Thesis, University of London, 1993. (Also
Technical report TR-655, Department of Computer Science, Queen
Mary and Westfield College, University of London).
[68] E. J. Kontoghiorghes. Solving seemingly unrelated regression equations
models using orthogonal decompositions. Technical Report TR-631,
Department of Computer Science, Queen Mary and Westfield College,
University of London, 1993.
[69] E. J. Kontoghiorghes. New parallel strategies for block updating the QR
decomposition. Parallel Algorithms and Applications, 5(1+2):229-239,
1995.
[70] E. J. Kontoghiorghes. Ordinary linear model estimation on a massively parallel SIMD computer. Concurrency: Practice and Experience,
11(7):323--341, 1999.
[71] E. J. Kontoghiorghes. Parallel strategies for computing the orthogonal
factorizations used in the estimation of econometric models. Algorithmica, 25:58-74, 1999.
[72] E. J. Kontoghiorghes. Parallel strategies for solving SURE models with
variance inequalities and positivity of correlations constraints. Computational Economics, 2000. (In press).
References
169
[73] E. J. Kontoghiorghes and M. R. B. Clarke. Computing the complete orthogonal decomposition using a SIMD array processor. In Lecture Notes
in Computer Science, volume 604, pages 660-663. Springer-Verlag,
1993.
[74] E. J. Kontoghiorghes and M. R. B. Clarke. Parallel reorthogonalization
of the QR decomposition after deleting columns. Parallel Computing,
19(6):703-707, 1993.
[75] E. J. Kontoghiorghes and M. R. B. Clarke. Solving the updated and
downdated ordinary linear model on massively parallel SIMO systems.
Parallel Algorithms and Applications, 1(2):243-252, 1993.
[76] E. J. Kontoghiorghes and M. R. B. Clarke. Stable parallel algorithms
for computing and updating the QR decomposition. In Proceedings
of the IEEE TENCON'93, pages 656--659, Beijing, 1993. International
Academic Publishers.
[77] E. J. Kontoghiorghes and M. R. B. Clarke. A parallel algorithm for repeated processing estimation of linear models with equality constraints.
In G. R. Joubert, D. Trystram, and F. J. Peters, editors, Parallel Computing: Trends and Applications, pages 525-528. Elsevier Science B.V.,
1994.
[78] E. J. Kontoghiorghes and M. R. B. Clarke. An alternative approach
for the numerical solution of seemingly unrelated regression equations
models. Computational Statistics & Data Analysis, 19(4):369-377,
1995.
[79] E. J. Kontoghiorghes and M. R. B. Clarke. Solving the general linear
model on a SIMD array processor. Computers and Artificial Intelligence, 14(4):353-370, 1995.
[80] E. J. Kontoghiorghes, M. R. B. Clarke, and A. Balou. Improving the
performance of optimum parallel algorithms on SIMD array processors: programming techniques and methods. In Proceedings ofthe IEEE
TENCON'93, pages 1203-1206, Beijing, 1993. International Academic
Publishers.
[81] E. J. Kontoghiorghes, M. Clint, and E. Dinenis. Parallel strategies for
estimating the parameters of a modified regression model on a SIMO
array processor. In A. Prat, editor, COMPSTAT, Proceedings in Computational Statistics, pages 319-324. Physical Verlag, 1996.
[82] E. J. Kontoghiorghes, M. Clint, and H.-H. Nageli. Recursive leastsquares using Householder transformations on massively parallel SIMO
systems. Parallel Computing, 25(8), 1999. (Forthcoming).
170
[83] E. J. Kontoghiorghes and E. Dinenis. Data parallel algorithms for solving least-squares problems by QR decomposition. In F. Faulbaum, editor, SoftStat'95: 8th conference on the scientific use of statistical software, volume 5 of Advances in Statistical Software, pages 561-568.
Stuttgart: Lucius & Lucius, 1996.
[84] E. J. Kontoghiorghes and E. Dinenis. Data parallel QR decompositions
of a set of equal size matrices used in SURE model estimation. Journal
ofMathematical Modelling and Scientific Computing, 6:421-427, 1996.
[85] E. J. Kontoghiorghes and E. Dinenis. Solving the sequential accumulation least squares with linear equality constraints problem on a SIMD
array processor. Zeitschrift for Angewandte Mathematik und Mechanik
(ZAMM), 76(SI):447-448, 1996.
[86] E. J. Kontoghiorghes and E. Dinenis. Solving triangular seemingly unrelated regression equations models on massively parallel systems. In
M. Gilli, editor, Computational Economic Systems: Models, Methods
& Econometrics, volume 5 of Advances in Computational Economics,
pages 191-201. Kluwer Academic Publishers, 1996.
[87] E. J. Kontoghiorghes and E. Dinenis. Computing 3SLS solutions
of simultaneous equation models with a possible singular variancecovariance matrix. Computational Economics, 10:231-250, 1997.
[88] E. J. Kontoghiorghes and E. Dinenis. Towards the parallel implementation of the SURE model estimation algorithm. Journal of Mathematical
Modelling and Scientific Computing, 8:335-341, 1997.
[89] E. J. Kontoghiorghes and D. Parkinson. Parallel Strategies for rank.-k
updating of the QR decomposition. Technical Report TR-728, Department of Computer Science, Queen Mary and Westfield College, University of London, 1996.
[90] E. J. Kontoghiorghes, D. Parkinson, and H.-H. Nageli. QR decomposition of dense matrices on massively parallel SIMD systems. In Proceedings of the 15th 1MACS Congress on Scientific Computation, Modelling
and Applied Mathematics. Wissenschaft und Technik verlag, 1997.
[91] S. Kourouklis and C. C. Paige. A constrained least squares approach
to the general Gauss-Markov linear model. Journal of the American
Statistical Association, 76(375):620-625, 1981.
[92] K. Lahiri and P. Schmidt. On the estimation of triangular structural
systems. Econometrica, 1978.
References
[93]
171
c. L. Lawson and R. J. Hanson. Solving Least Squares Problems.

Prentice-Hall Englewood Cliffs, 1974.
[94] F. T. Luk. A rotation method for computing the QR decomposition.

SIAM Journal on Scientific and Statistical Computing, 7(2):452-459,
1986.
[95] F. T. Luk and H. Park. On parallel Jacobi orderings. SIAM Journal on
Scientific and Statistical Computing, 10(1):18-26, 1989.
[96] H. Liitkepohl. Introduction to Multiple Time Series Analysis. SpringerVerlag, 1993.
[97] J. R. Magnus. Maximum likelihood estimation of the GLS model with
unknown parameters in the disturbance covariance matrix. Journal of
Econometrics, 7:281-312, 1978.
[98] J. H. Maindonald. Statistical Computing. John Wiley and Sons Inc.,
1984.
[99] MasPar computer corporation. MasPar System Overview, 1992.
[100] B. De Moor and P. Van Dooren. Generalizations of the singular value
and QR decompositions. SIAM Journal on Matrix Analysis and Applications, 13(4):993-1014, 1992.
[101] M. Metcalf and J. Reid. Fortran 90 Explained. Oxford University Press,
1990.
[102] J. J. Modi. Parallel Algorithms and Matrix Computation (Oxford Applied Mathematics and Computing Science series). Oxford University
Press, 1988.
[103] J. J. Modi and M. R. B. Clarke. An alternative Givens ordering. Numerische Mathematik, 43:83-90, 1984.
[104] J.J. Modi and G.S.J. Bowgen. QU factorization and singular value decomposition on the DAP. In D. Paddon, editor, Super-computers and
Parallel Computation, pages 209-228. Oxford University Press, 1984.
[105] M. Moonen and P. Van Dooren. On the QR algorithm and updating
the SVD and URV decompositions in parallel. Linear Algebra and its
Applications, 188/189:549-568, 1993.
[106] R. Narayanan. Computation of Zellner-Theil's three stage least squares
estimates. Econometrica, 37(2):298-306, 1969.
172
[107] W. Oberhofer and J. Kmenta. A general procedure for obtaining maximum likelihood estimates in generalized regression models. Journal
Econometrica, 42(3):579-590, 1974.
[108] S. J. Olszanskyj, J. M. Lebak, and A. W. Bojanczyk. Rank-k modification methods for recursive least squares problems. Numerical Algorithms, 7:325-354, 1994.
[109] C. C. Paige. Numerically stable computations for general univariate
linear models. Communications on Statistical and Simulation Computation, 7(5):437-453, 1978.
[110] C. C. Paige. Computer solution and perturbation analysis of generalized linear least squares problems. Mathematics of Computation,
33(145):171-183, 1979.
[111] C. C. Paige. Fast numerically stable computations for generalized
linear least squares problems. SIAM Journal on Numerical Analysis,
16(1):165-171, 1979.
[112]
c. C. Paige. The general linear model and the generalized singular value
decomposition. Linear Algebra and its Applications, 70:269-284, 1985.
[113] C. C. Paige. Some aspects of generalized QR factorizations. In M. G.

Cox and S. J. Hammarling, editors, Reliable Numerical Computation,
pages 71-91. Clarendon Press, Oxford, UK, 1990.
[114] C. T. Pan and R. J. Plemmons. Least squares modifications with inverse
factorizations: parallel implications. Journal of Computational and Applied Mathematics, 27:109-127, 1989.
[115] H. Park and L. Elden. Downdating the rank-revealing URV decomposition. SIAM Journal on Matrix Analysis and Applications, 16(1):138155, 1995.
[116] D. Parkinson. The distributed array processor (DAP). Computer Physics
Communications, 28:325-336, 1983.
[117] D. Parkinson. Organisational aspects of using parallel computers. Parallel Computing, 5:75-83, 1987.
[118] D. Parkinson, D. J. Hunt, and K. S. MacQueen. The AMT DAP 500. In
33rd IEEE Computer Society International Conference, pages 196-199,
San Francisco, 1988.
[119] D. S. G. Pollock. The Algebra of Econometrics (Wiley series in Probability and Mathematical Statistics). John Wiley and Sons, 1979.
References
173
[120] D. S. G. Pollock. 2 reduced-form approaches to the derivation of the

maximum-likelihood estimators for simultaneous-equation systems.
Journal of Econometrics, 1984.
[121] A. Pothen and P. Raghavan. Distributed orthogonal factorization:
Givens and Householder algorithms. SIAM Journal on Scientific and
Statistical Computing, 10(6):1113-1134, 1989.
[122] C. M. Rader and A. O. Steinhardt. Hyperbolic Householder transforms.
SIAM Journal on Matrix Analysis and Applications, 9:269-290, 1988.
[123] C. R. Rao. Computational Statistics, volume 9 of Handbook of Statistics. North-Holland, 1993.
[124] C. R. Rao and H. Toutenburg. Linear Models: Least Squares and Alternatives. Springer series in Statistics. Springer, 1995.
[125] P. A. Regalia and S. K. Mitra. Kronecker products, unitary matrices and
signal processing applications. SIAM Review, 31(4):586-613, 1989.
[126] J. K. Reid. A note on the least squares solution of a band system of
linear equations by Householder reductions. The Computer Journal,
10:188-189, 1967.
[127] N. S. Revankar. Some finite samples results in the context of two seemingly unrelated regression equations. Journal of the American Statistical
Association, 69:187-190, 1974.
[128] T. J. Rothenberg and P. A. Ruud. Simultaneous equations with covariance restrictions. Journal of Econometrics, 44:25-39, 1990.
[129] A. H. Sameh and D. J. Kuck. On stable parallel linear system solvers.
Journal of the ACM, 25(1):81-91, 1978.
[130] D. Sargan. Lectures on Advanced Econometric Theory. Basil Blackwell
Inc., 1988.
[131] K. Schittkowski and J. Stoer. A factorization method for the solution
of constrained linear least squares problems allowing subsequent data
changes. Numerische Mathematik, 31:431-463, 1979.
[132] P. Schmidt. Econometrics (Statistics: Textbooks and Monographs), volume 18. Marcel Dekker, Inc, 1976.
[133] P. Schmidt. A note on the estimation of seemingly unrelated regression
systems. Journal of Econometrics, 7:259-261, 1978.
174
[134] J. R. Schott. Matrix Analysis for Statistics (Wiley series in Probability

and Statistics). John Wiley and Sons, Inc., 1997.
[135] R. Schreiber and C. F. Van Loan. A storage efficient WY representation
for products of Householder transformations. SIAM Journal on Scientific and Statistical Computing, 10:53-57, 1989.
[136] S. R. Searle. Linear Models. John Wiley and Sons, Inc., 1971.
[137] G. A. F. Seber. Linear Regression Analysis. John Wiley and Sons Inc.,
1977.
[138] D. M. Smith. Regression using QR decomposition methods. PhD thesis,
University of Kent, 1991.
[139] D. M. Smith and J. M. Bremner. All possible subset regressions using
the QR decomposition. Computational Statistics and Data Analysis,
7:217-235, 1989.
[140] I. SOderkvist. On algorithms for generalized least-squares problems
with ill-conditioned covariance matrices. Computational Statistics,
11(3):303-313, 1996.
[141] V. K. Srivastava and T. D. Dwivedi. Estimation of seemingly unrelated
regression equations Models: a brief survey. Journal of Econometrics,
10:15-32, 1979.
[142] V. K. Srivastava and D. E. A. Giles. Seemingly Unrelated Regression
Equations Models: Estimation and Inference (Statistics: Textbooks and
Monographs), volume 80. Marcel Dekker, Inc., 1987.
[143] V. K. Srivastava and R. Tiwari. Efficiency oftwo-stage and three-stage
least squares estimators. Econometrica, 46(6):1495-1498, 1978.
[144] G. W. Stewart. Updating URV decompositions in parallel. Parallel
Computing, 20(2): 151-172, February 1994.
[145] J. Stoer. On the numerical solution of constrained least-squares problems. SIAM Journal on Numerical Analysis, 8(2):382-411, 1971.
[146] N. R. Swanson and C. W. J. Grange. Impulse response functions based
on a causal approach to residual orthogonalization in vector autoregressions. Journal of the American Statistical Association, 92(437):357367,1997.
[147] H. Takada, A. Ullah, and Y. M. Chen. Estimation of seemingly unrelated
regression-model when the error cova,riance-matrix is singular. Journal
of Applied Statistics, 1995.
References
175
[148] L. G. Telser. Iterative estimation of a set of linear regression equations.

Journal of the American Statistical Association, 59:845-862, 1964.
[149] H. Theil. Principles of Econometrics. John Wiley & Sons, Inc, 1971.
[150] A. Zellner. An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 57:348-368, 1962.
[151] A. Zellner. Estimators for seemingly unrelated regression equations:
some exact finite sample results. Journal of the American Statistical
Association, 58, 1963.
[152] A. Zellner. An error~omponents procedure (ECP) for introducing prior
information about covariance matrices and analysis of multivariate regression models. International Economic Review, 20(3):679-692, 1979.
[153] A. Zellner and H. Theil. Three-stage least squares: simultaneous estimation of simultaneous equations. Econometrica, 30(1):54-78, 1962.
Author Index
Anderson, E., 38, 150

Andrews, H.C., 118
Arellano, M., 129
Axelsson, 0., 4
Bai, Z., 38, 150
Balou, A., 19,23,113
Barlow, J.L., 158
Belsley, D.A., 57,149
Bendtsen, C., 57, 60
Berry, M.W., 98
Bischof, C.H., 12,38, 114, 150
Bjorck, A., 8,11,13,17,87,92,94, 158
Blackford, L.S., 35, 40
Bojanczyk, A.w., 57,155
Bowgen, G.S.1., 24-25,60, 110, 114
Bremner, J.M., 57
Businger, P., 11, 40, 114
Chambers, J.M., 57, 94
Chavas, J.-P., 57,
Chen Y.M., 121
Choi, J., 35, 40, 98
Clarke, M.RB., 19,22-23,29,49,57,60,66-67,
72-73,77,82,93-94,101-103,113,121,
123, 127, 138, 141, 149, 151, 155, 158
Cleary, A., 35, 40
Clint, M., 18,24,27,29,55,57,60,66,77,93
Cosnard, M., 22, 35, 67, 72-73, 101, 103, 114
Court, RH., 149
D' Azevedo, E., 35, 40
Daniel, J.w., 57
Daoudi, E.M., 22, 35, 114
Demmel, J., 35, 38, 40, 150
Dent, w., 147, 149
De Moor, B., 8
Dhillon, I., 35, 40
Dhrymes, P.1., 147
Dinenis, E., 17, 33, 35, 42, 49, 74-75, 77, 93,
140--141, 149, 152
Dongarra, 1.1., 35, 38, 40, 73, 98, 114, 150
Duff, I.S., 82
Du Croz, 1., 38, 150

Dwivedi, T.D., 119
Elden, L., 57, 93, 155
Erisman, M., 82
Farebrother, RW., 4, 114
Fausett, D.W., 130
Flanders, P.M., 17-18
Flannagan, J.B., 18,57,66
Foster, I., 17
Freeman, T.L., 17
Fulton, C.T., 130
Gentleman, W.M., 15, 17
Gilbert, R.F., 147
Giles, D.E.A., 119, 121, 123, 127, 129-130
Gill, P.E., 57, 155
Goldberger, A.S., 147
Golub, G.H., 4, 8,10--11,13,17,40--41,57,114,
123, 150, 155
Goodnight, J.H., 57
Gourlay, A.R., 11
Gragg, W.B., 57
Grange, C.W., 130
Greenbaum, A., 38, 150
Griffiths, W.E., 147, 149
Gulliksson, M., 8
Harumarling, S.1., 8-9, 35, 38,40, 114, 125, 150,
152
Handy, S.L., 158
Hansen, C., 57, 60
Hanson, R.I., 4, 40--41, 57, 87, 94,114, 152, 158
Hashish, H., 130
Hausman, J.A., 129
Henry, G., 35, 40
Hill, RC., 147, 149
Holt, C., 18
Hunt, D.1., 24
Jennings, L.S., 147, 149
Johnston, J., 7-8
Judge,G.G., 147,149
178
Kane, 1., 118

Karasalo, I., 41
Kaufman, L., 57
Kim, Y., 98
Kmenta, J., 147
Koelbel, C.H., 10, 17
Kontoghiorghes, EJ., 17-19,22-24,27,29,33,35,
40,42,49,55,60,66,74-75,77,80,82,90,
92-94,98,101-103,108,113,121,
123-124, 127-128, 138-141, 149, 151-153,
155,158
Kourouldis, S., 8, 106, 121, 127, 149
Kuck, D.J., 22,30,71, 101, 108
Kuh,E.,57
Lahiri, K., 123
Lawson,C.L.,4,4D-41,57,87,94, 114, 152, 158
Lebak, J.M., 57, 155
Lee, T.C., 147, 149
Long, E.M.R, 8-9, 125, 152
Lovemac, D.B., 10, 17
Luk, P.T., 22, 67
Liitkepohl, H., 147, 149
MacQueen, K.S., 24
Madsen, K., 57, 60
Magnus, J.R., 147
Maindonald, J.H., 4, 57
Martin, P.W., 8-9, 125, 152
McKenney, A., 38, 150
Metcalf, M., 10
Mitra, S.K., 118
Modi, lJ., 22, 24-25, 60, 67, 72-73, 101, 103, 110,
114
Moonen, M., 57
Muller, J.-M., 22, 67, 72-73, 101, 103
Murray, w., 57,155
Narayanan, R., 147
Newey, w.K., 129
Nielsen, H.B., 57, 60
Nageli, H.-H., 23, 29, 55, 60
Oberhofer, w., 147
Olszanskyj, SJ., 57, 155
Ostrouchov, S., 38, 150
Paige, C.C., 8, 106-108, 121, 127,149-150,155
Pan, C.T., 57
Parkinson, D., 17-18,23-24,98
Park, H., 22, 57, 93, 155
Perrott, R, 18
Petitet, A., 35, 40
Phillips, C., 17
Pinar, M., 57, 60
Plemmons, RJ., 57
Pollock, D.S.G., 7, 147
Pothen, A., 35
Rader, C.M., 94
Raghavan, P., 35
Rao, C.R., 2, 11, 147
Regalia, P.A., 118
Reid, J.K., 10-11, 82
Revankar, N.S., 127
Robert, Y.,22,67, 72-73,101,103
Rothenberg, TJ., 129
Ruud, P.A., 129
Sameh, A.H., 22, 30, 35, 71, 101, 108, 114
Sargan, D., 147, 158
Saunders, M.A., 57, 155
Schittkowski, K., 158
Schmidt, P., 121, 123, 147
Schott, J.R, 118
Schreiber, RS., 10, 17
Searle, S.R., 8
Seber, G.A.P., 8, 42, 62
Shroff, G.M., 12, 114
Smith, D.M., 57
Sorensen, D.C., 35, 38, 114, 150
Srivastava, V.K., 119, 121, 123, 127, 129-130, 147
Stanley, K., 35, 40
Steele, RS., 10, 17
Steinhardt, A.O., 94
Stewart, A., 18
Stewart, G.w., 57
Stoer, 1., 158
Swanson, N.R, 130
Soderkvist, I.S., 8
Takada, H., 121
Taylor, W.E., 129
Tesler, L.G., 121
Theil, H., 147, 149
Tiwari, R., 147
Toutenburg, H., 2,147
Ullah, A., 121
Van Loan, C.P., 4,8, 10-11, 13, 17,40-41,57,123,
150, 155
Van Dooren, P., 8, 57
Walker, D.W., 35, 40, 98
Wedin, p.-A., 8
Welsch, RE., 57
Weston, J.S., 18,24,27,55,57,66
Whaley, RC., 35, 40
Zellner, A., 119-121, 129-130, 147
Zosel, M.E., 10, 17
Subject Index
Algorithm, 8
3-D
Givens, 30, 74
Gram-Schmidt, 30
Householder, 30, 74
performance, 33, 74
bitonic, 69-70, 74-75, 82
complexity, 71
example, 71
orthogonal matrix, 69
column sweep, 124
data parallel, 17
downdating, 97
hybrid,55
SIMD,45
MIMD,35
QLD,135
QRD, 10
Givens, 14-15,22
Gram-Schmidt, 16-17,21
Householder, 12,19
hybrid,60,62,66
performance, 23, 25
set of equal size matrices, 29
skinny, 28
updating, 61
reconstructing the orthogonal matrix, 50, 52
SIMD, 23, 66
block parallel, 26
Givens, 40
Householder, 25, 40
performance, 28
QLD,41
QRD updating, 59
QRD,23
triangular (tSURE) model, 126
triangular systems, 128
Array processing, 18
Array processor, 41
BLUE, 4, 7-8, 58, 105-106,119

minimum 2-norm, 40
Cholesky,4, 105, 123, 149, 152
Colon notation, 10
Column sweep, 90
Column-based, 14-15
Conjugate Gradient, 4
Correlation, 119, 130
Data parallel, 17
CDGR,46,112
forall, 18
logical matrix, 19
permutation, 19
programming paradigm, 17
reduction, 19
replication, 19
spread, 19,29
sum, 19,29
Diagonally-based, 15
Direct methods, 4
Direct sum, 118
Error-components procedure, 129
Euclidean norm, 5
Fill-in, 4, 136
Frobenius norm, 121
Full parallelism, 137
Gauss-Jordan elimination, 4
Gaussian elimination, 4
Givens, 10, 13,43,46
bitonic,69
block generalization, 98
block-parallel strategies, 67
CDGR, 22, 30, 46, 48,59,70,73-79,82,84-85,
94-95, 101, 103, 108, 110-112, 138-139
column-based method, 79, 107
comparison, 79
Greedy sequence, 79
illustration, 79
modified Greedy sequence, 80
180
diagonally-based method, 76, 106

drawback, 79
illustration, 76--77
downdating, 94
illustration, 95
UGS, 94-95, 98
Greedy sequence, 22, 67, 77, 79, 81, 139
example, 73
organizational overheads, 73
updating, 72
Greedy-based method, 103
hyperbolic, 94
modified UGS (MUGS), 139
MSK sequence, 110, 113
SIMD implementation, III
parallel strategies, 67
PGS, 46, 48, 59
QLD updating, 77
recursive doubling, 67
rotation, 13
SK sequence, 22, 30, 71,101-102, 108, 110, 113
structured banded matrices, 82
SK-based method, 102
structural form, 13
UGS, 59, 67, 71, 73, 75, 77, 79,81,99, 138-139
updating, 67
Gram-Schmidt, 10
Classical, 10, 16
Modified, 10, 16
High Performance Fortran (HPF), 17,25
Householder, 10
compound transformations, 30, 34
hyperbolic, 94, 98
matrix, II
reflection, 41-45, 50
reflector, II
transformation, 11, 13, 59, 94, 114
vectors, 42, 50
Idempotent, 125, 153
Ill-conditioned, 8, 12,40, 149
Inconsistent, 9
Inner-product, 25
Iterative methods, 4
Kronecker product, 118
LAPACK,38
Least squares, 3-4, 7
3SLS, 147
constrained, 8
basis of the null space (BNS), 87
direct elimination (DE), 87, 89
estimator, 3-5
minimum 2-norm, 39
generalized, 8
variance-covariance matrix, 7, 9,106
GLLSP, 8-9, 105, 107
constraints, 9
objective function, 9
reduced, 107
restrictions, 8
variance-covariance matrix, 3
recursive, 7, 87
BNS, 88
constraints, 87
DE,90
performance, 90
restricted, 6--7
unrestricted, 6--7
weighted,8
Likelihood function, 8
Limited parallelism, 137
Linear model, 1-3
constrained, 6
variance-covariance, 8
general (GLM), 2, 7-9, 105
ordinary (OLM), 2, 7-8, 39, 57
adding variables, 90
deleting variables, 99
downdated, 57, 92
modified,57
non full column rank, 54
updated, 57-58
singular, 7
weighted (WLM), 7
Lower trapezoid, 39, 58
Lower triangular, 41, 44
LU decomposition, 4
Manifold,3
Massively parallel, 17,60
Maximum likelihood, 3, 147
estimator, 4
MIMD, 35, 66, 98
efficiency, 38
IBM SP2, 38
inter-processor communication, 35
load balancing, 35
locality, 35
scattering, 36
speedup, 38
SPMD,35-36
task-farming, 35, 38
Moore-Penrose, 8
MPI,38
Multivariate model, 120
Non-negative definite, 2,7
Non-singular, 5, 7, 9, 41
Normal equations, 3, 149
Normally distributed, 2
Numerically unstable, 8
Orthogonal, 9, II, 13-14,22,40,44,46,49,58,
67-68,83,94,100,108,121,139-140,162
Compound Disjoint Orthogonal Matrices
(CDOMs), 135, 137-138
application, 142
Subject Index
reconstruction, 50
special structure, 51
Outer product, 90
Permutation matrix, 41, 68, 100
Preconditioning, 4
QLD,40, 106, 108, 110, 135, 140-141, 145, 152
complete, 40
generalized (GQLD), 105
column pivoting, 40
PDS, 137-139
PRS, 137-138
SDS, 137
SRS, 137
updating with lower triangular matrix, 75
QRD, 4-5, 10, 123, 148, 155, 162
adding columns, 90
complete, 122, 155
deleting columns from trapezoid, 100
Greedy-based method, 103
illustration, 102
SK-based method, 102
deleting columns from triangular matrix, 101
deleting columns, 100
downdating, 93
Householder transformations, 95
parallel strategies, 94
performance, 98
generalized (GQRD), 8-9, 121
Givens, 14
examples, 15
Gram-Schmidt, 16-17
Householder, 11-12
block format, 12
incomplete, 148,156
downdated, 155
updated, 154
retriangularizing a trapezoidal matrix, 160
Givens sequence, 160
set of matrices, 29, 34
structured banded matrices, 82
bitonic method, 82
illustration, 82, 85
updating, 58,67,72,84
weighted,8
Rank,3,5-6,9-10,87
criterion, 41
full column, 40, 57, 121, 147, 149
not full, 39
Recurrence, 145
Recursive doubling, 67
Regression, 1, 117
model,41
stepwise, 42, 48, 61, 127
Residuals, 1, 3,5,62, 120
Set of vectors notation, 118
SIMD, 17-18, 73, 108
arrays, 17
181
communication, 17
CPP, 24
Fortran, 25
GAMMA,24
lalib,24
languages, 24
qr factor, 25
DAP, 24, 40-41, 48,52,59,90,94-95,110,113,
127-128
GAMMA, 60, 64
layers, 18-19,24,42,60,65,108-109
mapping, 17-18,24,54,60
column, 60-62, 64
cyclic, 18,27,60-62,64
row, 60-61, 66
strategies, 55
MasPar, 17-18,24,29,32,60,62,64,74-75
DPU, 18-19
Fortran, 18
front-end, 18
overheads,20,27,113
communication, 61
function, 64
implementation, 64
performance, 17, 65
remapping, 62, 64
overheads, 65
synchronization, 17
Simultaneous equations models (SEMs), 129, 147
2SLS, 148
residuals, 149, 151
3SLS, 147, 149
computational strategies, 160
constrained estimator, 159
constrained, 157
convergence, 152
estimator, 149, 151
estimator's covariance matrix, 151
QRD,160
consistent modified, 154
cross-section constraints, 160
disturbance variance-covariance matrix, 148
consistent estimator, 151
non singular, 150
singular, 149, 152
GLLSP, 149
consistency condition, 152
QRD,150
GLS, 148
inconsistent, 153
modified, 154
adding new data, 154
adding new predetermined variables, 156
adding new variables, 155
deleting data, 155
deleting variables, 155
OLS, 148
182
redundancies, 152
separable constraints, 157
BNS method, 158
DE method, 158
specification incompatibilities, 153-154
structural equation, 147, 149, 152
identifiable, 148
transformed, 153
transformed (TSEM), 148-149, 152-153
downdated, ISS
linear dependency, 152-153
modified, 156
undersized sample problem, 152
Skinny matrices, 27
Sparse, 4
Subscript notation, 10
SURE,74, 117, 119-121,147,149,162
common exogenous variables, 140
covariance restrictions, 129
distinct regressors, 120
disturbances, 119
singular covariance matrix, 121
FGLS, 120
GLLSP, 121-122
objective function, 122
GLS, 119
inconsistent, 122, 153
iterative FGLS, 120
OLS, 119-120
restricted residual, 120
subset regressors, 123
SURE-CC, 129-130
definition, 130
disturbance variance-covariance matrix, 140
FWLS, 132
proper subset regressors, 141, 145
reduced size, 140
SURR,120
SUUR,120
triangular (tSURE), 123
algorithm, 126
consistent regression equation, 125
consistent, 125
estimator, 125
estimator's covariance matrix, 126
FGLS estimator, 126
implementation, 127
inconsistent regression equation, 125
inconsistent, 125
modified consistent, 126
performance, 128
singular covariance matrix, 123
unrestricted residual, 120
variance inequalities and correlation constraints,
129
SVD, 4, 9
Timing model, 17, 19,23,40,42,46,65
constrained least squares, 90
downdating, 94
Givens, 95
Householder transformations, 97
Givens, 23, 33
GLM,1I3
updating, 61
Householder, 20, 32
MIMD,37
Modified Gram-Schmidt, 21, 32
PGS, 48
QLD,54
QRD updating, 60
column, 62
cyclic, 61
reconstructing the orthogonal matrix, 50, 53
remapping, 62
triangular (tSURE), 127
Trace, 120
Triangular factors, 70
Triplet subscript, 10
Unbiased estimator, 8
Upper trapezoid, 100
Upper triangular,S, 9, 13,69
Vector operator, 118
Virtual processor, 24
Advances in Computational Economics

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
A. Nagurney: Network Economics. A Variational Inequality Approach. 1993

ISBN 0-7923-9293-0
A.K. Duraiappah: Global Warming and Economic Development. A Holistic Approach
to International Policy Co-operation and Co-ordination. 1993
ISBN 0-7923-2149-9
D.A. Belsley (ed.): Computational Techniques/or Econometrics and Economic Analysis. 1993
ISBN 0-7923-2356-4
W.W. Cooper and A.B. Whinston (eds.): New Directions in Computational Economics. 1994
ISBN 0-7923-2539-7
M. Gilli (ed.): Computational Economic Systems. Models, Methods & Econometrics.
1996
ISBN 0-7923-3869-3
H. Amman, B. Rustem, A. Whinston (eds.): Computational Approaches to Economic
ISBN 0-7923-4397-2
Problems. 1997
G. Pauletto: Computational Solutions 0/ Large-Scale Macroeconometric
Models. 1997
ISBN 0-7923-4656-4
R.D. Herbert: Observers and Macroeconomic Systems. Computation of Policy TraISBN 0-7923-8239-0
jectories with Separate Model Based Control. 1998
D. Ho and T. Schneeweis (eds.): Applications in Finance, Investments, and Banking.
1999
ISBN 0-7923-8294-3
A. Nagurney: Network Economics: A Variational Inequality Approach. Revised
second edition. 1999
ISBN 0-7923-8350-8
T. Brenner: Computational Techniques/or Modelling Learning in Economics. 1999
ISBN 0-7923-8503-9
A. Hughes Hallett and P. McAdam (eds.): Analysis in Macroeconomic Modelling.
1999
ISBN 0-7923-8598-5
R.A. McCain: Agent-Based Computer Simulation o/Dichotomous Economic Growth.
1999
ISBN 0-7923-8688-4
F. Luna and B. Stefansson (eds.): Economic Simulations in Swarm. Agent-Based
Modelling and Object Oriented Programming. 1999
ISBN 0-7923-8665-5
E.J. Kontoghiorghes: Parallel Algorithms for Linear Models. Numerical Methods
and Estimation Problem. 1999
ISBN 0-7923-7720-6
KLUWER ACADEMIC PUBLISHERS - DORDRECHT / BOSTON / LONDON

(Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.) - Parallel Algorithms For Linear Models - Numerical Methods and Estimation Problems-Springer US (2000)

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

(Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.) - Parallel Algorithms For Linear Models - Numerical Methods and Estimation Problems-Springer US (2000)

Transféré par

Droits d'auteur :

Formats disponibles

PARALLEL ALGORITHMS FOR LINEAR MODELS