Académique Documents
Professionnel Documents
Culture Documents
HuazhongYang
Parallel Sparse
Direct Solver
for Integrated
Circuit
Simulation
Parallel Sparse Direct Solver for Integrated Circuit
Simulation
Xiaoming Chen Yu Wang
Huazhong Yang
123
Xiaoming Chen Yu Wang
Department of Computer Science Tsinghua University
and Engineering Beijing
University of Notre Dame China
Notre Dame, IN
USA Huazhong Yang
Tsinghua University
and Beijing
China
Tsinghua University
Beijing
China
With the advances in the scale and complexity of modern integrated circuits (ICs),
Simulation Program with Integrated Circuit Emphasis (SPICE) based circuit sim-
ulators are facing performance challenges, especially for post-layout simulations.
Advances in semiconductor technologies have greatly promoted the development of
parallel computers, and, hence, parallelization has become a promising approach to
accelerate circuit simulations. Parallel circuit simulation has been a popular research
topic for a few decades since the invention of SPICE. The sparse direct solver
implemented by sparse lowerupper (LU) factorization is the biggest bottleneck in
modern full SPICE-accurate IC simulations, since it is extremely difcult to par-
allelize. This is a practical challenge which both academia and industry are facing.
This book describes algorithmic methods and parallelization techniques that aim
to realize a parallel sparse direct solver named NICSLU (NICS is short for
Nano-Scale Integrated Circuits and Systems, the name of our laboratory in
Tsinghua University), which is specially targeted at SPICE-like circuit simulation
problems. We propose innovative numerical algorithms and parallelization frame-
work for designing NICSLU. We describe a complete flow and detailed parallel
algorithms of NICSLU. We also show how to improve the performance of NICSLU
by developing novel numerical techniques. NICSLU can be applied to any
SPICE-like circuit simulators and has been proven to be high performance by actual
circuit simulation applications.
There are eight chapters in this book. Chapter 1 gives a general introduction to
SPICE-like circuit simulation and also describes the challenges of parallel circuit
simulation. Chapter 2 comprehensively reviews existing work on parallel circuit
simulation techniques, including various software algorithms and hardware accel-
eration techniques. Chapter 3 covers the overall flow and all the core steps of
NICSLU.
Starting from Chap. 4, we present the proposed algorithmic methods and par-
allelization techniques of NICSLU in detail. We will describe two parallel factor-
ization algorithms, a full factorization with partial pivoting and a re-factorization
without partial pivoting, based on an innovative parallelization framework. The two
algorithms are both compatible with SPICE-like circuit simulation applications.
v
vi Preface
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Circuit Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Challenges of Parallel Circuit Simulation . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Device Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Sparse Direct Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Theoretical Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Focus of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Direct Parallel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Parallel Direct Matrix Solutions . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Parallel Iterative Matrix Solutions . . . . . . . . . . . . . . . . . . . . 19
2.2 Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Parallel BBD-Form Matrix Solutions . . . . . . . . . . . . . . . . . 22
2.2.2 Parallel Multilevel Newton Methods . . . . . . . . . . . . . . . . . . 24
2.2.3 Parallel Schwarz Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 Parallel Relaxation Methods . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Parallel Time-Domain Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Parallel Numerical Integration Algorithms . . . . . . . . . . . . . 28
2.3.2 Parallel Multi-Algorithm Simulation . . . . . . . . . . . . . . . . . . 30
2.3.3 Time-Domain Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Matrix Exponential Methods . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Hardware Acceleration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.2 FPGA Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
viii Contents
With the rapid development of the IC and computer technologies, EDA techniques
have become an important subject in the electronics area. The appearance and devel-
opment of EDA techniques have greatly promoted the development of the semi-
conductor industry. The development trend of modern very-large-scale integration
(VLSI) circuits is to integrate more functionalities into a single chip. To facilitate this,
the scale of modern ICs is extremely large and electronic systems are also becoming
more complex generation by generation. In addition, electronic devices are upgrad-
ing frequently and IC vendors are facing a huge challenge of the time-to-market.
Such a rapid developing electronic world, on the other hand, has brought a challenge
to EDA techniques: the performance of modern EDA tools must keep pace with
Springer International Publishing AG 2017 1
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_1
2 1 Introduction
the development of modern ICs, such that IC vendors can ceaselessly develop new
products for the world.
As one of the core components of EDA techniques, SPICE [1]-like transistor-
level circuit simulation is an essential step in the design and verification process
of a very broad range of ICs and electronic systems such as processors, memories,
analog and mixed-signal circuits, etc. It serves a critical mission and cheap way of
predicting circuit performance and identifying possible faults before the expensive
chip fabrication. As a fundamental step in the IC design and verification process,
SPICE simulation techniques including fundamental algorithms and parallel simula-
tion approaches have been widely studied and under a long-term active development
in the last a few decades, since the invention of SPICE. Today, there are a number of
circuit simulators from both academia and industry which are developed based on the
original SPICE code. SPICE has already become the de facto standard transistor-level
simulation tool. SPICE-like circuit simulators are widely adopted by universities and
IC vendors all over the world.
Modern SPICE-like circuit simulators usually integrate a large number of device
models, including resistors, capacitors, inductors, independent and dependent
sources, various semiconductor devices including diodes, metaloxide
semiconductor field-effect transistors (MOSFET), junction field-effect transistors
(JFET), etc., as well as many macro-models to represent complicated IC components.
Modern circuit simulators also support a wide variety of circuit analysis, including
direct current (DC) analysis, alternating current (AC) analysis, transient analysis,
noise analysis, sensitivity analysis, pole-zero analysis, etc. DC analysis, which cal-
culates a quiescent operating point, serves as a basic starting point for almost all of
the other simulations. Transient analysis, or called time-domain simulation, which
calculates the transient response in a given time interval, is the most widely used
function in analog and mixed-signal circuit design and verification among all the
simulation functions offered by SPICE. All of the device models and simulation
functions provided by modern circuit simulators, provide a strong support to the
transistor-level simulation of modern complementary metaloxidesemiconductor
(CMOS) circuit design and verification.
Figure 1.1 shows a typical framework of SPICE-like circuit simulators, in which
the blue boxes are essential components and the dotted boxes are supplementary
functionalities provided some software packages. The SPICE kernel accepts text
netlist files as input. Although some software packages have a graphical interface
such that users can draw circuit schematics using built-in symbolics, the schematics
are automatically compiled into netlist files by front-end tools before simulation.
Netlist files describe everything of the circuit that will be simulated, including the
circuit structure, parameters of devices and models, simulation type and control
parameters, etc. The SPICE kernel reads the netlist files and builds internal data
structures. Then device models are initialized according to the model parameters
specified in the input files. Based on the Kirchhoffs laws [2], a circuit equation is
created, which is then solved by numerical engines to get the response of the circuit.
Finally, back-end tools like waveform viewer can be used to show and analyze the
response.
1.1 Circuit Simulation 3
Schematic
Symbolic
library
Netlist file Model
parameters
SPICE
Simulation Device model
library
Output
files
Waveform
post-processing
SPICE employs the modified nodal analysis (MNA) [3] method to create the circuit
equation. In this subsection, we will use transient simulation as an example to explain
the principle and simulation flow of SPICE. In transient simulation, the equation cre-
ated by MNA has a unified form which can be expressed by the following differential
algebraic equation (DAE)
d
f (x(t)) + q(x(t)) = u(t), (1.1)
dt
where t is the time, x(t) is the unknown vector containing node voltages and branch
currents, f (x(t)) is a nonlinear vector function denoting the effect of static devices
in the circuit, q(x(t)) is a nonlinear vector function denoting the effect of dynamic
devices in the circuit, and u(t) is the known stimulate of the circuit.
In most practical cases, Eq. (1.1) does not have an analytical solution, so the
only way to solve it is to use numerical methods. The implicit backward Euler and
trapezoid methods [4] are usually adopted to solve the DAE in SPICE-like circuit
simulators. If we adopt the backward Euler method to discretize Eq. (1.1) in the time
domain, we get
where tn and tn+1 are the discrete time nodes. If the solutions at and before time node
tn (i.e., x(t0 ), x(t1 ), . . . , x(tn )) are all known, then the solution at time node tn+1 (i.e.,
x(tn+1 )) can be solved from Eq. (1.2).
Equation (1.2) is nonlinear and can be abstracted into the following implicit equa-
tion:
Fn+1 (x(tn+1 )) = 0, (1.3)
where Fn+1 denotes the implicit nonlinear function at time node tn+1 , which can
be solved by the NewtonRaphson method [4]. Namely, Eq. (1.3) is solved by the
following iteration form:
J x(tn+1 )(k) x(tn+1 )(k+1) = Fn+1 x(tn+1 )(k) + J x(tn+1 )(k) x(tn+1 )(k) , (1.4)
where J is the Jacobian matrix and the superscript is the iteration number.
Equation (1.4) can be further abstracted into a linear system form
Ax = b, (1.5)
where the matrix A and the right-hand-side (RHS) vector b only depend on the
intermediate results of the kth iteration. Till now, we have described that the core
operation to solve the circuit equation Eq. (1.1) in SPICE-like circuit simulation is
to solve the linear system Eq. (1.5), which is obtained by discretizing and linearizing
the DAE using numerical integration methods (e.g., the backward Euler method) and
the NewtonRaphson method.
Although the above equations are all derived from transient simulation, the core
method is similar for other simulation functions. Basically, for ordinary differential
equations (ODE), implicit integration methods are adopted to discretize the equation
in the time domain, and then the NewtonRaphson method is adopted to linearize the
nonlinear equation at a particular time point. Consequently, for any type of SPICE-
like simulations, the core operation is always solving linear equations associated
with the circuit and the simulation function. The major difference is in the format
of the equation. For example, in frequency-domain simulation, we need to solve
complex linear systems instead of real linear systems. Therefore, the linear solver is
an extremely important component in any SPICE-like circuit simulator.
1.1.2 LU Factorization
j1
i = 1, 2, . . . , N
Ui j = Ai j L ik Uk j (1.7)
j = i, i + 1, . . . , N
k=1
j1
i = 1, 2, . . . , N
Li j = Ai j L ik Uk j . (1.8)
Ujj j = 1, 2, . . . , i 1
k=1
To solve a linear system using LU factorization, at least the following two steps are
required: triangular factorization (i.e., A = LU) and forward/backward substitutions
(solving y from Ly = b and solving x from Ux = y). In practice, due to the
numerical instability problem caused by round-off errors, one needs to do pivoting
when performing LU factorizations. In most cases, a proper permutation in rows (or
columns) is sufficient for ensuring the numerical stability of LU factorization. Such
an approach is called a partial pivoting. Row permutation-based LU factorization
with partial pivoting can be expressed as follows:
PA = LU, (1.9)
where P is the row permutation matrix indicating the row pivoting order. LU factor-
ization with full pivoting involves both row and column permutations, i.e.,
where P and Q are the row and column permutation matrices indicating the row and
column pivoting orders, respectively.
The time complexity of LU factorization is O(N 3 ) for dense matrices, so it can
be very time-consuming when solving large linear systems. However, for sparse
matrices, the time complexity is greatly reduced, so efficiently solving a large sparse
linear system by LU factorization is possible. In order to enhance the performance of
solving sparse linear systems by LU factorization, an additional pre-analysis step to
reorder the row and column permutations to minimize fill-ins [6] is required before
factorization, which will be explained in detail in Chap. 3.
6 1 Introduction
Netlist
Parsing netlist
Matrix creation by
MNA & pre-analysis
DC analysis
Transient iteration
Device model
evaluation
Newton-Raphson
iteration
Matrix/RHS load
Updating SPICE iteration
time node Sparse LU factorization
(A=LU)
N
Iteration converged?
Forward & backward
Y substitutions (Ly=b,
Ux=y)
N
Time node ended?
Y
Waveform
Figure 1.2 shows a typical flow of SPICE-like transient simulation, which can be
derived from the mathematical formulation presented in Sect. 1.1.1. The SPICE ker-
nel first reads a circuit netlist written in a pure text format, and then parses the netlist
file to build internal data structures. A complete SPICE flow also includes many
auxiliary and practical functionalities, e.g., netlist check and circuit topology check.
After internal data structures are built, the SPICE kernel calculates the symbolic
pattern of the circuit matrix by MNA, followed by a pre-analysis step on the sym-
bolic pattern. Typically, the pre-analysis step reorders the matrix to minimize fill-ins
during sparse LU factorization. We will discuss the pre-analysis step in Sect. 3.2 in
detail. After a DC analysis to obtain the quiescent operating point, the SPICE kernel
enters the main body of transient simulation, taking the quiescent operating point as
the initial condition.
The main body of transient simulation is marked in blue in Fig. 1.2. Accord-
ing to the mathematical formulation presented in Sect. 1.1.1, SPICE-like transient
simulation has two nested levels of loops. The outer level is the transient iteration
and the inner level is the nonlinear NewtonRaphson iteration. The outer level loop
1.1 Circuit Simulation 7
discretizes the DAE Eq. (1.1) into Eq. (1.2) (i.e., Eq. (1.3)) in the time domain
by some numerical integration method. The inner level loop solves the nonlinear
equation Eq. (1.3) using the Newton-Raphson method (i.e., Eq. (1.4)) at a particular
time node. Once the NewtonRaphson method converges, the time node is increased
by estimating the local truncation error (LTE) of the adopted numerical integration
method, and then the inner level loop runs again at the new time node. Typically, a
SPICE-like transient simulation can perform thousands of iterations.
Each iteration in the inner level loop is called a SPICE iteration. In the SPICE iter-
ation, a device model evaluation is first performed, which is followed by matrix/RHS
load. Device model evaluation uses the solution obtained in the previous SPICE iter-
ation. The purpose of the two steps is to calculate the Jacobian matrix and the RHS of
Eq. (1.4), i.e., the coefficient matrix A and the RHS vector b. After the linear system
is constructed, a sparse solver is invoked to solve it, and then we get the solution
of the current SPICE iteration. Typically, SPICE-like circuit simulators adopt sparse
LU factorization to solve the linear system. Matrices created by SPICE-like circuit
simulators have an unique feature that, although the values change during SPICE
iterations, the symbolic pattern of the matrix keeps unchanged. This is also one of
the reason that SPICE-like circuit simulators usually adopt sparse LU factorization,
due to that some symbolic computations can be executed only once.
It is well known that there are two types of methods to solve linear systems:
direct methods [6] and iterative methods [7]. SPICE-like circuit simulators usually
adopt sparse LU factorization, which belongs to direct methods. The main reasons
of using direct methods include the high numerical stability of direct methods and
the poor convergence of iterative methods. Iterative methods usually require good
preconditioners to make the matrix diagonal dominant such that they can converge
quickly. However, circuit matrices created by MNA are typically quite irregular
and singular so they are difficult to be per-conditioned. In addition, during SPICE
iterations, the matrix values always change so the preconditioner is always required
in every iterations, which leads to a high-performance penalty. On the contrary,
direct methods do not have this limitation. By carefully pivoting during sparse LU
factorization, we can always get accurate solutions except for that the matrix is
ill-conditioned. Another advantage of using direct methods in SPICE-like circuit
simulation is that, if a fixed time step is used in transient simulation of linear circuits,
the coefficient matrix A keeps the same over all time nodes, so the LU factors also
keep the same and only forward/backward substitutions are required to solve the
linear system, which significantly saves the runtime of sparse LU factorization.
With the advances in the scale and complexity of modern ICs, SPICE-like circuit
simulators are facing performance challenges. For modern analog and mixed-signal
circuits, pre-layout simulations can usually take a few days [8] and post-layout sim-
ulations can even take a few weeks [9]. The extremely long simulation time may
8 1 Introduction
significantly affect the design efficiency and the time-to-market. In recent years, the
rapid evolution of parallel computers has greatly promoted the development of par-
allel SPICE simulation techniques. Accelerating SPICE-like circuit simulators by
parallel processing simulation tasks has become a popular research topic for a few
decades.
Generally speaking, parallelism can be achieved by two different granularities:
multi-core parallelism and multi-machine parallelism. In this book, we will focus on
multi-core parallelism, as it is easier to implement and the communication cost is
much smaller. Typically, multi-core parallelism is implemented by multi-threading
on shared-memory machines. Parallelism can be integrated into every step of the
SPICE-like simulation flow shown in Fig. 1.2. Considering the runtime of each step,
there are two major bottlenecks in SPICE-like transient simulation: device model
evaluation and the sparse direct solver. The two steps consume most of the simulation
time. To parallelize and accelerate SPICE-like circuit simulators, the primary task
is to parallelize the two steps. In this section, we will explain the challenges of
parallelizing SPICE-like circuit simulators.
Device model evaluation dominates the total simulation time for pre-layout circuits.
It may take up to 75% of the total simulation time and scales linearly with the
circuit size [10]. Parallelizing device model evaluation is straightforward, as one
only needs to distribute all the device models on multiple cores, achieving a simple
task-level parallelism. The inter-thread communication cost is almost zero, and load
balance is very easy to achieve by evenly distributing all the devices on multiple
cores. Such a method will demonstrate a good scalability for the device model eval-
uation step. However, even though the parallel efficiency of device model evaluation
achieves 100%, the overall parallel efficiency is still low due to many non-ignorable
sequential simulation tasks. Another challenge comes from the pure computational
cost. As modern MOSFET models become more complex, the computational cost
also increases rapidly. To reduce the computational cost of device model evalua-
tion, people have proposed some acceleration techniques, such as piecewise linear
approximation of device models [11, 12] and hardware acceleration approaches [13].
The sparse direct solver dominates the total simulation time for post-layout circuits. It
may consume 5090% of the total simulation time for large post-layout circuits [10].
Parallelizing the sparse direct solver is quite difficult. It is a big challenge that has
not been well solved for several decades. Although there are many popular software
packages that implement parallel sparse direct solvers, they are not suitable for circuit
1.2 Challenges of Parallel Circuit Simulation 9
matrices created by MNA. The following three features of circuit matrices make it
difficult to parallelize the sparse direct solver for circuit matrices.
Circuit matrices created MNA are extremely sparse. The average number of
nonzero elements per row is typically less than 10. Such a sparsity is much lower
than that of matrices from other areas, such as finite element analysis. This fea-
ture leads to a strong requirement of a high-efficiency scheduling algorithm. If the
scheduling efficiency is not high enough, the scheduling overhead may dominate
the solver time, as the computational cost of each task is relatively small.
Data dependence in sparse LU factorization is quite strong. To realize a high-
efficiency parallel sparse direct solver, one should carefully investigate the data
dependence and explore parallelism as much as possible. Due to the sparse nature
of circuit matrices, data-level parallelism is not suitable for circuit matrices. On
the contrary, task-level parallelism should be adopted.
The symbolic pattern of circuit matrices is irregular. This feature affects load
balance of parallel LU factorization. In addition to the irregular symbolic pattern
of the matrix, dynamic numerical pivoting also changes the symbolic pattern of
the LU factors at runtime, leading to a difficulty to achieve load balance, especially
when assigning tasks offline.
These features result in that the parallel efficiency of the sparse direct solver cannot
be high. Unlike device model evaluation that can nearly achieve a 100% parallel
efficiency, one can only expect 46 speedup using eight cores for the sparse
direct solver. The scalability will be even poorer if the number of cores becomes
more. In some cases, the performance may be even lower if using more cores.
The famous Amdahls law [14] says that the theoretical speedup of a parallel program
is mainly determined by the percentage of sequential tasks, as shown in the following
equation:
1
speedup = rp , (1.11)
rs +
P
where rs and r p (rs + r p = 1) mean the portion of sequential and parallel tasks,
respectively, and P is the number of used cores. In SPICE-like circuit simulation,
many tasks should be executed in sequential; otherwise the parallel cost can be very
high. For example, matrix/RHS load after device model evaluation is also difficult to
parallelize, which is mainly due to memory conflicts. Namely, different devices may
fill the same position of the matrix/RHS so a lock must be used for every position
of the matrix/RHS, leading to high cost due to numerous races. The cost can be
even higher when the number of used cores becomes more. These sequential tasks
significantly affects the efficiency and scalability of parallel SPICE simulations.
10 1 Introduction
5
4
3
2
1
0
2 4 6 8 10 12 14 16
Number of cores
According to Eq. (1.11), Fig. 1.3 plots some predicted theoretical speedups of
parallel SPICE simulation. In this illustration, the parallel efficiency of device model
evaluation and the sparse direct solver are assumed to be 100 and 70%, respectively.
As can be seen, even if the percentage of sequential tasks is only 5%, the speedup
can only be about 8 when using 16 cores. If the percentage of sequential tasks is
10%, the speedup reduces to 6 when using 16 cores, corresponding to an overall
parallel efficiency of only 37.5%. To achieve highly scalable parallel simulations,
the parallel efficiency of all tasks must be very close to 100%, which also means
that the percentage of sequential tasks must be very close to zero. However, this is
impossible in practical SPICE-like circuit simulators. Consequently, for a practical
simulator, linear scalability cannot be achieved by simply parallelizing every task in
the simulation flow.
1.3 Focus of This Book 11
As explained in Sect. 1.2, device model evaluation is easy to parallelize and there are
many techniques to accelerate it, but the sparse direct solver is difficult to parallelize
or accelerate due to three challenges. In this book, we will describe a parallel sparse
direct solver named NICSLU (NICS is the abbreviation of Nano-Scale Integration
Circuits and Systems, the name of our laboratory in Tsinghua University). NICSLU
is specially designed for SPICE-like circuit simulation applications. In particular,
NICSLU is well suited for DC and transient simulations in SPICE-like simulators.
The following technical features make NICSLU be a high-performance solver in
circuit simulation applications:
Three numerical techniques are integrated in NICSLU to achieve a high numerical
stability: an efficient static pivoting algorithm in the pre-analysis step, a partial
pivoting algorithm in the factorization step, and an iterative refinement algorithm
in the right-hand-solving step.
We propose an innovative framework to parallelize sparse LU factorization. It is
based on a detailed dependence analysis and contains two different scheduling
strategies, cluster mode and pipeline mode, to fit different data dependence and
sparsity of the matrix, making the scheduling be efficient on multi-core central
processing units (CPU).
Novel parallel sparse LU factorization algorithms are developed. Sufficient paral-
lelism is explored among highly dependent tasks by a novel pipeline factorization
algorithm.
In addition to the standard sparse LU factorization algorithm, we also propose
a map algorithm and a lightweight supernodal algorithm to accelerate factoriz-
ing extremely sparse matrices and slightly dense matrices. To integrate the three
numerical kernels together, we propose a simple but effective method to automat-
ically select the best algorithm according to the sparsity of the matrix.
A numerically stable pivoting reduction technique is proposed to reuse previous
information as much as possible during successive factorizations in circuit simu-
lation.
We have published five papers about NICSLU [1519]. Most techniques presented
in this book are based on these publications. However, we will add more introductory
contents and update the technical descriptions and experimental results in this book.
References
1. Nagel, L.W.: SPICE 2: A computer program to simulate semiconductor circuits. Ph.D. thesis,
University of California, Berkeley (1975)
2. Paul, C.: Fundamentals of Electric Circuit Analysis, 1st edn. Wiley, Manhattan, US (2001)
3. Ho, C.W., Ruehli, A.E., Brennan, P.A.: The modified nodal approach to network analysis. IEEE
Trans. Circuits Syst. 22(6), 504509 (1975)
12 1 Introduction
4. Sli, E., Mayers, D.F.: An Introduction to Numerical Analysis, 2nd edn. Cambridge University
Press, England (2003)
5. Turing, A.M.: Rounding-off errors in matrix processes. Q. J. Mech. Appl. Math. 1(1), 287308
(1948)
6. Davis, T.A.: Direct Methods for Sparse Linear Systems, 1st edn. Society for Industrial and
Applied Mathematics, US (2006)
7. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and
Applied Mathematics, Boston, US (2004)
8. Ye, Z., Wu, B., Han, S., Li, Y.: Time-domain segmentation based massively parallel simulation
for ADCs. In: Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, pp. 16
(2013)
9. Corporation, Cadence: accelerating analog simulation with full spice accuracy. Cadence Cor-
poration. Technical report (2008)
10. Daniels, R., Sosen, H.V., Elhak, H.: Accelerating analog simulation with HSPICE precision
parallel technology. Synopsys Corporation, Technical report (2010)
11. Li, Z., Shi, C.J.R.: A quasi-Newton preconditioned Newton-Krylov method for robust and
efficient time-domain simulation of integrated circuits with strong parasitic couplings. In: Asia
and South Pacific Conference on Design Automation 2006, pp. 402407 (2006)
12. Li, Z., Shi, C.J.R.: A quasi-Newton preconditioned Newton - Krylov Method for Robust and
efficient time-domain simulation of integrated circuits with strong parasitic couplings. IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst. 25(12), 28682881 (2006)
13. Kapre, N., DeHon, A.: Performance comparison of single-precision SPICE model-evaluation
on FPGA, GPU, cell, and multi-core processors. In: 2009 International Conference on Field
Programmable Logic and Applications, pp. 6572 (2009)
14. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing
capabilities. In: Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, pp.
483485 (1967)
15. Chen, X., Wu, W., Wang, Y., Yu, H., Yang, H.: An EScheduler-based data dependence analysis
and task scheduling for parallel circuit simulation. IEEE Trans. Circuits Syst. II: Express Briefs,
58(10), 702706 (2011)
16. Chen, X., Wang, Y., Yang, H.: An adaptive LU factorization algorithm for parallel circuit
simulation. In: Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific,
pp. 359364 (2012)
17. Chen, X., Wang, Y., Yang, H.: NICSLU: an adaptive sparse matrix solver for parallel circuit
simulation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 32(2), 261274 (2013)
18. Chen, X., Wang, Y., Yang, H.: A fast parallel sparse solver for SPICE-based circuit simulators.
In: Design, Automation Test in Europe Conference Exhibition (DATE), 2015, pp. 205210
(2015)
19. Chen, X., Xia, L., Wang, Y., Yang, H.: Sparsity-oriented sparse solver design for circuit sim-
ulation. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2016, pp.
15801585 (2016)
Chapter 2
Related Work
Parallel circuit simulation has been a popular research topic for several decades since
the invention of SPICE. Researchers have proposed a large amount of parallelization
techniques for SPICE-like circuit simulation [1]. In this chapter, we will compre-
hensively review state-of-the-art studies on parallel circuit simulation techniques.
Before that, we would like to briefly introduce classifications of these parallel tech-
niques. Based on different points of view, parallel circuit simulation techniques can
also have different classifications. From the implementation platform point of view,
parallel circuit simulation techniques can be classified into software techniques and
hardware techniques. Hardware techniques include field-programmable gate array
(FPGA)and graphics processing unit (GPU)-based acceleration approaches. For
software techniques, from the domain of parallel processing point of view, they can
be further classified into direct parallel methods, parallel circuit-domain techniques,
and parallel time-domain techniques. From the algorithm level of parallel processing
point of view, there are intra-algorithm and inter-algorithm parallel techniques.
According to the simulation flow shown in Fig. 1.2, the most straightforward way
to parallelize SPICE-like circuit simulators is to parallelize every step in the SPICE
simulation flow. Basically, the following major steps in the SPICE simulation flow
can be parallelized: netlist parsing and simulation setup, matrix pre-analysis, device
model evaluation, sparse direct solver, matrix/RHS load, and time node control.
However, as explained in Sect. 1.2, some steps are quite sequential and difficult to
parallelize. In addition, steps before entering SPICE iterations (i.e., netlist parsing,
simulation setup, and matrix pre-analysis) are executed only once, so their perfor-
mance is insensitive to the overall performance. According to the percentage of the
runtime, one may only focus on the parallelization of device model evaluation and
Springer International Publishing AG 2017 13
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_2
14 2 Related Work
the sparse direct solver, which are the two most time-consuming components in the
SPICE flow. Such simulation techniques can be called direct parallel methods as
they are straightforward to implement in existing SPICE-like simulation tools. This
is also the conventional parallelization method adopted by many commercial prod-
ucts. As explained in Sect. 1.2, the parallel efficiency of device model evaluation can
be close to 100% but the parallel efficiency of other steps, especially the sparse direct
solver, cannot be as high as expected. This means that, the overall parallel efficiency
is mainly limited by the poor scalability of those steps that cannot be efficiently par-
allelized. A detailed description of direct parallel methods is presented in an early
publication [2]. It gives several methods to improve the parallel efficiency for the
matrix/RHS load step using multiple locks or barriers.
In fact, for direct parallel methods, people pay more attention to the parallelization
of the sparse direct solver, due to its high runtime percentage and high difficulty of
parallelization. In what follows, we will review existing techniques for parallel direct
and iterative matrix solutions.
into regular dense matrix operations, such that the basic linear algebra subprogram
(BLAS) [17] and/or the linear algebra package (LAPACK) [18] can be invoked to deal
with dense submatrices. These solvers can be further classified into two categories:
supernodal methods and multifrontal methods.
U U U
L L L
1 2 3 4 5 6 7 8 9
1 9
2
3 8
4
5 7
6
7 3 6
8
9 1 2 4 5
(a) Matrix A (b) Elimination tree
The main purpose of the multifrontal [22] technique is somewhat similar to that of the
supernodal technique, but the basic theory and implementation are quite different.
The multifrontal technique factorizes a sparse matrix with a sequence of dense frontal
matrices, each of which corresponds to one or more steps of the LU factorization. We
use the example shown in Fig. 2.3 to demonstrate the basic idea of the multifrontal
method. The first pivot, say element (1, 1), is selected, and then the first frontal
matrix is constructed by collecting all the nonzero elements that will contribute to
the elimination of the first pivot row and column by the right-looking algorithm, as
shown in Fig. 2.3b. The frontal matrix is then factorized by a dense right-looking-like
pivoting operation, resulting in the factorized frontal matrix shown in Fig. 2.3c. As
can be seen, the computations of the frontal matrix can be done by dense kernels
such as BLAS so the performance can be enhanced. After eliminating the first pivot,
the second pivot, say element (3, 2), is selected. A new frontal matrix is constructed
by collecting all the contributing elements that are from the original matrix and
the previous frontal matrix, as shown in Fig. 2.3d. It is then also factorized and the
2.1 Direct Parallel Methods 17
(d) 2 3 4 5 7 (e) 2 3 4 5 7
3 X X X X X 3 U U U U U
4 X X X X X 4 L U U U U
5 X X 5 L L X X X
7 X 7 L L X X X
Second pivot: Second pivot:
before factorization after factorization
resulting frontal matrix is shown in Fig. 2.3e. The same procedure will be continued
until the LU factors are complete. The multifrontal technique can also be combined
with the supernodal technique to further improve the performance by simultaneously
processing multiple frontal matrices with the identical pattern.
There are several levels of parallelism in the multifrontal algorithm [14]. First, one
can also use the ET to schedule the computational tasks, such that independent frontal
matrices can be processed concurrently. This is a task-level parallelism. Second, if a
frontal matrix is large, it can be factorized by a parallel BLAS so this is a data-level
parallelism. Third, the factorization of the dense node at the root position of the ET
can be factorization by a parallel LAPACK.
Many software packages are based on the multifrontal technique. UMFPACK [8]
is an implementation of the multifrontal method to solve sparse linear systems.
Although the solver itself is purely sequential, its parallelism can be simply explored
by invoking parallel BLAS. MUMPS [1315] is a multifrontal-based distributed
sparse direct solver. WSMP [16] is a collection of various algorithms to solve sparse
linear systems that can be executed both in sequential and parallel. For sparse unsym-
metric matrices, it adopts the multifrontal algorithm.
18 2 Related Work
S G RD1 C
3: Solve
Dz = b1
4: Solve
Sx2 = b2 Rz
using iterative methods where S is used as the pre-conditioner. S is the exact Schur
complement but it does not need to be explicit formed
5: Solve
Dx1 = b1 Cx2
2.1 Direct Parallel Methods 19
S G RD1 C. (2.2)
ShyLU solves it using the algorithm shown in Algorithm 1. The second level hybrid
comes from the combined multi-machine and multi-core parallelism. ShyLU has also
been tested in SPICE-like circuit simulation. According to very limited results [28],
the performance of ShyLU in circuit simulation, especially the speedup over KLU,
is not so remarkable (the speedup over KLU is only 20 using 256 cores for a
particular circuit).
Till now, very few sparse linear solvers are specially designed for circuit simu-
lation applications, and very few public results of sparse linear solvers are reported
for circuit matrices. We believe that a comprehensive comparison and investigation
between various algorithms of sparse linear solvers on circuit matrices from different
applications can provide lots of new insights and guidelines to the development of
sparse linear solvers for circuit simulation.
Compared with direct methods, iterative methods can significantly reduce the mem-
ory requirement as they are executed almost in-place. Iterative methods are also quite
easy to parallelize as the core operation is just sparse matrix-vector multiplication
(SpMV). There are a great number of parallel SpMV implementations on modern
multi-core CPUs, many-core GPUs, or reconfigurable FPGAs [2934]. However,
in fact, there are very few researches that have investigated iterative methods for
solving linear systems in SPICE-like circuit simulation applications. Commercial
general-purpose circuit simulators rarely use iterative methods, mainly due to the
convergence and robustness issues of iterative methods. To improve the convergence,
iterative methods require good pre-conditioners, which should have the following
two properties. First, the pre-conditioner should approximate the matrix very well
to ensure good convergence. Second, the inverse of the pre-conditioner should be
cheap to compute to reduce the runtime of the linear solver. In most cases, we do
not need to explicitly calculate the inverse but the equivalent implicit computations
should also be cheap. For parallel iterative methods, many research efforts have been
carried out on how to build robust pre-conditioners, as iterative methods themselves
are straightforward to parallelize.
An example of a pre-conditioned linear system can be simply expressed as follows:
M1 Ax = M1 b, (2.4)
20 2 Related Work
where M is the pre-conditioner. M is selected such that solving the linear system
of Eq. (2.4) by iterative methods can converge much faster than solving the original
linear system Ax = b. If M is exactly A, then the left side of Eq. (2.4) is exactly
the identity matrix so it can be trivially solved. However, if we have obtained the
exact A1 , it is equivalent to that we have already solved the original linear system.
In other words, it is unnecessary to compute the exact inverse. On the contrary, the
pre-conditioner should be selected such that it can approximate the matrix as exactly
as possible with a very cheap method.
In mathematics, pre-conditioner techniques can be classified into two main
categories: incomplete factorization per-conditioner and approximate inverse pre-
conditioner [35]. Incomplete factorization tries to find an approximate factorization
of the matrix, i.e.,
A L U.
(2.5)
Researchers have proposed some iterative algorithms that can efficiently calculate
the sparse approximate inverse matrix M [35].
Based on the theory of these pre-conditioners, a few parallel pre-conditioners have
been developed for circuit simulation problems [3638]. A common feature of these
early work is that they treat the pre-conditioner and the iterative solve as a black-box
and do not utilize any information from circuit simulation.
In SPICE-like circuit simulation, there is another opportunity to apply pre-
conditioners for iterative solvers to solve linear systems. Due to the quadratic con-
vergence of the NewtonRaphson method, the matrix values change slow during
SPICE iterations, especially when the NewtonRaphson iterations are converging.
This property provides us an opportunity to utilize the LU factors in a certain iteration
to serve as a pre-conditioner for subsequent iterations which are solved by sequen-
tial or parallel generalized minimal residual (GMRES) methods [3941]. Compared
with the previous approaches that apply additional pre-conditioners, the computa-
tional cost of the pre-conditioner can be almost ignored in these methods, as com-
putation of the pre-conditioner, i.e., the complete LU factorization, is an inherent
step in circuit simulation. Another advantage is that the pre-conditioner can be used
in multiple iterations if the matrix values change very slow. However, due to the
sensitivity of iterative methods on matrix values, it is difficult to judge when the
pre-conditioner is invalid. To overcome this problem, the nonlinear device models
are piecewise linearized, and once nonlinear devices change their operating regions,
the pre-conditioner is required to update [3941].
2.1 Direct Parallel Methods 21
R2 R2
R3 g Cgd d
d
R3 g
d R2 d R2
Cgd/h Cgd/h
g g
R3 gmVgs gds C /h R3 gmVgs gds Cds/h
ds
Cgs/h s s
R4
R5 R1 R5 R1
(c) Original weighted graph (d) Sparsified weighted graph
R2
R3 g Cgd d
gds Cds
gmVgs
s
R5 R1
The above pre-conditioners are purely based on the matrix information, com-
pletely ignoring the circuit-level information. In other words, they are pure matrix-
based methods. Another type of pre-conditioners utilizes circuit-level information
named supper circuit pre-conditioner [4244], which is based on the support graph
and graph sparsification theories [45]. The basic idea is to extract a highly sparsified
circuit network which is called a support circuit that is very close to the original cir-
cuit, so that matrix factorization for the support circuit can be quickly done almost in
linear time, and can be served as the pre-conditioner for GMRES. Figure 2.4 shows
an example of the creation of the support circuit pre-conditioner.
The Sandia National Laboratory has proposed another type of pre-conditioner for
SPICE-like circuit simulation [46]. It first partitions the circuit into several blocks
and then uses the block Jacobi pre-conditioner for the GMRES solver. This approach
22 2 Related Work
fails on some circuits so its applicability in real SPICE-like circuit simulation needs
further investigation.
A common problem with pre-conditioned iterative methods in SPICE-like circuit
simulation is the universality. Although existing researches have shown that the
proposed approaches can work well for the circuits they have tested, unlike direct
methods, it cannot guarantee that these approaches can also work well for any circuit.
All of the existing iterative methods in circuit simulation are likely ad hoc approaches,
and, hence, more universality should be explored.
The concept of domain decomposition has different meanings under various con-
texts. Generally speaking, domain decomposition can be described as a method that
solves a large problem by partitioning the problem into multiple small subprob-
lems and then solving these subproblems separately. From the circuit point of view,
to realize parallel simulation, a natural idea is to partition the circuit into multiple
subcircuits such that each subcircuit can be solved independently, if the boundary
condition is properly formulated at either circuit level or matrix level. Actually,
domain decomposition is widely used in modern parallel circuit simulation tools,
especially in fast SPICE simulation techniques. There are basically several types
of methods in domain decomposition-based parallel simulation techniques: parallel
bordered block-diagonal (BBD)-form matrix solutions, parallel multilevel Newton
methods, parallel Schwarz methods, and parallel waveform relaxation methods.
This type of methods is more like a matrix-level technique rather than a domain
decomposition technique. However, building a BBD-form matrix requires to partition
the circuit, and the performance of solving the BBD-form matrix strongly depends on
the quality of the partition, so we put this type of methods in domain decomposition
instead of direct parallel methods.
Figure 2.5 illustrates how to create the BBD form by circuit partitioning. The
circuit is partitioned into K non-overlapped subdomains, in which one subdomain
contains all the interface nodes and the other subdomains are subcircuits. After such a
partitioning, the matrix created by MNA naturally have a BBD form, where there are
K 1 diagonal block matrices D1 , . . . , D K 1 , K 1 bottom-border block matrices
R1 , . . . , R K 1 , K 1 right-border block matrices C1 , . . . , C K 1 , and a right-bottom
block matrix G. The diagonal blocks correspond to the internal equations of all the
subcircuits. The border blocks correspond to all the connections between subcircuits
and interface nodes. The right-bottom block corresponds to the internal equations
of interface nodes. LU factorizing of a BBD-form matrix is based on the Schur
2.2 Domain Decomposition 23
Subcircuit Subcircuit D1 C1
1 2
...
Interface
nodes D K-1 C K-1
Subcircuit
...... K-1
R1 ... R K-1 G
Fig. 2.5 Illustration of how to create the BBD form by circuit partitioning
D1 = L1 U1 , . . . , D K 1 = L K 1 U K 1
R1 = R1 U11 , , R K 1 = R K 1 U1
K 1
C1 = L11 C1 , , C K 1 = L1
K 1 C K 1
K 1
G=G Rk Ck
k=1
G = LK UK
The above BBD-form matrix solutions are still matrix-level approaches but not real
circuit-level approaches. The idea can also be extended for solving nonlinear equa-
tions by the concept of the multilevel Newton technique [5457]. Multilevel Newton
methods are actually algorithm-level methods but they are operated at the circuit
level.
The basic idea of multilevel Newton methods can be described as follows. Each
subdomain is first solved separately using the NewtonRaphson method with a given
boundary condition, and then the top-level nonlinear equation is solved by integrating
the updated solutions from all the subdomains. The two levels of NewtonRaphson
iterations are repeated until all the boundary conditions are converged. Multilevel
Newton methods can be formulated as follows. After the circuit is partitioned into
K subdomains in which one subdomain contains the interface nodes, we have K
equations to describe the whole system
2.2 Domain Decomposition 25
f i (xi , u) = 0, i = 1, 2, . . . , K 1 (2.7)
g(x1 , x2 , . . . , x K 1 , u) = 0, (2.8)
The above parallel BBD-form matrix solutions and parallel multilevel Newton meth-
ods are both master-slave approaches, in which the master may be a severe bottle-
neck. To resolve this bottleneck, Schwarz methods can be adopted [58]. Different
from the above nonoverlapping partition methods, in Schwarz methods, the circuit
is partitioned into multiple overlapped subdomains.
A parallel simulation approach using the Schwarz alternating procedure has been
proposed in [59, 60]. A circuit can be partitioned into K 1 nonlinear subdomains
1 , 2 , . . . , K 1 and a linear subdomain K . This is equivalent to partition the
matrix A into K 1 overlapped submatrices A1 , A2 , . . . , A K 1 corresponding to all
the nonlinear subdomains and a background matrix A K corresponding to the overlaps
of subdomains 1 , 2 , . . . , K 1 and the linear subdomain K , as illustrated in
Fig. 2.6. After partitioning, linear systems during SPICE simulation is solved by the
Schwarz alternating procedure, as shown in Algorithm 3, in which all the subdomains
are solved in parallel.
Compared with parallel BBD-form matrix solutions and parallel multilevel New-
ton methods, the main advantage of parallel Scharwz methods is that, parallel
Scharwz methods do not belong to the masterslave parallelization framework but
they only involve point-to-point communications, potentially resulting in better
26 2 Related Work
A1
1 2
A2
5
A3
3 4
A5 A4
Fig. 2.6 Illustration of overlapped circuit partitioning and its corresponding matrix partitioning
3: repeat
4: for k = 1 to K in parallel do
5: Solve
Ak k = rk
6: Update solution
xk = xk + k
parallel scalability, as the bottleneck of the master is avoided. Since Schwarz meth-
ods belong to the category of iterative methods, they suffer from the convergence
problem. A general conclusion is that the convergence speed can be significantly
improved by increasing the overlapping areas. However, increasing overlaps lead to
higher computational cost.
dx(t)
C(x(t)) + b(u(t), x(t)) = 0. (2.10)
dt
If we use the Gauss-Seidel method to solve Eq. (2.10), it results in the following
linear system
i dx (k+1)
Ci j (x1(k+1) , . . . , xi(k+1) , xi+1
(k)
, . . . , x N(k) ) j
dt
j=1
N dx (k) (2.11)
+ Ci j (x1(k+1) , . . . , xi(k+1) , xi+1
(k)
, . . . , x N(k) ) dt
j
j=i+1
+ bi (x1(k+1) , . . . , xi(k+1) , xi+1
(k)
, . . . , x N(k) ) = 0, i = 1, 2, . . . , N
where the superscript is the iteration count. Waveform relaxation solves the circuit
DAE Eq. (1.1) in a given time interval by iterating Eq. (2.11) until the solution is
converged.
To enable parallel waveform relaxation, one also needs to partition the circuit into
subcircuits, while the interactions between subcircuits can be approximated by proper
devices, e.g., artificial sources. An DAE is built for each subcircuit and then solved
by waveform relaxation based on Eq. (2.11). When solving a subcircuit, interactions
from other subcircuits are considered and the latest solutions of interacted subcircuits
are always used. As can be seen, parallel waveform relaxation combines both domain
decomposition and time-domain parallelism.
Although waveform relaxation has been widely studied since the 1980s, they are
actually not widely used in practical circuit simulators today. The reasons mainly
include the convergence conditions and limitations of waveform relaxation. As wave-
form relaxation is an iterative method, convergence is always a problem. A necessary
28 2 Related Work
condition for Eq. (2.10) to have a unique solution, requires that the matrix C(x(t))1
exists. This also implies that there must be a grounded capacitor at each node. Such
a requirement cannot be always satisfied for actual circuits, especially for pre-layout
circuits. In addition, waveform relaxation also requires that one node of each inde-
pendent voltage source or inductor must be the ground, which also restricts the
applicability of waveform relaxation.
Except the relaxation methods, most of the above-mentioned methods have a common
point that the parallelism is explored at each time node. If we put our focus to the
whole time axis in transient simulation, parallelism can also be explored in the
time domain by many other techniques. Namely, different time nodes in the time
domain may be computed concurrently by either parallel integration algorithms or
multiple algorithms calculating different time nodes. As mentioned in Sect. 1.1.3, the
DAE associated with transient simulation is usually solved by numerical integration
algorithms. Numerical integration algorithms are typically completely sequential at
the time node level, as a node can be computed only when one or more previous
nodes are finished. To explore parallelism in the time domain, one needs to carefully
resolve this problem.
t4
t2 t3' Backward
Backwardt3 t4'
t1
a new time node using the solutions at t3 and t3 as the initial conditions, can move
forward by a larger time step to t4 , compared with a sequential integration method
that uses the solutions at t3 and t2 as the initial conditions, due to the reduced LTE.
At the same time, the second thread calculates the solution at t4 which is smaller than
t4 . The calculations of t3 and t4 are called backward steps. As can be seen, backward
pipelining results in larger time steps so accelerates transient simulation along the
time axis. The basic principle behind backward pipelining is that it provides better
initial conditions so that the integration time step can be larger.
thead
ttail
Lock
independently in parallel to process the same simulation task. Each algorithm main-
tains a complete SPICE context including the sparse direct solver, device model
evaluation, numerical integration method, NewtonRaphson iterations, etc. Due to
the different characteristics of these algorithms, their speeds at the time axis are also
different. The high performance is achieved by an algorithm selection strategy. To
synchronize the solutions of these algorithms, a solution vector containing K latest
time nodes is maintained. Let thead and ttail be the first and last time nodes of the
solution vector. As the vector is global and can be accessed by all the algorithms, a
lock is required when an algorithm attempts to access it. The update strategy to the
solution vector can be described as follows. Once an algorithm finishes solving one
time node, it will access the solution vector by acquiring the lock. If the current time
node, say talg , is beyond thead , then thead is updated by talg and ttail is moved forward
by one node. If talg is between ttail and thead , then talg is inserted and ttail is still moved
forward by one node. However, if talg is behind ttail , it indicates that this algorithm
is too slow so the current solution is discarded, and then the current algorithm picks
up the latest time node in the solution vector, i.e., thead , to calculate the next time
node. Additionally, before each algorithm starts to calculate the next new time node,
it also checks the solution vector to load the latest time node. Such a scheduling and
update policy implies an algorithm selection strategy that always selects the fastest
algorithms at any time, so that the speedups over a single algorithm-based simulation
can be far beyond the number of used cores.
Different from the above two approaches, a cruder method to implement parallel time-
domain simulation is to directly partition the time domain, such that each segment of
the time domain can be computed in parallel [75]. The major problem is that the initial
solution of each segment, which is necessary in any numerical integration method,
is unknown. However, considering a fact that many actual circuits have stationary
operation status, with different initial solutions, the circuit will eventually go to a
stationary status so the response will finally converge. This fact enables us to simulate
the time-domain response in parallel by partitioning the time domain into multiple
segments. The initial solution of each segment is selected as the DC operating point.
Of course, the waveform obtained by this method has errors. However, if we only need
to calculate some high-level or frequency-domain factors of analog circuits, such as
the signal to noise-plus-distortion ratio, this method can be applied, because a small
error in the waveform does not affect the frequency-domain response. Experimental
results show that this method can accelerate analog circuit simulations by more than
50 using 100 cores. However, anyway, such a method is not a unified approach
and it can only be applied to special simulations of special circuits.
32 2 Related Work
The matrix exponential method [76] is another approach to solve the circuit DAE
expressed as Eq. (1.1). Unlike conventional numerical integration methods such as
the backward Euler method or the trapezoid method [77] which are implicit, the
matrix exponential method is explicit but also A-stable [78].
For the circuit DAE expressed as Eq. (1.1), the matrix exponential method says
that its solution within the time interval [tn , tn+1 ] can be written as the following
form [79]:
tn+1 tn 1
x (tn+1 ) = C (x (tn+1 )) f (x (tn+1 ))
2
tn+1 tn 1
+ e(tn+1 tn )J(x(tn )) x (tn ) + C (x (tn )) f (x (tn ))
2
(tn+1 tn )J(x(tn ))
1
+ e I J (x (tn )) C1 (x (tn )) u (tn ) (2.13)
(tn+1 tn )J(x(tn ))
e (tn+1 tn ) J (x (tn )) I J2 (x (tn ))
+ C 1
(x (t )) u (t ) C1
(x (t )) u (t ) .
n+1 n+1 n n
tn+1 tn
The computation of the matrix exponential e(tn+1 tn )J(x(tn )) can be reduced using
Krylov subspace methods [80, 81]. Parallelism can be trivially explored in Krylov
subspace methods, as their major operation is just SpMV.
Generally speaking, compared with conventional numerical integration methods,
the matrix exponential method has advantages in the performance, accuracy, and
scalability. It has been studied in both nonlinear [8284] and linear [85, 86] cir-
cuits simulation. However, as a new technique in SPICE-like circuit simulation, the
applicability for general nonlinear circuits, especially for highly stiff systems, still
requires further investigations.
2.4 Hardware Acceleration Techniques 33
In recent years, with the rapid development of various accelerators such as GPUs and
FPGAs, hardware acceleration techniques are widely used in many areas to accel-
erate scientific computing. Underlying state-of-the-art accelerators provide much
more computing and memory resources than general-purpose CPUs, offering much
higher computing capability and memory bandwidth. However, regardless of the
claimed generality in computing, there are some architectural limitations that must
be dealt with when developing general-purpose applications such as circuit simu-
lation. GPUs and FPGAs have been investigated to accelerate SPICE-like circuit
simulation recently. Existing researches are mainly focused on accelerating device
model evaluation and the sparse direct solver.
/ -
4,4
References
1. Li, P.: Parallel circuit simulation: a historical perspective and recent developments. Found.
Trends Electron. Des. Autom. 5(4), 211318 (2012)
2. Saleh, R.A., Gallivan, K.A., Chang, M.C., Hajj, I.N., Smart, D., Trick, T.N.: Parallel circuit
simulation on supercomputers. Proc. IEEE 77(12), 19151931 (1989)
3. Li, X.S.: Sparse gaussian elimination on high performance computers. Ph.D. thesis, Computer
Science Division, UC Berkeley, California, US (1996)
4. Li, X.S., Demmel, J.W.: SuperLU_DIST: a scalable Distributed-Memory sparse direct solver
for unsymmetric linear systems. ACM Trans. Math. Softw. 29(2), 110140 (2003)
5. Li, X.S.: An overview of SuperLU: algorithms, implementation, and user interface. ACM
Trans. Math. Softw. 31(3), 302325 (2005)
6. Demmel, J.W., Eisenstat, S.C., Gilbert, J.R., Li, X.S., Liu, J.W.H.: A supernodal approach to
sparse partial pivoting. SIAM J. Matrix Anal. Appl. 20(3), 720755 (1999)
7. Demmel, J.W., Gilbert, J.R., Li, X.S.: An asynchronous parallel supernodal algorithm for
sparse gaussian elimination. SIAM J. Matrix Anal. Appl. 20(4), 915952 (1999)
36 2 Related Work
31. Tang, W.T., Tan, W.J., Ray, R., Wong, Y.W., Chen, W., Kuo, S.H., Goh, R.S.M., Turner,
S.J., Wong, W.F.: Accelerating sparse matrix-vector multiplication on GPUs using Bit-
Representation-Optimized schemes. In: 2013 SCInternational Conference for High Per-
formance Computing, Networking, Storage and Analysis (SC), pp. 112 (2013)
32. Greathouse, J.L., Daga, M.: Efficient sparse Matrix-Vector multiplication on GPUs using the
CSR storage format. In: SC14: International Conference for High Performance Computing,
Networking, Storage and Analysis, pp. 769780 (2014)
33. Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarath, S., Sadayappan, P.: Fast sparse Matrix-
Vector multiplication on GPUs for graph applications. In: SC14: International Conference for
High Performance Computing, Networking, Storage and Analysis, pp. 781792 (2014)
34. Grigoras, P., Burovskiy, P., Hung, E., Luk, W.: Accelerating SpMV on FPGAs by compressing
nonzero values. In: 2015 IEEE 23rd Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), pp. 6467 (2015)
35. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and
Applied Mathematics, Boston, US (2004)
36. Basermann, A., Jaekel, U., Hachiya, K.: Preconditioning parallel sparse iterative solvers for
circuit simulation. In: Proceedings of the 8th SIAM Proceedings on Applied Linear Algebra,
Williamsburg VA (2003)
37. Suda, R.: New iterative linear solvers for parallel circuit simulation. Ph.D. thesis, University
of Tokio (1996)
38. Basermann, A., Jaekel, U., Nordhausen, M., Hachiya, K.: Parallel iterative solvers for sparse
linear systems in circuit simulation. Future Gener. Comput. Syst. 21(8), 12751284 (2005)
39. Li, Z., Shi, C.J.R.: An efficiently preconditioned GMRES method for fast Parasitic-Sensitive
Deep-Submicron VLSI circuit simulation. In: Design, Automation and Test in Europe, Vol.
2, pp. 752757 (2005)
40. Li, Z., Shi, C.J.R.: A Quasi-Newton preconditioned Newton-Krylov method for robust and
efficient Time-Domain simulation of integrated circuits with strong parasitic couplings. Asia
S. Pac. Conf. Des. Autom. 2006, 402407 (2006)
41. Li, Z., Shi, C.J.R.: A Quasi-Newton preconditioned newtonKrylov method for robust and
efficient Time-Domain simulation of integrated circuits with strong parasitic couplings. IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst. 25(12), 28682881 (2006)
42. Zhao, X., Han, L., Feng, Z.: A Performance-Guided graph sparsification approach to scalable
and robust SPICE-Accurate integrated circuit simulations. IEEE Trans. Comput.-Aided Des.
Integr. Circuits Syst. 34(10), 16391651 (2015)
43. Zhao, X., Feng, Z.: GPSCP: A General-Purpose Support-Circuit preconditioning approach to
Large-Scale SPICE-Accurate nonlinear circuit simulations. In: 2012 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), pp. 429435 (2012)
44. Zhao, X., Feng, Z.: Towards efficient SPICE-Accurate nonlinear circuit simulation with On-
the-Fly Support-Circuit preconditioners. In: Design Automation Conference (DAC), 2012
49th ACM/EDAC/IEEE, pp. 11191124 (2012)
45. Bern, M., Gilbert, J.R., Hendrickson, B., Nguyen, N., Toledo, S.: Support-Graph precondi-
tioners. SIAM J. Matrix Anal. Appl. 27(4), 930951 (2006)
46. Thornquist, H.K., Keiter, E.R., Hoekstra, R.J., Day, D.M., Boman, E.G.: A parallel precondi-
tioning strategy for efficient Transistor-Level circuit simulation. In: 2009 IEEE/ACM Inter-
national Conference on Computer-Aided DesignDigest of Technical Papers, pp. 410417
(2009)
47. Chan, K.W.: Parallel algorithms for direct solution of large sparse power system matrix equa-
tions. IEE Proc.Gener. Transm. Distrib. 148(6), 615622 (2001)
48. Zecevic, A.I., Siljak, D.D.: Balanced decompositions of sparse systems for multilevel parallel
processing. IEEE Trans. Circuits Syst. I: Fundam. Theory Appl. 41(3), 220233 (1994)
49. Koester, D.P., Ranka, S., Fox, G.C.: Parallel Block-Diagonal-Bordered sparse linear solvers
for electrical power system applications. In: Proceedings of the Scalable Parallel Libraries
Conference, 1993, pp. 195203 (1993)
38 2 Related Work
50. Paul, D., Nakhla, M.S., Achar, R., Nakhla, N.M.: Parallel circuit simulation via binary link
formulations (PvB). IEEE Trans. Compon. Packag. Manuf. Technol. 3(5), 768782 (2013)
51. Hu, Y.F., Maguire, K.C.F., Blake, R.J.: Ordering unsymmetric matrices into bordered block
diagonal form for parallel processing. In: Euro-Par99 Parallel Processing: 5th International
Euro-Par Conference Toulouse, pp. 295302 (1999)
52. Aykanat, C., Pinar, A., atalyrek, U.V.: Permuting sparse rectangular matrices into Block-
Diagonal form. SIAM J. Sci. Comput. 25(6), 18601879 (2004)
53. Duff, I.S., Scott, J.A.: Stabilized bordered block diagonal forms for parallel sparse solvers.
Parallel Comput. 31(34), 275289 (2005)
54. Frohlich, N., Riess, B.M., Wever, U.A., Zheng, Q.: A new approach for parallel simulation
of VLSI circuits on a transistor level. IEEE Trans. Circuits Syst. I: Fundam. Theory Appl.
45(6), 601613 (1998)
55. Honkala, M., Roos, J., Valtonen, M.: New multilevel Newton-Raphson method for parallel
circuit simulation. Proc. Eur. Conf. Circuit Theory Des. 1, 113116 (2001)
56. Zhu, Z., Peng, H., Cheng, C.K., Rouz, K., Borah, M., Kuh, E.S.: Two-Stage Newton-Raphson
method for Transistor-Level simulation. IEEE Trans. Comput.-Aided Des. Integr. Circuits
Syst. 26(5), 881895 (2007)
57. Rabbat, N., Sangiovanni-Vincentelli, A., Hsieh, H.: A multilevel newton algorithm with
macromodeling and latency for the analysis of Large-Scale nonlinear circuits in the time
domain. IEEE Trans. Circuits Syst. 26(9), 733741 (1979)
58. Smith, B., Bjorstad, P., Gropp, W.: Domain Decomposition: Parallel Multilevel Methods for
Elliptic Partial Differential Equations, 1st edn. Cambridge University Press (2004)
59. Peng, H., Cheng, C.K.: Parallel transistor level circuit simulation using domain decomposition
methods. In: 2009 Asia and South Pacific Design Automation Conference, pp. 397402 (2009)
60. Peng, H., Cheng, C.K.: Parallel transistor level full-Chip circuit simulation. In: 2009 Design,
Automation Test in Europe Conference Exhibition, pp. 304307 (2009)
61. Lelarasmee, E., Ruehli, A.E., Sangiovanni-Vincentelli, A.L.: The waveform relaxation
method for Time-Domain analysis of large scale integrated circuits. IEEE Trans. Comput.-
Aided Des. Integr. Circuits Syst. 1(3), 131145 (1982)
62. Achar, R., Nakhla, M.S., Dhindsa, H.S., Sridhar, A.R., Paul, D., Nakhla, N.M.: Parallel and
scalable transient simulator for power grids via waveform relaxation (PTS-PWR). IEEE Trans.
Very Large Scale Integr. (VLSI) Syst. 19(2), 319332 (2011)
63. Odent, P., Claesen, L., Man, H.D.: A combined waveform Relaxation-Waveform relaxation
newton algorithm for efficient parallel circuit simulation. In: Proceedings of the European
Design Automation Conference, 1990, EDAC, pp. 244248 (1990)
64. Rissiek, W., John, W.: A dynamic scheduling algorithm for the simulation of MOS and Bipolar
circuits using waveform relaxation. In: Design Automation Conference, 1992, EURO-VHDL
92, EURO-DAC 92. European, pp. 421426 (1992)
65. Saviz, P., Wing, O.: PYRAMID-A hierarchical waveform Relaxation-Based circuit simulation
program. In: IEEE International Conference on Computer-Aided Design, 1988. ICCAD-88.
Digest of Technical Papers, pp. 442445 (1988)
66. Erdman, D.J., Rose, D.J.: A newton waveform relaxation algorithm for circuit simulation. In:
1989 IEEE International Conference on Computer-Aided Design, 1989. ICCAD-89. Digest
of Technical Papers, pp. 404407 (1989)
67. Saviz, P., Wing, O.: Circuit simulation by hierarchical waveform relaxation. IEEE Trans.
Comput.-Aided Des. Integr. Circuits Syst. 12(6), 845860 (1993)
68. Fang, W., Mokari, M.E., Smart, D.: Robust VLSI circuit simulation techniques based on over-
lapped waveform relaxation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 14(4),
510518 (1995)
69. Gristede, G.D., Ruehli, A.E., Zukowski, C.A.: Convergence properties of waveform relax-
ation circuit simulation methods. IEEE Trans. Circuits Syst. I: Fundam. Theory Appl. 45(7),
726738 (1998)
70. Dong, W., Li, P., Ye, X.: WavePipe: parallel transient simulation of analog and digital circuits
on Multi-Core Shared-Memory machines. In: Design Automation Conference, 2008. DAC
2008. 45th ACM/IEEE, pp. 238243 (2008)
References 39
71. Ye, X., Dong, W., Li, P., Nassif, S.: MAPS: Multi-Algorithm parallel circuit simulation. In:
2008 IEEE/ACM International Conference on Computer-Aided Design, pp. 7378 (2008)
72. Ye, X., Li, P.: Parallel program performance modeling for runtime optimization of Multi-
Algorithm circuit simulation. In: 2010 47th ACM/IEEE Design Automation Conference
(DAC), pp. 561566 (2010)
73. Ye, X., Li, P.: On-the-fly runtime adaptation for efficient execution of parallel Multi-Algorithm
circuit simulation. In: 2010 IEEE/ACM International Conference on Computer-Aided Design
(ICCAD), pp. 298304 (2010)
74. Ye, X., Dong, W., Li, P., Nassif, S.: Hierarchical multialgorithm parallel circuit simulation.
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 30(1), 4558 (2011)
75. Ye, Z., Wu, B., Han, S., Li, Y.: Time-Domain segmentation based massively parallel simulation
for ADCs. In: Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, pp. 16
(2013)
76. Chua, L.O., Lin, P.Y.: Computer-Aided analysis of electronic circuits: algorithms and com-
putational techniques, 1st edn. Prentice Hall Professional Technical Reference (1975)
77. Sli, E., Mayers, D.F.: An Introduction to Numerical Analysis, 2nd edn. Cambridge University
Press, England (2003)
78. Dahlquist, G.G.: A special stability problem for linear multistep methods. BIT Numer. Math.
3(1), 2743 (1963)
79. Nie, Q., Zhang, Y.T., Zhao, R.: Efficient Semi-Implicit schemes for stiff systems. J. Comput.
Phys. 214(2), 521537 (2006)
80. Hochbruck, M., Lubich, C.: On Krylov subspace approximations to the matrix exponential
operator. SIAM J. Numer. Anal. 34(5), 19111925 (1997)
81. Saad, Y.: Analysis of some Krylov subspace approximations to the matrix exponential oper-
ator. SIAM J. Numer. Anal. 29(1), 209228 (1992)
82. Zhuang, H., Wang, X., Chen, Q., Chen, P., Cheng, C.K.: From circuit theory, simulation
to SPICE_Diego: a matrix exponential approach for Time-Domain analysis of Large-Scale
circuits. IEEE Circuits Syst. Mag. 16(2), 1634 (2016)
83. Zhuang, H., Yu, W., Kang, I., Wang, X., Cheng, C.K.: An algorithmic framework
for efficient Large-Scale circuit simulation using exponential integrators. In: 2015 52nd
ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16 (2015)
84. Weng, S.H., Chen, Q., Wong, N., Cheng, C.K.: Circuit simulation via matrix exponential
method for stiffness handling and parallel processing. In: 2012 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), pp. 407414 (2012)
85. Chen, Q., Zhao, W., Wong, N.: Efficient matrix exponential method based on extended Krylov
subspace for transient simulation of Large-Scale linear circuits. In: 2014 19th Asia and South
Pacific Design Automation Conference (ASP-DAC), pp. 262266 (2014)
86. Zhuang, H., Weng, S.H., Lin, J.H., Cheng, C.K.: MATEX: A distributed framework for tran-
sient simulation of power distribution networks. In: 2014 51st ACM/EDAC/IEEE Design
Automation Conference (DAC), pp. 16 (2014)
87. NVIDIA Corporation: NVIDIA CUDA C Programming Guide. http://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html
88. Khronos OpenCL Working Group: The OpenCL Specification v1.1 (2010)
89. Gulati, K., Croix, J.F., Khatri, S.P., Shastry, R.: Fast circuit simulation on graphics processing
units. In: 2009 Asia and South Pacific Design Automation Conference, pp. 403408 (2009)
90. Poore, R.E.: GPU-Accelerated Time-Domain circuit simulation. In: 2009 IEEE Custom Inte-
grated Circuits Conference, pp. 629632 (2009)
91. Bayoumi, A.M., Hanafy, Y.Y.: Massive parallelization of SPICE device model evaluation
on GPU-based SIMD architectures. In: Proceedings of the 1st International Forum on Next-
generation Multicore/Manycore Technologies, pp. 12:112:5 (2008)
92. NVIDIA Corporation: CUDA BLAS. http://docs.nvidia.com/cuda/cublas/
93. Christen, M., Schenk, O., Burkhart, H.: General-Purpose sparse matrix building blocks Using
the NVIDIA CUDA technology platform. In: First Workshop on General Purpose Processing
on Graphics Processing Units. Citeseer (2007)
40 2 Related Work
94. Krawezik, G.P., Poole, G.: Accelerating the ANSYS direct sparse solver with GPUs. In: 2009
Symposium on Application Accelerators in High Performance Computing (SAAHPC09)
(2009)
95. Yu, C.D., Wang, W., Pierce, D.: A CPU-GPU hybrid approach for the unsymmetric multi-
frontal method. Parallel Comput. 37(12), 759770 (2011)
96. George, T., Saxena, V., Gupta, A., Singh, A., Choudhury, A.: Multifrontal factorization of
sparse SPD matrices on GPUs. In: 2011 IEEE International Parallel Distributed Processing
Symposium (IPDPS), pp. 372383 (2011)
97. Lucas, R.F., Wagenbreth, G., Tran, J.J., Davis, D.M.: Multifrontal Sparse Matrix Factorization
on Graphics Processing Units. Technical report. Information Sciences Institute, University of
Southern California (2012)
98. Lucas, R.F., Wagenbreth, G., Davis, D.M., Grimes, R.: Multifrontal computations on GPUs
and their Multi-Core Hosts. In: Proceedings of the 9th International Conference on High
Performance Computing for Computational Science, pp. 7182 (2011)
99. Kim, K., Eijkhout, V.: Scheduling a parallel sparse direct solver to multiple GPUs. In: 2013
IEEE 27th International Parallel and Distributed Processing Symposium Workshops Ph.D.
Forum (IPDPSW), pp. 14011408 (2013)
100. Hogg, J.D., Ovtchinnikov, E., Scott, J.A.: A sparse symmetric indefinite direct solver for GPU
architectures. ACM Trans. Math. Softw. 42(1), 1:11:25 (2016)
101. Sao, P., Vuduc, R., Li, X.S.: A distributed CPU-GPU sparse direct solver. In: Euro-Par 2014
Parallel Processing: 20th International Conference, pp. 487498 (2014)
102. Ren, L., Chen, X., Wang, Y., Zhang, C., Yang, H.: Sparse LU factorization for parallel circuit
simulation on GPU. In: Proceedings of the 49th Annual Design Automation Conference. DAC
12, pp. 11251130. ACM, New York, NY, USA (2012)
103. Chen, X., Ren, L., Wang, Y., Yang, H.: GPU-Accelerated sparse LU factorization for circuit
simulation with performance modeling. IEEE Trans. Parallel Distrib. Syst. 26(3), 786795
(2015)
104. He, K., Tan, S.X.D., Wang, H., Shi, G.: GPU-Accelerated parallel sparse LU factorization
method for fast circuit analysis. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(3),
11401150 (2016)
105. Kapre, N., DeHon, A.: Accelerating SPICE Model-Evaluation using FPGAs. In: 17th IEEE
Symposium on Field Programmable Custom Computing Machines, 2009. FCCM 09, pp.
3744 (2009)
106. Kapre, N.: Exploiting input parameter uncertainty for reducing datapath precision of SPICE
device models. In: 2013 IEEE 21st Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), pp. 189197 (2013)
107. Martorell, H., Kapre, N.: FX-SCORE: a framework for fixed-point compilation of SPICE
device models using Gappa++. In: Field-Programmable Custom Computing Machines
(FCCM), pp. 7784 (2012)
108. Kapre, N., DeHon, A.: Performance comparison of Single-Precision SPICE Model-Evaluation
on FPGA, GPU, Cell, and Multi-Core processors. In: 2009 International Conference on Field
Programmable Logic and Applications, pp. 6572 (2009)
109. Wu, W., Shan, Y., Chen, X., Wang, Y., Yang, H.: FPGA accelerated parallel sparse matrix
factorization for circuit simulations. In: Reconfigurable Computing: Architectures, Tools and
Applications: 7th International Symposium, ARC 2011, pp. 302315 (2011)
110. Kapre, N., DeHon, A.: Parallelizing sparse matrix solve for SPICE circuit simulation using
FPGAs. In: International Conference on Field-Programmable Technology, 2009. FPT 2009,
pp. 190198 (2009)
111. Wang, X., Jones, P.H., Zambreno, J.: A configurable architecture for sparse LU decomposition
on matrices with arbitrary patterns. SIGARCH Comput. Archit. News 43(4), 7681 (2016)
112. Wu, G., Xie, X., Dou, Y., Sun, J., Wu, D., Li, Y.: Parallelizing sparse LU decomposition
on FPGAs. In: 2012 International Conference on Field-Programmable Technology (FPT),
pp. 352359 (2012)
References 41
113. Johnson, J., Chagnon, T., Vachranukunkiet, P., Nagvajara, P., Nwankpa, C.: Sparse LU decom-
position using FPGA. In: International Workshop on State-of-the-Art in Scientific and Parallel
Computing (PARA) (2008)
114. Siddhartha, Kapre, N.: Heterogeneous dataflow architectures for FPGA-based sparse LU fac-
torization. In: 2014 24th International Conference on Field Programmable Logic and Appli-
cations (FPL), pp. 14 (2014)
115. Siddhartha, Kapre, N.: Breaking sequential dependencies in FPGA-Based sparse LU fac-
torization. In: 2014 IEEE 22nd Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), pp. 6063 (2014)
116. Kapre, N., DeHon, A.: VLIW-SCORE: beyond C for sequential control of SPICE FPGA
acceleration. In: 2011 International Conference on Field-Programmable Technology (FPT),
pp. 19 (2011)
117. Kapre, N., DeHon, A.: SPICE2: spatial processors interconnected for concurrent execution
for accelerating the SPICE circuit simulator using an FPGA. IEEE Trans. Comput.-Aided
Des. Integr. Circuits Syst. 31(1), 922 (2012)
118. Kapre, N.: SPICE2A spatial parallel architecture for accelerating the SPICE circuit simu-
lator. Ph.D. thesis, California Institute of Technology (2010)
Chapter 3
Overall Solver Flow
In this chapter, we will present the basic flow of our proposed solver NICSLU, as a
necessary background of the parallelization techniques. We will also introduce the
usage of NICSLU in SPICE-like circuit simulators. Basically, a sparse direct solver
uses the following three steps to solve sparse linear systems:
Pre-analysis or pre-processing. This step performs row and column reordering to
minimize fill-ins which will be generated in numerical LU factorization. NICSLU
also performs a symbolic factorization to predict the sparsity of the matrix and
pre-allocate memories for numerical factorization.
Numerical LU factorization. This step factorizes the matrix obtained from the
first step into LU factors. This is the most complicated and time-consuming step
in a sparse direct solver. NICSLU has two different factorization methods: full
factorization with partial pivoting and re-factorization without partial pivoting. In
circuit simulation, NICSLU can smartly decide to call which method according to
the numerical features of the matrix.
Right-hand-solving. This step solves the linear system by forward/backward sub-
stitutions. NICSLU also has an iterative refinement step which can be invoked to
refine the solution when necessary. NICSLU can also smartly decide whether to
call iterative refinement according to the numerical features of the matrix.
As the main contents of book are focused on the numerical LU factorization part,
in this chapter, we will also present the sequential LU factorization algorithm adopted
by NICSLU, which will be the foundation of the proposed parallel LU factorization
algorithms. Although our descriptions are for NICSLU, most of the algorithms and
techniques are actually general and not restricted to NICSLU.
Figure 3.1 shows the overall flow of NICSLU. The above-mentioned three steps are
clearly marked in this figure. The pre-analysis step is performed only once but the
numerical LU factorization and right-hand-solving steps are both executed many
times in the NewtonRaphson iterations in a SPICE-like circuit simulation flow.
During the SPICE iterations, the symbolic pattern of the matrix keeps the same but
the values change. This is an important feature of the sparse matrix in SPICE-like
circuit simulators, which avoids multiple executions of the pre-analysis step.
The pre-analysis step of NICSLU includes three steps: a static pivoting or zero-free
permutation, the approximate minimum degree (AMD) algorithm, and a symbolic
factorization. Once the symbolic factorization is finished, we calculate a sparsity
ratio (SPR) which is an estimation of the sparsity of the matrix. The SPR will be
used to select the factorization algorithm, such that the performance of NICSLU is
always high for different matrix sparsity.
As mentioned above, NICSLU offers two numerical factorization methods: full
factorization and re-factorization. The factorization method is selected according to
the concept of pseudo condition number (PCN), which is calculated at the end of the
numerical factorization step. For both methods, NICSLU provides three different
factorization algorithms: map algorithm, column algorithm, and supernodal algo-
rithm. The factorization algorithm is selected according to the SPR value to achieve
high performance for various sparsity. For full factorization, there is a minimum
suitable sparsity such that parallel factorization can really achieve acceleration than
sequential factorization. If the sparsity of a matrix is smaller than the suitable spar-
sity, parallel factorization may be even slower than sequential factorization, and, thus,
we should choose sequential factorization in this case. The SPR is used to control
whether full factorization should be executed in parallel or sequential.
The right-hand-solving step includes two steps: forward/backward substitutions
and iterative refinement. Forward/backward substitutions obtain the solution by solv-
ing two triangular equations and the iterative refinement refines the solution to make
it more accurate. Substitutions involve much fewer numerical computations than
numerical factorization, so they are always executed in sequential in NICSLU. If the
iterative refinement step is selected to execute, when the refinement should stop is
automatically controlled by NICSLU according to the PCN value.
All algorithms and parallelization techniques of NICSLU will be described in
three chapters. In this chapter, we will introduce the pre-analysis step, the sequential
column algorithm, and the right-hand-solving step, which render a general flow of
the solver. In the next chapter, we will introduce the parallelization techniques for
the column algorithm. In Chap. 5, we will introduce the map algorithm and the
supernodal algorithm, as well as their parallelization techniques.
3.1 Overall Flow 45
Static pivoting/Zero-free
permutation
Pre-analysis
Approximate minimum
degree
Algorithm Algorithm
selection selection
Map
Create map if
not created
Newton-Raphson iterations
Column full Map re-
Map
factorization factorization
Column
Column full Column
Column re-
factorization factorization
Supernodal
Supernodal full Supernodal
Supernodal re-
factorization factorization
Forward/backward
substitutions
Iterative refinement
(automatic control)
3.2 Pre-analysis
In this section, we will introduce the pre-analysis step of NICSLU. Since the pre-
analysis algorithms adopted by NICSLU are all existing algorithms, we only briefly
explain the fundamental theories of them without presenting their detailed algorithm
flows. If readers are interested in them, please refer to the corresponding references
which will be cited in the following contents.
This is the first step of pre-analysis. The primary purpose of this step is to obtain a
zero-free diagonal. NICSLU offers two options to perform the zero-free permutation.
The first option is to permute the matrix only based on the symbolic pattern regardless
of the numerical values. The other option is to permute the matrix such that the product
of the diagonal absolute values is maximized. Permuting a zero-free diagonal or
putting large elements on the diagonal helps reduce off-diagonal pivots during the
numerical LU factorization phase. If the latter option is selected, we also call it static
pivoting. We adopt the MC64 algorithm [1, 2] from the Harwell subroutine library
(HSL) [3] to implement static pivoting. If one only wants to obtain a symbolically
zero-free diagonal, a zero-free permutation algorithm numbered MC21 [4, 5] in HSL
is invoked. We will briefly introduce the two algorithms in the following contents.
If the MC64 algorithm is not selected or it fails, NICSLU performs the MC21 algo-
rithm to obtain a zero-free diagonal. The MC21 algorithm tries to find a maximum
matching for all the rows and all the columns, such that each column is matched to
one row and the one row can only be matched to one column. If a complete matching
cannot be found, i.e., there are rows and columns that cannot be matched, it means
that the matrix is structurally singular and NICSLU returns an error code to indicate
such an error.
The MC21 algorithm is based on depth-first search (DFS). To perform DFS, a
bipartite graph with 2N vertexes is created from the matrix, in which N vertexes
correspond to rows and the other N vertexes correspond to columns. A vertex cor-
responding to row i is marked as R(i) and a vertex corresponding to column j is
marked as C( j). Any nonzero element in the matrix Ai j corresponds to an undirected
edge (R(i), C( j)) in the bipartite graph. An array = {1 , 2 , . . . , N } is used to
record matched rows and columns. i = j means that row i is matched to column
j, and the nonzero element Ai j is the matched element that will be exchanged to the
diagonal after the MC21 algorithm is finished. The MC21 algorithm starts from each
column vertex C( j). All the adjacent row vertexes of C( j) are visited. If there is a
3.2 Pre-analysis 47
row vertex R(i) that is not matched to any column vertex, then row i is matched to
column j, i.e., i = j. If all the adjacent row vertexes of C( j) have been matched,
then a DFS procedure is performed based on matched rows and columns to find a
path until an unmatched row vertex is reached. All the row and column vertexes on
the path are marked as matched one-to-one. Figure 3.2 shows an example of such
a procedure. Assume that the first 4 columns have already been matched and the
matched elements are marked in red in Fig. 3.2a. Now we are trying to visit column
5 which has two nonzero elements at rows 3 and 6. Unfortunately, rows 3 and 6 both
have already been matched. Therefore, we start DFS from the columns which are
matched to the rows of the nonzero elements in column 5. First, column 3 is revisited
and we find an unmatched row 5, so column 3 is now matched to row 5. Then, column
5 can be matched to row 3. The same procedure will be continued until all the rows
and columns are matched one-to-one. Figure 3.2b shows the final matching results of
this example by red edges. Finally, for j = 1, 2, . . . , N , column j is exchanged to
column j, and then all the diagonal elements are symbolic nonzeros. Mathematically,
the MC21 algorithm is equivalent to find a column permutation matrix Q, such that
AQ has a zero-free diagonal.
Static pivoting is an alternate and better method for zero-free permutation. The MC64
algorithm has two steps. First, it finds a column permutation such that the product of
all the diagonal absolute values is maximized. The second step is to scale the matrix
such that each diagonal element is 1 and each off-diagonal element is bounded by
1 in the absolute value.
The MC64 algorithm first tries to find a permutation = {1 , 2 , . . . , N } to
maximize the product of all the diagonal absolute values, i.e.,
N
A j, . (3.1)
j
j=1
48 3 Overall Solver Flow
j records that row j is matched to column j . After the permutation is found, column
j is exchanged to column j, such that all the diagonal elements are nonzeros and
the product of the diagonal absolute values is maximized. Mathematically, this is
equivalent to find a column permutation matrix Q, such that the product of the
diagonal values of AQ is maximized. The algorithm to find the permutation is based
on the Dijkstras shortest path algorithm. The basic idea is quite similar to zero-
free permutation. When performing a DFS, the length of the path which equals to
an inverse form of the product of the absolute values of elements on the path is
recorded. The shortest path is found from all possible paths, which corresponds to
the permutation that maximizes the product of the diagonal absolute values. Once
the permutation is found, two diagonal scaling matrices Dr and Dc are generated
to scale the matrix, such that each diagonal element of Dr AQDc is 1 and all the
off-diagonal elements are in the range of [1, +1]. Details of the MC64 algorithm
can be found in [2].
By default, NICSLU runs the MC64 algorithm first. If static pivoting cannot find
a shortest path that makes all the rows and columns matched one-to-one, this means
that the matrix is numerically singular. In this case, NICSLU will abandon static
pivoting and run zero-free permutation instead. NICSLU also provides an option to
specify whether scaling the matrix is required. If not, NICSLU only maximizes the
product of all the diagonal absolute values without scaling the matrix.
The purpose of matrix ordering is to find an optimal permutation to reorder the matrix
such that fill-ins are minimized during sparse LU factorization. This is a special step
in sparse matrix factorizations. Figure 3.3 explains why matrix ordering is important
in sparse LU factorization. For sparse matrix factorization, different orderings can
generate significantly different fill-ins. If the matrix is ordered like the case shown
in Fig. 3.3a, then after LU factorization, both L and U are fully filled, leading to
a high fill-in ratio. On the contrary, if the matrix is ordered like the case shown in
Fig. 3.3b, no fill-ins are generated after LU factorization. For this simple example, it
is obvious that the ordering shown in Fig. 3.3b is a good one. As the computational
cost of sparse LU factorization is almost proportional to the number of FLOPs, which
in turn, depends on the number of fill-ins, generating too many fill-ins will greatly
degrade the performance of sparse direct solvers. Consequently, matrix ordering
is a necessary step for every sparse direct solver. Finding the optimal ordering to
minimize the fill-ins is actually a nondeterministic polynomial time complete (NPC)
problem [6], and, hence, people use heuristic algorithms to find suboptimal solutions
to this problem.
NICSLU adopts the AMD algorithm [7, 8], which is a very popular ordering algo-
rithm, to perform matrix ordering for fill-in reduction. Heuristics in AMD means that
the matrix ordering is done step by step, and in each step, we use a greedy strategy to
select the pivot to eliminate, such that fill-ins are minimized only at the current step,
3.2 Pre-analysis 49
(a) Fill-ins
= *
(b)
= *
Fig. 3.3 Different orderings generate different fill-ins. a A bad ordering leads to full fill-ins.
b A good ordering does not generate any fill-in
without considering its impact to the subsequent elimination steps. AMD can only
be applied to symmetric matrices, so a matrix after the zero-free permutation/static
pivoting step, say A, is first symmetrized by calculating A = A + AT . Mathemati-
cally, AMD finds a permutation matrix P and then applies symmetric row and column
permutations to the symmetric matrix, i.e., PA P T , such that factorizing PA P T gen-
erates much fewer fill-ins than directly factorizing A . As A is constructed from A,
factorizing PA P T also tends to generate fewer fill-ins than factorizing A.
Figure 3.4 illustrates the basic theory of AMD based on the elimination graph (EG)
model. The EG is defined as an undirected graph, with N vertexes numbered from 1
to N corresponding to the rows and columns of the matrix. Except for the diagonal,
any nonzero element in A , say Ai, j , corresponds to an undirected edge (i, j) in the
EG. According to the Gaussian elimination procedure, eliminating a vertex from the
EG will generate a clique (a clique means a subgraph where its vertexes are pairwise
connected) which is composed of vertexes which are adjacent to the eliminated
vertex. For the example illustrated in Fig. 3.4, if vertex 1 is eliminated, vertexes
{2, 3, 4} form a new clique so they are connected pairwise. The newly generated
edges, i.e., (2, 4) and (3, 4), correspond to the four fill-ins in the matrix, i.e., A2,4 ,
A3,4 , A4,2 and A4,3 , which are denoted by red squares in Fig. 3.4. According to this
observation, in order to minimize fill-ins, one should always select the vertex that
generates the fewest fill-ins at each step. However, calculating the exact number of
fill-ins is an expensive task, so AMD uses the approximate vertex degree instead of
the number of fill-ins, when selecting pivots to eliminate. Such an approximation
leads to a very fast speed without affecting the ordering quality for most practical
matrices [7].
50 3 Overall Solver Flow
1 1
2 2
3 3
4 4
5 5
Fill-ins
As can be seen, additional edges are generated in the EG during the elimination
process. This leads to two challenges in the implementation of the AMD algorithm.
First, it is difficult to predict the required memory for the EG before the algorithm
starts, so we need to dynamically reallocate the memory. Second, after a vertex is
eliminated from the EG, additional edges are required to be inserted into the EG.
This leads to a severe problem that the memory spaces which store edges need to
be moved frequently. To overcome the two problems, a realistic implementation of
AMD actually adopts the concept of quotient graph [9], which can be operated in-
place and is much faster than the EG model. We omit the detailed implementation
of AMD in this book. Readers can refer to [7].
prediction calculates the symbolic pattern of a column and pruning is used to reduce
the computational cost for subsequent columns. Without any numerical computa-
tions, the symbolic factorization is typically much faster than numerical LU factor-
ization.
Once the symbolic factorization is finished, we calculate the number of FLOPs
by using Algorithm 5, and then estimate the sparsity of the matrix by calculating the
SPR defined as
FLOPs
SPR = (3.2)
NNZ(L + U I)
where NNZ means the number of nonzeros. The SPR estimates the average number of
FLOPs per nonzero in the LU factors, which is a good estimator of the sparsity of the
matrix. Davis has pointed that circuit matrices typically have a very small SPR [11].
As mentioned above, in our symbolic factorization, the SPR may underestimate
the actual sparsity if some off-diagonal elements are selected as pivots during LU
factorization. Fortunately, in most cases, there are not too many off-diagonal pivots,
so the underestimated sparsity can be very close to the actual sparsity.
The estimated SPR is used to select the LU factorization algorithm (map algorithm,
column algorithm, and supernodal algorithm), as illustrated in Fig. 3.1. Basically, if
the matrix is too sparse, the map algorithm runs faster than the column algorithm.
While the matrix is slightly dense, the supernodal algorithm runs faster than the col-
umn algorithm. Consequently, the optimal factorization algorithm should be selected
according to the matrix sparsity. We will further explain this point in Chaps. 5 and 6. In
addition, the SPR is also used to control whether full factorization will be executed in
52 3 Overall Solver Flow
parallel or sequential. The basic observation behind such a strategy is that, for highly
sparse matrices, due to the extremely low computational cost, the overhead caused by
parallelism (scheduling overhead, synchronization overhead, workload imbalance,
memory and cache conflicts, etc.) can be a non-ignorable part in the total runtime.
What we have found from experiments is that for extremely sparse matrices, parallel
full factorization cannot be faster than sequential full factorization. Consequently,
we use the SPR to automatically control the sequential or parallel execution of full
factorization. According to our results, the threshold is selected to be 50. Namely,
NICSLU runs parallel full factorization when the SPR is larger than 50; otherwise
sequential full factorization is selected. We will explain the selection of the threshold
by experimental results in Chap. 6.
of the G-P algorithm, which are also the primary innovations of NICSLU, will be
presented in the next two chapters.
The modified G-P algorithm factorizes an N N square matrix by sequentially
processing each column in four main steps: (1) symbolic prediction; (2) numerical
update; (3) partial pivoting; and (4) pruning, as shown in Algorithm 6. The algorithm
flow clearly explains why this algorithm is also called left-looking: when doing the
symbolic prediction and numerical update for a given column, dependent columns
on its left side will be visited. We will present brief descriptions and algorithms of
the four steps in the following four subsections.
As mentioned above, NICSLU offers two numerical LU factorization methods:
full factorization and re-factorization. The main difference between them is that re-
factorization does not invoke partial pivoting. In this section, we introduce the full
factorization algorithm, and in the next section, we will introduce the re-factorization
algorithm.
Symbolic prediction is the first step of factorizing a column. It calculates the symbolic
pattern of a given column, which also indicates the dependent columns that will be
visited in the numerical update step. Like zero-free permutation, symbolic prediction
is also done by DFS. In order to perform DFS, we also need to construct a DAG.
In symbolic prediction, the DAG is constructed from the symbolic pattern of L with
finished columns. The DAG has N vertexes corresponding to all the columns. Except
for the diagonal elements in L, any nonzero element in L, say L i, j , corresponds to
a directed edge (i, j) in the DAG. For a given column, say column k, The DAG
procedure starts from nonzero elements in A(:, k) until all reachable vertexes are
visited. For each nonzero element in A(:, k), we can get a vertex sequence by DFS.
All the vertexes in all the sequences are topologically sorted, and, finally, we get
the symbolic pattern of column k. The resulting symbolic pattern contains nonzero
elements of the given column of both L and U.
Figure 3.5 illustrates an example of the DFS procedure. Suppose that we are
doing symbolic prediction for column 10. There are two nonzero elements in
A(:, 10): A1,10 and A2,10 . Starting from A1,10 , we get a DFS sequence {1, 3, 5, 8, 10}.
Starting from A2,10 , we get another DFS sequence {2, 4, 9, 12, 7, 10, 11}. The
two sequences are merged and topologically sorted, so we get the final sequence
{1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12}, which indicates the symbolic pattern of column 10.
Note that the DAG is updated once the symbolic prediction of a column is finished.
The updated DAG will be used for symbolic predictions of subsequent columns.
The above descriptions are more of a theory. In a practical implementation of the
symbolic prediction, the DAG does not need to be explicitly constructed. The storage
of L is directly used in symbolic prediction. In addition, topological sorting is not an
actual step, either. The topological order is automatically guaranteed by an elaborate
update order to the resulting sequence during the DFS procedure.
54 3 Overall Solver Flow
1 2 3 4 5 6 7 8 9 10 11 12
1
2
3 1 2
4
5 6 4
6 3
7
8 7 9
5 8
9
10 10
11 11 12
12
Nonzeros in A(:, 10) DAG used for DFS of column 10
Fill-ins of column 10
Fig. 3.5 Illustration of the DFS for symbolic prediction [11]. This example is illustrated for when
we are doing symbolic prediction for column 10
The purpose of numerical update is to calculate the numerical values for a given
column based on the symbolic pattern obtained in the symbolic prediction. Algo-
rithm 7 shows the algorithm flow of the numerical update for a given column. This
is typically the most time-consuming step in numerical LU factorization.
When updating a given column, say column k, numerical update uses dependent
columns on the left side to update column k. The dependence is determined by
the symbolic pattern of U(1 : k 1, k). Namely, column k depends on column j
( j < k), if and only if U jk is a nonzero element. The numerical update is actually
a set of multiplication-and-add (MAD) operations. Figure 3.6 illustrates the MAD
operation in a clearer way. In this example, we are doing numerical update for column
k and U(1 : k 1, k) has two nonzero elements. The numerical update for column
k involves three MAD operations, as marked by different colors in Fig. 3.6.
As can be seen from Algorithm 7, numerical update requires an uncompressed
array x of length N . This array serves as a temporary working space and stores all the
3.3 Numerical Full Factorization 55
immediate results during numerical update, as well as the final results of numerical
update. The necessity of this array is explained as follows. The symbolic patterns
of column k and its dependent columns are different, so for compressed storage
formats, it is expensive to simultaneously access two nonzero elements at the same
row in the two columns with different symbolic patterns. For example, assume that
we are using column j to update column k. We traverse the compressed array of
L(:, j), and for each nonzero element in L(:, j), say L i j , we need to find the address
of L ik or Uik to perform the numerical update. Since L and U are both stored in
compressed arrays, finding the address of L ik or Uik requires a traversal on L(:, k)
or U(:, k). On the contrary, if we use an uncompressed array x instead, the desired
address is simply the ith position of the array x. To integrate the uncompressed
array into numerical update, we need an operation named scattergather. Namely,
the numerical values of the nonzero elements are first scattered into x, and after
numerical update is finished, the numerical values stored in x will be gathered into
the compressed arrays of L and U. Figure 3.7 illustrates such an operation. Assume
that we are performing numerical update on column k. First, all the nonzero elements
1 1 1
2 2 2
3 4 4 4
Column k
Column j
3 3 3
1 6 6 3 6
4 4 4
6 3 3 1 3
5 5 5
8 8 6 8
6 6 6
1 1 1
7 7 7
8 8 8
(a) (b) Scatter column k (c) Numerical update (d) Gather nonzero
into an uncompressed on the uncompressed elements into
array array compressed storage
in column k are scattered into the uncompressed array x. Then, numerical update is
performed using all the dependent columns. Finally, the numerical results stored in
the uncompressed array x are gathered into the compressed storage of column k.
According to Algorithm 8, the word partial that describes the pivoting method
means that the pivot of a column is selected from the corresponding column of
L, but not the full column or the full matrix. Note that full pivoting can also be
adopted to achieve a better numerical stability. However, full pivoting involves more
complicated row and column permutations. In most cases, partial pivoting can achieve
satisfactory numerical stability.
3.3.4 Pruning
Pruning is the last step of factorizing a column. Actually pruning is not a necessary
step in the left-looking algorithm, just like the original G-P algorithm. However,
pruning can significantly reduce the computational cost of the symbolic prediction
3.3 Numerical Full Factorization 57
(a) j k (b) j k m
U U
k k
X
l l
X
Pruned
L L
Fig. 3.8 Illustration of pruning. a After column k is factorized, column j is pruned. b When we
are doing symbolic prediction for column m, the pruned nonzero elements in column j are skipped
when performing DFS
In the previous section, we have presented the full LU factorization algorithm. NIC-
SLU also offers another numerical factorization method named re-factorization. The
main difference between them is the use of partial pivoting. Re-factorization does
not perform partial pivoting. This difference leads to many other differences between
the two factorization methods. In the case with partial pivoting, partial pivoting can
exchange row orders so symbolic prediction is required for every column. This also
means that, symbolic prediction cannot be separated from numerical factorization
because the symbolic pattern depends on the numerical pivot choices. However,
if partial pivoting is not adopted, the symbolic pattern does not change, so all the
symbol-related computations, i.e., symbolic prediction and pruning, can be skipped.
Consequently, in numerical LU re-factorization, we only need to perform numerical
update for each column. The premise is that the symbolic pattern of the LU factors
must be known prior to re-factorization, so re-factorization can only be called after
full LU factorization has been called at least once. Re-factorization uses the symbolic
pattern and pivoting order obtained in the last full factorization. Algorithm 10 shows
the algorithm flow of numerical LU re-factorization. The scattergather operation is
also required in the re-factorization algorithm, which means that the uncompressed
array x is also required.
Without partial pivoting, there may be small elements on the diagonal so the numer-
ical instability problem may occur. However, in SPICE-like circuit simulation,
there is an opportunity that we can call much more re-factorizations than full fac-
torizations without making the results unstable. The opportunity comes from the
3.4 Numerical Re-factorization 59
where the superscript is the iteration count, and AbsTol and RelTol are two given
absolute and relative tolerances for checking convergence. Since the Newton
Raphson method has the feature of quadratic convergence, we can simply relax
the two tolerances to larger values to judge whether the NewtonRaphson iterations
are converging, i.e.,
(k)
x x(k1) < BigAbsTol + BigRelTol min x(k) , x(k1) (3.4)
where BigAbsTol >> AbsTol and BigRelTol >> RelTol. They can be determined
empirically. If Eq. (3.4) holds, it indicates that the NewtonRaphson iterations are
converging, so one can invoke re-factorization instead of full factorization; otherwise
full factorization must be called.
Although the above method is quite effective in practice, the solver is not a black
box under such a usage. This increases the difficulty for users to use the solver. The
second method is completely controlled by the solver itself so the usage is black box.
Toward this goal, we calculate the PCN after each full factorization or re-factorization
by
60 3 Overall Solver Flow
max |Ukk |
k
PCN = . (3.5)
min |Ukk |
k
where is a given threshold whose default value is 5. If Eq. (3.6) holds, it means
that the matrix values change dramatically so full factorization should be called;
otherwise we can invoke re-factorization instead.
Please note that for both methods, the thresholds should be selected to be a little
conservative such that the numerical stability can always be guaranteed.
3.5 Right-Hand-Solving
The purpose of iterative refinement is to refine the solution to get a more accurate
solution. NICSLU automatically determines whether iterative refinement is required
according to whether PCN is in a given range, i.e.,
where the default values of and are 1012 and 1040 respectively. If the condition
number is small, it means that the matrix is well-conditioned and the solution is
accurate enough, so refinement is not required. If the condition number is too large,
it indicates that the matrix is highly ill-conditioned. In this case, iterative refinement
usually does not have any effect. These two points explain why we use Eq. (3.7) to
determine whether iterative refinement is required.
The iterative refinement algorithm used in NICSLU is shown in Algorithm 11. It
is a modified version of the well-known Wilkinsons algorithm [14]. If one of the
following four conditions holds, the iterations stop.
The number of iterations reaches the allowed number maxiter (line 7). maxiter
is given by users and its default value is 3 in NICSLU.
The residual ||Ax b||22 satisfies the requirement eps (line 12). eps is given by
users and its default value is 1 1020 .
62 3 Overall Solver Flow
The residual saturates (line 19). This means that the residual changes slightly
compared with the residual in the previous iteration. Although the residual may still
be reduced by running more iterations, it is uneconomical as the iterative refinement
causes additional computational cost but the improvement of the solution is tiny.
The residual reaches the minimum (line 22). This means that the residual becomes
larger after a certain number of iterations. If this happens, NICSLU restores the
solution corresponding to the minimal residual and then stops the iterative refine-
ment.
It is worth mentioning that the iterative refinement algorithm is not always suc-
cessful. It is possible that for some ill-conditioned matrices, although the solution
is inaccurate, the iterative refinement algorithm cannot improve the solution at all.
Since it is an iterative algorithm, it has convergence conditions. Deriving the con-
vergence conditions is beyond the scope of this book. A detailed derivation can be
found in [15].
References
1. Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the
diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20(4), 889901 (1999)
2. Duff, I.S., Koster, J.: On algorithms for permuting large entries to the diagonal of a sparse
matrix. SIAM J. Matrix Anal. Appl. 22(4), 973996 (2000)
3. STFC Rutherford Appleton Laboratory: The HSL Mathematical Software Library. http://www.
hsl.rl.ac.uk/
4. Duff, I.S.: On algorithms for obtaining a maximum transversal. ACM Trans. Math. Softw. 7(3),
315330 (1981)
5. Duff, I.S.: Algorithm 575: permutations for a zero-free diagonal. ACM Trans. Math. Softw.
7(3), 387390 (1981)
6. Yannakakis, M.: Computing the minimum fill-in is NP-complete. SIAM J. Algebraic Discrete
Meth. 2(1), 7779 (1981)
7. Amestoy, P.R., Davis, T.A., Duff, I.S.: An approximate minimum degree ordering algorithm.
SIAM J. Matrix Anal. Appl. 17(4), 886905 (1996)
8. Amestoy, P.R., Davis, T.A., Duff, I.S.: Algorithm 837: AMD, an approximate minimum degree
ordering algorithm. ACM Trans. Math. Softw. 30(3), 381388 (2004)
9. George, A., Liu, J.W.H.: A quotient graph model for symmetric factorization. In: Sparse matrix
proceedings, pp. 154175 (1979)
10. George, A., Ng, E.: Symbolic factorization for sparse gaussian elimination with partial pivoting.
SIAM J. Sci. Stat. Comput. 8(6), 877898 (1987)
11. Davis, T.A., Palamadai Natarajan, E.: Algorithm 907: KLU, a direct sparse solver for circuit
simulation problems. ACM Trans. Math. Softw. 37(3), 36:136:17 (2010)
12. Gilbert, J.R., Peierls, T.: Sparse partial pivoting in time proportional to arithmetic operations.
SIAM J. Sci. Statist. Comput. 9(5), 862874 (1988)
13. Eisenstat, S.C., Liu, J.W.H.: Exploiting structural symmetry in a sparse partial pivoting code.
SIAM J. Sci. Comput. 14(1), 253257 (1993)
14. Martin, R.S., Peters, G., Wilkinson, J.H.: Iterative refinement of the solution of a positive
definite system of equations. Numerische Mathematik 8(3), 203216 (1966)
15. Moler, C.B.: Iterative refinement in floating point. J. ACM 14(2), 316321 (1967)
Chapter 4
Parallel Sparse Left-Looking Algorithm
In this chapter, we will propose parallelization methodologies for the G-P sparse
left-looking algorithm. Parallelizing sparse left-looking LU factorization faces three
major challenges: the high sparsity of circuit matrices, the irregular structure of the
symbolic pattern, and the strong data dependence during sparse LU factorization.
To overcome these challenges, we propose an innovative framework to realize par-
allel sparse LU factorization. The framework is based on a detailed task-level data
dependence analysis and composed of two different scheduling modes to fit different
data dependences: a cluster mode suitable for independent tasks and a pipeline mode
that explores parallelism between dependent tasks. Under the proposed scheduling
framework, we will implement several different parallel algorithms for parallel full
factorization and parallel re-factorization. In addition to the fundamental theories,
we will also present some critical implementation details in this chapter.
In this section, we will present parallelization methodologies for numerical full fac-
torization. Due to partial pivoting, the symbolic pattern of the LU factors depends
on detailed pivot choices, leading to that the column-level dependence cannot be
determined before numerical factorization. In addition, the dependence dynamically
changes during numerical factorization. However, we need to know the detailed data
dependence before scheduling the parallel algorithm. This is the major challenge
when developing scheduling techniques for parallel numerical full factorization.
According to the theory of the G-P sparse left-looking algorithm, it is easy to derive
that column k depends on column j ( j < k), if and only if U jk is a nonzero element.
This conclusion describes the fundamental column-level dependence in the sparse
left-looking algorithm. Our parallel algorithms are based on the column-level paral-
lelism. In order to schedule the parallel factorization, a DAG that expresses all the
column-level dependence is required. However, the problem is that we cannot obtain
the exact dependence graph before numerical factorization because partial pivoting
can change the symbolic pattern of the LU factors. To solve this problem, we adopt
the concept of ET [1], which has already been mentioned in Sect. 2.1.1.1, to construct
an inexact dependence graph. The ET describes an upper bound of the column-level
dependence by considering all possible pivoting choices during a partial pivoting-
based factorization. In other words, regardless of the actual pivoting choices, the
column-level dependence is always contained in the dependence graph described
by the ET. Consequently, the ET greatly overdetermines the actual column-level
dependence.
An ET is actually a DAG, with N vertexes corresponding to all the columns in
the matrix. A directed edge in the ET (i, j) means that column j potentially depends
on column i. In this case, vertex j is the parent of vertex i, and vertex i is a child of
vertex j. Since the column-level dependence described by the ET is an upper bound,
the edge (i, j) does not necessarily mean that column j must depend to column i.
It just means that there exists a pivoting order, and if the matrix is strong Hall and
pivoted following that order, column j depends on column i. The original ET theory
is derived only based on symmetric matrices; however, the ET can also be applied
to unsymmetric matrices. For unsymmetric matrices, the ET can be constructed
from AT A [2, 3]. More specifically, if Lc denotes the Cholesky factor of AT A (i.e.,
Lc LcT = AT A), then the parent of vertex i is the row index j of the first nonzero
element below the diagonal of column i of Lc . The ET can be computed from A
in time almost linear to the number of nonzero elements in A by a variant of the
algorithm proposed in [1], without explicitly constructing AT A.
Once the ET is obtained, tasks (i.e., columns) can be scheduled by the ET, as the ET
contains all the potential column-level dependence. Many practical parallel applica-
tions adopt dynamic scheduling as it can usually achieve good load balance. Take
SuperLU_MT [2, 3] as an example to introduce the dynamic scheduling method.
Each column is assigned a flag which indicates the status of the column which can
be one of the following four values: unready, ready, busy, and done. A ready
4.1 Parallel Full Factorization 65
task means that all of its children are finished. A task pool is maintained to store ready
tasks. The task pool is global and can be accessed by all the working threads. Once a
thread finishes its last task, it tries to fetch a new task from the task pool. As the task
pool is shared by all the threads, any access to the task pool is a critical section [4]
and requires a mutex [5] to avoid conflicts. For example, without using a mutex, 2
threads may fetch the same ready task if they access the task pool simultaneously.
Mutex operations involve system calls [6], so the overhead is quite large. A mutex
operation can typically spend thousands of CPU clock cycles. Once a new task is
fetched from the task pool, it is removed from the task pool, and then the thread
marks it as busy and executes it. After the task is finished, it is marked as done. The
thread searches all the unready tasks which now become ready and put them into
the task pool. This is the so-called dynamic scheduling method which is a standard
scheduling method and used in many practical parallel applications. Algorithm 12
shows a typical flow of the dynamic scheduling method.
However, such a dynamic scheduling method is not suitable for parallel LU fac-
torization for circuit matrices. The difficulty comes from the high sparsity of circuit
matrices. Sparse matrices from other applications are generally denser than circuit
matrices, so the computational cost of a task can be much larger than its scheduling
cost. In this case, dynamic scheduling can be adopted, since the scheduling cost of
each task can be ignored compared with the computational cost. However, for circuit
matrices, the computational cost of a task can be extremely small, so the schedul-
ing cost may be larger than the computational cost, leading to very low scheduling
efficiency.
To reduce the scheduling cost, we propose two different scheduling methods for
NICSLU: a static scheduling method and a pseudo-dynamic scheduling method. In
66 4 Parallel Sparse Left-Looking Algorithm
the both scheduling methods, tasks are sorted in a topological order, such that sequen-
tially finishing these tasks does not violate any dependence constraint. Suppose there
are M tasks and they are denoted as T1 , T2 , . . . , TM in a topological order. Let P
be the number of available threads. Static scheduling says that tasks are assigned to
threads orderly, as shown in Fig. 4.1. In short, task T j is assigned to thread
( j mod P) + 1, j mod P = 0
. (4.1)
P, j mod P = 0
Once a thread finishes its last task, it begins to process the next task by increasing
the task index by P. Such a static scheduling method is quite easy to implement with
a negligible assignment overhead, as the assignment is completely known and fixed
before execution. However, it is well known that static scheduling may cause load
imbalance due to the unequal workloads of tasks. Load imbalance can also be caused
by runtime factors. For example, when a thread begins to execute a new task, say task
Ti , the previous task in the task sequence, Ti1 , may be un-started. This also means
that task Ti1 is skipped in the time sequence. Figure 4.1 shows such an example
in which thread 2 runs faster than other threads. In this case, workloads of threads
may differ much, and, hence, there raises the load imbalance problem.
To solve the load imbalance problem of static scheduling, we further propose
a pseudo-dynamic scheduling method, which uses atomic operations and combines
advantages of both dynamic scheduling and static scheduling. In the pseudo-dynamic
scheduling method, a pointer named max_busy is maintained to point to the head-
most task that is being executed. Once any thread finishes its last task, max_busy is
atomically increased by one and then pointed to the next task. Figure 4.2 illustrates
the pseudo-dynamic scheduling method. The atomicity guarantees that even if multi-
ple threads are increasing max_busy simultaneously, they will get different results.
Algorithm 13 shows the proposed pseudo-dynamic scheduling method. It has two
advantages compared with static scheduling and conventional dynamic scheduling.
On one hand, such a method ensures that any thread always executes the next task with
the smallest index and no task can be skipped, and, thus, workloads of threads tend
to be balanced and load imbalance can be improved. On the other hand, compared
with the conventional dynamic scheduling method, the pseudo-dynamic scheduling
4.1 Parallel Full Factorization 67
......
max_busy
Busy tasks
......
max_busy
Busy tasks
method greatly reduces the scheduling overhead, since an atomic operation is much
cheaper than a mutex operation.
In NICSLU, except for that the parallel supernodal full factorization (described
in the next chapter) uses the static scheduling method, other factorization and re-
factorization methods all use the pseudo-dynamic scheduling method.
68 4 Parallel Sparse Left-Looking Algorithm
Figure 4.3b shows an example of the ET. Here we first give a simple explanation
of the statement that the ET is an upper bound of the column-level dependence.
If we do not consider any pivoting, the column-level dependence is determined by
the symbolic pattern of U. As can be seen from Fig. 4.3a, column 10 only depends
on column 7. However, the ET shows that column 10 can potentially depend on 8
columns out of all the 10 columns, except column 3 and column 10 itself. In order to
schedule tasks by utilizing the ET, we further levelized the ET, as shown in Fig. 4.3c.
The levelization is actually an ASAP scheduling of the ET. In other words, we can
define a level for each vertex in the ET as the maximum length from the vertex to
leaf vertexes, where a leaf vertex is defined as a vertex without any children vertex.
The level of a vertex can be calculated by the following equation:
where c1 , c2 , . . . are the children vertexes of vertex k. Visiting all the vertexes in a
topological order can calculate their levels in linear time. After the ET is levelized,
we can rewrite the ET into a tabular form, which is named Elimination Scheduler
(ESched), as illustrated in Fig. 4.4. It is obvious that tasks in the same level are
completely independent, so they can be factorized in parallel. Guided by the ESched,
we will propose a dual-mode scheduling method for parallel LU factorization. In
NICSLU, all parallel factorization methods are based on the proposed dual-mode
scheduling method.
There is a fundamental observation about the ESched that some levels at the front
have many tasks but the rest of levels have much fewer tasks. This observation is
caused by the ASAP nature of the ESched: leaf tasks are all put to the first level and
tasks with weak dependence are put to front levels. According to this observation,
10 10
1 2 3 4 5 6 7 8 9 10 9 5 9
1
2 8 2 8
3
4
5 6 7 7
6
7 4 4 5
8
9
10 1 3 6 1 2 3
Thread 1
Thread 2
we can set a threshold to distinguish the two cases. In what follows, we assume that
there are L levels in total, and the first L c levels and the remaining L p = L L c
levels are distinguished, i.e., the first L c levels have many tasks in each level and the
remaining L p levels have very few tasks in each level.
For the L c front levels that have many tasks in each level, tasks in each level can
be factorized in parallel as tasks in the same level are completely independent. This
parallel mode is called cluster mode. All the levels belonging to cluster mode are
processed level by level. For each level, tasks are assigned to different threads (tasks
assigned to one thread are regarded as a cluster), and the load balance is achieved
by equalizing the number of tasks among all the clusters. Each thread executes the
same code (i.e., the modified G-P sparse left-looking algorithm) to factorize the
tasks which are assigned to it. Task-level synchronization is not required since tasks
in the same level are independent, which reduces bulk of the synchronization cost.
However, a barrier is required to synchronize all the threads, which means that the
cluster mode is a level-synchronization algorithm. Figure 4.5 shows an example of
task assignment to 2 threads in the cluster mode.
For the remaining L p levels, each level has very few tasks, which also means that
there is insufficient task-level parallelism, so the cluster mode cannot be efficient. We
explore parallelism between dependent levels by proposing a new approach called
pipeline mode. First, all the tasks belonging to the pipeline mode are sorted into a
70 4 Parallel Sparse Left-Looking Algorithm
Dependence
Barrier
column 1 column 3 ...... Thread 1
Cluster
mode
column 2 column 4 ...... Thread 2
Time
Fig. 4.6 Time diagram of the cluster mode and the pipeline mode
topological sequence (in the above example shown in Fig. 4.4, the topological
sequence is {7, 8, 9, 10}), and then perform a static scheduling or pseudo-dynamic
scheduling to assign tasks to working threads. Parallelism is explored between depen-
dent tasks, and, thus, task-level synchronization is required in the pipeline mode. Each
thread factorizes a fetched column at a time. During the factorization, it needs to wait
for dependent columns to finish. Figure 4.5 also shows an example of task assign-
ment to 2 threads in the pipeline mode. To better understand the two modes, Fig. 4.6
illustrates the time diagram of the two parallel modes, compared with sequential
factorization.
In the cluster mode, each thread executes the modified G-P sparse left-looking algo-
rithm to factorize columns that are assigned to the thread. Since there is no column-
level synchronization in the cluster mode, fine-grained inter-thread communication
is not required. We only need a barrier to synchronize all the threads for each level
belonging to the cluster mode.
The pipeline mode is more complicated. In the pipeline mode, all the available
threads run in parallel, as shown in Algorithm 14. Suppose that a thread begins to
factorize a new column, say column k. The pseudo-code can be partitioned into two
parts: pre-factorization and post-factorization. In both parts, a set S is maintained to
store all the newly detected columns that are found in the last symbolic prediction.
Pre-factorization is composed of two passes of incomplete symbolic prediction and
numerical update. In both passes, symbolic prediction skips all unfinished columns,
and then all the finished columns stored in S are used to update the current column
4.1 Parallel Full Factorization 71
k. These columns are marked as used and they will not be put into S again in later
symbolic predictions when factorizing column k. The second pass of symbolic pre-
diction starts from the skipped columns in the first pass, and then the thread waits for
all the children of column k to finish. After that, the thread enters post-factorization.
In post-factorization, the thread performs a complete symbolic prediction without
72 4 Parallel Sparse Left-Looking Algorithm
skipping any columns, as all the dependent columns are finished now, to determine
the exact symbolic pattern of column k. However, used columns will not be put into
S so S only contains the dependent columns which have not been used by column k.
The thread uses these newly detected columns to perform the remaining numerical
update on column k. Finally, partial pivoting and pruning are performed.
The pipeline mode exploits parallelism by pre-factorization. In the sequential
algorithm, one column, say column k, starts strictly after the previous column, i.e.,
column k 1, is finished. However, in the pipeline mode, before the previous column
is finished, column k has already accumulated some numerical update from some
dependent and finished columns.
Although partial pivoting can change the row ordering, it cannot cause inter-
thread conflicts in the pipeline mode algorithm. The reason is that the ET contains
all possible column-level dependence if partial pivoting is adopted. If two columns
can cause conflicts due to partial pivoting, they cannot be factorized at the same
time since one of the two columns must depend on the other column in the ET.
However, pruning in the pipeline mode algorithm may cause inter-thread conflicts.
For example, if one thread is pruning a column but another thread is trying to visit
that column, it will cause unpredictable results or even a program crash. We will
discuss how to solve this problem in the next subsection.
The pipeline mode algorithm involves two practical issues in the implementation,
which require special attentions.
How to determine whether a column is finished and how to guarantee the topo-
logical order during the symbolic prediction in the pre-factorization? We have
found that only using a flag for each column to indicate whether it is finished is
a d
b c
Visit a and
Some thread skip a Visit d Visit c Time
Other threads Finish a Finish c
Fig. 4.7 Example used to illustrate the problem of symbolic prediction in pre-factorization
4.1 Parallel Full Factorization 73
is too long. Our solution is to store an additional copy of the L indexes for each
column. The original L indexes are used for pruning and the copy will never be
changed. If a thread is going to visit a column during symbolic prediction, it first
checks whether this column is pruned. If so, it visits the pruned indexes. This will
not cause any problem since if a column is pruned, it will not be pruned again, so it
will not be changed any more. If the visiting column is not pruned, it indicates that
the column may be pruned at any time in the future, so we can only visit the copied
indexes. This method completely avoids the conflict but leads to some additional
storage overhead and runtime penalty.
1 2 3 4 5 6 7 8 9 10 2 2
1 1
1
2 7 3 7 3
6 5 6 5
3 4 4
4
5 8 8
6
7 9 9
8
9
10 10 10
Please note that we do not need to explicitly construct the dependence graph
for parallel re-factorization. As the column-level dependence can be completely
determined by the symbolic pattern of U, the dependence graph is implied in the
symbolic pattern of U. Namely, the symbolic pattern of U is just the EG.
For parallel re-factorization, we also adopt the dual-mode scheduling method pro-
posed in Sect. 4.1.2.2 to schedule tasks. First, the EG is levelized by calculating the
level of each vertex using Eq. (4.2), as illustrated in Fig. 4.8c. The EG has a similar
feature as the ET. Some front levels have many tasks in each level and the remaining
levels have very few tasks in each level. An ESched is constructed according to the
levelized EG. The cluster mode and the pipeline mode are launched based on the
ESched. For the example shown in Fig. 4.8, the scheduling result is shown in Fig. 4.9,
assuming that there are 2 threads.
Thread 2
76 4 Parallel Sparse Left-Looking Algorithm
In the cluster mode, each thread executes Algorithm 10 to factorize the columns that
are assigned to it. Like the cluster mode of full factorization, inter-thread synchro-
nization is not required, but we need a barrier to synchronize all the threads for each
level belonging to the cluster mode.
The pipeline mode algorithm in re-factorization is also much simpler than that
in full factorization. Algorithm 15 shows the pipeline mode re-factorization algo-
rithm. The major difference between Algorithm 15 and Algorithm 10 is in line 6
of Algorithm 15. In the pipeline mode re-factorization algorithm, when a thread is
trying to access a column, it will first wait for that column to finish. This is the only
inter-thread communication in the pipeline mode re-factorization algorithm. Such
a pipeline mode algorithm breaks the computational task of each column into fine-
grained subtasks, such that column-level dependence is also broken. Parallelism is
explored between dependent columns by running multiple subtasks in parallel. The
pipeline mode algorithm ensures a detailed computational order such that all the
numerical updates are done in a correct topological order.
We use Fig. 4.10 to illustrate the pipeline mode algorithm. Suppose that 2 threads
are factorizing column j and column k simultaneously. Column k depends on column
j and another column i (i < j < k). Assume that column i is already finished. While
factorizing column k, column k can be first updated by column i, corresponding to
the red line in Fig. 4.10. When it needs to use column j, it waits for column j until
it is finished (if currently column j is already finished, then no waiting is required).
Once column j is finished, column k can be updated by column j, corresponding
to the blue line in Fig. 4.10. At this moment, the thread that factorized column j
4.2 Parallel Re-factorization 77
just now is now factorizing another unfinished column. Such a parallel execution
approach is very similar to the pipeline mechanism of CPUs, so we call it pipeline
mode.
Waiting for a dependent column to finish can be done by two methods: blocked wait-
ing and spin waiting. Blocked waiting does not consume CPU resources; however,
it involves system calls, so the performance overhead is quite large. In the pipeline
mode algorithm, since inter-thread synchronization happens very frequently, blocked
waiting can significantly degrade the performance. Consequently, in NICSLU, we
use spin waiting for inter-thread synchronization. Implementing spin waiting is quite
easy. A binary flag is set for each column. If the flag is 0, it indicates that the column
is unfinished; otherwise the column is finished. We use a spin loop to implement the
waiting operation. There is another problem that must be resolved. If the column that
is being waited fails in factorization due to some reason, e.g., zero pivot, then the
waiting thread will never exit the waiting loop because the dependent column can
never be finished. In this case, the waiting thread falls into a dead loop. To resolve this
problem, we set an error code for each thread. During the waiting loop, we check all
the error codes. Once an error from some other thread is detected, the waiting thread
exits the waiting loop and also exits the current function. The spin waiting method is
shown in Algorithm 16, assuming that we are waiting column k and there P available
threads in total. In Algorithm 16, err is the array that stores all the error codes of all
the threads, and state is the state flag assigned to each column to indicate whether
a column is finished. The overhead of spin waiting is that it always consumes CPU
resources. Therefore, when spin waiting is adopted, the number of invoked working
threads cannot exceed the number of available cores; otherwise the performance will
be dramatically degraded due to CPU resource conflicts.
References
1. Liu, J.W.H.: The role of elimination trees in sparse factorization. SIAM J. Matrix Anal. Appl.
11(1), 134172 (1990)
2. Li, X.S.: Sparse gaussian elimination on high performance computers. Ph.D. thesis, Computer
Science Division, UC Berkeley, California, US (1996)
3. Demmel, J.W., Gilbert, J.R., Li, X.S.: An asynchronous parallel supernodal algorithm for sparse
gaussian elimination. SIAM J. Matrix Anal. Appl. 20(4), 915952 (1999)
4. Wikipedia: Critical Section. https://en.wikipedia.org/wiki/Critical_section
5. Wikipedia: Mutual Exclusion. https://en.wikipedia.org/wiki/Mutual_exclusion
6. Wikipedia: System Call. https://en.wikipedia.org/wiki/System_call
Chapter 5
Improvement Techniques
In the previous two chapters, we have presented the basic flow of our solver
and the parallelization methodologies for both numerical full factorization and re-
factorization, as well as the factorization method selection strategy. The numerical
factorization algorithms described are based on the G-P sparse left-looking algo-
rithm, which is a column-level algorithm. Although the G-P algorithm is widely
used in circuit simulation problems, actually whether it is really the best algorithm
for circuit matrices is unclear. Till now, very little work has been published to com-
prehensively analyze the performance of different computational granularities for
circuit matrices, but most efforts have been done for general sparse matrices which
are much denser than circuit matrices. In this chapter, we will point out that the
pure G-P algorithm is not always the best for circuit matrices. We will introduce two
improvement techniques for the G-P sparse left-looking algorithm. Inspired by the
observation that the best algorithm depends on the matrix sparsity, we will propose a
map algorithm and a supernodal algorithm which are suitable for extremely sparse
and slightly dense circuit matrices, respectively. Combining with the G-P algorithm,
we will integrate three algorithms in NICSLU. For a given matrix, the best algorithm
is selected according to the matrix sparsity, such that NICSLU always achieves high
performance for circuit matrices with various sparsity. In addition, based on the
observation that the matrix values change slow during NewtonRaphson iterations,
we will propose a novel pivoting reduction technique for numerical full factoriza-
tion to reduce the computational cost of symbolic prediction without affecting the
numerical stability.
In this section, we will introduce the map algorithm including the map definition and
the algorithm flow in detail. The map algorithm is proposed to reduce the overhead
of cache miss and data transfer for extremely sparse circuit matrices, so that the
performance can be improved for such matrices.
Springer International Publishing AG 2017 79
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_5
80 5 Improvement Techniques
5.1.1 Motivation
Uncompressed x b a c
1 2 3 4 5 6 7 8
5.1 Map Algorithm 81
The map algorithm is proposed to resolve the above two problems for extremely
sparse matrices. In the map algorithm, the uncompressed array x is avoided. Instead,
the addresses corresponding to the positions that will be updated during sparse LU
factorization are recorded in advance.
The map algorithm does not use the uncompressed array x. Instead, the compressed
storages of L and U are directly used in the numerical factorization. To solve the
indexing problem, the concept of map is proposed. The map is defined as a pointer
array which records all the addresses corresponding to the positions that will be
updated during the G-P left-looking sparse LU factorization. The map records all such
addresses in sequence. By employing the map, in the G-P algorithm, we only need to
directly update the numerical values which are pointed by the corresponding pointers
recorded in the map, instead of searching the update positions from compressed
arrays. After each update operation, the pointer is increased by one to point to the
next update position.
Creating the map is trivial. We just need to go through the factorization process
and record all the positions which are updated during sparse LU factorization in
sequence. Algorithm 17 shows the algorithm flow for creating the map. Besides
the map itself, we also record another array ptr , the location of each rows first
pointer in the map, which will be used for parallel map-based re-factorization. In
SPICE-like circuit simulation, the map is created after each full factorization. As
most of the factorizations are re-factorizations, the map is re-created very few times,
so its computational cost can be ignored. Actually our tests have shown that the time
overhead of creating a map is generally less than the runtime of one full factorization.
k
U
i1
*
(1)
i2
-
* (3) *
i3
i4 (2) -
L -
(a) Numerical update of
column k
dominates the total runtime. Second, for non-extremely sparse matrices, the map can
be so long that the main memory may not hold the map.
Please note that the map algorithm can only be applied to re-factorization but not
full factorization, because the map can be created only when the symbolic pattern
of the LU factors is known. As shown in Fig. 3.1, in NICSLU, if the map algo-
rithm is selected, we still perform the column algorithm in full factorization. In
re-factorization, the map is first created if it is not yet created. In SPICE-like cir-
cuit simulation, the map algorithm not only takes advantages of the high sparsity of
circuit matrices, but also utilizes the unique feature that the matrix values change
slow in NewtonRaphson iterations. Since full factorization is performed very few
times, map creation is required infrequently, either. This means that, successive re-
factorizations which follow the same full factorization can use the same map, so the
map is not required to be re-created for these re-factorizations. This feature signifi-
cantly saves the overhead of map creation.
84 5 Improvement Techniques
In parallel map re-factorization, we also apply the dual-mode scheduling strategy (i.e.,
the cluster and pipeline modes) to schedule tasks. The only point that is worth men-
tioning is that, in the parallel map re-factorization algorithm, since each thread does
not compute successive columns, the map pointers ptr constructed in Algorithm 17
are required to obtain the map starting position for desired columns. Algorithm 19
shows the algorithm flow of the pipeline mode map re-factorization algorithm. Before
factorizing a column, a thread first obtains the map for that column from ptr , i.e.,
the first update position of that column (line 4). The numerical update part is almost
the same as that in the sequential map algorithm, i.e., Algorithm 18. In the pipeline
mode, before visiting a dependent column, we also need to wait for it to finish (line
6).
In this section, we will present the supernodal algorithm in detail. The supernodal
algorithm is proposed to enhance the performance for slightly dense circuit matri-
ces by utilizing dense submatrix kernels. Different from the supernodal algorithm
adopted by SuperLU and SuperLU_MT [2, 3] which is actually a supernode-panel (in
SuperLU and SuperLU_MT, a panel means a set of successive columns which may
have different symbolic patterns) algorithm, our supernodal algorithm is a supernode-
column algorithm. Although circuit matrices can be sometimes slightly dense, they
5.2 Supernodal Algorithm 85
are still much sparser than sparse matrices from other applications, such as finite ele-
ment analysis. Such an observation prevents us from adopting such a heavyweight
supernode-panel algorithm. On the contrary, we adopt the lightweight supernode-
column algorithm which can well fit slightly dense circuit matrices.
5.2.1 Motivation
Although circuit matrices are usually very sparse, they can also be dense in some
special cases. For example, post-layout circuits will contain large power and ground
meshes so matrices created by MNA can be dense due to the mesh nature. For the
LU factors of such matrices, there are many nonzero elements that can form dense
submatrices. To efficiently solve such matrices, we borrow the concept of supernode
from SuperLU and develop a lightweight supernode-column algorithm which is quite
suitable for slightly dense circuit matrices. The performance can be greatly improved
by utilizing a vendor-optimized BLAS library.
Padding
L
After grouping columns with the same symbolic pattern together, numerical updates
from these columns can be combined together by utilizing supernodal operations, i.e.,
supernode-column updates. Figure 5.4 explains why we can perform a supernode-
column update instead of multiple column-column updates. Suppose we are fac-
torizing column k, and there is a nonzero element U jk in column k. This means
that column k depends on column j. We further assume that column j belongs to a
supernode which ends at column s, as illustrated in Fig. 5.4. We do not care the first
(leftmost) column of the supernode, since it has no impact to the supernode-column
update. According to the theory of the symbolic prediction presented in Sect. 3.3.1,
there must be fill-ins at rows j + 1, j + 2, . . . , s in column k. Consequently, column
s
Fill-ins
L
5.2 Supernodal Algorithm 87
algorithm flow of the sequential supernodal full factorization algorithm, where the
numerical update flow is shown in Algorithm 21. Compared with the basic column-
based G-P algorithm which is shown in Algorithms 6 and 7, there are two major
differences. First, after the symbolic prediction of each column, supernode detec-
tion (line 4 of Algorithm 20) is performed to determine whether the current column
belongs to the same supernode as the pervious column. Second, the numerical update
is different. As shown in lines 410 of Algorithm 21, if a dependent column belongs
to a supernode, we use two BLAS routines to perform a supernodal-column update;
otherwise the conventional column-column update is executed. It is easy to verify that
the supernode-column update is equivalent to multiple successive column-column
updates in theory.
The proposed supernodal algorithm has three advantages compared with the
column-based G-P algorithm for slightly dense matrices. First, due to the dense
storage of supernodes, indirect memory accesses within supernodes are avoided.
Second, we can utilize vendor-optimized BLAS library to compute dense submatrix
operations, so that the performance can be significantly enhanced. Finally, the cache
efficiency can also be improved because supernodes are stored by continuous arrays.
The proposed supernode-column algorithm is different from SuperLU or PAR-
DISO, although they also utilize supernodes to enhance the performance for dense
submatrices. SuperLU and PARDISO both use a so-called supernode-supernode or
supernode-panel algorithm, where each supernode is updated by dependent supern-
odes. The reason why they use such a method is that, when multiple columns depend
on a same supernode, the common dependent supernode will be read for multiple
times to update these columns separately. Consequently, gathering these columns
into a destination supernode (regardless of whether they have the same symbolic
pattern) and updating them together will make the common dependent supernode be
read only once. However, considering the fact that modern CPUs always have large
caches and supernodes in circuit matrices cannot be too large, many supernodes can
reside in cache simultaneously. Reading a supernode multiple times cannot signifi-
cantly degrade the performance. In addition, the supernode-panel algorithm adopted
by SuperLU and SuperLU_MT can introduce some additional computations and
fill-ins. Consequently, we develop the supernode-column algorithm which is more
lightweight than the supernode-supernode or supernode-panel algorithm adopted by
SuperLU and PARDISO. Another different from SuperLU is in the implementation
of supernodal numerical update step. In SuperLU and SuperLU_MT, actually there
are only supernodes but there is no concept of column. Even if a column cannot
form a supernode with its neighboring columns, it is still treated as a supernode.
Any numerical update is performed by calling BLAS routines. In NICSLU, how-
ever, we do not call BLAS for column-column updates, which are computed by our
own code. As calling library routines involves some extra penalty, such as the stack
operations, using BLAS to compute a single-column supernode is not a good idea,
since the computational cost is too small, compared with other overhead associated
with calling library routines.
5.2 Supernodal Algorithm 89
In re-factorization, the symbolic pattern of the LU factors is fixed so all the supern-
odes are also fixed. Namely, whether a column belongs to a supernode and which
supernode it belongs to are known and fixed. Consequently, like the column-based
re-factorization algorithm, we also only need to perform the numerical update in the
supernodal re-factorization algorithm.
columns belonging to the supernode are all marked as used so they will not be used
to update column k again (line 12); otherwise a column-column update is performed
(line 14). A special case is that columns j and k belong to the same supernode. In this
case, the last column of the supernode is larger than or equal to k; however, only the
columns j to k 1 in the supernode are required to update column k, so we need to
set the last column of the supernode as column k 1 instead of its actual last column
(lines 79).
Column-column update
Waiting due to previous unfinished
column-column updates
Supernodal Fact.
...... ......
Fact.
pipeline column j column s
Supernode- Supernode-
column
column update update
Time
to a supernode, we can wait for the entire supernode to finish. This does not cause
any accuracy problem but really causes a performance problem. If the supernode
is very large, i.e., it is composed of many columns, the waiting cost can be high,
and the performance may be even poorer than column-based re-factorization. In the
column-based pipeline mode, we can access a dependent column immediately after
it is finished; however, if we wait for the entire supernode to finish, we can access
the supernode only after all the columns belonging to the supernode are finished.
In this case, the waiting time can be very long. We still use the example shown in
Fig. 5.4 to illustrate this problem. Column k depends on columns j to s. When we
are factorizing column k and want to use columns j to s to perform a supernode-
column update, if column s is not finished, we need to wait for column s until it is
finished, and then perform a supernode-column update. In other words, column j is
accessed after column s is finished, instead of column j itself. In the column-based
pipeline mode algorithm, updates from finished columns can be performed before
when column s is finished. Figure 5.5 illustrates and compares the two cases (naive
supernodal pipeline and column pipeline). Note that in the column-based pipeline
mode algorithm, we may wait for some additional time due to previous unfinished
column-column updates.
To solve this problem, we propose to partition a large supernode into two parts.
Please note that the partition does not mean that we explicitly store a large supern-
ode by two separated parts. It only means that when performing supernode-column
updates, a large supernode is treated as two smaller supernodes so that two supernode-
column updates are performed. We can treat finished columns in a large supernode
5.2 Supernodal Algorithm 93
update. Figure 5.5 also illustrates this case (supernodal pipeline). Due to the higher
performance of a supernode-column update than multiple column-column updates,
the waiting time caused by the unfinished first supernode-column update tends to
be significantly reduced, and, hence, the second supernode-column update may be
started immediately after column s is finished. Consequently, the total runtime may
be reduced compared with the column-based pipeline mode. To optimize this imple-
mentation, the second part of the supernode should contain only a few columns;
otherwise the second supernode-column update may still consume too much run-
time. In NICSLU, the threshold to judge whether a supernode is so large that it
requires to be partitioned into two parts is 2P, where P is the number of invoked
threads. The size of the second part of the supernode is always set to P. The key rea-
son behind this setting is that, if there are two columns, say columns j and k (column
j is on the left of column k), whose positions in the pipeline sequence differ larger
than P, and if column k is being factorized, then column j must have been finished,
because there are only P threads. Consequently, setting the size of the second part of
the supernode to P ensures that no waiting happens for the first supernode-column
update. According to this principle, we present the algorithm flow of the pipeline
mode supernodal re-factorization in Algorithm 24. Lines 1214 correspond to the
case in which only one supernode-column update is invoked. Lines 1621 correspond
to the case in which two supernode-column updates are invoked. Other operations
in this algorithm flow has already been explained before so we will skip them here.
In this section, we will present a fast full factorization algorithm based on a novel
pivoting reduction technique. The proposed technique is used to accelerate full fac-
torization and improve its scalability for sparse matrices. It is also well compatible
with the SPICE-like circuit simulation flow.
KLU and NICSLU both have full factorization and re-factorization to perform numer-
ical LU factorization. Re-factorization does not perform any pivoting so it is faster
than full factorization. However, re-factorization is numerically unstable, so we can
use it only when we can guarantee the numerical stability. Full factorization accom-
modates partial pivoting but it is slower and the scalability is poor. In the Newton
Raphson iterations of SPICE-like circuit simulation, the matrix values change slow
and the difference of matrix values between two successive iterations is small, espe-
cially when the NewtonRaphson method is converging. In this case, if full factor-
ization is invoked, it tends to reuse most of the previous pivot choices. Consider an
extreme case in which the second full factorization completely reuses the pivoting
5.3 Fast Full Factorization 95
order generated in the first full factorization. In this case, the symbolic predictions
performed in the second full factorization are actually useless because the symbolic
pattern is unchanged. However, before the second full factorization, we do not know
whether it really reuses the pivoting order so we still need to do pivoting during
factorization. If very few columns change their pivot choices, it also raises the same
question that the symbolic predictions of some columns in the second full factor-
ization are useless. Our test statistics show that the symbolic prediction costs on
average 20% of the total runtime of full factorization. For extremely sparse matrices,
this ratio can be up to 50%. Therefore, if the useless symbolic predictions can be
avoided, the performance of full factorization can be significantly improved.
Why not borrow some ideas from re-factorization? Re-factorization is based on
the prerequisite that the symbolic pattern of the LU factors and the pivoting order are
known and fixed. In the second full factorization, when we are factorizing a column,
its symbolic pattern can be considered known from the first full factorization.
Here the only difference between full factorization and re-factorization is that, in full
factorization, the symbolic pattern may be changed if the pivot choice of that column
is changed from the first full factorization. However, before the symbolic prediction
of the column, we can assume that its symbolic pattern is known so we can directly
use the symbolic pattern obtained in the first full factorization. Then the symbolic
prediction of that column can be skipped, and the numerical update can be done as
usual. After that, partial pivoting is performed. If the pivot choice is changed, it means
that for subsequent columns, the symbolic pattern is also changed so the symbolic
prediction cannot be skipped. On the contrary, if that column still uses the previous
pivot choice, our assumption holds and the symbolic prediction of the next column
can still be skipped. To maximize the skipped symbolic prediction, we should reuse
previous pivot choices as many as possible. Toward this goal, we develop a pivoting
reduction technique, which is quite simple but effective. This gives us an opportunity
to skip the symbolic prediction for as many columns as possible.
In the conventional partial pivoting method, the diagonal element has the highest
priority when searching for the pivot. As shown in Algorithm 8, if the diagonal
absolute value is larger than or equal to the product of the threshold and the maximum
absolute value in the corresponding column, then the diagonal element can be the
pivot, even if it is not with the maximum magnitude in the column. In the pivoting
reduction technique, the element at the previous pivot position has the highest priority.
Namely, when we are doing partial pivoting for a column, we first check whether
the element at the previous pivot position is still large enough by absolute value so it
can still be the pivot. If so, the pivot order is not changed; otherwise a conventional
partial pivoting is performed. This is the so-called pivoting reduction technique.
Algorithm 25 shows the algorithm flow of the pivoting reduction technique. It reuses
previous pivot choices as many as possible, and, hence, it is helpful for keeping
the symbolic pattern unchanged as much as possible. By employing the pivoting
reduction technique, we will develop a fast full factorization algorithm.
Algorithm 26 shows that algorithm flow of the sequential fast full factorization,
which has two major parts: fast full factorization and normal full factorization. In
the fast full factorization (lines 313), for each column, the symbolic prediction
is skipped and the numerical update is performed based on the symbolic pattern
obtained in a previous full factorization. Then the pivoting reduction-based partial
pivoting shown in Algorithm 25 is performed. Once a re-pivoting has occurred, the
fast factorization is stopped and we enter the normal factorization (lines 1422)
to compute the remaining columns by the normal factorization algorithm without
skipping the symbolic prediction.
It should be noticed that when a re-pivoting occurs at a column k, not all the
subsequent columns (i.e., columns k + 1, k + 2, . . . , N ) are required to be factorized
by the normal full factorization. Only those columns which directly or indirectly
depend on column k require the normal full factorization. However, searching for all
the dependent columns from the subsequent columns will traverse all the subsequent
columns so it is time-consuming. Consequently, we use a simple but effective method
that once a re-pivoting occurs, all the subsequent columns are computed by the normal
full factorization algorithm.
From the fast full factorization algorithm presented above, it can be concluded
that the performance of fast factorization strongly depends on the matrix change
during the NewtonRaphson iterations. If the matrix values change little during iter-
ations, each fast factorization can always use the previous pivoting order so that
no re-pivoting happens, which is the best case. On the contrary, if the matrix val-
ues change dramatically, re-pivoting will always happen. The worst case is that
re-pivoting happens at the first column of each fast factorization so that fast factor-
ization degenerates to the normal full factorization. Consequently, fast factorization
should never be slower than normal full factorization.
Although algorithm 26 is for the column algorithm, the idea of fast factorization
can also be easily applied to the supernodal full factorization algorithm. One point
that is worth mentioning is that, if re-pivoting occurs at a column which belongs
to a supernode, the supernode will be changed. If the column in which re-pivoting
5.3 Fast Full Factorization 97
occurs is the first column of a supernode, then the supernode will be completely
destroyed; otherwise the supernode is ended at that column. Note that subsequent
columns may still belong to the same supernodes as which they belong to before;
however, this cannot be determined before those columns are factorized since their
symbolic pattern may be changed due to the re-pivoting. We will not present the
details of the supernodal fast factorization algorithm for concision.
We also use the dual-mode scheduling method to perform the parallel fast full fac-
torization. After applying the pivoting reduction-based partial pivoting method into
the cluster mode and pipeline mode, we call the new cluster mode and new pipeline
98 5 Improvement Techniques
Re-pivoting occurs
Fast cluster
mode as fast cluster mode and fast pipeline mode. Figure 5.6 shows the scheduling
framework of the parallel fast full factorization. If no re-pivoting occurs, the fast
cluster and fast pipeline modes are executed successively. If a re-pivoting occurs in
the fast cluster mode, the remaining levels belonging to the cluster mode are com-
puted by the normal cluster mode, and the other levels are computed by the normal
pipeline mode. If re-pivoting occurs in the fast pipeline mode, the fast pipeline stops
and the new normal pipeline mode is invoked for the remaining columns.
It is worth mentioning that in the fast pipeline mode, once a re-pivoting has
occurred at a column, say column k, all the finished computations of subsequent
columns by other threads must be abandoned, and the normal pipeline mode must
be completely restarted from column k + 1, as the finished computations of columns
after column k are based on the old symbolic pattern of column k before the re-
pivoting occurs.
References
1. Davis, T.A., Palamadai Natarajan, E.: Algorithm 907: KLU, A direct sparse solver for circuit
simulation problems. ACM Trans. Math. Softw. 37(3), 36:136:17 (2010)
2. Demmel, J.W., Gilbert, J.R., Li, X.S.: An asynchronous parallel supernodal algorithm for sparse
gaussian elimination. SIAM J. Matrix Anal. Appl. 20(4), 915952 (1999)
3. Li, X.S.: An overview of superLU: algorithms, implementation, and user interface. ACM Trans.
Math. Softw. 31(3), 302325 (2005)
4. Li, X.S.: Sparse gaussian elimination on high performance computers. Ph.D. thesis, Computer
Science Division, UC Berkeley, California, US (1996)
5. Demmel, J.W., Eisenstat, S.C., Gilbert, J.R., Li, X.S., Liu, J.W.H.: A supernodal approach to
sparse partial pivoting. SIAM J. Matrix Anal. Appl. 20(3), 720755 (1999)
Chapter 6
Test Results
In this chapter, we will present the experimental results of NICSLU and the compar-
isons with PARDISO and KLU. The excellent performance of NICSLU is demon-
strated by two tests: benchmark test and simulation test. We will first describe the
experimental setup, and then present the detailed results of the two tests.
Both benchmark test and simulation test are carried out on a Linux server equipped
two Intel Xeon E5-2690 CPUs running at 2.9 GHz and 64 GB memory. All codes are
compiled by the Intel C++ compiler (version 14.0.2) with O3 optimization. PARDISO
is from Intel Math Kernel Library (MKL) 11.1.2. Both NICSLU and PARDISO use
BLAS provided by Intel MKL.
In the benchmark test, we compare NICSLU with PARDISO and KLU for 40
benchmarks obtained from the University of Florida sparse matrix collection [1].
Table 6.1 shows the basic information (dimension, number of nonzeros, and the
average number of nonzeros in each row) of the benchmarks. All these bench-
marks are unsymmetric circuit matrices obtained from SPICE-based DC, transient,
or frequency-domain simulations. We exclude symmetric circuit matrices because
Cholesky factorization [2] is about 2 more efficient than LU factorization for sym-
metric matrices. The dimension of these benchmarks covers a very range which is
from two thousand to five million. The average number of nonzeros in each row
clearly shows that circuit matrices are extremely sparse, as for most of these bench-
marks, there are averagely less than 10 nonzero elements in each row. Even if for a
few slightly dense circuit matrices, the average number of nonzeros in each row is
only a little larger than 10.
In the simulation test, we use an in-house SPICE-like circuit simulator to compare
NICSLU and KLU by running three self-generated circuits and six circuits modified
Springer International Publishing AG 2017 99
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_6
100 6 Test Results
from IBM power grid benchmarks [3]. Our self-generated benchmarks are post-
layout-like, i.e., there are large power and ground networks with a few transistors.
Since IBM power grid benchmarks are pure linear circuits, a few inverter chains are
inserted between the power network and the ground network to make the benchmarks
nonlinear. In order to reduce the impact of device model evaluation as much as
possible such that the total simulation time is dominated by the solver time, only a
few transistors are inserted in each benchmark.
6.2.1 Speedups
Speedup is the most intuitive factor that can be used to compare the runtime of dif-
ferent solvers. In the following results, two types of speedups are defined to compare
the performance:
runtime of other solver
speedup = (6.1)
runtime of NICSLU
runtime of NICSLU (sequential)
relative speedup = . (6.2)
runtime of NICSLU (parallel)
In the following results, some figures will be plotted by the concept of performance
profile [4], which is defined as follows, taking computational time as an example.
Assume that we have a solver set S and a problem set P. t p,s is defined as the
runtime to solve the problem p P by solver s S . If solver s cannot solve
102 6 Test Results
The performance ratio measures the ratio of the runtime of a solver s on problem p
to the runtime of the best solver on the same problem. If solver s can solve problem
p in ( 1) times of the runtime of the best solver on the same problem, i.e., the
runtime of solver s is less than min{t p,s : s S }, then for problem p, solver
s is called -solvable. The performance profile of solver s is defined as the ratio of
number of -solvable problems to the total number of problems, i.e.,
|{ p P : r p,s }|
Ps () = , 1 (6.4)
|P|
where | | is the size of the set. Ps () measures the probability for solver s that r p,s is
within a factor of the best possible ratio. For a given , a high- performance profile
value means that solver s has high performance. If = 1, the performance profile
measures for how much ratio of problems, solver s is with the best performance.
In this section, we will present the detailed performance results of the benchmark
test, and also analyze the relation between the performance and the matrix sparsity.
We will first investigate how to select the optimal algorithm from the map algorithm,
column algorithm, and the supernodal algorithm by analyzing the matrix sparsity. We
will then present the relative speedups of NICSLU. We will also comprehensively
compare NICSLU with KLU and PARDISO in the terms of factorization time, resid-
ual, and the number of fill-ins to show the superior performance of NICSLU.
In this subsection, we will analyze and compare the performance of the map algo-
rithm, the column algorithm, and the supernodal algorithm. As explained in Chap. 5,
the performance of sparse LU factorization strongly depends on the matrix spar-
sity, which is evaluated by the SPR defined in Eq. (3.2). Figure 6.1 plots the SPR
values of all the 40 benchmarks in the increasing order. As can be seen, the SPR
of circuit matrices covers a wide range from zero to more than 1000. For the 40
6.3 Results of Benchmark Test 103
10000
1000
Sparsity ratio
100
10
1
scircuit
add32
add20
twotone
memplus
dc1
rajat21
circuit_3
rajat22
hcircuit
rajat26
rajat23
rajat18
rajat27
bcircuit
circuit_1
trans4
ckt11752_tr_0
rajat03
rajat29
rajat15
rajat24
rajat30
rajat28
rajat25
rajat20
rajat31
memchip
circuit5m
raj1
circuit5m_dc
onetone2
onetone1
circuit_4
circuit_2
transient
asic_680k
freescale1
asic_320k
asic_100k
Fig. 6.1 Sparsity ratio
benchmarks, their SPR values are almost uniformly distributed in the logarithmic
scale. By analyzing the performance of these benchmarks, we are able to compre-
hensively investigate the performance of NICSLU.
To investigate how to select the optimal algorithm according to the value of SPR,
the map algorithm and the supernodal algorithm are compared with the pure column
algorithm, i.e., the G-P algorithm. Figure 6.2 shows the comparison, which is for the
re-factorization time. It clearly shows that the performance of the three algorithms
strongly depends on the matrix sparsity. The map algorithm is generally faster than
the column algorithm for extremely sparse matrices, i.e., matrices on the most left
side. By comparing the map algorithm with the column algorithm in the sequential
and parallel cases, we can conclude that for matrices with SPR < 20, we should
select the map algorithm. The parallel map algorithm has higher speedups than the
sequential map algorithm, compared with the corresponding column algorithm. This
is because that the parallel column algorithm has a higher cache miss rate than the
sequential column algorithm, as multiple uncompressed arrays x share the same
cache in the parallel column algorithm. For the parallel map algorithm, the threshold
can be up to 40. However, for a simple implementation, we use the same threshold
of the SPR value to select the sequential or parallel map algorithm in NICSLU. For
denser matrices, the map algorithm not only runs slower than the column algorithm,
but also consumes more memory to store the map. As shown in Fig. 6.2, the map
algorithm fails on three large matrices due to insufficient memory. The supernodal
algorithm is faster than the column algorithm for nearly half of the matrices on the
most right side, i.e., slightly dense matrices. By comparing the supernodal algorithm
with the column algorithm, we can conclude that for matrices SPR > 80, we should
select the supernodal algorithm rather than the column algorithm. By applying such
104 6 Test Results
2.5
2.0
Speedup
1.5
1.0
0.5
0.0
circuit_3
circuit_4
circuit_2
circuit5m
circuit_1
twotone
trans4
dc1
transient
add32
scircuit
rajat21
rajat22
rajat26
rajat23
rajat18
rajat27
add20
rajat03
rajat29
rajat15
rajat24
raj1
asic_680k
rajat30
rajat28
rajat25
rajat20
rajat31
memchip
ckt11752_tr_0
memplus
onetone2
asic_320k
asic_100k
onetone1
hcircuit
bcircuit
circuit5m_dc
freescale1
Fig. 6.2 Comparison of different algorithms
In this subsection, we will analyze the relative speedups of full factorization and
re-factorization of NICSLU to analyze the scalability. Here we focus on the rela-
tive performance of the parallel algorithms of NICSLU, so we only consider the
factorization time or re-factorization time, while the right-hand-solving time is not
considered.
6.3 Results of Benchmark Test 105
7 T=8
6 T=16
Relative speedup
0
circuit_4
circuit_3
twotone
circuit_2
circuit_1
scircuit
asic_320k
asic_100k
rajat21
rajat22
rajat18
rajat27
memplus
bcircuit
dc1
rajat03
rajat29
rajat15
add32
hcircuit
add20
asic_680k
rajat20
memchip
rajat26
rajat23
circuit5m
trans4
raj1
transient
rajat24
onetone2
rajat30
rajat28
rajat25
rajat31
circuit5m_dc
onetone1
freescale1
ckt11752_tr_0
Figure 6.3 shows the relative speedups of full factorization for all the 40 benchmarks.
As mentioned in Sect. 3.2.3, NICSLU selects sequential full factorization if the SPR
is smaller than 50. Therefore, for the first 22 matrices on the most left side, NICSLU
does not run parallel full factorization, so the relative speedup is always 1. For the
other 18 matrices, NICSLU runs parallel full factorization. However, the relative
speedups are not high. The average relative speedups of the 18 matrices when using
8 threads and 16 threads are 2.0 and 2.22, respectively. The reason of the low
scalability is that we use the ET to schedule tasks in parallel full factorization. The
ET severely overdetermines the column-level dependence. For a few matrices, the
performance when using 16 threads is even lower than that when using 8 threads.
This abnormal phenomenon is caused by the hardware platform. We use two Intel
CPUs to run all the experiments. Each CPU has 8 cores, so if we run the solver using
8 threads, all the communications are within one CPU. However, if we run the solver
using more than 8 threads, inter-core communications are invoked. The overhead of
inter-core communication is much larger than that of intra-core communication. This
observation caused by the hardware platform also limits the scalability of the solver
when using more too many threads. Fortunately, in SPICE-like circuit simulation,
we only need to invoke few times of full factorization, so the low scalability of
parallel full factorization will not significantly affect the overall performance of
circuit simulators.
106 6 Test Results
12
T=8
10
T=16
Relative speedup
0
circuit_3
circuit_1
circuit_4
circuit_2
dc1
asic_680k
onetone2
asic_100k
twotone
memchip
rajat21
rajat26
rajat23
rajat18
rajat27
hcircuit
memplus
bcircuit
scircuit
rajat03
rajat29
rajat24
asic_320k
rajat25
rajat20
onetone1
rajat31
add32
rajat22
add20
circuit5m
trans4
transient
rajat15
raj1
rajat30
rajat28
freescale1
circuit5m_dc
ckt11752_tr_0
6.3.2.2 Re-factorization
Figure 6.4 shows the relative speedups of re-factorization of NICSLU for all the 40
benchmarks. Since in SPICE-like circuit simulation, most factorizations during the
NewtonRaphson iterations are re-factorizations, the scalability of re-factorization
will have a significant impact on the overall performance of circuit simulation. For-
tunately, compared with full factorization, the scalability of re-factorization is much
better. The reason is that the EG used for task scheduling in parallel re-factorization
stores the exact column-level dependence. Compared with the ET, the EG is wider
and shorter, indicating that the EG implies more parallelism. For almost all of these
benchmarks, parallel re-factorization can be faster than sequential re-factorization.
The average relative speedups of re-factorization when using 8 threads and 16 threads
are 3.76 and 4.29, respectively.
Figure 6.4 also shows that the relative speedups of re-factorization tend to be
higher for denser matrices. To investigate the relation between the relative speedup
and the matrix sparsity, we show a scatter plot which draws the relation between the
SPR and the relative speedup of re-factorization in Fig. 6.5. It clearly shows that the
relative speedup has an approximate linear relation with the logarithm of the SPR.
This observation indicates that the relative speedup, i.e., the scalability, is better for
denser matrices. However, circuit matrices are highly sparse, so the scalability of
circuit matrix-oriented sparse solvers cannot be as high as that of solvers for general
sparse matrices from other applications. The reason can be simply explained as that
the communication overhead is relatively large for highly sparse matrices, since the
computational cost is too small. From this observation, we can also have an early
estimation of the relative speedup for re-factorization when the SPR is known in the
pre-analysis step.
6.3 Results of Benchmark Test 107
Relative speedup
4
0
1 10 100 1000 10000
Sparsity ratio
6.3.3 Speedups
In this subsection, we will compare NICSLU with KLU and PARDISO in the term of
runtime. Since our purpose here is to evaluate the three solvers in circuit simulation
applications, we will compare the total runtime of factorization/re-factorization and
forward/backward substitutions, as these two steps are both repeated in the Newton
Raphson iterations.
Table 6.2 compares the total runtime of full factorization and forward/backward sub-
stitutions. Please note that for PARDISO, the runtime also includes the iterative
refinement step which is a necessary step for PARDISO. When comparing NICSLU
with KLU and PARDISO, due to the different pre-analysis algorithms adopted, the
number of fill-ins may differ dramatically, so the runtime also shows great differ-
ences. Therefore, the geometric mean is fairer than the arithmetic mean when com-
paring the runtime. Recall that KLU is a sequential solver. Compared with KLU,
NICSLU achieves 3.46, 4.56, and 4.56 speedups on average when NICSLU
uses 1 thread, 8 threads, and 16 threads, respectively. NICSLU is averagely faster
than PARDISO when using 1 thread, and slower than PARDISO when using mul-
tiple threads. This is mainly due to the low scalability of NICSLU. As NICSLU
uses the ET which contains all the potential column-level dependence to schedule
tasks in full factorization and PARDISO uses a fixed dependence graph to schedule
tasks, PARDISO naturally has better scalability than full factorization of NICSLU.
However, such a direct comparison is unfair, because by adopting partial pivoting,
NICSLU has much better numerical stability than PARDISO, which can only select
pivots from diagonal blocks.
108 6 Test Results
6.3.3.2 Re-factorization
Table 6.3 compares the total runtime of re-factorization and forward/backward substi-
tutions. Re-factorization has better scalability than full factorization, so the speedups
of NICSLU compared with KLU and PARDISO are also higher. Compared with
KLU, NICSLU is faster for almost all of the benchmarks. The average speedups are
2.58, 7.51, and 7.94 when NICSLU uses 1 thread, 8 threads, and 16 threads,
respectively. Compared with PARDISO, NICSLU is faster for most of the bench-
marks and slower for only a few very dense matrices, as for such dense matrices, the
supernodesupernode algorithm adopted by PARDISO is more suitable. The average
speedups compared with PARDISO are 3.15, 2.01, and 1.9 when NICSLU and
PARDISO both use 1 thread, 8 threads, and 16 threads, respectively.
Figure 6.6 shows the performance profile for the total runtime of re-factorization
and forward/backward substitutions, which approximately evaluates the overall
solver performance in SPICE-like circuit simulators. It clearly shows that multi-
threaded NICSLU has the highest performance, and multi-threaded PARDISO is the
second best. The performance of sequential NICSLU is just a little lower than that of
multi-threaded PARDISO. Sequential PARDISO and KLU generally have the lowest
performance.
Figure 6.7 compares the factor of giga FLOP per second (GFLOP/s). GFLOP/s mea-
sures the floating-point computational performance achieved by the three solvers.
From the trend point of view, GFLOP/s of the three solvers increases when the
matrix becomes dense. 16-threaded NICSLU generally has the highest GFLOP/s and
sequential PARDISO has the lowest GFLOP/s. Such a trend is consistent with the run-
time performance of the three solvers. For a few benchmarks, PARDISO shows very
high GFLOP/s. This is sometimes due to the large number of fill-ins caused by the
pre-analysis step of PARDISO. For example, for benchmark circuit5M, 16-threaded
PARDISO can run at high computational performance of 125 GFLOP/s; however,
actually PARDISO runs slow on this benchmark, as shown in Tables 6.2 and 6.3.
For rajat21, rajat18, rajat29, and asic_680k, we can see a similar situation. For the
last three benchmarks (onetone1, rajat31, and memchip), the high GFLOP/s of PAR-
DISO is really caused by its computational performance. Consequently, GFLOP/s is
an one-sided factor that cannot well estimate the real performance of sparse solvers.
When comparing GFLOP/s, we should also compare the runtime or speedup to avoid
the one-sidedness of GFLOP/s.
On the other hand, the GFLOP/s values shown in Fig. 6.7 indicate that the floating-
point performance achieved by the three solvers is far away from the peak per-
formance of the CPUs that are used in our experiments. Theoretically, the peak
110 6 Test Results
1.0
0.9
0.8
0.7
0.6
P( )
Fig. 6.6 Performance profile for the total runtime of re-factorization and forward/backward sub-
stitutions
1000
100
10
GFLOP/s
1
NICSLU (T=1)
NICSLU (T=16)
0.1 KLU
PARDISO (T=1)
PARDISO (T=16)
0.01
memchip
circuit_3
dc1
asic_680k
asic_320k
asic_100k
circuit_4
circuit_2
circuit_1
scircuit
transient
circuit5m_dc
add32
rajat21
rajat22
rajat26
rajat23
rajat18
rajat27
add20
memplus
rajat03
rajat29
rajat15
raj1
rajat24
rajat30
rajat28
rajat25
rajat20
twotone
rajat31
circuit5m
trans4
onetone2
hcircuit
bcircuit
onetone1
freescale1
ckt11752_tr_0
In addition to the runtime, speedup, and GFLOP/s which directly or indirectly reflect
the performance of sparse solvers, we will also compare some other factors among
the the three solvers to present a comprehensive analysis. Figures 6.8 and 6.9 compare
NICSLU with KLU and PARDISO in the terms of the residual and the number of
fill-ins, by plotting the corresponding performance profiles.
0.6
P( )
0.6
P( )
NICSLU
0.4
KLU
0.2 PARDISO
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
6.3 Results of Benchmark Test 113
Figure 6.8 compares the three solvers in the term of the residual. Residual is
defined as the root-mean-square error (RMSE) of the residual vector Ax b, i.e.,
r = Ax b,
1 N (6.5)
residual = r 2.
N i=1 i
As mentioned in Sects. 3.1 and 3.5, NICSLU has a feature that it can automatically
control the iterative refinement step, which can potentially improve the accuracy of
the solution. In Fig. 6.8, we evaluate the residual of NICSLU in both cases when
the iterative refinement step is disabled and enabled. Please note that when itera-
tive refinement is enabled, NICSLU may not perform any iterations as the solution is
already accurate enough or cannot be refined, as shown in Algorithm 11. The compar-
ison illustrated in Fig. 6.8 clearly shows that NICSLU with refinement generally has
the highest solution accuracy. Even when iterative refinement is disabled, NICSLU
still generates more accurate solutions than KLU and PARDISO. Compared with
KLU, NICSLU has an additional step of static pivoting, i.e., the MC64 algorithm,
which is introduced in Sect. 3.2.1.2, so the accuracy of the solution can be improved.
Compared with PARDISO, NICSLU adopts the partial pivoting strategy, which has a
larger pivoting selection space and generates more accurate solutions than the block
supernode diagonal pivoting method adopted by PARDISO. Actually, we have found
that for a few matrices, due to the incomplete pivot selection space, PARDISO fails
to get an accurate solution, which means that the residual is unreasonably large. For
NICSLU, by integrating the MC64 algorithm, partial pivoting, and/or the iterative
refinement algorithm together, we can always obtain accurate solutions even when
the matrix is near ill-conditioned.
Figure 6.9 compares the three solvers in the term of the number of fill-ins, i.e.,
the number of nonzero elements of L + U I. Generally, NICSLU generates the
fewest fill-ins, and KLU and PARDISO have a similar performance on the number
of fill-ins. The difference in the fill-ins is mainly caused by the different algorithms
adopted in the pre-analysis step. KLU permutes the matrix into a block triangular form
(BTF) [7, 8] in the pre-analysis step. It is claimed that nearly all circuit matrices are
permutable to a BTF [9]; however, whether such a form can improve the performance
is unclear and needs further investigations. Our results from the benchmark test tend
to indicate that the effect of BTF on reducing fill-ins is somewhat small. On the
contrary, the MC64 algorithm adopted by NICSLU is helpful for improving the
numerical stability and reducing fill-ins. Although PARDISO also adopts the MC64
algorithm in the pre-analysis step, it uses a different ordering algorithm based on the
nested dissection method [10, 11], which can generate better orderings only for very
large matrices. Combining with the MC64 algorithm, the AMD [12, 13] algorithm
adopted by NICSLU is generally more efficient in most practical problems.
114 6 Test Results
In this section, we will present the detailed results of the simulation test. We have
created an in-house SPICE-like circuit simulator with the BSIM3.3 and BSIM4.7
MOSFET models [14] integrated. The simulator integrates NICSLU and KLU, so we
can easily compare the performance of NICSLU and KLU by running the simulator.
Six IBM power grid benchmarks for transient simulation [3] are adopted. Since they
are pure linear circuits and only forward/backward substitutions are required dur-
ing transient simulation, leading to some difficulties to evaluate the performance of
numerical LU factorization, we artificially insert a few transistors into each bench-
mark to make them nonlinear. We also create three power grid-like benchmarks
with large power and ground networks. The power and ground networks in the self-
generated benchmarks are completely regular meshes. A few inverter chains which
act as the functional circuit are inserted between the power network and the ground
network, making the circuit nonlinear as well. Figure 6.10 illustrates the power and
ground networks.
Table 6.4 compares the total transient simulation time between NICSLU and KLU.
NICSLU is faster than KLU in transient simulation for all of the nine adopted bench-
marks, regardless of the number of threads invoked by NICSLU. NICSLU achieves
3.62, 6.42, and 9.03 speedups on average compared with KLU in transient
simulation, when NICSLU uses 1 thread, 4 threads, and 8 threads, respectively. The
high performance of NICSLU is caused by two factors: less fill-ins/FLOPs and the
more advanced algorithms. In order to explain this, Table 6.5 compares the numbers
of fill-ins and FLOPs. For some benchmarks (ibmpg1t mod., ibmpg2t mod., ibmpg3t
mod., ibmpg5t mod., and ibmpg6t mod.), NICSLU generates much less fill-ins and
FLOPs than KLU, and, thus, NICSLU runs much faster than KLU in transient sim-
ulation. However, for the other benchmarks, NICSLU generates more fill-ins and
FLOPs than KLU, but NICSLU still runs faster than KLU. For example, for ibmpg4t
mod., NICSLU generates 6 % more FLOPs than KLU, but NICSLU is 2.36 faster
than KLU even if NICSLU runs in sequential. This speedup is certainly caused by
the advanced algorithms adopted by NICSLU.
The comparison on the number of fill-ins and FLOPs shown in Table 6.5 indicates
that the BTF algorithm adopted by KLU seems to be more suitable for regular meshed
circuits, as KLU generates less fill-ins and FLOPs than NICSLU for the three self-
generated regular meshed circuits. However, whether this conclusion is true requires
further investigations, which is out of the scope of this book.
For a simple summary, NICSLU has been proven to be high performance in time-
consuming post-layout simulation problems.
References
1. Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math.
Softw. 38(1), 1:11:25 (2011)
2. Davis, T.A.: Direct Methods for Sparse Linear Systems, 1st edn. Society for Industrial and
Applied Mathematics, US (2006)
3. Li, Z., Li, P., Nassif, S.R.: IBM Power Grid Benchmarks. http://dropzone.tamu.edu/~pli/
PGBench/
4. Dolan, D.E., Mor, J.J.: Benchmarking optimization software with performance profiles. Math.
Program. 91(2), 201213 (2002)
5. Wikipedia: Hyper-threading. https://en.wikipedia.org/wiki/Hyper-threading
6. Wikipedia: SSE2. https://en.wikipedia.org/wiki/SSE2
7. Duff, I.S., Reid, J.K.: Algorithm 529: permutations to block triangular form [F1]. ACM Trans.
Math. Softw. 4(2), 189192 (1978)
8. Duff, I.S.: On permutations to block triangular form. IMA J. Appl. Math. 19(3), 339342
(1977)
9. Davis, T.A., Palamadai Natarajan, E.: Algorithm 907: KLU, a direct sparse solver for circuit
simulation problems. ACM Trans. Math. Softw. 37(3), 36:136:17 (2010)
10. George, A.: Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal. 10(2),
345363 (1973)
11. Lipton, R.J., Rose, D.J., Tarjan, R.E.: Generalized nested dissection. SIAM J. Numer. Anal.
16(2), 346358 (1979)
12. Amestoy, P.R., Davis, T.A., Duff, I.S.: An approximate minimum degree ordering algorithm.
SIAM J. Matrix Anal. Appl. 17(4), 886905 (1996)
13. Amestoy, P.R., Davis, T.A., Duff, I.S.: Algorithm 837: AMD, an approximate minimum degree
ordering algorithm. ACM Trans. Math. Softw. 30(3), 381388 (2004)
14. BSIM Group: Berkeley Short-Channel IGFET Model. http://bsim.berkeley.edu/
Chapter 7
Performance Model
In the previous chapter, we have shown the test results of NICSLU, where the relative
speedups vary in a big range for different benchmarks. In order to understand the
performance difference and find possible limiting factors of the scalability, further
investigations are required. Toward this goal, in this chapter, we will build a perfor-
mance model to analyze the performance and find bottlenecks of the scalability of
NICSLU. The performance model is based on an as-soon-as-possible (ASAP) analy-
sis on the dependence graph (i.e., the EG) used for parallel numerical re-factorization.
Under a unified assumption about the computational and synchronization costs, the
performance model predicts the theoretical maximum relative speedup and the max-
imum relative speedup when using given cores. With the performance model, one
can also analyze the parallel efficiency to further understand the bottlenecks in the
parallel algorithm.
OPupd(3,6) OPupd(5,6)
OPnorm(6) 6
Executed sequentially by one thread
as OPupd ( j, k), which takes 2 NNZ(L( j + 1 : N , j)) units of runtime. The other
part is related to the normalization of column k of L, corresponding to line 8 of
Algorithm 10, which is denoted as OPnorm (k) and takes NNZ(L(k + 1 : N , k))
units of runtime. Finishing OPnorm (k) is equivalent to finishing the factorization of
column k.
The above-mentioned operations can be easily mapped onto the dependence graph
used for scheduling parallel re-factorization. A directed edge ( j, k) in the dependence
graph corresponds to OPupd ( j, k) and a node labeled as k corresponds to OPnorm (k).
According to this mapping, the dependence graph becomes a task flow graph that
describes all the FLOPs which are required to factorize the matrix. The task flow
graph also implies the timing constraints that must be satisfied during parallel re-
factorization. Figure 7.1 shows an example of the task flow graph. Take node 6 as
example to illustrate the timing constraints.
OPupd (3, 6) can only be started after OPnorm (3) is finished, and the same for
OPupd (4, 6) and OPupd (5, 6).
OPnorm (6) can only be started after OPupd (3, 6), OPupd (4, 6), and OPupd (5, 6) are
all finished.
According to the thread-level scheduling method, the four tasks OPupd (3, 6),
OPupd (4, 6), OPupd (5, 6), and OPnorm (6) are executed by one thread, so they are
executed in sequential.
These timing constraints imply that we can use an ASAP algorithm to calculate the
earliest finish time of all the tasks shown in the dependence graph. Before presenting
the ASAP algorithm, we first define some symbols which will be used in the ASAP
algorithm.
FT(k): the earliest finish time of OPnorm (k), which is also the earliest finish time
of the factorization of column k.
FT: the earliest finish time of the entire dependence graph.
FTcore ( p): the time when core p finishes its last task.
7.1 DAG-Based Performance Model 119
We have two algorithms to evaluate the performance of NICSLU. The first one
is shown in Algorithm 27. It assumes infinite cores and calculates the earliest finish
time of the entire graph. The algorithm calculates the theoretical minimum finish
time for a given matrix by accumulating the computational cost of FLOPs and the
synchronization cost, while the above-mentioned timing constraints are satisfied.
After the earliest finish time is calculated, the predicted relative speedup can be
calculated as follows:
FLOPs
predicted relative speedup = . (7.1)
FT
As Algorithm 27 assumes that infinite cores are used, the relative speedup esti-
mated by Algorithm 27 and Eq. (7.1) is the theoretical upper limit of the relative
speedup for a given matrix. Namely, it is the theoretical upper limit that the actual
relative speedup of any practical execution cannot exceed this value, regardless of
how many threads are running in parallel. The theoretical maximum relative speedup
cannot be used to predict actual relative speedups as it assumes infinite cores; how-
ever, it gives us a good estimation about the parallelism of a given matrix. In other
words, it estimates the maximum parallelism that can be achieved by parallel re-
factorization, regardless of the number of cores used. If the theoretical maximum
relative speedup is too low, it indicates that the given matrix is not suitable for par-
allel factorization.
We have another algorithm to calculate the earliest finish time under the condition
that limited cores are used, as shown in Algorithm 28. For each task (i.e., a column),
the core which finishes its last task earliest among all the available cores is selected to
execute the current task. Except for this point, the algorithm to calculate the earliest
finish time is the same as Algorithm 27. After Algorithm 28 is finished, we can
estimate the maximum relative speedup under limited cores using Eq. (7.1). The
estimated relative speedup can be used to predict actual relative speedups.
Besides predicting the relative speedups, we are also interested in investigating the
bottlenecks of the parallel algorithm. There are two potential factors that may limit
the scalability. The first factor is the parallelism. If there is not enough parallelism,
the parallel efficiency will be low. The other factor is the synchronization cost. If
the synchronization cost takes a big portion in the total computational time, the
parallel efficiency will also be low. Parallelism is not easy to investigate, as sparse
LU factorization is a task-driven application. In this model, we use the waiting cost
instead of the parallelism to evaluate the parallelism. When we are trying to use a
column to update another column, the former column must be finished; otherwise
we need to wait until it is finished. It can be explained intuitively why the waiting
cost can be treated as an estimation of the parallelism. If the parallelism is high,
7.1 DAG-Based Performance Model 121
there tends to be many independent columns that can be factorized in parallel, and,
therefore, the dependence graph tends to be wide and the critical path tends to be
short. In other words, the data dependence in the pipeline mode tends to be weak. It is
easy to understand that weak dependence leads to low waiting cost. On the contrary,
if the parallelism is low, the dependence graph will be narrow and the critical path
tends to be long. In this case, the dependence is strong, leading to high waiting cost
as tasks are closely dependent. Please note that directly analyzing the dependence
graph used for scheduling parallel re-factorization cannot get a good estimation of
the parallelism, because we use the proposed pipeline mode scheduling strategy to
explore parallelism between dependent vertexes in the DAG. In other words, an
inter-vertex level analysis underestimates the parallelism. To analyze the impact of
the parallelism and synchronization to the parallel efficiency, we also collect the
waiting cost and the synchronization cost in Algorithm 28, as shown in lines 9 and
13. Once Algorithm 28 is finished, we can calculate the percentage of the waiting
cost and the synchronization cost based on
Cwait
waiting% = 100%
FLOPs (7.2)
Csync
synchronization% = 100%.
FLOPs
Bottlenecks of parallel LU re-factorization can be investigated by comparing the
waiting cost and the synchronization cost obtained from Algorithm 28. One can
also judge whether the matrix is suitable for parallel factorization by analyzing the
waiting percentage and the synchronization percentage according to Eq. (7.2). If at
least one percentage is high, e.g., 50%, it indicates that the parallel efficiency cannot
be high for the given matrix due to the high waiting or synchronization cost.
In this section, we will show and analyze the results of the proposed performance
model. We will analyze three aspects of results: theoretical maximum relative
speedup, predicted relative speedup, and bottlenecks of parallel LU re-factorization.
Tsync is set to 10 in these experiments.
Figure 7.2 plots the theoretical maximum relative speedup of all the 40 benchmarks
calculated by Algorithm 27. Since the theoretical maximum relative speedup is the
theoretical upper limit of the relative speedup, Fig. 7.2 plots the maximum possible
relative speedup that we can achieve, regardless of how many cores are used to execute
122 7 Performance Model
1000
100
10
1
rajat21
rajat18
rajat30
memchip
rajat22
rajat26
rajat23
rajat27
ckt11752_tr_0
rajat15
rajat03
rajat29
rajat24
rajat28
rajat25
rajat31
asic_100k
rajat20
dc1
asic_680k
add32
circuit_3
add20
memplus
scircuit
circuit_2
asic_320k
raj1
hcircuit
circuit_4
circuit5m
trans4
circuit_1
transient
onetone2
freescale1
circuit5m_dc
onetone1
bcircuit
twotone
Fig. 7.2 Predicted theoretical maximum relative speedup of re-factorization
Figure 7.3 shows the scatter plot which describes the relation between the predicted
relative speedup and the actual relative speedup of re-factorization when 8 threads
are used. It clearly shows that the predicted relative speedup is consistent with the
actual relative speedup, and there is an approximate linear relationship between
them. Consequently, the proposed performance model can be used to predict the
parallel efficiency of re-factorization of NICSLU. Of course, there are lots of detailed
factors that can affect the actual performance, which cannot be all captured by our
model. However, it is possible to capture the major factors and reasonably predict
the performance by a simple performance model. In what follows, we will analyze
the bottlenecks that can affect the scalability of NICSLU.
7.2 Results and Analysis 123
0
0 1 2 3 4 5 6 7 8
Predicted relative speedup
Fig. 7.3 Relation between the predicted relative speedup and the actual relative speedup of re-
factorization (T = 8)
200%
180% Waiting%
160% Synchronization%
140%
120%
Percentage
100%
80%
60%
40%
20%
0%
rajat21
rajat22
rajat26
rajat23
rajat18
rajat27
memplus
rajat03
rajat29
rajat15
rajat24
rajat30
rajat28
rajat25
rajat20
rajat31
trans4
asic_680k
asic_320k
asic_100k
memchip
dc1
add32
add20
circuit5m
raj1
circuit5m_dc
scircuit
ckt11752_tr_0
circuit_3
circuit_4
circuit_2
circuit_1
twotone
transient
hcircuit
bcircuit
freescale1
onetone2
onetone1
Fig. 7.4 Percentages of the waiting cost and the synchronization cost
124 7 Performance Model
decrease, as the computational cost, i.e., the number of FLOPs, tends to increase
for dense matrices. For a few extremely sparse matrices, i.e., matrices on the most
left side, the synchronization cost is higher than the waiting cost, and they can be
both very high. This observation indicates that for extremely sparse matrices, it is not
suitable for parallel factorization as the synchronization cost is too high. Additionally,
the waiting cost is also high due to the insufficient parallelism. However, when the
matrix is not so sparse, the synchronization cost decreases rapidly, and the waiting
cost dominates the parallel overhead. Even for slightly dense matrices, the waiting
percentage can be up to 20%. This also means that the parallelism is the major
limiting factor of the scalability of NICSLU for those matrices.
Chapter 8
Conclusions
Reference
1. Gilbert, J.R., Peierls, T.: Sparse Partial Pivoting in Time Proportional to Arithmetic Operations.
SIAM J. Sci. Statist. Comput. 9(5), 862874 (1988)
Index
G
D Gaussian elimination, 4, 48, 53, 82
Data dependence, 9, 11, 15, 63, 64, 74, 121 Giga FLOP (GFLOP/s), 109, 111, 112
Dependence graph, 15, 16, 64, 74, 75, 107, Graphics processing unit (GPU), 13, 19, 33
117, 118, 121 35
Depth-first search (DFS), 45, 46, 51, 52, 54,
73
Differential algebraic equation (DAE), 3, 4, I
7, 27, 28, 32 Incomplete factorization, 20
Springer International Publishing AG 2017 127
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9
128 Index
T
Task flow graph, 118 Z
Timing constraint, 118, 119 Zero-free permutation, 4447, 51