Vous êtes sur la page 1sur 43

PERFORMANCE

IMPROVEMENT OF SPICE

Presented by: Guided by:


Abhilash Wase Prof S.R. Pandey
Prince Harinkhede
Rajat Karkare
Souvik Das
CIRCUIT SIMULATOR
Software
Performs Various Analysis
1. DC operating Point
2. Transient
3. Small Signal AC
4. Fourier
5. Noise
6. Distortion

Allows testing of design prior to


manufacture on silicon
VLSI ERA
As the size of circuit scales
simulation time increases
Todays ICs pack billions of
transistors
Simulation of VLSI circuits will take
longer time with SPICE running on
GPP
Innovation in CPU architecture and
clock speed improvement have now
hit a speed-wall
SPICE
Simulation Program with Integrated
Circuit Emphasis
De-facto standard for circuit
simulation
Developed at the University of
California at Berkeley
SPICE2G6 version written in FOTRAN
Later upgraded to SPICE3f5 written in
C
Theoretically Speaking
To Solve any given electric circuit
1. Write KVL or KCL equations
2. Get the Matrix [G][I]=[V] or [R][V]=[I]
3. Solve for the unknowns
As the number of nodes increases,
the size of Matrix increases
SPICE more or less does the same
Inside SPICE
Net list input Describing the circuit
connections, model parameters and
types of analysis
Modified Nodal Analysis translates this
into a set of equations modeling the
electric behavior
This generates [A][x]=[b]
Model-Evaluation
Matrix-Solve
Flowchart of SPICE
Simulator
Example Circuit

1. Is IR = 0

2. IR Id IC =0

3. Is = IR
4. IR = (V1 V2).G1

5. Id = Gd(eq).V2 +
Id(eq)
6. Ic = Gc(eq).V2 +
Ic(eq)
Resulting Matrix
Equation

Model Equations For Diode and Capacito


SPICE rusage Analysis
Metric Value
Total CPU time (s) 2.400
Nominal temperature (c) 27
Operating temperature (c) 27
Total Iterations 468118
Circuit Equations 5
Transient timepoints 200059
Total Analysis Time (s) 0.7
Transient time (s) 0.7
Matrix reordering time (s) 0
LU decomposition time (s) 0.12
Matrix solve time (s) 0.09
Load time (s) 0.22
Where is the problem?
Total iterations.
Repeated evaluation of model
evaluation phase for nonlinear
elements
Repeated evaluation of matrix solve
phase for transient elements
Iterations increase as the size of
device scales.
Iteration = Repeated LU factorization
Analyzing SPICE
Works on a Matrix ([A]) which models
the circuit topology.
Matrix [A] is generated after MNA.
At every iteration this matrix goes
through LU factorization , front and
back solve phases.
SPARSE 1.3 solver for SPARSE
Matrices.
Sparse Matrices
Intermediate Matrix(A) generated
from MNA exhibit high sparsity as
each node in the underlying circuit
has only few devices connected to it.
Means that approximately 99% of
Matrix A entries are zeros .
Only the numerical values of the
nonzero locations are updated
SPARSE 1.3
LU Solver Equipped within SPICE3f5.
Base on an orthogonal linked list
structure.
Inside SPARSE
Collection of C sub-routines.
Build, Allocation, Factorization ,Solve and
Print.
Holds the matrix in a MatrixFrame structure.
spOrderAndFactor routine for Factorization.
Contains various sub-routines;
SearchForPivot, ExchangeRowAndCol and
RealRowColElimin
Called at every iteration running for the size
of matrix
Analyzing Sparse1.3
To analyze Sparse1.3 we first extract
it from SPICE.
Make Sparse run LU Factorizations on
Benchmark Matrices available at
University of Florida Sparse Matrix
collection.
Get a CPU (intel corei3) Time usage
of every function executed using
profiling tools(Aqtime).
Benchmark Matrices

Fpga_dcop_01 Bmhof_1 Rajat 28

rajat24 Bmhof_2 Grund_meg1


Characteristics of Circuit
Matrices

* Number of nonzero elements.


** Numerical Symmetry is the fraction of nonzeros matched by
equal values in symmetric locations.
*** Structural Symmetry is the fraction of nonzeros matched by
nonzeros in symmetric locations.
Conversion of Benchmarks
Stored in .mat file format.
Converted to .txt using MATLAB and
used inside SPARSE through file
handling.
Result of CPU Time Analysis
Sr
no Matrix Matri spOrderAndFac SearchForP ExchangeRo RealRowC
Name x tor Time (sec) ivot Time wAndCol olElimTim
Orde (sec) Time (sec) e (sec)
r
1 YZhou 1020 0.22 0.06 0.05 0.05

2 Fpga_dcop 1220 0.03 0.01 0.01 0.00


01
3 Adder_dco 1813 0.13 0.04 0.08 0.02
p
4 Grund_me 5860 0.22 0.04 0.14 0.04
g4
5 Hamrle2 5952 4.03 3.64 0.27 0.12

6 Bmhof3 12127 6.62 1.97 3.47 1.17

7 Rajat27 20640 345.97 98.19 227.73 20.00

8 Rajat26 51032 257.25 110.44 126.65 20.09

9 Rajat28 87190 2677.86 500.96 1753.88 422.56


Analyzing The Results
Three Main Functions Encountered
By analyzing These functions further
we conclude
SearchForPivot() unnecessarily goes
through four sub-functions
We force it to follow only one routine
and recalculate timing
This routine is more generalized in
nature.
Modified Results
Sr Matrix Matri spOrderAndFac SearchEnti ExchangeRo RealRowC
no Name x tor Time (sec) reMatrix wAndCol olElimTim
Orde Time (sec) Time (sec) e (sec)
r
1 YZhou 1020 0.28 0.12 0.03 0.01

2 Fpga_dcop 1220 0.18 0.17 0.01 0.00


01
3 Adder_dco 1813 0.47 0.39 0.08 0.00
p01
4 Grund_me 5860 2.69 2.38 0.27 0.03
g4
5 Hamrle2 5952 3.21 3.09 0.10 0.02

6 Bmhof3 12127 14.07 13.38 0.67 0.02

7 Rajat27 20640 36.04 35 0.92 0.10

8 Rajat26 51032 208.33 203.2 4.64 0.43

9 Rajat28 87190 1393.92 1013.12 299.01 81.56


Concluding
Though Pivot Search now takes
longer
RealRowColExchange and
RealRowColElemination time has
drastically reduced
Reducing overall Factorization time is
a possible solution.
Accelerating SearchForPivot would
give further performance
improvements
Accelerating Compute Intensive
Algos
Over the last decade people have
relied on CPU architecture
improvements or clock frequency
increase
Now Hitting a speed-wall
Hardware Platform supporting spatial
parallel execution is now rising
Algorithm must be polished well
before acceleration
General Purpose Processors
Computer architecture Innovations
and clock frequency increase
Traditional GPPs have hit a speed-
wall
High Performance Computing

Further speed-up can be harnessed


at hardware level
Firstly, exploit parallelism at software
level
FPGA provide the ideal platform
High Performance Reconfigurable
Computing
Off-Load Compute intensive
Tasks
Hard/Soft core Processors can be
used for reconfiguration of
programmable logic
Spatial implementation
Reconfiguration at run time
Parallelism at different granularities
Easy to use CAD tools
Spatial Implementation
Back to SPICE..
Model Evaluation of devices can be
done in parallel
As the Circuit scales Matrix-Solve
dominates Simulation run time
This is nothing but LU factor and
solve phase.
Inside Sparse1.3 SearchEntireMatrix
can be targeted
Revisiting SPARSE 1.3
Search Entire Matrix
HLS
Software used.
Transforms C specification to RTL
design
Following C features not supported.
1. Dynamic memory allocation
2. File I/O
3. Pointer to Pointer referencing.
4. Recursive function call.
Several code modifications are
required.
Pointer-to-Pointer Problem
As is turns out there are a number of
restrictions when synthesizing a C code on
hardware
1. Every port of the resulting RTL should point
to a static memory location
2. Every Loop Trip Count Should be known
prior
3. Dynamic Memory allocation not supported
4. Basically this requires a huge modification
of the source code
Modifications..
Static data structure.
New allocation and build routines.
Works with Reduced re-ordered
matrix
Pointer to Pointer problem solved.
Now can be used with HLS
Modified Code.
Changes in the Source Code
The portion of the code offloaded on
hardware should be fed only with
statistically allocated memory data
or a pointer to a static location
The pivot returned by software and
hardware functions should match.
Result Analysis of HLS
Performance Estimate:

Utilization Estimate:
Static Timing Analysis
Method of computing the expected
timing of a digital circuit without
requiring a simulation of the full
circuit
Floor Plan
Idea of SPICE on ZYNQ
CONCLUSION
We presented an empirical analysis for the SPICE
runtime and the type of matrices that typically arise in
circuit simulations. We studied the total SPICE
execution time and we demonstrated that the runtime
scales as the circuit size increases.
We presented a thorough analysis of the SPARSE 1.3 LU
solver package profiling every possible routine to
choose the best suitable candidate for hardware
acceleration. The internal matrix data structure of
SPARSE 1.3 was also discussed. We further presented
an empirical analysis of the runtime of various routines
within SPARSE 1.3, responsible solely for LU
factorization.
We gave a detail analysis of the hardware
design created and the major modifications
made on the existing SPARSE source code.
We also defined custom routines and data
types inside SPARSE 1.3 which support the
proper execution of our hardware design.
Finally we presented our implemented design
using HLS tools and reported its STA. Device
implementation was also reported in the form
of floor plane as generated by Vivado.
THANK YOU

Vous aimerez peut-être aussi