Académique Documents
Professionnel Documents
Culture Documents
ISSN: 2455-5703
Abstract
Grid increase is the piece operation utilized as a part of numerous picture and flag handling applications. This work exhibits a
viable configuration for the Matrix Multiplication utilizing Systolic Architecture. This design expands the registering speed by
utilizing the idea of parallel handling and pipelining into a solitary idea. The chose stage is a FPGA (Field Programmable Gate
Array) gadget since, in systolic registering, FPGAs can be utilized as committed PCs as a part of request to perform certain
calculations at high frequencies. The paper exhibits a systolic design for framework duplication calculation utilizing FPGA.
Approach utilizes four preparing components that minimizes area, lessens the range and enhances calculation time.
Keywords- FPGA, matrix multiplication, Systolic architecture, processing element (PE)
I. INTRODUCTION
Matrix multiplication has a high multifaceted nature, particularly the configuration and productive usage on a FPGA where assets
are exceptionally restricted, has been additionally requesting. The perpetually developing computational applications require more
prominent handling power than prior. So the utilization of parallel PCs has accomplished higher registering speed. So as to meet
the higher request execution processing speed, the focused on design utilized is a systolic architecture. Systolic architecture is not
difficult to execute because of its consistency and reconfigurability. This work exhibits the successful design and implementation
of matrix multiplication with systolic architecture.
Network increase operation can be dissected with programming running either on quick processors or on devoted
equipment. This product based framework increase is moderate and gotten to be bottleneck in general framework operation.
Consequently to accomplish the grid increase with huge rate up in calculation time and adaptability, Field Programmable Gate
Array (FPGA) based configuration is utilized. As of late FPGA gets to be appealing stage for equipment acknowledgment of
calculation concentrated applications. Computation calculations can be examined by utilizing FPGA's. Since it permits time
effective, asset focused simple reconfigurability when contrasted with full custom VLSI outlines.
This work displays the improvement of framework increase calculation which is free of framework size and engineering
with productive usage of region at specific clock recurrence. Likewise proposes framework increase calculation and their
acknowledgment on FPGA which reduces area of CLB's and enhanced calculation speed.
92
High Speed and Resource Efficient Systolic Architecture for Matrix Multiplication using FPGA
(GRDJE/ Volume 1 / Issue 5 / 015)
components that minimizes area, lessens the reduces the routing complexity and enhances Area/Speed metric. Bravo [8] proposed
four lattice increase calculation and their acknowledgment on FPGA which reduced range of CLBs and enhanced rate of
calculation.
III. METHODOLOGY
Numerous calculations utilized as a part of matrix multiplication require elite thick direct variable based math codes. So execution
of time productive, asset effective framework augmentation gets to be key test. Thus keeping in mind the end goal to diminish the
territory, enhance calculation speed, and minimize assets systolic design is drawn nearer for framework increase. Framework
improves calculation execution speed, area, I/O transfer speed all these relies on upon MAC units, PE's and successive control
rationale. So execution on FPGA with memory utilization, clock recurrence, and dormancy gets to be assessment criteria.
A. Basic Principle of Systolic Architecture
Replacing the PE by an array of PE reduces the millions of operations per second. Systolic frameworks comprises of array of
PE(Processing Elements), processors are called cells, Generally the operations will be the same in every cell, every cell performs
an operation or little number of operations on an information thing and after that passes it to its Neighbour as shown in fig.1
B. Proposed Algorithm
Network Multiplication calculation proposed in this segment employments 4 PEs, which orchestrate systolically. The unit
component i.e. PE comprises of multiplier piece, adder, one register and sequential control logic piece, which controls the
information sustain operation to multiplier, adder and register as shown in figure 2.
The calculation organizes the stream of information for framework as matrix A and matrix B in such a path, to the point
that one input is a row of matrix A and second input is column of matrix B. Figure 3 demonstrates proposed systolic array of four
PEs that have been utilized as a part of acknowledgment of matrix algorithm calculation.
93
High Speed and Resource Efficient Systolic Architecture for Matrix Multiplication using FPGA
(GRDJE/ Volume 1 / Issue 5 / 015)
94
High Speed and Resource Efficient Systolic Architecture for Matrix Multiplication using FPGA
(GRDJE/ Volume 1 / Issue 5 / 015)
Consider an instance of network increase, which utilizes matrix A and B having 6x6 measurement. PE1 will follow up on the
principal rows of matrix A and columns of matrix B which thus produces component of matrix C, comparatively PE2, PE3 and
PE4 will follow up on the second, third and fourth rows of matrix A and second, third and fourth of column of matrix B. As shown
in figure 6.
95
High Speed and Resource Efficient Systolic Architecture for Matrix Multiplication using FPGA
(GRDJE/ Volume 1 / Issue 5 / 015)
In the 1st step algorithm takes 1st 4 rows of matrix A and 1st 4 columns of matrix B, this gives the matrix C with the result
of 4x4 array, Shown in fig 7.
In the 2nd step algorithm takes 1st 4 rows of matrix A and last 2 columns of matrix B , this gives the matrix C with the result of last
2 columns, Shown in fig 8.
96
High Speed and Resource Efficient Systolic Architecture for Matrix Multiplication using FPGA
(GRDJE/ Volume 1 / Issue 5 / 015)
Like this In the 3rd step algorithm takes last 2 rows of matrix A and first 4 columns of matrix B , this gives the matrix C
with the result of last 2 rows. And In the last step algorithm takes last 2 rows of matrix A and last 2 columns of matrix B , this
gives the matrix C with the result of remaining elements.
In this algorithm we used only 4 processing elements throughout the operation to reduce computation steps of
multiplication operation. Each processing element is consisting of one multiplier and one adder, one accumulator to perform the
computation of matrix multiplication.
D. Tradeoff for Choice of PE's
In the event of square reasonable number matrix, we get N/4 times Nx4 matrix squares. This holds useful for matrix of size 4x4
and more. If there should be an occurrence of odd number of lattice size results in N/4 times Nx4 matrix squares and to get
remaining components of yield network we need to reorder the PEs subject to remaining sections of matrix B. The PEs get
successfully used when number is restricted to four. On the off chance that we choose any number of PEs which are more
prominent than four, then the usage of PEs are diminishes definitely as capacity of increment in matrix size. Figure 9 indicates
utilization of PEs verses underutilization of PEs for 8x8 lattice. This outcome additionally holds useful for 7x7 lattices. Diagram
indicates 4 PEs are idealistic for lattice size 4x4 and more that holds tradeoffs between execution time and asset usage. If there
should be an occurrence of 8 PEs, all PEs are viably used in any case, more than 8x8 network underutilization for 8 PEs will
increments.
Its computational time T1 can be calculated using below formula for N x N network, then
T1 = (N3/ P * L)
(1)
Where "L" equivalents to number of clock cycles required to coordinate the input clock frequency, it is consistent for
specific clock frequency and "N" is the size of matrix, "" PE's performs calculation in parallel,.
E. Confinements of Calculation
If there should arise an occurrence of uneven frameworks, zero padding is required before multiplication occurs to make the
framework symmetric as calculation works on symmetric lattice as it were. In our calculation 4 PEs are viably composed
independent of matrix sizes and decrease the area with other configuration.
97
High Speed and Resource Efficient Systolic Architecture for Matrix Multiplication using FPGA
(GRDJE/ Volume 1 / Issue 5 / 015)
Matrix
size
CLBs
No. of
PEs
Clock
MHz
Computation
time ( s )
XC3S500E
4x4
93
50
0.140
XC3S500E
6x6
190
50
0.550
Matrix size
CLBs
Clock MHz
Area/Speed Ratio
proposed
6x6
190
50
3.8
D.N.Sonawane
6x6
277
50
5.54
98
High Speed and Resource Efficient Systolic Architecture for Matrix Multiplication using FPGA
(GRDJE/ Volume 1 / Issue 5 / 015)
V. CONCLUSION
In this paper we exhibited resource effective execution of matrix multiplication calculation that lessens the area and enhances the
speed. The proposed systolic architecture is bland and can be executed for any symmetric grids utilizing four PE's for grid sizes
4x4 and the sky is the limit from there. We additionally guarantee that four PE's are more productive to have tradeoffs in the middle
of area and execution time. The execution time can be further enhanced by working the calculation at higher frequency
REFERENCES
J. Jang, S. Choi and V. Prasanna, Area and Time efficient implementation of Matrix Multiplication on in proceedings, IEEE International conference on
Field Programmable Technology pp. 93-100, Dec. 2002.
[2] Scott J.Campbell and Sunil Khatri, "Resource and Delay Efficient Matrix Multiplication using newer FPGA devices. GLSVLSI2006.
[3] A. Amira, A. Bouridane, and P. Milligan, "Accelerating Matrix Product on Reconfigurable Hardware for Signal Processing," Field-Programmable Logic and
Applications (FPL), pp. 101-111, 2001.
[4] J. Jang and S. Choi, Energy and time efficient matrix multiplication on FGPAs, IEEE transaction on VLSI systems Vol. 13, pp 1305-1319 Nov, 2007.
[5] A. Amira and F. Bansali An FPGA Based parameterisable system for matrix product implementation. in IEEE workshop on signal processing systems, pp.
75-79 Oct. 2002.
[6] D.N.Sonawane, Dr. M.S Sutaone, Mr.Inayat Malek Systolic Architecture for Integer Point Matrix Multiplication using FPGA in IEEE Conference on
Industrial Electronics and Applications, 3822-3825, May 2009.
[7] J. Jang and S. Choi, Energy and time efficient matrix multiplication on FGPAs, IEEE transaction on VLSI systems Vol. 13, pp 1305-1319 Nov, 2007.
[8] O. Mencer,M. Morf, and M. Flynn, "PAM-Blox: High Performance FPGA Design for Adaptive Computing, "IEEE Symposium on FPGAs for Custom
Computing Machines, pp. 167-174, 1998.
[9] Ling Zhuo and Victor K. Prasanna, Scalable and Modular Algorithms for floating Point Matrix Multiplication on FPGAs, Los Angeles, USA, Feb. 16,
2004.
[10] J. Jang and S. Choi, Energy and time efficient matrix multiplication on FGPAs, IEEE transaction on VLSI systems Vol. 13, pp 1305-1319 Nov, 2007.
[11] A book on VLSI Digital Signal Processing Systems Design and Implementation by K. K. Parhi, Wiley, 1999.
[12] Xilinx, Spartan XC3S500E -FPGA users guide. www.xilinx.com
[1]
99