High Speed and Resource Efficient Systolic Architecture For Matrix Multiplication Using FPGA

GRD Journals- Global Research and Development Journal for Engineering | Volume 1 | Issue 5 | April 2016
ISSN: 2455-5703
High Speed and Resource Efficient Systolic

Architecture for Matrix Multiplication using
FPGA
Anitha
PG Scholar
Department of Electronics and Communication Engineering
KIT Tiptur, India
Mr. Pradeep Kumar S K

Assistant Professor
Department of Electronics and Communication Engineering
KIT Tiptur, India
Abstract
Grid increase is the piece operation utilized as a part of numerous picture and flag handling applications. This work exhibits a
viable configuration for the Matrix Multiplication utilizing Systolic Architecture. This design expands the registering speed by
utilizing the idea of parallel handling and pipelining into a solitary idea. The chose stage is a FPGA (Field Programmable Gate
Array) gadget since, in systolic registering, FPGAs can be utilized as committed PCs as a part of request to perform certain
calculations at high frequencies. The paper exhibits a systolic design for framework duplication calculation utilizing FPGA.
Approach utilizes four preparing components that minimizes area, lessens the range and enhances calculation time.
Keywords- FPGA, matrix multiplication, Systolic architecture, processing element (PE)
I. INTRODUCTION
Matrix multiplication has a high multifaceted nature, particularly the configuration and productive usage on a FPGA where assets
are exceptionally restricted, has been additionally requesting. The perpetually developing computational applications require more
prominent handling power than prior. So the utilization of parallel PCs has accomplished higher registering speed. So as to meet
the higher request execution processing speed, the focused on design utilized is a systolic architecture. Systolic architecture is not
difficult to execute because of its consistency and reconfigurability. This work exhibits the successful design and implementation
of matrix multiplication with systolic architecture.
Network increase operation can be dissected with programming running either on quick processors or on devoted
equipment. This product based framework increase is moderate and gotten to be bottleneck in general framework operation.
Consequently to accomplish the grid increase with huge rate up in calculation time and adaptability, Field Programmable Gate
Array (FPGA) based configuration is utilized. As of late FPGA gets to be appealing stage for equipment acknowledgment of
calculation concentrated applications. Computation calculations can be examined by utilizing FPGA's. Since it permits time
effective, asset focused simple reconfigurability when contrasted with full custom VLSI outlines.
This work displays the improvement of framework increase calculation which is free of framework size and engineering
with productive usage of region at specific clock recurrence. Likewise proposes framework increase calculation and their
acknowledgment on FPGA which reduces area of CLB's and enhanced calculation speed.
II. LITERATURE SURVEY

In prior studies scientists focused to actualize matrix multiplication calculation on different FPGA stages.A. Amira and F. Bansali
[1] proposed two designs for MATRIX Multiplication. Initial one is systolic architecture and second is distributed arthimetic. J.
Jang, S. Choi and V. Prasanna[2],Presents new calculations and designs for framework augmentation on configurable equipment.
These outlines essentially diminish the inactivity and in addition the territory. The plans enhance the area/speed metric where the
speed indicates the most extreme achievable running frequency. Scott J.Campbell and Sunil Khatri[3], Proposes a few techniques
to diminish the algorithmic many-sided quality of the network augmentation operation. Processors execute the lattice augmentation
in O(n3) running time. To lessen this multifaceted nature, numerous parallel strategies have been produced. Late advances in Field
Programmable Gate Array (FPGA) innovation gave new conceivable outcomes for usage of more productive parallel matrix
multiplication algorithms for calculations. A. Amira, A. Bouridane, and P. Milligan[4], This introduces a FPGA-based equipment
acknowledgment of matrix multiplication taking into account a parallel engineering. The proposed parallel engineering utilizes
propelled outline procedures and adventures design elements of FPGA. J. Jang and S. Choi[5], proposes algorithmic procedures to
enhance energy performance, rather than low-level (entryway level) advancements. A. Amira and F. Bansali[6], Presents novel
engineering for proficient execution of matrix items utilizing a FPGA based parameterisable framework. This proposes systolic
architecture for matrix multiplication utilizing Baugh-Wooley calculation. D.N.Sonawane, Dr. M.S Sutaone, Mr.Inayat Malek[7]
proposes a systolic array for integer point matrix multiplication calculation utilizing FPGA. Approach utilizes four handling
All rights reserved by www.grdjournals.com
92
High Speed and Resource Efficient Systolic Architecture for Matrix Multiplication using FPGA
(GRDJE/ Volume 1 / Issue 5 / 015)
components that minimizes area, lessens the reduces the routing complexity and enhances Area/Speed metric. Bravo [8] proposed
four lattice increase calculation and their acknowledgment on FPGA which reduced range of CLBs and enhanced rate of
calculation.
III. METHODOLOGY
Numerous calculations utilized as a part of matrix multiplication require elite thick direct variable based math codes. So execution
of time productive, asset effective framework augmentation gets to be key test. Thus keeping in mind the end goal to diminish the
territory, enhance calculation speed, and minimize assets systolic design is drawn nearer for framework increase. Framework
improves calculation execution speed, area, I/O transfer speed all these relies on upon MAC units, PE's and successive control
rationale. So execution on FPGA with memory utilization, clock recurrence, and dormancy gets to be assessment criteria.
A. Basic Principle of Systolic Architecture
Replacing the PE by an array of PE reduces the millions of operations per second. Systolic frameworks comprises of array of
PE(Processing Elements), processors are called cells, Generally the operations will be the same in every cell, every cell performs
an operation or little number of operations on an information thing and after that passes it to its Neighbour as shown in fig.1
Fig. 1: Basic principle of Systolic Array
B. Proposed Algorithm
Network Multiplication calculation proposed in this segment employments 4 PEs, which orchestrate systolically. The unit
component i.e. PE comprises of multiplier piece, adder, one register and sequential control logic piece, which controls the
information sustain operation to multiplier, adder and register as shown in figure 2.
Fig. 2: Single Processing Element
The calculation organizes the stream of information for framework as matrix A and matrix B in such a path, to the point
that one input is a row of matrix A and second input is column of matrix B. Figure 3 demonstrates proposed systolic array of four
PEs that have been utilized as a part of acknowledgment of matrix algorithm calculation.
93
Fig. 3: Proposed Systolic Architecture of PEs
1) Pseudo Code for Matrix Multiplication

procedure MatrixMultiplication(A, B)
input A, B n*n matrix
output C, n*n matrix
begin
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
c [i,j] = 0;
end for
end for
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
c [i,j] = C[i,j] + A[i,k] * B[k,j]
end for
end for
end for
end MatrixMultiplication
C. Wallace Multiplier
To improve the computation speed of matrix multiplication this work used Wallace multiplier. Wallace tree multiplier consists of
three step process, in the first step, the bit product terms are formed after the multiplication of the bits of multiplicand and multiplier,
in second step, the bit product matrix is reduced to lower number of rows using half and full adders, this process continues till the
last addition remains, in the final step, final addition is done using adders to obtain the result.
Fig. 4: Block diagram of multiplier architecture
94
Fig. 5:88 Wallace multiplier
Consider an instance of network increase, which utilizes matrix A and B having 6x6 measurement. PE1 will follow up on the
principal rows of matrix A and columns of matrix B which thus produces component of matrix C, comparatively PE2, PE3 and
PE4 will follow up on the second, third and fourth rows of matrix A and second, third and fourth of column of matrix B. As shown
in figure 6.
Fig. 6: of 6x6 dimension matrix
95
In the 1st step algorithm takes 1st 4 rows of matrix A and 1st 4 columns of matrix B, this gives the matrix C with the result
of 4x4 array, Shown in fig 7.
Fig. 7: 1st operation of 6x6 matrix multiplication
In the 2nd step algorithm takes 1st 4 rows of matrix A and last 2 columns of matrix B , this gives the matrix C with the result of last
2 columns, Shown in fig 8.
Fig. 8: 2nd operation of 6x6 matrix multiplication
96
Like this In the 3rd step algorithm takes last 2 rows of matrix A and first 4 columns of matrix B , this gives the matrix C
with the result of last 2 rows. And In the last step algorithm takes last 2 rows of matrix A and last 2 columns of matrix B , this
gives the matrix C with the result of remaining elements.
In this algorithm we used only 4 processing elements throughout the operation to reduce computation steps of
multiplication operation. Each processing element is consisting of one multiplier and one adder, one accumulator to perform the
computation of matrix multiplication.
D. Tradeoff for Choice of PE's
In the event of square reasonable number matrix, we get N/4 times Nx4 matrix squares. This holds useful for matrix of size 4x4
and more. If there should be an occurrence of odd number of lattice size results in N/4 times Nx4 matrix squares and to get
remaining components of yield network we need to reorder the PEs subject to remaining sections of matrix B. The PEs get
successfully used when number is restricted to four. On the off chance that we choose any number of PEs which are more
prominent than four, then the usage of PEs are diminishes definitely as capacity of increment in matrix size. Figure 9 indicates
utilization of PEs verses underutilization of PEs for 8x8 lattice. This outcome additionally holds useful for 7x7 lattices. Diagram
indicates 4 PEs are idealistic for lattice size 4x4 and more that holds tradeoffs between execution time and asset usage. If there
should be an occurrence of 8 PEs, all PEs are viably used in any case, more than 8x8 network underutilization for 8 PEs will
increments.
Fig. 9: Number of PEs and their underutilization.
Its computational time T1 can be calculated using below formula for N x N network, then
T1 = (N3/ P * L)
(1)
Where "L" equivalents to number of clock cycles required to coordinate the input clock frequency, it is consistent for
specific clock frequency and "N" is the size of matrix, "" PE's performs calculation in parallel,.
E. Confinements of Calculation
If there should arise an occurrence of uneven frameworks, zero padding is required before multiplication occurs to make the
framework symmetric as calculation works on symmetric lattice as it were. In our calculation 4 PEs are viably composed
independent of matrix sizes and decrease the area with other configuration.
IV. EXPERIMENTAL RESULTS

Table I shows the previously approached results, and table II shows experimental results of our proposed algorithm for matrix
multiplication using Xilinx Spartan - XC3S500E target board and Xilinx Design Tool Suit ISE 14.5.
A. Simulation Results
This chapter discusses the results obtained in this project. First the simulation results of the modules in Xilinxs Design
Tool ISE 14.5 simulator are shown.
97
1) 44 Matrix Multiplication Simulation Results
Fig. 10: Simulation results of 44 matrix multiplication
2) 66 Matrix Multiplication Simulation Results
Fig 11: Simulation results of 66 matrix multiplication
Table I: Results of Proposed Algorithm

An experimental result demonstrates the assets utilized of the calculation are free of matrix size at specific frequency.
Device
Matrix
size
CLBs
No. of
PEs
Clock
MHz
Computation
time ( s )
XC3S500E
4x4
93
50
0.140
XC3S500E
6x6
190
50
0.550
Table II: Performance Comparison Results

Design
Matrix size
CLBs
Clock MHz
Area/Speed Ratio
proposed
6x6
190
50
3.8
D.N.Sonawane
6x6
277
50
5.54
98
V. CONCLUSION
In this paper we exhibited resource effective execution of matrix multiplication calculation that lessens the area and enhances the
speed. The proposed systolic architecture is bland and can be executed for any symmetric grids utilizing four PE's for grid sizes
4x4 and the sky is the limit from there. We additionally guarantee that four PE's are more productive to have tradeoffs in the middle
of area and execution time. The execution time can be further enhanced by working the calculation at higher frequency
REFERENCES
J. Jang, S. Choi and V. Prasanna, Area and Time efficient implementation of Matrix Multiplication on in proceedings, IEEE International conference on
Field Programmable Technology pp. 93-100, Dec. 2002.
[2] Scott J.Campbell and Sunil Khatri, "Resource and Delay Efficient Matrix Multiplication using newer FPGA devices. GLSVLSI2006.
[3] A. Amira, A. Bouridane, and P. Milligan, "Accelerating Matrix Product on Reconfigurable Hardware for Signal Processing," Field-Programmable Logic and
Applications (FPL), pp. 101-111, 2001.
[4] J. Jang and S. Choi, Energy and time efficient matrix multiplication on FGPAs, IEEE transaction on VLSI systems Vol. 13, pp 1305-1319 Nov, 2007.
[5] A. Amira and F. Bansali An FPGA Based parameterisable system for matrix product implementation. in IEEE workshop on signal processing systems, pp.
75-79 Oct. 2002.
[6] D.N.Sonawane, Dr. M.S Sutaone, Mr.Inayat Malek Systolic Architecture for Integer Point Matrix Multiplication using FPGA in IEEE Conference on
Industrial Electronics and Applications, 3822-3825, May 2009.
[8] O. Mencer,M. Morf, and M. Flynn, "PAM-Blox: High Performance FPGA Design for Adaptive Computing, "IEEE Symposium on FPGAs for Custom
Computing Machines, pp. 167-174, 1998.
[9] Ling Zhuo and Victor K. Prasanna, Scalable and Modular Algorithms for floating Point Matrix Multiplication on FPGAs, Los Angeles, USA, Feb. 16,
2004.
[11] A book on VLSI Digital Signal Processing Systems Design and Implementation by K. K. Parhi, Wiley, 1999.
[12] Xilinx, Spartan XC3S500E -FPGA users guide. www.xilinx.com
[1]
99

High Speed and Resource Efficient Systolic Architecture For Matrix Multiplication Using FPGA

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

High Speed and Resource Efficient Systolic Architecture For Matrix Multiplication Using FPGA

Transféré par

Droits d'auteur :

Formats disponibles

GRD Journals- Global Research and Development Journal for Engineering | Volume 1 | Issue 5 | April 2016

High Speed and Resource Efficient Systolic

Mr. Pradeep Kumar S K

II. LITERATURE SURVEY

All rights reserved by www.grdjournals.com

Fig. 1: Basic principle of Systolic Array

Fig. 2: Single Processing Element

All rights reserved by www.grdjournals.com

Fig. 3: Proposed Systolic Architecture of PEs

1) Pseudo Code for Matrix Multiplication

Fig. 4: Block diagram of multiplier architecture

All rights reserved by www.grdjournals.com

Fig. 5:88 Wallace multiplier

Fig. 6: of 6x6 dimension matrix

All rights reserved by www.grdjournals.com

Fig. 7: 1st operation of 6x6 matrix multiplication

Fig. 8: 2nd operation of 6x6 matrix multiplication

All rights reserved by www.grdjournals.com

Fig. 9: Number of PEs and their underutilization.

IV. EXPERIMENTAL RESULTS

All rights reserved by www.grdjournals.com

1) 44 Matrix Multiplication Simulation Results

Fig. 10: Simulation results of 44 matrix multiplication

2) 66 Matrix Multiplication Simulation Results

Fig 11: Simulation results of 66 matrix multiplication

Table I: Results of Proposed Algorithm

Table II: Performance Comparison Results

All rights reserved by www.grdjournals.com

All rights reserved by www.grdjournals.com

Vous aimerez peut-être aussi