Vous êtes sur la page 1sur 78

Simulation Manual

for
Configurable MapReduce Accelerator
(work in progress)

Gheorghe M. Stefan

The emergence of the hybrid computation domain is an incipient process. Roughly speaking it is about a system
containing two parts: a standard computing engine, used as host and to run the complex part of the code1 , and an accelerator,
for running the intense part of the code2 . While for the host there are few consecrated solutions (from the shelf mono- or
multi-core processors), for the position of the accelerator part complete few solutions. Some of them have a considerable
advance: various GPUs (such as Nvidia, AMDs ATI), MICs3 (such as Intels Xeon Phi, Adaptevas Epiphany), or FPGA
implemented circuits. GPU solutions are limited because the architecture is biased due to the graphic functionality legacy,
while MIC processors are limited because of their ad hoc structured organization. The FPGA solutions look the most
promising because of their flexibility. The flexibility is used to provide well fitted solutions and in the same time it helps in
the prototyping process when the final target is an ASIC implemented hybrid system.
The only drawback in using FPGAs is the requirement of circuit design abilities in defining and implementing the
circuit used as accelerator. A good compromise is to use a predefined framework for the FPGA design as a configurable
programmable parallel system. In the following, a configurable Map-Reduce programmable structure [11] is considered
as a generic accelerator engine.
In the second section the structure of the simulated system is described. The assembly language is described in the third
section. The fourth section contains examples. The fifth section develops a library of functions. The last section is reserved
for upgrades expected as outcomes of the evaluation process.
1A code is said complex if its size is in the same range with its execution time.
2A code is said intense if its size is much smaller than its execution time.
3 Many Integrated Core

1
Contents
1 Functional Electronics 4

I SIMULATOR 5
2 The General Description of Configurable MapReduce Accelerator 5

3 The Assembly Language 10


3.1 Host-Accelerator Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 External memory & its initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 DMA subset of assembly instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 MapReduce Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Input-output instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Load instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Store instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.4 Address register load instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.5 Two-operand n-bit integer instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.6 Floating point instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.7 Shift instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.8 Send controllers operand as co-operand for array . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.9 Sequential control instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.10 Spatial control instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.11 Global shift instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.12 Global search/insert/delete instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.13 Serial register instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 How to Use the Assembler 28


4.1 How to Use the Host-ACCELERATOR Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 How to Program the ACCELERATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Data transfer programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Simple Vector & Reduction Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

II LIBRARY OF FUNCTIONS 47
5 The list of the reserved storage resources 47
5.1 The list of the reserved storage resources in the scalar memory . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 The list of the reserved storage resources in the vector memory . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Transfer Functions 49
6.1 Two-dimension Array Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.1 Load N full horizontal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1.2 Store N full horizontal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.3 Load M m-component vertical vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.4 Store M m-component vertical vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Two-dimension Arrays Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7 Dense Linear Algebra 53


7.1 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2 Matrix Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.3 Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2
8 Sparse Linear Algebra 62
8.1 Sparse matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.1.1 Band matrices representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.1.2 Sparse matrices with randomly distributed non-zero elements representation . . . . . . . . . . . . . 62
8.2 Band Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2.1 Band Matrix Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.3 Random Sparse Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.3.1 Sparse Matrix Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.3.2 Sparse Matrix Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.3.3 Sparse Matrices Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

9 Graphs 73
9.1 Minimum Spanning Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.2 All-Pairs Shortest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.3 Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

III UPGRADES 77
References 78

3
1 Functional Electronics
The evolution of electronics tends naturally toward the emergence of systems where circuits interleave with information
in order to achieve high functional capabilities. The action of Moores Low provides big sized circuits, but there is not a
Moore Low for the functional complexity. Structures get big but only if they remain simple, characterized by repetitive
patterns. The complexity comes only if the flexible informational structures can be inserted in the big pattern-based physical
structures.
Indeed, it is easy to design, verify, implement and test silicon chips with billions of transistors, but only if the description
of these circuits are kept in reasonable limits. If the structure is big & complex (the description of the circuit has the size
in the same magnitude order with the size of the circuits), then it is impossible to provide a verifiable design and a credible
test procedure for it.
In this context, Functional Electronics is the emergent domain of the functionally big & complex systems built by
tightly interleaving pattern-based big circuits with complex information. Thus:
Circuit & Information = Functional Electronics

Because circuits a naturally parallel engines, Functional Electronics is equivalent in the commercial space with
Parallel Embedded Systems.

The accelerator described in the following is a typical product of Functional Electronics with applications in the Parallel
Embedded System domain. As circuit it is based on a N-order digital system with a scan super (global) loop, a reduction
super (global) loop and a controlled super (global) loop [12]. As computation engine, it is based on the synergy between
Stephen Kleenes mathematical model of computation and John Backuss Functional Programming Systems [11].
The physical implementation of the accelerator provides, in 28 nm technology, for less than 10 Watt:
2 32-bit TOPS

1 32-bit TFLOPS for hi-intense flop applications


2 16-bit TFLOPS for hi-intense flop applications

The degree of parallelism depends on the application. For linear algebra domain it tends to be more than 90%. For
molecular dynamics it is already proved to be over 75%. In the Artificial Neural Network domain the performance of our
programmable solution is similar to those of ASICs.

4
Part I
SIMULATOR
2 The General Description of Configurable MapReduce Accelerator
The structure of the development system we consider (see Figure 1) consists of:
Host + External Memory, with functionality specified only at the level of the interaction assembly language used to
control the accelerator;
ACCELERATOR
The Host is supposed to run a program whose intense part is sent to be run by the Accelerator. The entire program is loaded
in External Memory. The intense part of the program and the associated data are sent to Accelerator using the interface
subsystem containing:
DMA: Direct Memory Access controller which receives commands from Host, through inFIFO, or from Accelera-
tors Controller
inFIFO: used to receive
commands from Host
program from External Memory under the Host control
data from the External Memory
outFIFO: used to
send back the result of computation
send requests of data from the External Memory

ACCELERATOR

reset
?
 inFIFO  in
 -
control

Controller 6 6 6 control - Host


+ DMA +
data - 
MapReduce control External
Array ? ? ? Memory
 data
- outFIFO out -

Figure 1: The hybrid system: Host & ACCELERATOR.

The computational part of the accelerator (see Figure 2) performs functions dealing with scalar or vectors and consists
of a three parts:
CONTROLLER: performing functions defined on scalars with values in scalars; it has a Hardware RISC architecture
with its program memory (prog mem), data memory (mem) and execution unit (eng)
MAP section: performing functions defined on vectors with values in vectors; it is a linear array of cells each with
its own data memory and execution unit similar with those of the controller
REDUCTION network: performing functions defined on vectors with values in scalars; it is a log-depth circuit.

5
from inFIFO

?? ? ?? ? MAP ?? ? ?
- eng mem - eng mem  - eng mem  - eng mem prog
mem
 - to/from DMA

6 6CONTROLLER
- to outFIFO

?? ?
REDUCE

Figure 2: The functional organization of the accelerators core.

The users image of the system is presented in Figure 3. It consists of the memory resources accessible at the level of
the assembly language. There are three levels of storage in the system we simulate:
External Memory, loaded at the beginning of the simulation with program and data; at the end of simulation it
contains the results.
Controllers Memory resources are:

Accumulator Register: is a 32-bit register in the accumulator-based execution unit; it provides one of the
operand and and stores the result of the unary and binary operations performed by the execution unit
reg [n-1:0] acc
Carry Bit: is a 1-bit register whose content is actualized at each arithmetic operation (shifts are arithmetic
operations)
reg cr
Scalar Memory: is the data memory of the controller; it provides, by the rule,, the second operand for binary
operations.
reg [n-1:0] mem[0:(1<<s)-1]
Address Register: is a register used to form the address for Scalar Memory when relative addressing mode is
used; its content is added with the immediate value provided by controllers instruction
reg [s-1:0] addr
Programm Memory: contains at each location a pair of instructions, one for CONTROLLER and another for
MAP-REDUCE array; it is loaded under the control of DMA unit
reg [31:0] progMem[0:(1<<p)-1]

Arrays Memory resources are:


Boolean Vector: is a p = 2x one-bit words vector; if bi = 1 the celli is active, i.e., the instruction received from
controller is executed, else, if bi = 0 the celli is inactive
reg boolVect[0:(1<<x)-1]
Accumulator vector: is a p n-bit words vector distributed along the p cells of the MAP section; its components
are used as accumulators in the execution units of each cell
reg [n-1:] accVect[0:(1<<x)-1]
Carry Vector: is a p one-bit words vector distributed along the p cells of the MAP section; its content is updated
at each arithmetic and shift operation
reg crVect[0:(1<<x)-1]
Address Vector: is a vector distributed along the p cells of the MAP section; it is used for relative addressing
the data memory of each cell
reg [v-1:0] addrVect[0:(1<<x)-1]

6
Vector Memory: contains m = 2v p-component vectors
reg [n-1:0] vectMem[0:(1<<x)-1][0:(1<<v)-1]
as follow

7
ARRAYS MEMORY

Serial Register r0 r1 ri

vector[m-1]

vector[j] v ji

vector[1] v10 v11 v1i

vector[0] v00 v01 v0i

Address Vector a0 a1 ai

Carry Vector c0 c1 ci

Accumulator Vector v0 v1 vi

(BooleanVector) b0 b1 bi

Index Vector 0 1 i 2x 1

CONTROLLERS MEMROY

Controllers Scalar Memory s0 s1 sk

Controllers Program Memory p0 p1 pi

Address Register a

Carry Bit c

Accumulator Register s

External Memory m0 m1 ml

Figure 3: The users view of the architecture.

vector[0]: reg [n-1:0] vectMem[0:(1<<x)-1][0]


vector[1]: reg [n-1:0] vectMem[0:(1<<x)-1][1]

8
...
vector[i]: reg [n-1:0] vectMem[0:(1<<x)-1][j]
...
vector[p-1]: reg [n-1:0] vectMem[0:(1<<x)-1][p-1]
Serial Register: is a serial-parallel register distributed along the MAPs cells; each of the p cells contains a n-bit
parallel register serially connected in the previous and in the next cell
reg [n-1:0] serialReg[0:(1<<x)-1]
Index Vector: is a constant vector used to index the p cells of the MAP section
reg [x-1:0] ixVect[0:(1<<x)-1]

There are the following five operation modes in the storage space just described:
1. vector to scalar mode: is performed in REDUCTION section starting from accVect and providing a value in acc
or back to the MAP section.
Important note: the REDUCTION unit is a log-depth circuit with a latency (p) = 1 + 0.5log2 p. Therefore, any
scalar generated at the output of the REDUCTION unit is valid with a cycles delay, i.e., between the instruction
which set the content of accVect submitted to a reduction operation and the instruction which uses the result of
the reduction operation whatever instructions must be inserted; if nothing to do, then no operation instructions are
welcome.
2. scalar-scalar to scalar mode: is performed in CONTROLLER between acc and mem[i] or immediate value con-
tained in instruction or coOperand with result in acc; coOperand is the scalar value received, with cycles latency,
through REDUCTION unit from MAP section
3. vector-scalar to vector: is performed in MAP section between accVect and immediate value contained in instruction
or coOperand with result in accVect; coOperand is the scalar value received from CONTROLLER or, with
cycles latency, from the REDUCTION unit

4. vector-vector to vector mode: is performed in MAP section between accVect and vectMem[j]

5. vector to vector mode: is performed in MAP section on accVect


The compiler generates the binary form of the program in External Memory. The data supposed to be received as input
for the intense computation is stored in the same memory. There are two operation modes for Accelerator:
slave mode: the host load the program form External Memory into Controllers Program Memory and then controls
all the data transfers between External Memory and Arrays Memory
autonomous mode: the host load the program form External Memory into Controllers Program Memory and then
the program loaded in Controllers Program Memory controls all the data transfers between External Memory and
Arrays Memory

9
3 The Assembly Language
Instruction formats:

executed by DMA unit as interface between HOST and ACCELERATOR:


dmaInstr[31:0] = {3b111, transCode[2:0], value[23:0]}

executed by MapReduce Accelerator (MRA) on its internal data structures (see Figure 3):
mraInstr[31:0] = {controllerInstr[7:0], value[7:0], arrayInstr[7:0], value[7:0]}

10
3.1 Host-Accelerator Interface
The host-accelerator interface allows program load form the external memory and data transfers between external memory
and the internal vector memory.

3.1.1 External memory & its initialization


The external memory is managed, in the actual system, by the host engine. For the simulation purpose this memory is
positioned in the module host as reg[31:0] extMem[0:1023].
The assembler loads the program starting from the address 0. The external memory is initialized with data, if needed,
by loading the file initialData.txt. The file is defined in hexaand has the form:

@yyy

yyyyyyyy
yyyyyyyy
...
yyyyyyyy
where y is a hexa symbol. The file starts with yyy which represents the starting address in extMem. This address must be
carefully attributed in concordance with the size of the program loaded starting form the address 0.

3.1.2 DMA subset of assembly instructions


DMA instructions are unrequested 32-bit words received through inFIFO:

cPLOAD : program load


cPRUN(labelIndex) : program run form LB(labelIndex)
cLSIZE(size) : vector size in words
cTRUN(transferType) : transfer run (load, store)
codded as follows:

cPLOAD : {8b1111_1111, 24b0}


cPRUN(7) : {8b1111_1110, startAddr[23:0]}
cLSIZE(256) : {3b001, 29b1_0000_0000}
cTRUN(1) : {3b000, 3b001, 25b0} // load
cTRUN(2) : {3b000, 3b010, 25b0} // store
For load and store operations the HOST knows the addresses in the External Memory, while the Accelerator knows
the address in the vector memory. For strided, gathered and scattered aspect of the transfer, if any, the Host will take care.

11
3.2 MapReduce Accelerator
The parameters used to configure the ACCELERATOR are the following:

parameter
n = 32 , // word s i z e
x = 10 , // i n d e x s i z e > 2 x = 1024 c e l l s
v = 11 , // v e c t o r memory a d d r e s s s i z e > 2048 1024 component v e c t o r s
s = 9 , // s c a l a r memory a d d r e s s s i z e > 512 32 b i t s c a l a r s
p = 8 , // p r o g r a m memory a d d r e s s s i z e > 256 p a i r s o f i n s t r u c t i o n s
c = 8 , // value size in i n s t r u c t i o n
a = 5 // ( s i z e o f a c t i v a t i o n c o u n t e r > 32 embedded WHEREs)

for an engine characterized by:

32-bit word
1024-cell array

2048-word local memory in each cell, which translates in a Vector Memory of 2048 vectors of 1024 32-bit scalar
each

512-word Data Memory in CONTROLLER


256-instruction Program Memory in CONTROLLER

8-bit immediate in instruction for both, array and controller.


The assembly language provides a sequence of lines each containing an instruction for Controller (with the prefix c)
and another for Array. For jumps and branches, some of the line are labeled (LB(n), where n is a positive integer).

12
3.2.1 Input-output instructions
send from mem[cScalar] to DMA unit the size of vector to be transferred :
cLSIZE(cScalar): size <= mem[cScalar[s-1]][x-1:0] // in DMA unit

send from mem[cScalar] to DMA unit the address in the external memory where starts the transfer :
cLADDR(cScalar): addr <= mem[cScalar[s-1]][28:0] // in DMA unit

send from mem[cScalar] to DMA unit the size of burst to be transferred :


cLBURST(cScalar): burst <= mem[cScalar[s-1]][x-1:0] // in DMA unit

send from mem[cScalar] to DMA unit the size of stride in the external memory :
cLSTRIDE(cScalar): stride <= mem[cScalar[s-1]][28:0] // in DMA unit

send from mem[cScalar] to DMA unit the type of transfer and the transfer start command :
cTRUN(cScalar): case(cScalar[2:0])
001: load
010: store
011: strided load
100: strided store
101: gathered load
110: scattered store
endcase

starts the program load; is is executed by DMA unit :


cPLOAD: load the program until the instruction cPRUN

starts the run of the program from address cScalar :


cPRUN(cScalar): pc <= cScalar

13
3.2.2 Load instructions
The subset of load instructions are used to load, from various storage resources inside the accelerator, n-bit words in the
accumulator of controller, acc, or in the accumulators of each cell, acc[i].
no operation :
cNOP: acc <= acc
NOP: acc[i] <= acc[i]

index load :
IXLOAD: acc[i] <= i

immediate load :
cVLOAD(cScalar): acc <= {(n-c){cScalar[c-1]}}, cScalar}
VLOAD(aScalar): acc[i] <= {(n-c){aScalar[c-1]}}, aScalar}

absolute load :
cLOAD(cScalar): acc <= mem[cScalar]
LOAD(aScalar): acc[i] <= vectMem[i][aScalar]

relative load :
cRLOAD(cScalar): acc <= mem[addr + cScalar]
RLOAD(aScalar): acc[i] <= vectMem[i][addrVect[i] + aScalar]

relative load & increment :


cRILOAD(cScalar): acc <= mem[addr + cScalar
addr <= addr + cScalar
RILOAD(aScalar): acc[i] <= vectMem[i][addrVect[i] + aScalar]
addrVect[i] <= addrVect[i] + aScalar

co-operand load :
cCLOAD(0): acc <= reductionAdd
cCLOAD(1): acc <= reductionMin
cCLOAD(2): acc <= reductionMax
cCLOAD(3): acc <= reductionFlag
cCLOAD(4): acc <= serialReg[0]
cCLOAD(5): acc <= serialReg[(1<<x)-1]
CLOAD: acc[i] <= acc
CALOAD: acc[i] <= vectMem[i][acc]
CRLOAD: acc[i] <= vectMem[i][addrVect[i] + acc]

serial register load :


SRLOAD: acc[i] <= serialReg[i]

input-output register load :


IOLOAD: acc[i] <= ioReg[i]

14
3.2.3 Store instructions
The subset of store instructions are used to store, into various storage resources inside the accelerator, n-bit words from the
accumulator of controller, acc, or from the accumulators of each cell, acc[i].
absolute store :
cSTORE(cScalar): mem[cScalar] <= acc
STORE(aScalar): vectMem[i][aScalar] <= acc[i]

relative store :
cRSTORE(cScalar): mem[addr + cScalar] <= acc
RSTORE(aScalar): vectMem[i][addrVect[i] + aScalar] <= acc[i]

relative store & increment :


cRISTORE(cScalar): mem[addr + cScalar] <= acc
addr <= addr + cScalar
RISTORE(aScalar): vectMem[i][addrVect[i] + aScalar] <= acc[i]
addrVect[i] <= addrVect[i] + aScalar

co-operand store :
CSTORE: vectMem[i][acc] <= acc[i]
cCRSTORE(0): mem[addr + reductionAdd] <= acc (reduction applied to acc[i] (1+x/2) cycles before)
cCRSTORE(1): mem[addr + reductionMin] <= acc (reduction applied to acc[i] (1+x/2) cycles before)
cCRSTORE(2): mem[addr + reductionMax] <= acc (reduction applied to acc[i] (1+x/2) cycles before)
cCRSTORE(3): mem[addr + reductionFlag] <= acc (reduction applied to acc[i] (1+x/2) cycles before)
CRSTORE: vectMem[i][addrVect[i] + acc] <= acc[i]

serial register store :


SRSTORE: serialReg[i] <= acc[i]

input-output register store :


IOSTORE: ioReg[i] <= acc[i]

15
3.2.4 Address register load instructions
These instructions are used to instantiate the value of the address register in controller, addr, and in each cell of the array,
addrVect[i]. The address register is used to provide differentiations in the address apace of each local data memory
distributed in array at the cells level.
address register takes the value from accumulator :
cADDRLD: addr <= acc
ADDRLD: addrVect[i] <= acc[i]

address registers in array take the value from controllers accumulator :


CADDRLD addrVect[i] <= acc

16
3.2.5 Two-operand n-bit integer instructions
The pattern for the two-operand instruction is presented using the function ADD (addition). Each of the two-operand in-
struction has the following 12 forms (5 for Controller and 7 for Array) according to the way the second operand is selected.
(For the sake of simplicity, in the following, acc[i] stands for accVect[i] and cr[i] stands for crVect[i].)
immediate add :
cVADD(cScalar): {carry, acc} <= acc + {(n-8){cScalar[7]}}, cScalar}
VADD(aScalar): {carry[i], acc[i]} <= acc[i] + {(n-8){aScalar[7]}}, aScalar}

absolute add :
cADD(cScalar): {carry, acc} <= acc + mem[cScalar]
ADD(aScalar): {carry[i], acc[i]} <= acc[i] + vectMem[i][aScalar]

relative add :
cRADD(cScalar): {carry, acc} <= acc + mem[addr + cScalar]
RADD(aScalar): {carry[i], acc[i]} <= acc[i] + vectMem[i][addrVect[i] + aScalar]

relative add & increment :


cRIADD(cScalar): {carry, acc} <= acc + mem[addr + cScalar]
addr <= addr + cScalar
RIADD(aScalar): {carry[i], acc[i]} <= acc[i] + vectMem[i][addrVect[i] + aScalar]
addrVect[i] <= addrVect[i] + aScalar

co-operand add :
cCADD(0): {carry, acc} <= acc + reductionAdd(applied to acc[i] (1+x/2) cycles before)
cCADD(1): {carry, acc} <= acc + reductionMin(applied to acc[i] (1+x/2) cycles before)
cCADD(2): {carry, acc} <= acc + reductionMax(applied to acc[i] (1+x/2) cycles before)
cCADD(3): {carry, acc} <= acc + reductionFlag(applied to acc[i] (1+x/2) cycles before)
cCADD(4): {carry, acc} <= acc + serialReg[0]
cCADD(5): {carry, acc} <= acc + serialReg[(1<<x)-1]
CADD: {carry[i], acc[i]} <= acc[i] + acc
CAADD: {carry[i], acc[i]} <= acc[i] + vectMem[i][acc]
CRADD: {carry[i], acc[i]} <= acc[i] + vectMem[i][addrVect[i] + acc]

relative to co-operand add4 :


cCRADD(0): {carry, acc} <= acc + reductionAdd(applied to acc[i] (1+x/2) cycles before)
cCRADD(1): {carry, acc} <= acc + reductionMin(applied to acc[i] (1+x/2) cycles before)
cCRADD(2): {carry, acc} <= acc + reductionMax(applied to acc[i] (1+x/2) cycles before)
cCRADD(3): {carry, acc} <= acc + reductionFlag(applied to acc[i] (1+x/2) cycles before)

For the following mnemonics, the previously described 12 instructions forms are the same:
ADDC - add with carry: {carry, acc} <= acc + op + carry
SUB - subtract: {carry, acc} <= acc - op
RSUB - reverse SUB: {carry, acc} <= op - acc
SUBC - SUB with carry: {carry, acc} <= acc - op - carry
RSUBC - reverse SUBC: {carry, acc} <= op - acc - carry
DIV - division: acc <= acc / op

17
RDIV - reverse DIV: acc <= op / acc
MULT - multiplication: acc <= acc * op
AND - bitwise and: acc <= acc & op
OR - bitwise or: acc <= acc | op
XOR - bitwise xor: acc <= acc ^ op
COMPARE - compare: {carry, acc} <= (acc - op)&(10...0)|{0, acc};

Thus, instead of the suffix ADD in one of the previous 12 instruction descriptions, one of the previous can be used. For
example: VADDC, instead of VADD. Thus, 12 13 instructions are already described.

18
3.2.6 Floating point instructions
The floating point set of instructions use as co-operand only the local memory content addressed by the immediate value
from the instruction: mem[cScalar] for controller and vectMem[i][aScalar] for each arrays cell. The execution times
for float operations are:
float add: 3 cycles for the following sequence of instructions (exemplified for controller):

cFADD(28); NOP; // fadd(acc, mem[28])


cMADD; NOP;
cAPACK; NOP;

float multiplication: 2 cycles for the following sequence of instructions (exemplified for controller):

cMULT(28); NOP; // fmult(acc, mem[28])


cMPACK; NOP;

float division: 26 cycles for the following sequence of instructions (exemplified for controller):

cFDIV(28); NOP; // fdiv(acc, mem[28])


cMDIV; NOP; // executed in 24 cycles
cDPACK; NOP;

first step float addition: unpack & align :


cFADD(cScalar): performs fadd(acc, mem[cScalar])
FADD(aScalar): performs fadd(acc[i], vectMem[i][aScalar])

second step float addition: add :


cMADD: one-cycle operation on mantissa
MADD: one-cycle operation on mantissa

third step float addition: pack back :


cAPACK: acc <= fadd(acc, mem[cScalar])
APACK: acc[i] <= fadd(acc[i], vectMem[i][aScalar])

first step float multiplication :


cFMULT(cScalar): performs fmult(acc, mem[cScalar])
FMULT(aScalar): performs fadd(acc[i], vectMem[i][aScalar])

second step float multiplication :


cMPACK: acc <= fmult(acc, mem[cScalar])
MPACK: acc[i] <= fmult(acc[i], vectMem[i][aScalar])

first step float division :


cFDIV(cScalar): performs fdiv(acc, mem[cScalar])
FDIV(aScalar): performs fdiv(acc[i], vectMem[i][aScalar])

19
second step float division :
cMDIV: 24-cycle operation on mantissa
MDIV: 24-cycle operation on mantissa

third step float division :


cDPACK: acc <= fdiv(acc, mem[cScalar])
DPACK: acc[i] <= fdiv(acc[i], vectMem[i][aScalar])

20
3.2.7 Shift instructions
shift right one bit position :
cSHRIGHT: {cr, acc} <= {acc[0], 1b0, acc[n-1:1]}
SHRIGHT: {cr[i], acc[i]} <= {acc[i][0], 1b0, acc[i][n-1:1]}

shift right one bit position with carry :


cSHRIGHTC: {cr, acc} <= {acc[0], cr, acc[n-1:1]}
SHRIGHTC: {cr[i], acc[i]} <= {acc[i][0], cr[i], acc[i][n-1:1]}

shift right arithmetic one bit position :


cSHARIGHT: {cr, acc} <= {acc[0], acc[n-1], acc[n-1:1]}
SHARIGHT: {cr[i], acc[i]} <= {acc[i][0], acc[i][n-1], acc[i][n-1:1]}

insert value on the least positions :


cINSVAL: acc <= {acc[(n-c):0], cScalar}
INSVAL: acc[i] <= {acc[i][(n-c):0], aScalar}

21
3.2.8 Send controllers operand as co-operand for array
Are executed only by the controller. Are used to send as co-operand for the array the operand of the controller. Must be
used in conjunction with an instruction which in array requests the co-operand.

send as co-operand to array mem[cScalar] :


cSEND(cScalar): opVect[k] = mem[cScalar]

send as co-operand to array mem[addr + cScalar] :


cRSEND(cScalar): opVect[k] = mem[addr + cScalar]

send as co-operand to array mem[addr + cScalar] and update addr :


cRISEND(cScalar): opVect[k] = mem[addr + cScalar]
addr <= addr + cScalar

send as co-operand to array the output of the reduction unit selected with cScalar[1:0] :
cCSEND(0): opVect[k] = reductionAdd
cCSEND(1): opVect[k] = reductionMin
cCSEND(2): opVect[k] = reductionMax
cCSEND(3): opVect[k] = reductionFlg
cCSEND(4): opVect[k] = serialReg[0]
cCSEND(5): opVect[k] = serialReg[(1<<x)-1]

This subset of instructions is used in conjunction with the instructions CXXX, where XXX is one of the two-operand
instructions previously defined. For example:

cSEND(5); CADD; // {cr[i], acc[i]} <= acc[i] + mem[5]


// each accumulator in array is summed with the content of location 5
// from Scalar Memory

22
3.2.9 Sequential control instructions
unconditioned jump to the instruction labeled with LB(cScalar) :
cJMP(cScalar): pc <= pc + valueComputedByAssembler

branch if acc is zero to the instruction labeled with LB(cScalar) :


cBRZ(cScalar): pc <= (acc = 0) ? pc + valueComputedByAssembler : pc + 1

branch if acc is not zero to the instruction labeled with LB(cScalar) :


cBRNZ(cScalar): pc <= (acc = 0) ? pc + 1 : pc + valueComputedByAssembler

branch if acc is zero to the instruction labeled with LB(cScalar) & decrement :
cBRZDEC(cScalar): pc <= (acc = 0) ? pc + valueComputedByAssembler : pc + 1
acc <= acc - 1

branch if acc is not zero to the instruction labeled with LB(cScalar) & decrement :
cBRNZDEC(cScalar): pc <= (acc = 0) ? pc + 1 : pc + valueComputedByAssembler
acc <= acc - 1

branch if acc+1 is zero to the instruction labeled with LB(cScalar) & increment :
cBRZINC(cScalar): pc <= (acc+1 = 0) ? pc + valueComputedByAssembler : pc + 1
acc <= acc + 1

branch if acc+1 is not zero to the instruction labeled with LB(cScalar) & increment :
cBRNZINC(cScalar): pc <= (acc+1 = 0) ? pc + 1 : pc + valueComputedByAssembler
acc <= acc + 1

branch if acc is negative to the instruction labeled with LB(cScalar) & increment :
cBRSGN(cScalar): pc <= (acc[n-1] = 1) ? pc + valueComputedByAssembler : pc + 1

branch if acc is positive to the instruction labeled with LB(cScalar) & increment :
cBRNSGN(cScalar): pc <= (acc[n-1] = 1) ? pc + 1 : pc + valueComputedByAssembler

skip next instruction if acc = mem[cScalar] :


cSKIPEQ(cScalar): pc <= (acc = mem[cScalar]) ? pc + 2 : pc + 1

skip next instruction if acc != mem[cScalar] :


cSKIPNEQ(cScalar): pc <= (acc = mem[cScalar]) ? pc + 1 : pc + 2

wait for IO system to end the current transfer :


cIOWAIT: pc <= idleState ? pc : pc + 1;

halt :
cHALT: pc <= pc

23
3.2.10 Spatial control instructions
The instructions from this subset are used to select the active cells.
The cell i is active if boolVect[i] = 1, where: boolVect[i] = (actVect[i] == 0).
activate all cells :
ACTIVATE: actVect[i] <= 0

keep active where zero :


WHEREZERO: actVect[i] <= (boolVect[i]&(condVect[i][0])) ? actVect[i] : actVect[i]+1

keep active where carry :


WHERECARRY: actVect[i] <= (boolVect[i]&(condVect[i][1])) ? actVect[i] : actVect[i]+1

keep active where first :


WHEREFIRST: actVect[i] <= (boolVect[i]&(condVect[i][2])) ? actVect[i] : actVect[i]+1

keep active where next :


WHERENEXT: actVect[i] <= (boolVect[i]&(condVect[i][3])) ? actVect[i] : actVect[i]+1

keep active where not zero :


WHERENZERO: actVect[i] <= (boolVect[i]&!(condVect[i][0])) ? actVect[i] : actVect[i]+1

keep active where not carry :


WHERENCARRY: actVect[i] <= (boolVect[i]&!(condVect[i][1])) ? actVect[i] : actVect[i]+1

keep active where not first :


WHERENFIRST: actVect[i] <= (boolVect[i]&!(condVect[i][2])) ? actVect[i] : actVect[i]+1

keep active where not next :


WHERENNEXT: actVect[i] <= (boolVect[i]&!(condVect[i][3])) ? actVect[i] : actVect[i]+1

activate the cells inactivated by the last where :


ELSEWHERE: actVect[i] <= (actVect[i]==0) ? 1 : ((actVect[i]==1) ? 0 : actVect[i])

restore actVect before the corresponding where :


ENDWHERE: actVect[i] <= (actVect[i] == 0) ? actVect[i] : (actVect[i] - 1)

add to the active cells the cells where accVect[i] == acc :


ACTWHERE: actVect[i] <= (!boolVect[i] && (accVect[i] == acc)) ? 0 : actVect[i]

save the active cells going back to the previous selection pattern :
SAVEACT: actVect[i] <= actVect[i] - 1

restore the activation pattern saved by the last saveact :


RESTACT: actVect[i] <= actVect[i] + 1

24
3.2.11 Global shift instructions
global rotate with one position :
GROTATE: acc[i] <= acc[(i+1)%(1<<x)]

global right shift with one position :


GRSHIFT: accVect[i] <= (i==0) ? 0 : accVect[i-1]

global left shift with one position :


GLSHIFT: acc[i] <= (i < (1<<x)-1) ? acc[i+1] : 0

25
3.2.12 Global search/insert/delete instructions
search for co-operand in all cells :
SRCALL: boolVect[i] <= (acc[i] == acc) ? 1b1 : 1b0

search for value aScalar in all cells :


VSRCALL(aScalar): boolVect[i] <= (acc[i] == aScalar) ? 1b1 : 1b0

search for co-operand :


SEARCH: boolVect[i] <= (boolVect[i] & (acc[i] == acc)) ? 1b1 : 1b0

search for value aScalar :


VSEARCH(aScalar): boolVect[i] <= (boolVect[i] & (acc[i] == aScalar)) ? 1b1 : 1b0

conditioned search for co-operand :


CSEARCH: boolVect[i] <= (i==0) ? 0 : ((acc[i] == acc) & boolVect[i-1]) ? 1 : 0

conditioned search for value aScalar :


VCSEARCH(aScalar): boolVect[i] <= (i==0) ? 0 : ((acc[i]==aScalar)&boolVect[i-1]) ? 1 : 0

insert value aScalar in the first active position :


INSERT(aScalar): acc[i] <= (firstVect[i]) ? aScalar : ((nextVect[i]) ? acc[i-1] : acc[i])

insert co-operand in the first active position :


CINSERT: acc[i] <= (firstVect[i]) ? acc : ((nextVect[i]) ? acc[i-1] : acc[i])

delete the first active accumulator :


DELETE: acc[i] <= (firstVect[i] | nextVect[i]) ?
((i == ((1<<x)-1)) ? 0 : acc[i+1]) : acc[i]

move selections of active cells one position right :


SELSHIFT: boolVect[i] <= (i == 0) ? 1b0 : boolVect[i-1]

26
3.2.13 Serial register instructions
push right cScalar in serial register :
cVPUSHR(cScalar): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : {(n-c){cScalar[c-1]}, cScalar}

push right mem[cScalar] in serial register :


cPUSHR(cScalar): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : mem[cScalar]

push right mem[cScalar[s-1:0] + addr] in serial register :


cRPUSHR(cScalar): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : mem[cScalar + addr]

push right mem[cScalar[s-1:0] + addr] in serial register & update addr :


cRIPUSHR(cScalar): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : mem[cScalar + addr]
addr <= cScalar + addr

push right in the serial register the co-operand selected by cScalar[1:0] :


cCPUSHR(0): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : reductionAdd
cCPUSHR(1): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : reductionMin
cCPUSHR(2): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : reductionMax
cCPUSHR(3): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : reductionFlag

push left cScalar in serial register :


cVPUSHL(cScalar): serialReg[i] <= (i == 0) ? {(n-c){cScalar[c-1]}, cScalar} : serialReg[i-1]

push left mem[cScalar] in serial register :


cPUSHL(cScalar): serialReg[i] <= (i == 0) ? mem[cScalar] : serialReg[i-1]

push left mem[cScalar[s-1:0] + addr] in serial register :


cRPUSHL(cScalar): serialReg[i] <= (i == 0) ? mem[cScalar + addr] : serialReg[i-1]

push left mem[cScalar[s-1:0] + addr] in serial register & update addr :


cRIPUSHL(cScalar): serialReg[i] <= (i == 0) ? mem[cScalar + addr] : serialReg[i-1]
addr <= cScalar + addr

push left in the serial register the co-operand selected by cScalar[1:0] :


cCPUSHL(0): serialReg[i] <= (i == 0) ? reductionAdd : serialReg[i-1]
cCPUSHL(1): serialReg[i] <= (i == 0) ? reductionMin : serialReg[i-1]
cCPUSHL(2): serialReg[i] <= (i == 0) ? reductionMax : serialReg[i-1]
cCPUSHL(3): serialReg[i] <= (i == 0) ? reductionFlag : serialReg[i-1]

27
4 How to Use the Assembler
In order to evaluate what can be the maximum of performance of our architecture, the assembly language must be used.
Therefore, the initial stage of evaluation must be done in assembly language. (The next stage, beyond the scope of our
approach, is to provide an efficient compiler from a high level language to the machine language.) We provide few sim-
ple example of using the previously described assembly language. The behavioral description of the generic structure is
simulated on ISE Design Suite 14.2 provided by Xilinx.
For simulation reasons, the engine is kept small. It is defined by the content of 00 parameters.v file:

parameter
n = 32 , // word s i z e
x = 4 , // i n d e x s i z e > 16 c e l l s
v = 6 , // v e c t o r memory a d d r e s s s i z e > 64 v e c t o r s
s = 8 , // s c a l a r memory a d d r e s s s i z e > 256 s c a l a r s
p = 8 , // p r o g r a m memory a d d r e s s s i z e > 256 32 b i t i n s t r u c t i o n s
c = 8 , // value size in i n s t r u c t i o n
a = 5 // s i z e of a c t i v a t i o n counter

For editorial reasons, the simulators monitor has the following, compressed form:

i n i t i a l begin
$ m o n i t o r ( t =%0d pc=%d a=%0d a [0]=%0 d a [1]=%0 d a [2]=%0 d . . . a [6]=%0 d a [7]=%0 d
b = %0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d c c=%0d ,
$time / 2 ,
d u t . pc ,
d u t . acc ,
dut . accVect [ 0 ] ,
dut . accVect [ 1 ] ,
dut . accVect [ 2 ] ,
dut . accVect [14] ,
dut . accVect [15] ,
dut . boolVect [ 0 ] , dut . boolVect [ 1 ] , dut . boolVect [ 2 ] , dut . boolVect [ 3 ] ,
dut . boolVect [ 4 ] , dut . boolVect [ 5 ] , dut . boolVect [ 6 ] , dut . boolVect [ 7 ] ,
dut . boolVect [ 8 ] , dut . boolVect [ 9 ] , dut . boolVect [10] , dut . boolVect [11] ,
dut . boolVect [12] , dut . boolVect [13] , dut . boolVect [14] , dut . boolVect [15] ,
dut . cc
);
end

4.1 How to Use the Host-ACCELERATOR Interface


Example 4.1 How to load and run a program.

Listing 1: Loading and running the program XXX


/
PPP

The p r o g r a m PPP i s l o a d e d s t a r t i n g from a d d r e s s 0 and s t a r t s r u n n i n g from a d d r e s s LB ( 5 0 )

cXXX : g e n e r i c i n s t r u c t i o n e x e c u t e d i n C o n t r o l l e r
XXX: g e n e r i c i n s t r u c t i o n e x e c u t e d i n t h e MapReduce a r r a y
/
cPLOAD ; / / e x e c u t e d by DMA

28
/ / f o l l o w s t h e p r o g r a m e x e c u t e d by a c c e l e r a t o r
cXXX ; XXX;
... ...
LB ( 5 0 ) cXXX ; XXX; / / t h e s t a r t i n g l i n e o f c o d e i n t h e p r o g r a m PPP
... ...
cXXX ; XXX;
/ / e n d s t h e p r o g r a m e x e c u t e d by y a c c e l e r a t o r
cPRUN ( 5 0 ) ; / / e x e c u t e d by DMA
//

Example 4.2 How to re-launch a program already loaded.

Listing 2: Running the program GGG


/
GGG

The p r o g r a m GGG a l r e a d y l o a d e d s t a r t s r u n n i n g from a d d r e s s LB ( 8 0 )


/
cPRUN ( 8 0 ) ;
//

Example 4.3 Working in slave mode, the accelerator receives a sequence of unrequested operations.

Listing 3: Slave-mode execution


/
SSS

The p r o g r a m SSS i s e x e c u t e d i n s l a v e mode s t a r t s r u n n i n g from a d d r e s s LB ( 5 0 ) .

The two o p e r a n d s a r e l o a d e d from t h e e x t e r n a l memory a s 128 s c a l a r v e c t o r s , w h i l e t h e


r e s u l t i s s t o r e d b a c k i n t h e e x t e r n a l memory .

The s e q u e n c e o f u n r e q u e s t e d o p e r a t i o n s a r e :
/
cPLOAD ;
... / / h e r e g o e s t h e p r o g r a m SSS
cPRUN ( 1 2 ) ;
cLSIZE [ 1 2 8 ] ) ; / / s e t t h e v e c t o r s i z e ; [ . . . ] means a c t u a l v a l u e
cTRUN ( 1 ) ; / / l o a d t h e f i r s t v e c t o r ; SSS knows where i n v e c t o r memory
cTRUN ( 1 ) ; / / l o a d t h e s e c o n d v e c t o r ; SSS knows where i n v e c t o r memory
cTRUN ( 2 ) ; / / when t h e p r o g r a m ends , s e n d s t h e r e s u l t b a c k
//

Example 4.4 Example of program with data transfer:

29
Listing 4: Data transfer in the program YYY
/
YYY

The p r o g r a m YYY s t a r t s r u n n i n g from a d d r e s s LB ( 5 0 ) .

The two o p e r a n d s a r e l o a d e d a s 128 s c a l a r v e c t o r s a t t h e a d d r e s s e s 15 and 16 i n v e c t o r


memory from t h e t h e a d d r e s s e s 64 and 500 l o c a t e d i n t h e e x t e r n a l memory , w h i l e t h e
r e s u l t i s s t o r e d a t t h e a d d r e s s 200 i n e x t e r n a l memory
/
dPLOAD ;
...
cVLOAD ( 1 2 8 ) ; NOP ; / / s e t t h e s i z e o f t h e v e c t o r t o be t r a n s f e r r e d
cSTORE ( 1 0 ) ; NOP ; / / s a v e t h e s i z e i n mem[ 1 0 ]
cVLOAD ( 6 4 ) ; NOP ; / / set the external address for the t r a n s f e r
cSTORE ( 1 2 ) ; NOP ; / / s a v e t h e e x t e r n a l a d d r e s s i n mem[ 1 2 ]
cLSIZE ( 1 0 ) ; NOP ; / / s e n d t o DMA t h e v e c t o r s i z e
cLADDR ( 1 2 ) ; NOP ; / / s e n d t o DMA t h e f i r s t a d d r e s s i n e x t e r n a l memory
cTRUN ( 1 ) ; NOP ; / / l o a d t h e f i r s t v e c t o r command i s s e n t t o DMA;
cVLOAD ( 5 0 0 ) ; NOP ; / / set the e xt er na l address for the next t r a n s f e r
cSTORE ( 1 2 ) ; NOP ; / / s a v e t h e e x t e r n a l a d d r e s s i n mem[ 1 2 ]
cIOWAIT ; NOP ; / / wait for the t r a n s f e r of the f i r s t vector
cLADDR ( 1 2 ) ; IOLOAD ; / / send the e x t e r n a l a d d r e s s f o r the next t r a n s f e r ;
/ / load in accVect the r e s u l t of the t r a n s f e r
cTRUN ( 1 ) ; LOAD ( 1 5 ) ; / / l o a d t h e s e c o n d v e c t o r command i s s e n t t o DMA;
/ / t h e f i r s t v e c t o r i s l o a d e d a t memVect [ 1 5 ]
cVLOAD ( 2 0 0 ) ; NOP ; / / set the external address for the r e s u l t vector
cSAVE [ 1 2 ] ; NOP ; / / s a v e t h e e x t e r n a l a d d r e s s i n mem[ 1 2 ]
cIOWAIT ; NOP ; / / wait f o r the t r a n s f e r of the second v e c t o r
cNOP ; IOLOAD ; / / load in accVect the r e s u l t of the second t r a n s f e r
cNOP ; LOAD ( 1 6 ) ; / / t h e s e c o n d v e c t o r i s l o a d e d a t memVect [ 1 6 ]
... / / compute t h e r e s u l t v e c t o r i n a c c V e c t
cLADDR ( 1 2 ) ; IOSTORE ; / / s e n d t h e f i r s t a d d r e s s i n e x t e r n a l memory t o DMA;
/ / s t o r e accVect i n ioReg
cTRUN ( 2 ) ; NOP ; / / s t o r e t h e r e s u l t v e c t o r command i s s e n t t o DMA
...
dPRUN ( 5 0 ) ;
//

30
4.2 How to Program the ACCELERATOR
The following examples are presented in order to show how the main features of the MapReduce engine, with the generic
architecture, works. The following classes of operations are exemplified:

1. data transfer operations


2. vector operations

3. reduction operations

31
4.2.1 Data transfer programs
The following data transfer operations are possible in this generic version of the accelerator:
load : vector load, which requests the following parameters:

size: the number of n-bit scalars of vector


address: the starting address in the external memory
store : vector store, which requests the following parameters:

size: the number of n-bit scalars of vector


address: the starting address in the external memory
strided load : vector strided load, which requests the following parameters:

size: the number of n-bit scalars of vector


address: the starting address in the external memory
burst: the number of n-bit scalars transferred in a burst
stride: the stride between the beginnings of each burst
strided store : vector strided store, which requests the following parameters:

size: the number of n-bit scalars of vector


address: the starting address in the external memory
burst: the number of n-bit scalars transferred in a burst
stride: the stride between the beginnings of each burst

gathered load : vector gathered load, which requests the following parameters:
size: the number of n-bit scalars of vector containing n addresses

scattered store : vector scattered store, which requests the following parameters:
size: the number of n-bit scalars of the vector containing n/2 pais address-data

Once a parameter of a certain type is received by the DMA unit it is maintained until a new value is sent to the DMA unit
for the same type of parameter. For example, once the vector size is established for the first data transfer, we dont need to
re-send the size for the next transfers if the size remains the same.

32
Example 4.5 Load vector requests two parameters: the size of vector and the starting address in the external memory,
because the program knows where to load the vector in the vector memory. The parameters of the transfer are loaded
in two locations of the controllers memory in order to be used in more complex programs, where eventually they can be
submitted to some computations. In or example the size of vector is stored in mem[1] and the initial address in external
memory is stored in mem[2]. These values are used in the instructions cLSIZE and cLADDR pointed by the locations in mem
where they are stored: 1, respectively 2.
Important note: the instruction cIOWAIT cannot follow in the next cycle the instruction TRUN. At least on cycle delay
must be introduced before starting to wait the end of transfer. Any kind of processing instruction can be executed meantime.
In our example we inserted a cNOP instructions.

/
TEST PROGRAM FOR : l o a d v e c t o r
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 10
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 0 ) ; ACTIVATE ; / / a d d r = 14 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; NOP ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 1 ) ; NOP ; / / s t a r t i n DMA t h e l o a d o p e r a t i o n
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; IOLOAD ; / / s t o p c y c l e c o u n t e r ; l o a d ioReg i n accVect
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////

The program ends with the transferred vector in accumulator vector.


33
Example 4.6 Store vector requests two parameters: the size of vector and the starting address in the external memory,
because the program knows from where, in the vector memory, to take the vector. The parameters of the transfer are
loaded in two locations of the controllers memory in order to be used in more complex programs, where eventually they can
be submitted to some computations. In or example the size of vector is stored in mem[1] and the initial address in external
memory is stored in mem[2]. These values are used in the instructions cLSIZE and cLADDR pointed by the locations in mem
where they are stored: 1, respectively 2.
Important note: the instruction cIOWAIT cannot follow in the next cycle the instruction TRUN. At least on cycle delay
must be introduced before starting to wait the end of transfer. Any kind of processing instruction can be executed meantime.
In our example we inserted a cNOP instructions.

/
TEST PROGRAM FOR : s t o r e v e c t o r

The p r o g r a m :
l o a d s 10 i n {\ t t mem[ 1 ] } and 14 i n {\ t t mem[ 2 ] }
s t o r e s t h e f i r s t mem [ 1 ] c o m p o n e n t s o f t h e i n d e x v e c t o r i n t h e e x t e r n a l memory
s t a r t i n g from t h e a d d r e s s mem [ 2 ]
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 0 ) ; NOP ; / / s i z e = 10
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 1 4 ) ; ACTIVATE ; / / a d d r = 14 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t i n DMA t h e s t o r e o p e r a t i o n
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////

The program ends with the stream 0, 1, ... 9 loaded in the external memory starting from the address 14.

34
Example 4.7 Store - load vector is a program which loads the full index vector in the external memory starting from the
address 32, then loads back into the accumulator vector the same vector. Meantime the content of the accumulator vector
is incremented (VADD(1)).

/
TEST PROGRAM FOR : s t o r e l o a d v e c t o r
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 16
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 3 2 ) ; ACTIVATE ; / / a d d r = 32 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cNOP ; VADD( 1 ) ; / / w a i t t o s t a r t b e f o r e w a i t t o end ; a c c V e c t + 1
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; NOP ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 1 ) ; NOP ; / / s t a r t t h e l o a d o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; IOLOAD ; / / s t o p c y c l e c o u n t e r ; l o a d ioReg i n accVect
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////

35
Example 4.8 Strided load vector program has two parts. The first part loads the external memory with two foll vectors
starting from the location 16. The first vector is the index vector and the second is the incremented index vector.

/
TEST PROGRAM FOR : s t r i d e d l o a d v e c t o r
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 16
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 1 6 ) ; ACTIVATE ; / / a d d r = 16 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cVLOAD ( 3 2 ) ; NOP ; / / a d d r = 32
cSTORE ( 2 ) ; VADD( 1 ) ; / / s a v e a d d r a t mem [ 2 ] ; i n c r e m e n t a c c V e c t
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cLADDR ( 2 ) ; IOSTORE ; / / s e n d a d d r t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
/ / the strided part
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 16
cSTORE ( 1 ) ; NOP ; / / mem [ 1 ] <= s i z e
cVLOAD ( 1 8 ) ; NOP ; / / a d d r = 18
cSTORE ( 2 ) ; NOP ; / / mem [ 2 ] <= a d d r
cVLOAD ( 4 ) ; NOP ; // burst = 4
cSTORE ( 3 ) ; NOP ; / / mem [ 3 ] <= b u r s t
cVLOAD ( 7 ) ; NOP ; // stride = 7
cSTORE ( 4 ) ; NOP ; / / mem [ 4 ] <= s t r i d e
cLSIZE ( 1 ) ; NOP ; / / s i z e > DMA
cLADDR ( 2 ) ; NOP ; / / a d d r > DMA
cLBURST ( 3 ) ; NOP ; / / b u r s t > DMA
cLSTRIDE ( 4 ) ; NOP ; / / s t r i d e > DMA
cTRUN ( 3 ) ; NOP ; / / run s t r i d e d load
cNOP ; NOP ; / / wait to s t a r t the t r a n s f e r
cIOWAIT ; NOP ; / / w a i t t o end t h e t r a n s f e r
cSTOP ; IOLOAD ; / / s t o p c y c l e c o u n t e r ; l o a d ioReg i n accVect
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////

The program loads, starting from location 18 in the external memory, 4 words, then starts loading from 18 + 7 other 4
words and so on until 16 words are transferred as a 16-word vector into the accumulator vector accVect of the array.

36
Example 4.9 Strided store vector program prepares the parameters of the transfer in 4 location in controllers data memory
starting from 1. Then load the accumulator vector, accVect, into the input-output register, ioReg, and runs the transfer.
The program wait for the end of transfer, stop the cycle counter and halts the accelerator.

/
TEST PROGRAM FOR : s t r i d e d s t o r e v e c t o r
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 2 ) ; NOP ; / / s i z e = 12
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 1 4 ) ; ACTIVATE ; / / a d d r = 14 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cVLOAD ( 2 ) ; NOP ; // burst = 2
cSTORE ( 3 ) ; NOP ; / / s a v e b u r s t a t mem [ 3 ]
cVLOAD ( 5 ) ; NOP ; // stride = 5
cSTORE ( 4 ) ; NOP ; / / s a v e s t r i d e a t mem [ 4 ]
cLBURST ( 3 ) ; NOP ; / / s e n d b u r s t t o DMA
cLSTRIDE ( 4 ) ; NOP ; / / s e n d b u r s t t o DMA
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 4 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////

The program loads in the external memory, starting from the address 14, a burst of 2 words, then do the same from the
address 14+5, and so on until 12 words from the accumulator vector are loaded into the external memory.

37
Example 4.10 Scattered store starts having in accReg a vector containing pairs of address-data words. The first ele-
ment of the pair is used to address in the external memory the location where the second element of the pair is stored. Then,
the size of the transfer, the only parameter of this transfer type, must be an even number.

/
TEST PROGRAM FOR : s c a t t e r e d s t o r e
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 0 ) ; ACTIVATE ; / / s i z e = 10; a c t i v a t e a l l c e l l s
cSTORE ( 1 ) ; IXLOAD ; / / s a v e s i z e a t mem [ 1 ] ; l o a d i n d e x i n a l l c e l l s
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 6 ) ; NOP ; / / s t a r t t h e s c a t e r e d s t o r e o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////

Because the accumulator register, accReg, is loaded with the index vector, the program loads at the address 0 in the
external memory the value 1, at 2 the value 3, and so on until loading at 8 the value 9. The size of the transfer is 10, then 5
values are transferred into the external memory.

38
Example 4.11 Gathered load program prepares first the content of the external memory loading from the location 16 the
index vector followed by the incremented index vector. Then, the accumulator in incremented with 17 and is used as address
vector to gather data from the external memory.

/
TEST PROGRAM FOR : g a t h e r e d l o a d
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 16
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 1 6 ) ; ACTIVATE ; / / a d d r = 16 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cVLOAD ( 3 2 ) ; NOP ; / / a d d r = 32
cSTORE ( 2 ) ; VADD( 1 ) ; / / s a v e a d d r a t mem [ 2 ] ; i n c r e m e n t a c c V e c t
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cLADDR ( 2 ) ; IOSTORE ; / / s e n d a d d r t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; VADD( 1 7 ) ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end

cVLOAD ( 1 6 ) ; NOP ;
cSTORE ( 1 ) ; NOP ;
cLSIZE ( 1 ) ; IOSTORE ;
cTRUN ( 5 ) ; NOP ;
cNOP ; NOP ;
cIOWAIT ; NOP ;
cSTOP ; IOLOAD ;
cHALT ; NOP ;
/ / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////

The program loads in accVect data gathered from the external memory starting from the address 18, because the
address vector loaded in ioReg is 18, 19, . . . , 33. The coming back in accVect is 2, 3, . . . , 15, 1, 2.

39
4.2.2 Simple Vector & Reduction Programs
Example 4.12 The program which provide in the controllers accumulator, acc, the sum of indexes loaded in the accumu-
lators of each cell, accVect[i], is:

/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c [ i ] <= i , f o r i = 0 , 1 , . . . , 1 5
a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . + a c c [ 1 5 ]
o n l y t h r e e l a t e n c y s t e p s a r e i n s e r t e d b e c a u s e x = 4 ( lambda = 2 + x )
/
//
cSTART ; ACTIVATE ; / / s t a r t c y c l e c o u n t e r ; a c t i v a t e a l l c e l l s
cNOP ; IXLOAD ; / / load the index of each c e l l in accumulator
cNOP ; NOP ; / / latency step 1
cNOP ; NOP ; / / latency step 2
cNOP ; NOP ; / / latency step 3
cNOP ; NOP ; / / latency step 4
cNOP ; NOP ; / / latency step 5
cNOP ; NOP ; / / latency step 6
cCLOAD ( 0 ) ; NOP ; / / a c c <= sum o f i n d e x e s
cSTOP ; NOP ; / / stop cycle counter
cNOP ; NOP ; / / t o show c y c l e c o u n t e r s t o p p e d
cHALT ; NOP ;
/////

Appropriately commented means //* instead of /* before the first line of code.
The assembled code, provided by the simulator, is:

progMem [ 0 ] = 00110111000000000111011100000000
progMem [ 1 ] = 01101111000000000000000000000000
progMem [ 2 ] = 00000000000000000000000000000000
progMem [ 3 ] = 00000000000000000000000000000000
progMem [ 4 ] = 00000000000000000000000000000000
progMem [ 5 ] = 00000000000000000110010000000000
progMem [ 6 ] = 00000000000000000111111100000000
progMem [ 7 ] = 00000000000000000000000000000000
progMem [ 8 ] = 00000000000000000000011100000000

The result of simulation is:

t =0 pc = x a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x b= xxxxxxxxxxxxxxxx c c =x


t =0 pc =255 a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x b= xxxxxxxxxxxxxxxx c c =0
t =1 pc = 0 a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x b= xxxxxxxxxxxxxxxx c c =0
t =2 pc = 1 a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x b =1111111111111111 c c =0
t =3 pc = 2 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =1
t =4 pc = 3 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =2
t =5 pc = 4 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =3
t =6 pc = 5 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =4
t =4 pc = 6 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =5
t =5 pc = 7 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =6
t =6 pc = 8 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =7
t =7 pc = 9 a =120 a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =8
t =8 pc = 10 a =120 a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =9
t =9 pc = 11 a =120 a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b =1111111111111111 c c =9

40
In the initial cycle (t=0) the system reset resets the cycle counter, cc = 0. In the first cycle (t=1) the program activate
all the cells of the array (the Boolean vector is filled up with 1s). The operation is validated in the next cycle when b <=
11...1. Then the accumulator in each cell takes the value of index. During three cycles the reduction network computes
the sum of indexes. Then in t=7 the controllers accumulator is loaded with the sum of indexes, i.e., the sum of the numbers
acc = 0 + 1 + 2 + ... + 15 = 120, because we instantiated for our simulation an array with 16 cells. The cycle
counter stops in the next cycle on the value 6, which means: the program performed the task in cc - 1 = 5 cycles (the
instruction which stops the counter is also counted).

41
Example 4.13 The program which stores at mem[24] the inner product of the index vector with itself is:

/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c [ i ] <= i f o r 1 = 0 , 1 , . . . , 1 5
memVect [ i ] [ 4 ] <= a c c [ i ] f o r 1 = 0 , 1 , . . . , 1 5
a c c [ i ] <= a c c [ i ] x vectMem [ i ] [ 4 ]
a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . + a c c [ 1 5 ]
mem[ 2 4 ] <= a c c = i n n e r P r o d u c t ( i n d e x , i n d e x )
/
//
cSTART ; ACTIVATE ; / / a c t i v a t e a l l c e l l s
cNOP ; IXLOAD ; / / a c c [ i ] <= i n d e x
cNOP ; STORE ( 4 ) ; / / memVect [ i ] [ 4 ] <= a c c [ i ] , f o r a l l i
cNOP ; MULT ( 4 ) ; / / a c c [ i ] <= a c c [ i ] memVect [ i ] [ 4 ]
cNOP ; NOP ; / / latency step 1
cNOP ; NOP ; / / latency step 2
cNOP ; NOP ; / / latency step 3
cNOP ; NOP ; / / latency step 4
cNOP ; NOP ; / / latency step 5
cNOP ; NOP ; / / latency step 6
cCLOAD ( 0 ) ; NOP ; / / a c c <= r e d u c t i o n A d d ( a c c [ i ] )
cSTORE ( 2 4 ) ; NOP ; / / mem[ 2 4 ] <= a c c
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ;
///
//=============================================================================

The simulation provides, a little edited to fit in page, the following results:

t =1 pc = 0 a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x b=xx . . . x c c =0


t =2 pc = 1 a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x b=11...1 c c =0
t =3 pc = 2 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b=11...1 c c =1
t =4 pc = 3 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 b=11...1 c c =2
t =5 pc = 4 a=x a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =3
t =6 pc = 5 a=x a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =4
t =7 pc = 6 a=x a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =5
t =8 pc = 7 a=x a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =6
t =9 pc = 5 a=x a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =7
t =10 pc = 6 a=x a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =8
t =11 pc = 7 a=x a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =9
t =12 pc = 8 a =1240 a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =10
t =13 pc = 9 a =1240 a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =11
t =14 pc =10 a =1240 a [0]=0 a [1]=1 a [2]=4 ... a [14]=196 a [15]=225 b=11...1 c c =12

In cycle 9 the controllers accumulator is loaded with the value of the inner product and in the next cycle its content is
stored in the local scalar memory.

42
Example 4.14 The program which provide in acc the number of components of the index vector bigger than 5 and smaller
than 15 is:

/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c [ i } <= i
k e e p a c t i v e c e l l s where ( a c c [ i ] >= 5 )
k e e p a c t i v e c e l l s where ( a c c [ i ] < 1 5 )
a c c [ i ] <= 1 o n l y i n a l l a c t i v e c e l l s
a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . + a c c [ 1 5 ] o n l y f o r t h e a c t i v e c e l l s
/
//
cNOP ; ACTIVATE ; // activate all cells
cNOP ; IXLOAD ; / / a c c [ i ] <= i n d e x
cNOP ; VSUB ( 5 ) ; / / { c r , a c c [ i ] } <= a c c [ i ] 5
cNOP ; WHERENCARRY; / / where c r =1 r e m a i n a c t i v e
cNOP ; VSUB ( 1 0 ) ; / / {{ c r , a c c [ i ] } <= a c c [ i ] ( 1 5 5 )
cNOP ; WHERECARRY; / / where c r =0 r e m a i n a c t i v e
cNOP ; VLOAD ( 1 ) ;
cNOP ; ENDWHERE; / / r e a c t i v a t e where t h e s e c o n d WHERE a c t e d
cNOP ; ENDWHERE; / / r e a c t i v a t e where t h e f i r s t WHERE a c t e d
cNOP ; NOP ; / / latency step 3
cNOP ; NOP ; / / latency step 4
cNOP ; NOP ; / / latency step 5
cNOP ; NOP ; / / latency step 6
cCLOAD ( 0 ) ; NOP ; / / a c c <= number o f a c t i v e c e l l s
cHALT ; NOP ;
///
//=============================================================================

The simulation provides:

t =1 pc = 0 a=x a [0]= x ... a [6]= x a [7]= x b= xxxxxxxxxxxxxxxx


t =2 pc = 1 a=x a [0]= x ... a [6]= x a [7]= x b =1111111111111111
t =3 pc = 2 a=x a [0]=0 ... a [6]=14 a [7]=15 b =1111111111111111
t =4 pc = 3 a=x a [0]=4294967291 ... a [14]=9 a [15]=10 b =1111111111111111
t =5 pc = 4 a=x a [0]=4294967291 ... a [14]=9 a [15]=10 b =0000011111111111
t =6 pc = 5 a=x a [0]=4294967291 ... a [14]=4294967295 a [15]=0 b =0000011111111111
t =7 pc = 6 a=x a [0]=4294967291 ... a [14]=4294967295 a [15]=0 b =0000011111111110
t =8 pc = 7 a=x a [0]=4294967291 ... a [14]=1 a [15]=0 b =0000011111111110
t =9 pc = 8 a=x a [0]=4294967291 ... a [14]=1 a [15]=0 b =0000011111111111
t =10 pc = 9 a=x a [0]=4294967291 ... a [14]=1 a [15]=0 b =1111111111111111
t =11 pc =10 a=x a [0]=4294967291 ... a [14]=1 a [15]=0 b =1111111111111111
t =12 pc =11 a=x a [0]=4294967291 ... a [14]=1 a [15]=0 b =1111111111111111
t =13 pc =12 a=x a [0]=4294967291 ... a [14]=1 a [15]=0 b =1111111111111111
t =14 pc =13 a=x a [0]=4294967291 ... a [14]=1 a [15]=0 b =1111111111111111
t =15 pc =14 a =10 a [0]=4294967291 ... a [14]=1 a [15]=0 b =1111111111111111

Indeed, the index vector contains 10 components bigger then 5 and smaller than 15.

43
Example 4.15 Load index in cells accumulators and do l = 9 times: divide by 2 (integer operation) and increment with
99 each accumulator. The program is:

/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c <= 8 ; i n i t i a l i z e t h e l o o p c o u n t e r w i t h l 1
a c c [ i } <= i ; l o a d i n d e x
do ( a c c + 1 ) t i m e s
a c c [ i ] <= a c c [ i ] / 2
a c c [ i ] <= a c c [ i ] + 99
/
//
cNOP ; ACTIVATE ;
cVLOAD ( 8 ) ; IXLOAD ;
LB ( 1 ) ; cNOP ; SHRIGHT ;
cBRNZDEC ( 1 ) ; VADD( 9 9 ) ; / / b r a n c h i f a c c =0 and acc <=acc 1
cHALT ; NOP ;
///
//=============================================================================

The simulation provides:

t =1 pc =0 a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x


t =3 pc =2 a =8 a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15
t =4 pc =3 a =8 a [0]=0 a [1]=0 a [2]=1 ... a [14]=7 a [15]=7
t =5 pc =2 a =7 a [0]=99 a [1]=99 a [2]=100 ... a [14]=106 a [15]=106
t =6 pc =3 a =7 a [0]=49 a [1]=49 a [2]=50 ... a [14]=53 a [15]=53
t =7 pc =2 a =6 a [0]=148 a [1]=148 a [2]=149 ... a [14]=152 a [15]=152
t =8 pc =3 a =6 a [0]=74 a [1]=74 a [2]=74 ... a [14]=76 a [15]=76
t =9 pc =2 a =5 a [0]=173 a [1]=173 a [2]=173 ... a [14]=175 a [15]=175
t =10 pc =3 a =5 a [0]=86 a [1]=86 a [2]=86 ... a [14]=87 a [15]=87
t =11 pc =2 a =4 a [0]=185 a [1]=185 a [2]=185 ... a [14]=186 a [15]=186
t =12 pc =3 a =4 a [0]=92 a [1]=92 a [2]=92 ... a [14]=93 a [15]=93
t =13 pc =2 a =3 a [0]=191 a [1]=191 a [2]=191 ... a [14]=192 a [15]=192
t =14 pc =3 a =3 a [0]=95 a [1]=95 a [2]=95 ... a [14]=96 a [15]=96
t =15 pc =2 a =2 a [0]=194 a [1]=194 a [2]=194 ... a [14]=195 a [15]=195
t =16 pc =3 a =2 a [0]=97 a [1]=97 a [2]=97 ... a [14]=97 a [15]=97
t =17 pc =2 a =1 a [0]=196 a [1]=196 a [2]=196 ... a [14]=196 a [15]=196
t =18 pc =3 a =1 a [0]=98 a [1]=98 a [2]=98 ... a [14]=98 a [15]=98
t =19 pc =2 a =0 a [0]=197 a [1]=197 a [2]=197 ... a [14]=197 a [15]=197
t =20 pc =3 a =0 a [0]=98 a [1]=98 a [2]=98 ... a [14]=98 a [15]=98
t =21 pc =4 a =4294967295 a [0]=197 a [1]=197 a [2]=197 ... a [14]=197 a [15]=197

The initial value of accVect is {0, 1, ..., 14, 15}. After 8 execution of the two cycles loop it becomes: {197,
197, ..., 197, 197}.

44
Example 4.16 Add in accVect index with the sum of all indexes. The program is:

/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c [ i } <= i ; l o a d i n d e x
a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . a c c [ 1 5 ]
a c c [ i ] <= a c c [ i ] + a c c
/

cSTART ; ACTIVATE ; / / c c E n a b l e <= 1 ; b o o l V e c t <= 1 1 . . . 1


cNOP ; IXLOAD ; / / a c c [ i ] <= i
cNOP ; NOP ; / / latency step 1
cNOP ; NOP ; / / latency step 2
cNOP ; NOP ; / / latency step 3
cNOP ; NOP ; / / latency step 4
cNOP ; NOP ; / / latency step 5
cNOP ; NOP ; / / latency step 6
cCLOAD ( 0 ) ; NOP ; / / a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . a c c [ 1 5 ]
cNOP ; CADD; / / a c c [ i ] <= a c c [ i ] + a c c
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ;
//=============================================================================

The result of simulation is:

t =1 pc =0 a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x c c =0


t =2 pc =1 a=x a [0]= x a [1]= x a [2]= x ... a [14]= x a [15]= x c c =0
t =3 pc =2 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 c c =1
t =4 pc =3 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 c c =2
t =5 pc =4 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 c c =3
t =6 pc =5 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 c c =4
t =7 pc =6 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 c c =5
t =8 pc =7 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 c c =6
t =9 pc =8 a=x a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 c c =7
t =10 pc =9 a =120 a [0]=0 a [1]=1 a [2]=2 ... a [14]=14 a [15]=15 c c =8
t =11 pc =10 a =120 a [0]=120 a [1]=121 a [2]=122 ... a [14]=134 a [15]=135 c c =9
t =12 pc =11 a =120 a [0]=120 a [1]=121 a [2]=122 ... a [14]=134 a [15]=135 c c =10

In 6 cycles is performed a computation consisting of 29 additions. In the general case, for a number of p cells, the
execution time is 3 + (1 + 0.5 log p) = 4 + 0.5 log p, where 1 + 0.5 log p is the latency of the reduction net.
Therefore, 2p 1 are performed in 4 + 0.5 log p cycles by an engine with p cells. The acceleration belongs to
p
O( )
log p
which is normal for a computation involving communication.

45
Example 4.17 Example of program with data transfer:

Listing 5: Data transfer in the program YYY


/
YYY

The p r o g r a m YYY s t a r t s r u n n i n g from a d d r e s s LB ( 5 0 ) .

The two o p e r a n d s a r e l o a d e d a s 128 s c a l a r v e c t o r s a t 15 and 16 i n v e c t o r memory from 64


and 64+128 , w h i l e t h e r e s u l t i s computed i n 16 and s t o r e d a t 64+128+128 i n s c a l a r memory
/
dPLOAD ;
...
cVLOAD ( 1 2 8 ) ; NOP ; / / a c c <= 128
cSTORE ( 1 0 ) ; NOP ; / / mem[ 1 0 ] <= s i z e = 128
cVLOAD ( 1 5 ) ; NOP ; / / a c c <= 15
cSTORE ( 1 1 ) ; NOP ; / / mem[ 1 1 ] <= v e c t o r A d d r e s s = 15
cVLOAD ( 6 4 ) ; NOP ; / / a c c <= 64
cSTORE ( 1 2 ) ; NOP ; / / mem[ 1 2 ] <= s c a l a r A d d r e s s = 64
cLSIZE ( 1 0 ) ; NOP ; / / s e n d t h e v e c t o r s i z e t o DMA u n i t
cLADDR ( 1 1 ) ; NOP ; / / s e n d t h e a d d r e s s i n v e c t o r memory t o DMA u n i t
cLEADDR ( 1 2 ) ; NOP ; / / s e n d t h e f i r s t a d d r e s s i n s c a l a r memory t o DMA u n i t
cTRUN ( 1 ) ; NOP ; / / load the f i r s t vector
cLOAD ( 1 1 ) ; NOP ; / / a c c <= v e c t o r A d d r e s s
cVADD ( 1 ) ; NOP ; / / a c c <= v e c t o r A d d r e s s + 1
cSTORE ( 1 1 ) NOP ; / / mem[ 1 1 ] <= a c c = new v e c t o r A d d r e s s
cLOAD ( 1 2 ) ; NOP ; / / a c c <= s c a l a r A d d r e s s
cVADD ( 1 2 8 ) ; NOP ; / / a c c <= s c a l a r A d d r e s s + 128
cSTORE ( 1 2 ) NOP ; / / mem[ 1 2 ] <= a c c = new s c a l a r A d d r e s s
cLADDR ( 1 1 ) ; NOP ; / / s e n d t h e a d d r e s s i n v e c t o r memory t o DMA u n i t
cLEADDR ( 1 2 ) ; NOP ; / / s e n d t h e f i r s t a d d r e s s i n s c a l a r memory t o DMA u n i t
cTRUN ( 1 ) ; NOP ; / / load the second v e c t o r
...
cLOAD ( 1 2 ) ; NOP ; / / a c c <= s c a l a r A d d r e s s
cVADD ( 1 2 8 ) ; NOP ; / / a c c <= s c a l a r A d d r e s s + 128
cSTORE ( 1 2 ) NOP ; / / mem[ 1 2 ] <= a c c = new s c a l a r A d d r e s s
cLEADDR ( 1 2 ) ; NOP ; / / s e n d t h e f i r s t a d d r e s s i n s c a l a r memory t o DMA u n i t
cTRUN ( 2 ) ; NOP ; / / store the r e s u l t
...
dPRUN ( 5 0 ) ;
//

46
Part II
LIBRARY OF FUNCTIONS
5 The list of the reserved storage resources
5.1 The list of the reserved storage resources in the scalar memory
mem[16 ] = numberOfLines : number of lines in array, i.e., number of vectors in vectMem, for functions
05 arrayLoad
05 arrayStore
mem[17 ] = numberOfcolumns : number of columns in array, i.e., number of component per vector, for functions
05 arrayLoad
05 arrayStore
mem[18 ] = vectorAddress : the address of the first line (vector) in vectMem, for functions
05 arrayLoad
05 arrayStore
mem[19 ] = scalarAddress : the address of the first location in extMem where starts the stream of vectors to be trans-
ferred, for functions
05 arrayLoad
05 arrayStore
mem[20 ] = burstSize :
mem[21 ] = strideSize :
mem[22 ]: size : the edge size of the matrix submitted to one of the following functions
05 matrixTranspose
05 matrixVectorMultiply
05 matrixMatrixMultiply
mem[23 ] = destMatrix : the address in vectMem of the first line of the result matrix, for functions
05 matrixTranspose
05 matrixMatrixMultiply
mem[24 ] = destVect : the address of the result vector for 05 matrixVectorMultiply
mem[25 ] = firstMatrix the address pionting to the matrix used as operand for
05 matrixTranspose
05 matrixVectorMultiply for which points to the last line
05 matrixMatrixMultiply as multiplicand
mem[26 ] = secondMatrix : the address of the first line in the matrix used as second operand for
05 matrixMatrixMultiply
mem[27 ] = operandVector : the address of the vector used as operand for 05 matrixVectorMultiply
mem[28 ] =
mem[29 ] =
mem[30 ] =
mem[31 ] =

47
5.2 The list of the reserved storage resources in the vector memory
vectMem[16 ]:
vectMem[17 ]:

vectMem[18 ]:
vectMem[19 ]:

vectMem[20 ]:
vectMem[21 ]:

vectMem[22 ]:
vectMem[23 ]:

vectMem[24 ]:
vectMem[25 ]:

vectMem[26 ]:
vectMem[27 ]:

vectMem[28 ]:

vectMem[29 ]:
vectMem[30 ]:

vectMem[31 ]:

48
6 Transfer Functions
The reserved locations in the controllers data memory are: mem[16], ..., mem[21]

/
FUNCTION NAME:
AUTHOR: Gheorghe M. S t e f a n
DATE :
/
//

/////

6.1 Two-dimension Array Transfer


Blocks of N full vectors are transferred.

49
6.1.1 Load N full horizontal vectors

/
FUNCTION NAME: Twod i m e n s i o n a r r a y l o a d
AUTHOR: Gheorghe M. S t e f a n
DATE : S e p t . 25 2016

The f u n c t i o n l o a d s i n t h e v e c t o r memory , s t a r t i n g from t h e a d d r e s s v e c t o r A d d r e s s ,


numberOfLines v e c t o r s e a c h o f l e n g t h numberOfColumns s t o r e d i n t h e e x t e r n a l memory
s t a r t i n g from s c a l a r A d d r e s s .

The p a r a m e t e r s f o r t h e f u n c t i o n a r e s e t i n c o n t r o l l e r s d a t a memory i n f o u r s u c c e s s i v e
l o c a t i o n s s t a r t i n g w i t h 1 6 . Recommended p a r a m e t e r i n i t i a l i z a t i o n s e q u e n c e :

cVLOAD( n u m b e r O f L i n e s ) ; NOP ; // number o f l i n e s


cSTORE ( 1 6 ) ; NOP ; // mem[ 1 6 ] <= number of l i n e s
cVLOAD( numberOfColumns ) ; NOP ; // number o f c o l u m n s
cSTORE ( 1 7 ) ; NOP ; // mem[ 1 7 ] <= number of columns ( v e c t o r s s i z e )
cVLOAD( v e c t o r A d d r e s s ) ; NOP ; // vector address
cSTORE ( 1 8 ) ; NOP ; // mem[ 1 8 ] <= v e c t o r address
cVLOAD( s c a l a r A d d r e s s ) ; NOP ; // scalar address
cSTORE ( 1 9 ) ; NOP ; // mem[ 1 9 ] <= s c a l a r address

Example : i f i n t h e d a t a memory o f c o n t r o l l e r t h e r e i s t h e f o l l o w i n g c o n t e n t
mem[ 1 6 ] = 4
mem[ 1 7 ] = 16
mem[ 1 8 ] = 8
mem[ 1 9 ] = 16

and i n t h e e x t e r n a l memory
extMem [ 1 6 ] = 16
extMem [ 1 7 ] = 17
...
extMem [ 7 9 ] = 79

t h e n t h e r u n o f t h e f u n c t i o n p r o v i d e s i n v e c t o r memory
v e c t [ 8 ] = <16 , 1 7 , . . . , 31>
v e c t [ 9 ] = <32 , 3 3 , . . . , 47>
v e c t [ 1 0 ] = <48 , 4 9 , . . . , 63>
v e c t [ 1 1 ] = <64 , 6 5 , . . . , 79>
/
//
cLSIZE ( 1 7 ) ; NOP ; / / s i z e > DMA
LB ( 1 6 ) ; cLADDR ( 1 9 ) ; NOP ; / / a d d r > DMA
cTRUN ( 1 ) ; NOP ; / / r u n l o a d v e c t o r > DMA
cLOAD ( 1 6 ) ; NOP ; / / a c c <= number o f t r a n s f e r s
cVADD ( 2 5 5 ) ; NOP ; / / a c c <= a c c 1
cSTORE ( 1 6 ) ; NOP ; //
cLOAD ( 1 9 ) ; NOP ; / / a c c <= a d d r
cADD ( 1 7 ) ; NOP ; / / a c c <= a c c + s i z e = n e x t a d d r e s s
cSTORE ( 1 9 ) ; NOP ; / / save next address
cLOAD ( 1 8 ) ; NOP ; / / a c c <= v e c t o r a d d r e s s
cVADD ( 1 ) ; CADDRLD; / / inc v e c t o r address ; addrVect [ i ] = acc
cSTORE ( 1 8 ) ; NOP ; / / save next vector address
cIOWAIT ; NOP ; / / w a i t t h e end o f l o a d
cLOAD ( 1 6 ) ; IOLOAD ; / / l o a d number o f l i n e s ; a c c V e c t [ i ] <= i o R e g [ i ]
cBRNZ ( 1 6 ) ; RSTORE ( 0 ) ; / / i f a c c ! = 0 imp t o LB ( 1 6 )
/////

50
6.1.2 Store N full horizontal vectors

/
FUNCTION NAME: Twod i m e n s i o n a r r a y s t o r e
AUTHOR: Gheorghe M. S t e f a n
DATE : S e p t . 25 2016

The f u n c t i o n s t o r e s from t h e v e c t o r memory , s t a r t i n g from t h e a d d r e s s v e c t o r A d d r e s s ,


numberOfLines v e c t o r s e a c h o f l e n g t h numberOfColumns s t o r e d i n t o t h e e x t e r n a l memory
s t a r t i n g from s c a l a r A d d r e s s .

The p a r a m e t e r s f o r t h e f u n c t i o n a r e s e t i n c o n t r o l l e r s d a t a memory i n f o u r s u c c e s s i v e
l o c a t i o n s s t a r t i n g w i t h 1 6 . Recommended p a r a m e t e r i n i t i a l i z a t i o n s e q u e n c e :

cVLOAD( n u m b e r O f L i n e s ) ; NOP ; // number o f l i n e s


cSTORE ( 1 6 ) ; NOP ; // mem[ 1 6 ] <= number of l i n e s
cVLOAD( numberOfColumns ) ; NOP ; // number o f c o l u m n s
cSTORE ( 1 7 ) ; NOP ; // mem[ 1 7 ] <= number of columns ( v e c t o r s s i z e )
cVLOAD( v e c t o r A d d r e s s ) ; NOP ; // vector address
cSTORE ( 1 8 ) ; NOP ; // mem[ 1 8 ] <= v e c t o r address
cVLOAD( s c a l a r A d d r e s s ) ; NOP ; // scalar address
cSTORE ( 1 9 ) ; NOP ; // mem[ 1 9 ] <= s c a l a r address

Example : i f i n t h e d a t a memory o f c o n t r o l l e r t h e r e i s
mem[ 1 6 ] = 4
mem[ 1 7 ] = 16
mem[ 1 8 ] = 8
mem[ 1 9 ] = 16

and i n t h e v e c t o r memory
v e c t [ 8 ] = <16 , 1 7 , . . . , 31>
v e c t [ 9 ] = <32 , 3 3 , . . . , 47>
v e c t [ 1 0 ] = <48 , 4 9 , . . . , 63>
v e c t [ 1 1 ] = <64 , 6 5 , . . . , 79>

t h e n , t h e f u n c t i o n s t o r e i n t h e e x t e r n a l memory , s t a r t i n g from t h e a d d r e s s 16 t h e
f o l l o w i n g s t r e a m o f d a t a : <16 , 1 7 , . . . , 79>
/
/
cLOAD ( 1 8 ) ; NOP ;
cLSIZE ( 1 7 ) ; CADDRLD; / / s i z e > DMA; a d d r V e c t [ i ] <= a c c
LB ( 1 7 ) ; cLADDR ( 1 9 ) ; RILOAD ( 0 ) ; / / a d d r > DMA; a c c [ i ] <= memVect [ a d d r V e c t [ i ] ]
cTRUN ( 2 ) ; IOSTORE ; / / r u n l o a d v e c t o r > DMA; i o R e g [ i ] <= a c c [ i ]
cLOAD ( 1 6 ) ; RILOAD ( 1 ) ; / / acc <=n u m b e r O f T r a n s f e r s ; a d d r V e c t [ i ]<= a d d r V e c t [ i ] + 1
cVADD ( 2 5 5 ) ; NOP ; / / a c c <= a c c 1
cSTORE ( 1 6 ) ; NOP ; //
cLOAD ( 1 9 ) ; NOP ; / / a c c <= a d d r
cADD ( 1 7 ) ; NOP ; / / a c c <= a c c + s i z e = n e x t a d d r e s s
cSTORE ( 1 9 ) ; NOP ; / / save next address
cIOWAIT ; NOP ; / / w a i t t h e end o f l o a d
cLOAD ( 1 6 ) ; NOP ; / / l o a d number o f l i n e s ;
cBRNZ ( 1 7 ) ; NOP ; / / i f a c c ! = 0 imp t o LB ( 1 7 )
/////

51
6.1.3 Load M m-component vertical vectors
6.1.4 Store M m-component vertical vectors

6.2 Two-dimension Arrays Transfer


P matrices of M N are transferred organized as a bloc of N vectors, transferred using strided transfer operations with
bursts = stride = M.

52
7 Dense Linear Algebra
The main features of the architecture stressed by these dwarf [1] are vector multiplication and reduction add.

53
7.1 Matrix-Vector Multiplication

/
FUNCTION NAME: M a t r i x v e c t o r m u l t i p l i c a t i o n ( r e s e r v e d p r e f i x : MV)
FILE NAME: 05 m a t r i x V e c t o r M u l t i p l y . v
AUTHOR: Gheorghe M. S t e f a n
DATE : J a n u a r y 05 2017

The f u n c t i o n m u l t i p l i e s a NxN m a t r i x w i t h a v e c t o r
Initial :
addr [ i ] = M : address of the l a s t l i n e in matrix
a c c [ i ] = V[ i ] : t h e v e c t o r
Final :
acc [ i ] = r e s u l t

EXAMPLE: f o r t h e d e f i n i t i o n below
acc = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
addr = 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
vect [0] = x x x x x x x x x x x x x x x x
vect [1] = x x x x x x x x x x x x x x x x
vect [2] = x x x x x x x x x x x x x x x x
vect [3] = x x x x x x x x x x x x x x x x
vect [4] = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
vect [5] = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
vect [6] = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
vect [7] = 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
vect [8] = 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
vect [9] = 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
vect [10] = 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
vect [11] = 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
vect [12] = 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
vect [13] = 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
vect [14] = 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
vect [15] = 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
vect [16] = 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
vect [17] = 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
vect [18] = 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14
vect [19] = 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
then f i n a l l y the following changes are produced :
acc = 39 52 65 78 91 104 117 130 143 156 169 182 195 0 0 0
addr = 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
vect [0] = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
DEFINITIONS
Parameters :
d e f i n e MV N 13 / / m a t r i x edge s i z e
d e f i n e MV W 0 / / working space : to save v e c t o r
d e f i n e MV S ( x /2 2) / / latency size
Labels :
d e f i n e MV M 1 / / main l o o p l a b e l
d e f i n e MV L 2 / / l a t e n c y loop l a b e l
/
cNOP ; STORE ( MV W ) ; / / mem[ i ] [W] <= a c c [ i ] = V[ i ]
cVLOAD( MV N ) ; RLOAD ( 0 ) ; / / a c c <= N; a c c [ i ] <= l a s t m a t r i x l i n e
cVSUB ( 1 ) ; MULT( MV W) ; / / a c c <= N1; a c c [ i ] <= a c c [ i ] mem[ i ] [W]
LB ( MV M ) ; cCPUSHL ( 0 ) ; RILOAD ( 2 5 5 ) ; / / p u s h redSum ; a c c [ i ] <= p r e v i o u s l i n e
cBRNZDEC( MV M ) ; MULT( MV W) ; / / l o o p c o n t r o l ; a c c [ i ]<= a c c [ i ] mem[ i ] [W]
cVLOAD( MV S ) ; NOP ; / / l o a d f o r l a t e n c y l o o p : x /2 2
LB ( MV L ) ; cBRNZDEC( MV L ) ; NOP ; / / l a t e n c y loop
cNOP ; SRLOAD ; / / load r e s u l t in acc [ i ]

54
The execution time is: 2n + 4 + 0.5x. In our example, there are only 2 latency step because x = 4 (the simulation is for
an array with 2x = 16 cells).

55
7.2 Matrix Transpose

/
FUNCTION NAME: M a t r i x t r a n s p o s e ( r e s e r v e d p r e f i x : MT)
FILE NAME: 05 m a t r i x T r a n s p o s e . v
AUTHOR: Gheorghe M. S t e f a n
DATE : J a n u a r y 05 2017

M: t h e m a t r i x t o be t r a n s p o s e d s t o r e d s t a r t i n g from t h e a d d r e s s MT S
MT: t h e t r a n s p o s e d m a t r i x s t o r e d s t a r t i n g from t h e a d d r e s s MT D
N : t h e s i z e o f t h e s q u a r e m a t r i x , named MT N

Example : f o r t h e d e f i n i t i o n s b e l l o w , i f t h e i n i t i a l s t a t e i s :
vect [16] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [17] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [18] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [19] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [20] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [21] = x x x x x x x x x x x x x x x x
then the f i n a l s t a t e i s :
vect [0] = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0
vect [1] = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0
vect [2] = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0
vect [3] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [4] = 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11
vect [5] = x x x x x x x x x x x x x x x x
...
vect [16] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [17] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [18] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [19] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [20] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [21] = x x x x x x x x x x x x x x x x
...
vect [32] = 0 0 0 0 0 5 5 5 5 5 10 10 10 10 10 15
vect [33] = 1 1 1 1 1 6 6 6 6 6 11 11 11 11 11 0
vect [34] = 2 2 2 2 2 7 7 7 7 7 12 12 12 12 12 0
vect [35] = 3 3 3 3 3 8 8 8 8 8 13 13 13 13 13 0
vect [36] = 4 4 4 4 4 9 9 9 9 9 14 14 14 14 14 0
vect [37] = x x x x x x x x x x x x x x x x
The work s p a c e u s e d by t h e f u n c t i o n i s : vectMem [ 0 ] , . . . , vectMem [ 4 ]

DEFINITIONS
Parameters :
d e f i n e MT N 5 / / m a t r i x edge s i z e
d e f i n e MT S 16 / / s o u r c e a d d r e s s i n v e c t o r memory
d e f i n e MT D 32 / / d e s t i n a t i o n a d d r e s s i n v e c t o r memory
Labels :
d e f i n e MT M 1 / / main l o o p l a b e l
d e f i n e MT L 2 / / l e f t s h i f t loop l a b e l
d e f i n e MT R 3 / / r i g h t s h i f t loop l a b e l
/
cNOP ; IXLOAD ; / / a c c [ i ] <= i n d e x
cNOP ; VDIV ( MT N ) ; / / a c c [ i ] <= i n d e x /N
cNOP ; VMULT( MT N ) ; / / a c c <= N ( i n d e x /N) i n i n t e g e r s
cNOP ; STORE ( 0 ) ; / / mem [ 0 ] [ i ] <= a c c [ i ]
cVLOAD( MT N ) ; IXLOAD ; / / a c c [ i ] <= i n d e x
cVSUB ( 1 ) ; SUB ( 0 ) ; / / acc <=acc 1; a c c [ i ]<= i n d e x N ( i n d e x /N) = ixModN
cSTORE ( 1 ) ; STORE ( 0 ) ; / / mem[5] <= s i z e 1= c y c l e s ; mem [ 0 ] [ i ]<=ixModN [ i ]
/ / mem [ 1 ] [ i ]<= sAddr [ i ] = ( ixModN [ i ] c y c l e s ) modN

56
cNOP ; CSUB ; // a c c <= ixModN c y c l e s
cNOP ; WHERECARRY; // a c c <= N; s e l e c t where c a r r y
cNOP ; VADD( MT N ) ; // a c c [ i ] <= a c c [ i ] + a c c
cNOP ; ENDWHERE; // reselect all cells
cNOP ; STORE ( 1 ) ; // s t o r e a t mem [ 1 ] [ i ]
// mem [ 2 ] [ i ]<=dAddr [ i ] = ( ixModN [ i ] + c y c l e s ) modN
cNOP ; LOAD ( 0 ) ; // a c c <= c y c l e s ; a c c [ i ] <= ixModN [ i ]
cNOP ; CADD; // a c c <= N; a c c [ i ] <= ixModN [ i ] + c y c l e s
cNOP ; VCOMPARE( MT N ) ; / / compare w i t h N ( a c c N)
cNOP ; WHERENCARRY; // s e l e c t where n o t c a r r y
cNOP ; VSUB( MT N ) ; // a c c [ i ] <= a c c N;
cNOP ; ENDWHERE; // reselect all cells
cNOP ; STORE ( 2 ) ; // s t o r e a t mem [ 2 ] [ i ]
// r e a d on d i a g o n a l
LB ( MT M ) ;
cVLOAD( MT S ) ; LOAD ( 1 ) ; // l o a d s o u r c e ; l o a d ( ixModN [ i ] c y c l e s ) modN
cNOP ; ADDRLD; // a d d r [ i ] <= ( ixModN [ i ] c y c l e s ) modN
cLOAD ( 1 ) ; CRLOAD; // a c c [ i ] <= mem[ i ] [ S + ( ixModN [ i ] c y c l e s ) modN ]
// l o c a l , modN r o t a t e w i t h c y c l e s
cVSUB ( 1 ) ; STORE ( 3 ) ; // s a v e d i a g o n a l ( a r e g i s t e r s h o u l d be good )
LB ( MT L ) ;
cBRNZDEC( MT L ) ; GLSHIFT ; // global l e f t s h i f t cycle times
cNOP ; STORE ( 4 ) ; // save the l e f t s h i f t e d diagonal
// w r i t e on d i a g o n a l
cNOP ; LOAD ( 2 ) ; // l o a d d e s t ; l o a d ( ixModN [ i ] + c y c l e s ) modN
cNOP ; ADDRLD; // a d d r [ i ] <= ( ixModN [ i ] + c y c l e s ) modN
cVLOAD( MT N ) ; LOAD ( 4 ) ; // reload the shifted diagonal
cSUB ( 1 ) ; RSTORE( MT D ) ; // a c c [ i ] <= mem[ i ] [ D + ( ixModN [ i ] + c y c l e s ) modN ]
cVSUB ( 1 ) ; LOAD ( 3 ) ; // reload the diagonal
LB ( MT R ) ;
cBRNZDEC( MT R ) ; GRSHIFT ; // g l o b a l r i g h t s h i f t Nc y c l e s t i m e s
cVLOAD( MT N ) ; STORE ( 4 ) ; // save the r i g h t s h i f t e d diagonal
cSUB ( 1 ) ; LOAD ( 0 ) ; // a c c <= c y c l e s ; a c c [ i ] <= ixModN [ i ]
cNOP ; CCOMPARE; // compare ixModN [ i ] w i t h c y c l e s
cNOP ; WHERENCARRY; // where n o t c a r r y
cVLOAD( MT D ) ; LOAD ( 4 ) ; // restore the right shifted diagonal
cNOP ; CRSTORE ; // a c c [ i ] <= mem[ i ] [ D + ( ixModN [ i ] + c y c l e s ) modN ]
cVLOAD( MT N ) ; ENDWHERE; // a c c <= N; r e s e l e c t a l l c e l l s
// increment source diagonal
cVSUB ( 1 ) ; LOAD ( 1 ) ; // a c c <= N1; l o a d s o u r c e d i a g o n a l a d d r e s s e s
cNOP ; CSUB ; // a c c [ i ] <= ( ixModN [ i ] + c y c l e s ) modN (N1)
cVLOAD( MT N ) ; WHERENZERO; // s e l e c t where n o t c a r r y
cNOP ; CADD; // a c c [ i ] <= a c c [ i ] + a c c
cNOP ; ENDWHERE; // reselect all cells
cNOP ; STORE ( 1 ) ; // s t o r e b a c k aAddr [ i ]
// decrement dest diagonal
cVSUB ( 1 ) ; LOAD ( 2 ) ; // acc <=N1;
// a c c [ i ]<=dAddr [ i ] = ( ixModN [ i ] + c y c l e s ) modN
cNOP ; WHEREZERO; // s e l e c t where z e r o
cNOP ; CLOAD; // where 0 a c c <= N1
cLOAD ( 1 ) ; ELSEWHERE ; // s e l e c t where n o t z e r o
cVSUB ( 1 ) ; VSUB ( 1 ) ; // a c c <= c y c l e s ; a c c [ i ] <= a c c [ i ] 1
cSTORE ( 1 ) ; ENDWHERE; // a c c <= a c c 1 ; r e s e l e c t a l l c e l l s
cBRNZ ( MT M ) ; STORE ( 2 ) ; // mem [ 5 ] <= c y c l e s ; s t o r e b a c k dAddr [ i ]
// move d i a g o n a l
cVLOAD( MT S ) ; LOAD ( 0 ) ; // a c c <= S ; a c c [ i ] <= ixModN [ i ]
cNOP ; ADDRLD; // a d d r [ i ] <= ixModN [ i ]
cVLOAD( MT D ) ; CRLOAD; // a c c <= D; a c c [ i ] <= mem[ i ] [ S + ixModN [ i ] ]
cNOP ; CRSTORE ; // mem[ i ] [ D + ixModN [ i ] ] <= a c [ i ]

57
The execution time is: T (N) = N 2 + 29N 7.

58
7.3 Matrix-Matrix Multiplication

/
FUNCTION NAME: M a t r i x m a t r i x m u l t i p l i c a t i o n ( r e s e r v e d p r e f i x : MM)
FILE NAME: 05 m a t r i x M a t r i x M u l t i p l y . v
AUTHOR: Gheorghe M. S t e f a n
DATE : J a n u a r y 07 2017

The f u n c t i o n m u l t i p l i e s two NxN m a t r i c e s , b o t h s t o r e d i n t h e v e c t o r memory .


The t r a n s p o s e o f t h e s e c o n d m a t r i x i s computed i n t h e r e s u l t s p a c e
The r e s u l t i s r e t u r n e d i n t h e same memory .

The p a r a m e t e r s f o r t h e f u n c t i o n a r e s e t i n c o n t r o l l e r s d a t a memory i n f o u r s u c c e s s i v e
l o c a t i o n s s t a r t i n g w i t h 2 6 . Recommended c e l l a c t i v a t i o n & p a r a m e t e r i n i t i a l i z a t i o n
sequence :

EXAMPLE: i f t h e i n i t i a l s t a t e o f t h e v e c t o r memory is :
vect [0] = 0 1 2 3 4 5 6 7 8 9 10 x x x x x
...
vect [16] = 1 2 3 4 5 6 7 8 9 10 11 x x x x x
vect [17] = 2 3 4 5 6 7 8 9 10 11 12 x x x x x
vect [18] = 3 4 5 6 7 8 9 10 11 12 13 x x x x x
vect [19] = 4 5 6 7 8 9 10 11 12 13 14 x x x x x
vect [20] = 5 6 7 8 9 10 11 12 13 14 15 x x x x x
vect [21] = 6 7 8 9 10 11 12 13 14 15 16 x x x x x
vect [22] = 7 8 9 10 11 12 13 14 15 16 17 x x x x x
vect [23] = 8 9 10 11 12 13 14 15 16 17 18 x x x x x
vect [24] = 9 10 11 12 13 14 15 16 17 18 19 x x x x x
vect [25] = 10 11 12 13 14 15 16 17 18 19 20 x x x x x
vect [26] = 11 12 13 14 15 16 17 18 19 20 21 x x x x x
...
vect [48] = 0 0 0 0 0 0 0 0 0 0 0 x x x x x
vect [49] = 1 1 1 1 1 1 1 1 1 1 1 x x x x x
vect [50] = 2 2 2 2 2 2 2 2 2 2 2 x x x x x
vect [51] = 3 3 3 3 3 3 3 3 3 3 3 x x x x x
vect [52] = 4 4 4 4 4 4 4 4 4 4 4 x x x x x
vect [53] = 5 5 5 5 5 5 5 5 5 5 5 x x x x x
vect [54] = 6 6 6 6 6 6 6 6 6 6 6 x x x x x
vect [55] = 7 7 7 7 7 7 7 7 7 7 7 x x x x x
vect [56] = 8 8 8 8 8 8 8 8 8 8 8 x x x x x
vect [57] = 9 9 9 9 9 9 9 9 9 9 9 x x x x x
vect [58] = 10 10 10 10 10 10 10 10 10 10 10 x x x x x
...
then the f i n a l s t a t e i s :
vect [0] = 0 1 2 3 4 5 6 7 8 9 10 x x x x x
vect [1] = 0 1 2 3 4 5 6 7 8 9 10 x x x x x
vect [2] = 0 1 2 3 4 5 6 7 8 9 10 x x x x x
vect [3] = 10 0 1 2 3 4 5 6 7 8 9 x x x x x
vect [4] = 0 0 0 0 0 0 0 0 0 0 10 x x x x x
...
vect [16] = 1 2 3 4 5 6 7 8 9 10 11 x x x x x
vect [17] = 2 3 4 5 6 7 8 9 10 11 12 x x x x x
vect [18] = 3 4 5 6 7 8 9 10 11 12 13 x x x x x
vect [19] = 4 5 6 7 8 9 10 11 12 13 14 x x x x x
vect [20] = 5 6 7 8 9 10 11 12 13 14 15 x x x x x
vect [21] = 6 7 8 9 10 11 12 13 14 15 16 x x x x x
vect [22] = 7 8 9 10 11 12 13 14 15 16 17 x x x x x
vect [23] = 8 9 10 11 12 13 14 15 16 17 18 x x x x x
vect [24] = 9 10 11 12 13 14 15 16 17 18 19 x x x x x
vect [25] = 10 11 12 13 14 15 16 17 18 19 20 x x x x x

59
vect [26] = 11 12 13 14 15 16 17 18 19 20 21 x x x x x
...
vect [32] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [33] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [34] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [35] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [36] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [37] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [38] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [39] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [40] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [41] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [42] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
...
vect [48] = 0 0 0 0 0 0 0 0 0 0 0 x x x x x
vect [49] = 1 1 1 1 1 1 1 1 1 1 1 x x x x x
vect [50] = 2 2 2 2 2 2 2 2 2 2 2 x x x x x
vect [51] = 3 3 3 3 3 3 3 3 3 3 3 x x x x x
vect [52] = 4 4 4 4 4 4 4 4 4 4 4 x x x x x
vect [53] = 5 5 5 5 5 5 5 5 5 5 5 x x x x x
vect [54] = 6 6 6 6 6 6 6 6 6 6 6 x x x x x
vect [55] = 7 7 7 7 7 7 7 7 7 7 7 x x x x x
vect [56] = 8 8 8 8 8 8 8 8 8 8 8 x x x x x
vect [57] = 9 9 9 9 9 9 9 9 9 9 9 x x x x x
vect [58] = 10 10 10 10 10 10 10 10 10 10 10 x x x x x
vect [59] = x x x x x x x x x x x x x x x x

DEFINITIONS :
Parameters :
define N 11 / / m a t r i x edge s i z e
d e f i n e M1 16 / / f i r s t matrix address
d e f i n e M2 48 / / second matrix address
d e f i n e MT 32 / / transposed matrix address
d e f i n e MR 32 / / r e s u l t matrix address
Label :
d e f i n e MM 0 / / matrix multiply loop l a b e l
Parameters for matrix transpose :
define S M2 / / s o u r c e a d d r e s s i n v e c t o r memory
define D MT / / d e s t i n a t i o n a d d r e s s i n v e c t o r memory
Labels for matrix transpose :
d e f i n e TL 1 / / t r a n s p o s e loop l a b e l
d e f i n e LS 2 / / l e f t s h i f t loop l a b e l
d e f i n e RS 3 / / r i g h t s h i f t loop l a b e l
Parameters for matrix vector multiply :
define M ( M1+ N1) / / a d d r e s s o f t h e LAST l i n e i n m a t r i x
define W 0 / / working space : to save v e c t o r
define L ( x /2 2) / / latency size
Labels for matrix vector multiply :
d e f i n e MV 4 / / loop l a b e l
d e f i n e LL 5 / / l a t e n c y loop l a b e l
/
i n c l u d e 05 m a t r i x T r a n s p o s e . v

cVLOAD( MR) ; NOP ; // a c c <= result pointer


cSTORE ( 0 ) ; NOP ; // mem [ 0 ] <= r e s u l t p o i n t e r
cVLOAD( MT) ; VLOAD( M) ; // a c c <= MT; a c c [ i ] <= M
cSTORE ( 1 ) ; ADDRLD; // mem [ 1 ] <= MT; a d d r [ i ] <= M
cVLOAD( N ) ; CALOAD; // a c c <= N; a c c [ i ] <= v e c t o r
cSTORE ( 2 ) ; NOP ;
LB ( MM) ;

60
i n c l u d e 05 m a t r i x V e c t o r M u l t i p l y . v

cLOAD ( 0 ) ; NOP ; / / a c c <= r e s u l t p o i n t e r


cVADD ( 1 ) ; CSTORE ; / / a c c <= a c c + 1 ; mem[ i ] [ r e s u l t p o i n t e r ] <= a c c [ i ]
cSTORE ( 0 ) ; NOP ; / / mem [ 0 ] <= new r e s u l t p o i n t e r

cLOAD ( 1 ) ; VLOAD( M) ; / / a c c <= MT; a c c [ i ] <= M


cVADD ( 1 ) ; ADDRLD; / / a c c <= a c c + 1 ; a d d r [ i ] <= M
cSTORE ( 1 ) ; NOP ; / / mem [ 1 ] <= new t r a n s M a t r i x p o i n t e r

cLOAD ( 2 ) ; CALOAD; / / a c c <= l o o p c o u n t e r ; a c c [ i ] <= new v e c t o r


cVSUB ( 1 ) ; NOP ; / / a c c <= acc 1;
cSTORE ( 2 ) ; NOP ; / / mem [ 2 ] <= new l o o p c o u n t e r

cBRNZ ( MM) ; NOP ;

61
8 Sparse Linear Algebra
The two kinds of sparse matrices are investigated:

sparse matrices with randomly distributed non-zero elements


band matrices

8.1 Sparse matrix representation


8.1.1 Band matrices representation
The algorithm uses a number of vectors equal with the number of non-zero diagonals. The diagonals a positioned according
the their relation with the main diagonal. The main diagonal and the diagonals under the main diagonal are left aligned,
while the other diagonals are right aligned in the first N positions of the vectors. For example, the following band matrix

2 3 4 0 0 0 0 0

1 2 3 4 0 0 0 0

0 1 2 3 4 0 0 0

0 0 1 2 3 4 0 0

0 0 0 1 2 3 4 0

0 0 0 0 1 2 3 4

0 0 0 0 0 1 2 3

0 0 0 0 0 0 1 2

is represented in vector memory as follows:

vect[b1] = <0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0>


vect[b2] = <0 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0>
vect[b3] = <2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0> = main diagonal
vect[b4] = <1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0>

The algorithms for sparse matrices presented in this section are designed only for the limited case N P.

8.1.2 Sparse matrices with randomly distributed non-zero elements representation


The algorithm is designed to fructify the two main features of the pRISC architecture: search operations and the scan
function first (keep selected active only the first active cell).
The representation used for sparse matrix the coordinate list (COO). COO stores a list of (row, column, value) tuples
as three vectors. Ideally, the vectors are sorted (by row index, then column index) to improve the mono-core computation.
For our architecture the sorting is not required because of the very efficient search operation.
For example, the 4 4 matrix, with the non-zero elements M = 8, for an engine with P = 2x = 16:

8 0 0 7

0 6 0 5

4 0 3 0

0 2 0 1

is represented by following 3 vectors:


valuVector = < 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 >
lineIndexVector = < 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4 >
columnIndexVector = < 0 3 1 3 0 2 1 3 4 4 4 4 4 4 4 4 >
The algorithms for sparse matrices presented in this section are designed only for the limited case M P.

62
8.2 Band Matrix Operations
8.2.1 Band Matrix Vector Multiplication

/
FUNCTION NAME: Band m a t r i x v e c t o r m u l t i p l i c a t i o n
FILE NAME: 05 BdMV . v
AUTHOR: Gheorghe M. S t e f a n
DATE : J a n u a r y 7 2017

The f u n c t i o n m u l t i p l i e s i n w a band NxN m a t i c e , A, w i t h a d e n s e v e c t o r , v , s t o r e d i n


v e c t o r memory . The m a t r i x i s r e p r e s e n t e d by bw ( band w i d t h ) s e q u e n c e s i n v e c t o r s o f P
e l e m e n t s ( P : number o f c e l l s ) , s t a r t i n g from t h e a d d r e s s f d a ( f i r s t d i a g o n a l a d d r e s s ) ,
as follows :

vector [ fda ] = 0 0 ... 0 0 0 v v ... v v 0 0 ... 0


v e c t o r [ f da +1] = 0 0 ... 0 0 v v v ... v v 0 0 ... 0
v e c t o r [ f da +2] = 0 0 ... 0 v v v v ... v v 0 0 ... 0
...
v e c t o r [ f d a +md1] = v v . . . v v v v v . . . v v 0 0 . . . 0 : main d i a g o n a l
v e c t o r [ f d a +md ] = v v ... v v v v v ... v 0 0 0 ... 0
v e c t o r [ f d a +md+ 1 ] = v v ... v v v v v ... 0 0 0 0 ... 0
...
v e c t o r [ f d a +bw1] = v v ... v v v v v ... 0 0 0 0 ... 0

where : v !== 0
THE ALGORITHM
========================================================================
z = md
v e c t o r [ r v a ] <= <0 0 . . . 0>
f o r i = 0 ; i <bw ; i = i + 1 ;
z <= z 1
i f ( ! ( z <0))
v e c t o r [ r v a ] <= v e c t o r [ r v a ] + ( v e c t o r [ f d a + i ] v e c t o r [ va ] ) << z
else
v e c t o r [ r v a ] <= v e c t o r [ r v a ] + ( v e c t o r [ f d a + i ] v e c t o r [ va ] ) >> | z |
========================================================================
DEFINITIONS :
Parameters :
define N 8 / / m a t r i x edge s i z e
define R 4 / / r e s u l t v e c t o r a d d r e s s
define F 8 / / f i r s t diagonal address
define W 4 / / number o f d i a g o n a l s
define M 3 / / main d i a g o n a l p o s i t i o n
define V 6 / / vector address
Labels :
d e f i n e ML 0 / / main l o o p l a b e l
d e f i n e LS 1 // left shift label
d e f i n e RS 2 // right shift label
d e f i n e SK 3 / / skip label
EXAMPLE: l e t be t h e o p e r a t i o n p e r f o r m e d w i t h t h e p r e v i o u s l y d e f i n e d
p a r a m e t e r s and l a b e l s :
|2 3 4 0 0 0 0 0 | |1| |20|
|1 2 3 4 0 0 0 0 | |2| |30|
|0 1 2 3 4 0 0 0 | |3| |40|
|0 0 1 2 3 4 0 0 | |4| |50|
|0 0 0 1 2 3 4 0 | X |5| = |60|
|0 0 0 0 1 2 3 4 | |6| |70|
|0 0 0 0 0 1 2 3 | |7| |44|
|0 0 0 0 0 0 1 2 | |8| |23|

63
With t h e p r e v i o u s i n i t i a l i z a t i o n data is represented i n v e c t o r memory a s follows :
vect [6] = 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0
vect [7] = x x x x x x x x x x x x x x x x
vect [8] = 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0
vect [9] = 0 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0
vect [10] = 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0
vect [11] = 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0

The f i n a l c o n t e n t o f t h e v e c t o r memory i s :
vect [0] = 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0
vect [1] = 20 30 40 50 60 70 44 23 0 0 0 0 0 0 0 0
vect [2] = x x x x x x x x x x x x x x x x
vect [3] = x x x x x x x x x x x x x x x x
vect [4] = 20 30 40 50 60 70 44 23 0 0 0 0 0 0 0 0
vect [5] = x x x x x x x x x x x x x x x x
vect [6] = 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0
vect [7] = x x x x x x x x x x x x x x x x
vect [8] = 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0
vect [9] = 0 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0
vect [10] = 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0
vect [11] = 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
/
cVLOAD( V ) ; ACTIVATE ; / / a c c <= va ; activate all cells
cVLOAD( W) ; CALOAD; / / a c c <= bw ; a c c [ i ] <= mem[ va ]
cSTORE ( 0 ) ; STORE ( 0 ) ; / / mem [ 0 ] <= bw ; mem [ 0 ] [ i ] <= v
cVLOAD( F ) ; VLOAD ( 0 ) ; / / a c c <= f d a ; a c c [ i ] <= 0
cVSUB ( 1 ) ; STORE ( 1 ) ; / / a c c <= f d a 1; mem [ 1 ] [ i ] <= 0
cVLOAD( M) ; CLOAD; / / a c c <= md ; a c c [ i ] <= f d a 1
cSTORE ( 1 ) ; ADDRLD; / / mem [ 1 ] <= md ; a d d r [ i ] <= f d a 1
LB ( ML) ; cLOAD ( 1 ) ; NOP ; / / a c c <= mem [ 1 ]
cVSUB ( 1 ) ; RILOAD ( 1 ) ; / / a c c <= acc 1; a c c [ i ] <= mem[ a d d r + 1 ]
cSTORE ( 1 ) ; MULT ( 0 ) ; / / mem [ 1 ] <= a c c ; a c c [ i ] <= a c c [ i ] mem [ 0 ] [ i ]
cBRSGN ( RS ) ; NOP ; / / i f a c c [ n 1] jmp ( 3 2 ) ( 2 nd s t e p o f f l o a t m u l t )
cBRZDEC ( SK ) ; NOP ; / / i f a c c =0 jmp 33
LB ( LS ) ; cBRNZDEC( LS ) ; GLSHIFT ; / / i f a c c =0 jmp ( 3 3 ) a c c [ i ] <= a c c [ i + 1 ]
cJMP ( SK ) ; NOP ;

LB ( RS ) ; cBRNZINC ( RS ) ; GRSHIFT ; // i f ! ( a c c +1)=0 jmp ( 3 2 ) a c c [ i ] <= a c c [ i 1]

LB ( SK ) ; cLOAD ( 0 ) ; ADD( 1 ) ; // a c c <= mem [ 0 ] ; a c c [ i ] <= a c c [ i ] +mem [ 1 ] [ i ]


cVSUB ( 1 ) ; NOP ; // a c c <= acc 1; ( 2 nd s t e p o f f l o a t add )
cSTORE ( 0 ) ; NOP ; // mem [ 0 ] <= a c c ; (3 rd step o f f l o a t add )
cBRNZ ( ML) ; STORE ( 1 ) ; // i f ! a c c =0 jmp ( 3 0 ) mem [ 1 ] [ i ] <= a c [ i ]
// add on l i n e
cVLOAD( R ) ; LOAD ( 1 ) ; // a c c <= rwa ; a c c [ i ] <= mem [ 1 ] [ i ]
cNOP ; CSTORE ; // mem[ rwa ] [ i ] <= a c c [ i ]

Evaluation: TBdMV max (W ) = 0.5W 2 + 9.5W + 8. For floating point operations the execution time is the same. There are
reserved NOPs for the second ant third steps of the floating operations. See parenthesis like (2nd step of float
mult) in the comments of the program.

64
8.3 Random Sparse Matrix Operations
8.3.1 Sparse Matrix Transpose

/
FUNCTION NAME: S p a r s e m a t r i x t r a n s p o s e
AUTHOR: Gheorghe M. S t e f a n
DATE : Oct . 30 2016

The f u n c t i o n t r a n s p o s e s a s p a r s e NxN m a t r i x i n v e c t o r memory . The m a t r i x i s


r e p r e s e n t e d by t h r e e v e c t o r s o f P e l e m e n t s ( P : number o f c e l l s ) , a s f o l l o w s :

m: ( ( v v ... v 0 0 ...) // nonz e r o v a l u e s o f t h e m a t r i x


(x x ... x N N ...) // l i n e i n d e x e s o f t h e nonz e r o v a l u e s
(y y ... y N N ...) // column i n d e x e s o f t h e nonz e r o v a l u e s
) = ( ( fmv ) ( f m l ) ( fmc ) ) // m a t r i x w i t h F =< P e l e m e n t s

w : ( . . . . . . ) / / working v e c t o r
x, y, < N

INITIALIZATION i s done by t h e f o l l o w i n g s e q u e n c e i n c o n t r o l l e r s s i d e o f t h e c o d e :

cVLOAD( mv ) ; // a c c <= m = address of the matrix value vector


cSTORE ( 2 5 ) ; // mem[ 2 5 ] <= m
cVLOAD(w) ; // a c c <= w = a d d r e s s of t h e working v e c t o r
cSTORE ( 2 7 ) ; // mem[ 2 7 ] <= w

EXAMPLE:
| 8 0 0 7 | | 8 0 4 0 |
| 0 6 0 5 | | 0 6 0 2 |
| 4 0 3 0 | | 0 0 3 0 |
| 0 2 0 1 |T = | 7 5 0 1 |

The a l g o r i t h m i s e m b a r r a s s i n g l y s i m p l e :
t h e l i n e v e c t o r i s swapped w i t h t h e column v e c t o r

I f t h e d a t a memory o f t h e c o n t r o l l e r i s i n i t i a l i z e d by t h e s e q u e n c e :
cVLOAD ( 8 ) ;
cSTORE ( 2 5 ) ; / / mem[ 2 5 ] <= 8 = fmv
cVLOAD ( 1 4 ) ;
cSTORE ( 2 7 ) ; / / mem[ 2 7 ] <= 14 = wm

and t h e c o n t e n t o f t h e v e c t o r memory i s initially :


f i r s t matrix :
vect [8] = 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4
vect [10] = 0 3 1 3 0 2 1 3 4 4 4 4 4 4 4 4
working v e c t o r :
vect [14] = x x x x x x x x x x x x x x x x

a t t h e end o f r u n n i n g t h e f u n c t i o n t h e memory becomes :

vect [8] = 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0
vect [9] = 0 3 1 3 0 2 1 3 4 4 4 4 4 4 4 4
vect [10] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4

vect [14] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4

/
cLOAD ( 2 5 ) ; NOP ; / / a c c <= l = a d d r e s s o f l i n e v e c t o r

65
cLOAD ( 2 7 ) ; CALOAD; / / a c c <= w = a d d r e s s o f w o r k i n g v e c t o r ; a c c [ i ] <= mem[ l ] [ i ]
cLOAD ( 2 5 ) ; CSTORE ; / / a c c <= l ; mem[w] <= ml
cVADD ( 1 ) ; NOP ; / / a c c <= l +1 = c : a d d r e s s o f column v e c t o r
cVSUB ( 1 ) ; CALOAD; / / a c c <= l ; a c c [ i ] <= mem[ c ] [ i ]
cLOAD ( 2 7 ) ; CSTORE ; / / a c c <= w ; mem[ l ] [ i ] <= a c c [ i ] = mem[ c ] [ i ]
cLOAD ( 2 5 ) ; CALOAD; / / a c c <= l ; a c c [ i ] <= ml
cVADD ( 1 ) ; NOP ; / / a c c <= c ;
cNOP ; CSTORE ; / / mem[ c ] [ i ] <= ml
//

66
8.3.2 Sparse Matrix Vector Multiplication

/
FUNCTION NAME: S p a r s e m a t r i x v e c t o r m u l t i p l i c a t i o n
AUTHOR: Gheorghe M. S t e f a n
DATE : Nov . 10 2016

The f u n c t i o n m u l t i l i e s i n w a s p a r s e NxN m a t i c e , A, w i t h a d e n s e v e c t o r , v , s t o r e d i n
v e c t o r memory . The m a t r i x i s r e p r e s e n t e d by t h r e e s e q u e n c e s i n v e c t o r s o f P e l e m e n t s
( P : number o f c e l l s ) , a s f o l l o w s :

A: ((v v ... v 0 0 ...) // nonz e r o v a l u e s s e q u e n c e o f t h e m a t r i x


(x x ... x N N ...) // l i n e i n d e x e s s e q u e n c e o f t h e nonz e r o v a l u e s
(y y ... y N N ...) // column i n d e x e s s e q u e n c e o f t h e nonz e r o v a l u e s
) = ( ( vs ) ( l s ) ( cs )) // m a t r i x w i t h M =< P e l e m e n t s

x, y < N
w: r e s u l t r e g i s t e r
z : working r e g i s t e r
sv : s e r i a l v e c t o r implemented i n hardware d i s t r i b u t e d along t h e c e l l s

THE ALGORITHM
==========================
s r <= v
f o r i = 0 ; i <N; i = i + 1 ;
s e l e c t ( where c s == i )
z <= s r [ 0 ]
s r <= s r << 1
z <= z v s
f o r i =N1; i =<0; i = i 1;
s e l e c t ( where l s == i )
s r <= { redAdd ( z ) , s r }
w <= s r
==========================

INITIALIZATION i s done by t h e f o l l o w i n g s e q u e n c e i n c o n t r o l l e r s s i d e o f t h e c o d e :

cVLOAD ( 4 ) ;
cSTORE ( 2 2 ) ; / / mem[ 2 2 ] <= N; m a t r i x / v e c t o r s i z e
cVLOAD ( 1 5 ) ;
cSTORE ( 2 3 ) ; / / mem[ 2 3 ] <= wsa ; r e s u l t v e c t o r ( ws ) a d d r e s s
cVLOAD ( 8 ) ;
cSTORE ( 2 4 ) ; / / mem[ 2 4 ] <= v s a ; v a l u e s e q u e n c e ( v s ) a d d r e s s
cVLOAD ( 9 ) ;
cSTORE ( 2 5 ) ; / / mem[ 2 5 ] <= l s a ; l i n e s e q u e n c e ( l s ) a d d r e s s
cVLOAD ( 1 0 ) ;
cSTORE ( 2 6 ) ; / / mem[ 2 6 ] <= c s a ; columns e q u e n c e ( c s ) a d d r e s s
cVLOAD ( 1 1 ) ;
cSTORE ( 2 7 ) ; / / mem[ 2 7 ] <= va ; v e c t o r ( v ) a d d r e s s

EXAMPLE: t h e i n i t a l d a t a i s

A = |8 0 0 7| v = |4|
|0 6 0 5| |3|
|4 0 0 0| |2|
|0 2 0 1| |1|

t h e n , t h e i n i t a i l c o n t e n t o f v e c t o r memory :
vect [8] = 8 7 6 5 4 2 1 0 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4

67
vect [10] = 0 3 1 3 0 1 3 4 4 4 4 4 4 4 4 4
vect [11] = 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0

The f i n a l c o n t e n t o f v e c t o r memory ( a f t e r 26 c l o c k c y c l e s ) :
vect [0] = 32 7 18 5 16 6 1 x x x x x x x x x
vect [1] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
...
vect [8] = 8 7 6 5 4 2 1 0 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
vect [10] = 0 3 1 3 0 1 3 4 4 4 4 4 4 4 4 4
vect [11] = 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0
...
vect [15] = 39 23 16 7 0 0 0 0 0 0 0 0 0 0 0 0

which c o r r e s p o n d t o :w = |39|
|23|
|16|
| 7|
/
cLOAD ( 2 7 ) ; ACTIVATE ; / / a c c <= va ; a c t i v a t e a l l c e l l s
cNOP ; CALOAD; / / a c c [ i ] <= mem[ va ] [ i ] = v [ i ]
cLOAD ( 2 6 ) ; SRSTORE ; / / a c c <= c s a ; s r [ i ] <= v [ i ]
cVLOAD ( 0 ) ; CALOAD; / / a c c <= 0 ; a c c [ i ] <= mem[ c s a ] [ i ] = c s [ i ]
cNOP ; STORE ( 1 ) ; / / mem [ 1 ] [ i ] <= c s [ i ]
/ / DISTRIBUTE VECTOR COMPONENTS
LB ( 2 6 ) ; cVADD ( 1 ) ; SEARCH ; / / a c c <= a c c + 1 ; s e l e c t column a c c
cCSEND ( 4 ) ; CLOAD; / / coOp = s r [ 0 ] ; a c c [ i ] <= s r [ 0 ]
cVPUSHR ( 0 ) ; STORE ( 0 ) ; / / pop s r [ 0 ] ; mem [ 0 ] [ i ] <= a c c [ i ] = v [ j ]
cSKIPEQ ( 2 2 ) ; ACTIVATE ; / / s k i p i s (mem[ 2 2 ] = N ) ; a c t i v a t e a l l c e l l s
cJMP ( 2 6 ) ; LOAD ( 1 ) ; / / jump t o LB ( 2 6 ) ; a c c [ i ] <= c s [ i ]
/ / MUTIPLY
cLOAD ( 2 4 ) ; LOAD ( 0 ) ; / / a c c <= v s a ; a c c [ i ] <= mem [ 0 ] [ i ] = m u l t i p l i e r
cNOP ; CAMULT; / / a c c [ i ] <= a c c [ i ] v s
cLOAD ( 2 5 ) ; STORE ( 0 ) ; / / a c c <= l s a ; mem [ 0 ] [ i ] <= p r o d u c t s
/ / ADD LINES
cLOAD ( 2 2 ) ; CALOAD; / / a c c <= N ; a c c [ i ] <= l s [ i ]
cVSUB ( 1 ) ; STORE ( 1 ) ; / / a c c <= N1; mem [ 1 ] [ i ] <= l i n e i n d e x e s
cNOP ; SRCALL ; / / a c c <= N2; s e a r c h N1 i n l s

LB ( 2 7 ) ; cVSUB ( 1 ) ; LOAD ( 0 ) ; / / a c c <= acc 1; a c c [ i ] <= p r o d u c t s [ i ]


cCPUSHL ( 0 ) ; LOAD ( 1 ) ; / / s r <= { redAdd , s r } ; a c c [ i ] <= l i n e i n d e x e s
cBRNZ ( 2 7 ) ; SRCALL ; / / i f ( a c c ==0) jump t o LB ( 2 7 ) ; s e a r c h a c c i n a c c [ i ]

cNOP ; LOAD ( 0 ) ; / / a c c [ i ] <= l i n e i n d e x e s


cCPUSHL ( 0 ) ; ACTIVATE ; / / s r <= { redAdd , s r } ; a c t i v a t e a l l c e l l s
cNOP ; NOP ; / / latency step
cNOP ; NOP ; / / latency step
cNOP ; NOP ; / / latency step
/ / cNOP ; NOP ; // latencies
cLOAD ( 2 3 ) ; SRLOAD ; / / a c c <= r a ; a c c [ i ] <= s r [ i ]
cNOP ; CSTORE ; / / mem[ r a } [ i ] <= a c c [ i ]
//

Evaluation: TSpMV (N, P) = 8N + 0.5lop2 P + 12 O(N)

68
8.3.3 Sparse Matrices Multiplication

/
FUNCTION NAME: S p a r s e m a t r i x m u l t i p l i c a t i o n
AUTHOR: Gheorghe M. S t e f a n
DATE : Oct . 31 2016

The f u n c t i o n m u l t i p l i e s s p a r s e NxN m a t r i c e s s t o r e d i n v e c t o r memory . Each m a t r i x i s


r e p r e s e n t e d by t h r e e v e c t o r s o f P e l e m e n t s ( P : number o f c e l l s ) , a s f o l l o w s :

fm : ( ( v v ... v 0 0 ...) / / nonz e r o v a l u e s o f t h e f i r s t m a t r i x


(x x ... x N N ...) / / l i n e i n d e x e s o f t h e nonz e r o v a l u e s
(y y ... y N N ...) / / column i n d e x e s o f t h e nonz e r o v a l u e s
) = ( ( fmv ) ( f m l ) ( fmc ) ) / / f i r s t m a t r i x w i t h F =< P e l e m e n t s
sm : ( ( v v ... v 0 0 ...) / / nonz e r o v a l u e s o f t h e s e c o n d m a t r i x
(z z ... z N N ...) / / l i n e i n d e x e s o f t h e nonz e r o v a l u e s
(w w ... w N N ...) / / column i n d e x e s o f t h e nonz e r o v a l u e s
) = ( ( smv ) ( sml ) ( smc ) ) / / s e c o n d m a t r i x w i t h S =< P e l e m e n t s
rm : (( ... ...) / / nonz e r o v a l u e s o f t h e r e s u l t m a t r i x
(N N ... N N N ...) / / l i n e i n d e x e s o f t h e nonz e r o v a l u e s
(N N ... N N N ...) / / column i n d e x e s o f t h e nonz e r o v a l u e s
) = ( ( rmv ) ( r m l ) ( rmc ) ) / / r e s u l t m a t r i x w i t h R =< P e l e m e n t s

wm: ( 0 0 . . . 0 0 0 . . . ) / / w o r k i n g m a t r i x w i t h t h e s h a p e o f sm

x, y, z, w < N

THE ALGORITHM
=================================================
i n i t i a l i z e wm t o z e r o
do
s e l e c t f i r s t nonz e r o column i n sm
t a k e column i n d e x : c
do
s e l e c t f i r s t nonz e r o s c a l a r i n column c
take value v
remove f i r s t s c a l a r
take l i n e index : l
s e l e c t column l i n fm
m u l t i p l y i n wm column l i n f m w i t h v
l o o p u n t i l ( no s c a l a r i n column )
do
s e l e c t f i r s t nonempty l i n e i n wm
take index : l
compute redAdd : r
s t o r e ( r , l , c ) i n rm
remove l i n e l i n wm
l o o p u n t i l ( no nonz e r o l i n e i n wm)
c l e a r f i r s t nonz e r o column i n sm
l o o p u n t i l ( no nonz e r o column i n sm )
=================================================

INITIALIZATION i s done by t h e f o l l o w i n g s e q u e n c e i n c o n t r o l l e r s s i d e o f t h e c o d e :

cVLOAD(N) ; // a c c <= N = s i z e of matrices


cSTORE ( 2 2 ) ; // mem[ 2 2 ] <= N
cVLOAD( fmv ) ; // a c c <= fmv = address of the f i r s t matrix value vector
cSTORE ( 2 5 ) ; // mem[ 2 5 ] <= fmv
cVLOAD( smc ) ; // a c c <= smc = a d d r e s s o f t h e s e c o n d m a t r i x column i n d e x v e c t o r
cSTORE ( 2 6 ) ; // mem[ 2 6 ] <= smc

69
cVLOAD(wm) ; // a c c <= wm = a d d r e s s o f t h e w o r k i n g v e c t o r
cSTORE ( 2 7 ) ; // mem[ 2 7 ] <= wm
cVLOAD( rmv ) ; // a c c <= rmv = a d d r e s s o f t h e r e s u l t m a t r i x v a l u e v e c t o r
cSTORE ( 2 3 ) ; // mem[ 2 3 ] <= rmv

EXAMPLE:
| 8 0 0 7 | | 1 0 1 0 | | 8 7 8 7 |
| 0 6 0 5 | | 1 1 0 0 | | 6 11 0 5 |
| 4 0 3 0 | | 0 0 1 0 | | 4 0 7 0 |
| 0 2 0 1 | X | 0 1 0 1 | = | 2 3 0 1 |

I f t h e d a t a memory o f t h e c o n t r o l l e r i s i n i t i a l i z e d by t h e s e q u e n c e :

cVLOAD ( 4 ) ;
cSTORE ( 2 2 ) ; / / mem[ 2 2 ] <= 4 = N
cVLOAD ( 8 ) ;
cSTORE ( 2 5 ) ; / / mem[ 2 5 ] <= 8 = fmv
cVLOAD ( 1 3 ) ;
cSTORE ( 2 6 ) ; / / mem[ 2 6 ] <= 13 = smc
cVLOAD ( 1 4 ) ;
cSTORE ( 2 7 ) ; / / mem[ 2 7 ] <= 14 = wm
cVLOAD ( 1 5 ) ;
cSTORE ( 2 3 ) ; / / mem[ 2 3 ] <= 15 = rmv

and t h e c o n t e n t o f t h e v e c t o r memory i s initially :

f i r s t matrix :
vect [8] = 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4
vect [10] = 0 3 1 3 0 2 1 3 4 4 4 4 4 4 4 4
second matrix :
vect [11] = 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
vect [12] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
vect [13] = 0 2 0 1 2 1 3 4 4 4 4 4 4 4 4 4
working v e c t o r :
vect [14] = x x x x x x x x x x x x x x x x
space reserved for result :
vect [15] = x x x x x x x x x x x x x x x x
vect [16] = x x x x x x x x x x x x x x x x
vect [17] = x x x x x x x x x x x x x x x x

a t t h e end o f r u n n i n g t h e f u n c t i o n t h e memory becomes :

vect [8] = 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4
vect [10] = 0 3 1 3 0 2 1 3 4 4 4 4 4 4 4 4
vect [11] = 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
vect [12] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
vect [13] = 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
vect [14] = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
vect [15] = 8 6 4 2 7 11 3 8 7 7 5 1 x x x x
vect [16] = 0 1 2 3 0 1 3 0 2 0 1 3 4 4 4 4
vect [17] = 0 0 0 0 1 1 1 2 2 3 3 3 4 4 4 4
/
cLOAD ( 2 7 ) ; VLOAD ( 0 ) ;
cLOAD ( 2 2 ) ; CSTORE ; / / wm <= 0
cLOAD ( 2 3 ) ; CLOAD; / / a c c [ i ] <= N
cVADD ( 1 ) ; NOP ;
cVADD ( 1 ) ; CSTORE ; / / r m l <= N
cNOP ; CSTORE ; / / rmc <= N

70
cLOAD ( 2 6 ) ; CSTORE ; / / mem[ 1 4 ] <= wm
cNOP ; CALOAD; / / a c c [ i ] <= smc
cNOP ; NOP ; / / latency

LB ( 2 3 ) ; cNOP ; NOP ; // latency


cNOP ; NOP ; // latency
/ / cNOP ; NOP ; // latencies
cCLOAD ( 1 ) ; NOP ; // a c c <= min ( smc )
cSTORE ( 1 ) ; NOP ; // mem [ 1 ] <= c

LB ( 2 4 ) ; cLOAD ( 1 ) ; NOP ; // a c c <= c ; a c c [ i ] = smc


cLOAD ( 2 6 ) ; SEARCH ; // a c c [ i ] <= a c c [ i ] c SEARCH ;
cVSUB ( 1 ) ; NOP ; // s e l e c t f i r s t nonz e r o column i n sm NOP ;
cNOP ; CALOAD; // l o a d l i n e i n d e x e s on f i r s t column
cVSUB ( 1 ) ; WHEREFIRST ; // s e l e c t f i r s t nonz e r o s c a l a r on column
cVADD ( 1 ) ; CALOAD; // l a t e n c y ; a c c [ f i r s t ] <= f i r s t v a l u e on column
/ / cNOP ; NOP ; // latencies
cNOP ; NOP ; // latency ;
cLOAD ( 2 2 ) ; CALOAD; // l a t e n c y ; a c c [ i ] <= l i n e i n d e x i n sm
cLOAD ( 2 6 ) ; CLOAD; // a c c [ i ] <= N
cCLOAD ( 0 ) ; CSTORE ; // a c c <= f i r s t v a l u e on column ; mem[ 1 3 ] <= N
cSTORE ( 2 ) ; ACTIVATE ; // mem [ 2 ] <= v ; s e l e c t a l l c e l l s
cCLOAD ( 0 ) ; NOP ;
cSTORE ( 3 ) ; NOP ;
cLOAD ( 2 5 ) ; NOP ;
cVADD ( 2 ) ; NOP ;
cLOAD ( 3 ) ; CALOAD; // a c c <= l i n e i n d e x i n sm ; a c c [ i ] <= fmc
cLOAD ( 2 5 ) ; SEARCH ; // s e l e c t t h e column i n fm
cLOAD ( 2 ) ; CALOAD; // a c c [ i ] <= fmv
cLOAD ( 2 7 ) ; CMULT; // a c c [ i ] <= a c c [ i ] v
cNOP ; CSTORE ; // s t o r e i n wm
cLOAD ( 2 6 ) ; ENDWHERE; // a c c <= c ; a c t i v a t e a l l
cLOAD ( 1 ) ; CALOAD; // a c c [ i ] <= smc
cNOP ; SEARCH ; // s e l e c t t h e c u r r e n t column
/ / cNOP ; NOP ; // latencies
cNOP ; NOP ; // latency ; activate all cells
cLOAD ( 2 6 ) ; ENDWHERE; // latency
cNOP ; NOP ; // latency
cCLOAD ( 3 ) ; CALOAD; // a c c <= r e d F l a g
cBRNZ ( 2 4 ) ; NOP ; // loop i s not zero
cLOAD ( 2 7 ) ; NOP ; //
cNOP ; CALOAD; //

LB ( 2 5 ) ; cLOAD ( 2 5 ) ; NOP ; // a c c [ i ] <= mw[ i ]


cVADD ( 1 ) ; WHERENZERO; // s e l e c t where n o t z e r o
cNOP ; CALOAD; // a c c [ i ] <= f m l [ i ]
cNOP ; NOP ; // latency
cNOP ; NOP ; // latency
cNOP ; NOP ; // latency
/ / cNOP ; NOP ; // latencies
cCLOAD ( 1 ) ; NOP ; // a c c <= l , f i r s t nonz e r o l i n e
cSTORE ( 2 ) ; SEARCH ; // mem [ 2 ] <= l ; s e l e c t t h e l i n e t o sum
cLOAD ( 2 7 ) ; NOP ; //
cNOP ; CALOAD; // l o a d s c a l a r s on f i r s t l i n e
cNOP ; VLOAD ( 0 ) ; // a c c [ i ] <= 0
cLOAD ( 2 3 ) ; CSTORE ; // c l e a r the content of the f i r s t line
cVADD ( 1 ) ; ACTIVATE ; // all cells activated
/ / cNOP ; NOP ; // latencies
cCLOAD ( 0 ) ; CALOAD; // a c c <= r ; a c c [ i ] <= r m l
cSTORE ( 3 ) ; NOP ; // mem [ 3 ] <= r ; / / s a v e r e s u l t

71
cLOAD ( 2 2 ) ; NOP ; // a c c <= N
cLOAD ( 3 ) ; SEARCH ; // a c c <= r ; s e l e c t f r e e s p a c e i n rm
cNOP ; NOP ; // latency for f i r s t
cNOP ; WHEREFIRST ; //
cLOAD ( 2 3 ) ; CINSERT ; // a c c [ f i r s t ] <= r
cLOAD ( 2 ) ; CSTORE ; // a c c <= l ; mem [ 1 5 ] [ f i r s t ] <= r
cLOAD ( 2 3 ) ; CINSERT ; // a c c [ f i r s t ] <= l ;
cVADD ( 1 ) ; NOP ; //
cLOAD ( 1 ) ; CSTORE ; // a c c <= c ; mem [ 1 6 ] [ f i r s t } <= l
cLOAD ( 2 3 ) ; CINSERT ; // a c c [ f i r s t ] <= c ;
cVADD ( 2 ) ; NOP ;
cNOP ; CSTORE ; // mem [ 1 7 ] [ f i r s t ] <= c ;
cLOAD ( 2 7 ) ; ACTIVATE ; // all cells activated
cNOP ; CALOAD; //
cNOP ; NOP ; // latency
cLOAD ( 2 7 ) ; NOP ; // latency
cNOP ; CALOAD; // latency
/ / cNOP ; NOP ; // latencies
cCLOAD ( 0 ) ; NOP ;
cBRNZ ( 2 5 ) ; NOP ;

cLOAD ( 2 6 ) ; NOP ;
cLOAD ( 2 2 ) ; CALOAD;
cNOP ; CSUB ;
/ / cNOP ; NOP ; // latencies
cNOP ; NOP ;
cLOAD ( 2 6 ) ; NOP ; / / latency
cNOP ; NOP ; / / latency
cCLOAD ( 0 ) ; CALOAD;
cBRNZ ( 2 3 ) ; NOP ; / / branch ; l a t e n c y
//

The performance depends on:

C2: number of non-zero columns in sm


M2: number of non-zero elements in sm

R: number of non-zero elements in rm


and is upper limited by:

TsparseMatrixMultiplication = 9 +C2(10 + log2 P) + M2(22 + 1.5log2 P) + R(28 + 1.5log2 P)

72
9 Graphs
A possible weakness of the pRISC circuit is the option for the simplest interconnection network between the cells in the
MAP section. In the worst case, from one cell to another cell the distance is in O(log P) (the depth of the reduction
network). The advantage is the small size of pRISC circuit (S pRISC O(P)), the a small inter-connectivity compared, for
example, with a hyper-cube interconnection organization. The ninth computational motif graph traversal is used to
prove that, despite the simplicity of the interconnection network, the pRISC-based hybrid computing version achieve the
same performance as the hyper-cube version of parallel engines, according to [5], whose size are in O(Plog P).

73
9.1 Minimum Spanning Tree
The evaluation used the Prims algorithm for computing the minimum spanning tree, MST, of a graph with N vertices.
The main functions of pRISC involved in providing an efficient algorithm are: vector to scalar functions (reduction-minim,
reduction-add), spatial control functions (WHERE[COND], SEARCH, WHEREFIRST). The evaluation program provides for
dense graphs:
TMST Dense = (N 1)(20 + log2 P) O(N log P)
while for sparse graphs:
TMST Sparse = 2(N 1)log2 P + 31N 24 O(N log P)

74
9.2 All-Pairs Shortest Path
The N N adjacency matrix A of graph G is used to compute the matrix of the shortest path in G, A , using the modified
matrix multiplication, X Y . If X and Y are matrices, then computing X Y means to substitute in the matrix multiplication
algorithm the scalar multiplication with addition and the reduction sum with reduction minim. The algorithm is A =
| A {z. . . A}. The algorithm is not the optimal one, but is used in systems which perform efficiently (modified) matrix
A
(N1) times
multiplication. The time for computing A from A is TAPSP O(N 2 log P).

75
9.3 Breadth-First Search
The breadth-first search algorithm uses mainly the same specific function as the minimum spanning tree algorithm (only
instead of reduction-minim the reduction-maxim is used). The simulation program provides for dense graphs:

TBFS Dense (N 1)log2 P + 33N 17 O(N log P)

while for sparse graphs:


TBFS Sparse 3.5(N 1)log2 P + 85N 66 O(N log P)
The log2 P component in TBFS , as in TMST , is due to the latency introduced by the log-depth reduction circuit. Similarly, in
hyper-cube engines the log term is due to the log-depth interconnection network.

76
Part III
UPGRADES
Envisaged versions:
Stack based engines instead of accumulator based engines.

Register File based engines instead of accumulator based engines.


...

77
References
[1] Krste Asanovic, et al., The landscape of parallel computing research: A view from Berkeley, 2006.
See at: www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
[2] John Backus, Can programming be liberated from the von Neumann style? A functional style and its algebra of
programs. Communications of the ACM 21, 8 (August) 1978. 613-641.
[3] Calin Bra, R. Hobincu, Lucian Petrica, OPINCAA: A Light-Weight and Flexible Programming Environment For
Parallel SIMD Accelerators Romanian Journal of Information Science and Technology, Volume 16, Numbers 4, 2013,
336-350

[4] Stephen Kleene, General recursive functions of natural numbers. Mathematische Annalen 112, 5, 1936. 727-742.

[5] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing. Design and Analysis of Algorithms,
The Benjamin/Cummings Pub. Comp., Inc., 1994.

[6] Mihaela Malita, Gheorghe M. Stefan, Dominique Thiebaut, Not Multi-, but Many-Core: Designing Integral Parallel
Architectures for Embedded Computation. ACM SIGARCH Computer Architecture News, Vol. 35, No. 5, December
2007. 32-39.
[7] Mihaela Malita, and Gheorghe M. Stefan, Backus language for functional nano-devices. CAS 2011, vol. 2, 331-334.

[8] Gheorghe M. Stefan, et al., The CA1024: A fully programmable system-on-chip for cost-effective HDTV media
processing. Hot Chips: A Symposium on High Performance Chips. Memorial Auditorium, Stanford University.

[9] Gheorghe M. Stefan, One-chip TeraArchitecture. Proceedings of the 8th Applications and Principles of Information
Science Conference. Okinawa, Japan, 2009.
See at: www.dropbox.com/s/5oqncu71t7zf8es/teraArchitecture.pdf?dl=
[10] Gheorghe M. Stefan, Integral parallel architecture in system-on-chip designs. The 6th International Workshop on
Unique Chips and Systems, Atlanta, GA, USA, December 4, 2010, pp. 23-26.
[11] Gheorghe M. Stefan, Mihaela Malita, Can One-Chip Parallel Computing Be Liberated From Ad Hoc Solutions? A
Computation Model Based Approach and Its Implementation, 18th Inter. Conf. on Ciruits, Systems, Communications
and Computers, Santorini, July 17-21, 2014, 582-597.
See at: www.dropbox.com/s/rtzzs1d06526jzj/COMPUTERS2-42.pdf?dl=0
[12] Gheorghe Stefan, Loops & Complexity in DIGITAL SYSTEMS. Lecture Notes on Digital Design in Giga-Gate/Chip
Era, (work in endless progress) 2016 version.
See at: www.dropbox.com/s/neooi2cca5y8lxa/0-BOOK.pdf?dl=0

78

Vous aimerez peut-être aussi