Techniques D'optimisation Architecturale: Camille Diou Diou@

Techniques
d’optimisation
architecturale
Camille Diou
diou@univ-metz.fr
DIOU Master EAII

Camille Sp. RSEE 1
1 Microprocessor basics
Tristate components (inputs/ outputs)

BUS
CONTROLLER DATAPATH
t1
State t2
machine t3
ALU
A
B
C
Register file Arithmetic and Logic Unit (ALU)
DIOU Master EAII

Camille Sp. RSEE 2
Computation example : CONTROLLER DATAPATH

t1 <- x
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2
t3 <- t3.t1
t3
ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C
DIOU Master EAII

Camille Sp. RSEE 3
x
t1 <- x
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2
t3 <- t3.t1
t3
ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C
#CYCLES: 1
DIOU Master EAII

Camille Sp. RSEE 4
y
t1 <- x
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2
t3 <- t3.t1
t3
ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C
#CYCLES: 2
DIOU Master EAII

Camille Sp. RSEE 5

t1 <- x A.t1
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t1
t3 <- t3.t1
t3
X
t2 <- B.t2
A
t3 <- t2+t3 A
B
out<- t3+C C
#CYCLES: 3
DIOU Master EAII

Camille Sp. RSEE 6

t1 <- x t3.t1
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t1
t3 <- t3.t1
t3
X
t2 <- B.t2
A
t3 <- t2+t3 t3
B
out<- t3+C C
#CYCLES: 4
DIOU Master EAII

Camille Sp. RSEE 7

t1 <- x B.t2
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 B
t3 <- t3.t1
t3
X
t2 <- B.t2
A
t3 <- t2+t3 t2
B
out<- t3+C C
#CYCLES: 5
DIOU Master EAII

Camille Sp. RSEE 8

t1 <- x t2+t3
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t2
t3 <- t3.t1
t3
+
t2 <- B.t2
A
t3 <- t2+t3 t3
B
out<- t3+C C
#CYCLES: 6
DIOU Master EAII

Camille Sp. RSEE 9
S
t1 <- x t3+C
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t3
t3 <- t3.t1
t3
+
t2 <- B.t2
A
t3 <- t2+t3 C
B
out<- t3+C C
#CYCLES: 7
DIOU Master EAII

Camille Sp. RSEE 10
Execution principle
Fetch Cycle Execute Cycle
Fetch
FetchNext
Next Execute
Execute
START
START Instruction Instruction HALT
HALT
Instruction Instruction
DIOU Master EAII

Camille Sp. RSEE 11
MAR : Memory Adress Register
A Single accumulator machine IR : Instruction Register
PC : Program Counter register
Store path
Load path n
Data flow
ACC
Control signals
Memory
A B
Address
FSM Function ALU m
controls 16 bits wide
16M words
Opcode MAR
S LD
IR
incr PC
Branch
DIOU Address operand

Master EAII
Camille Sp. RSEE Instruction path 12
Single Address Instruction: one of the registers is fixed (= accumulator)-

AC is an implicit operand
AC:= AC <operation> Memory(Address)
Instruction:
15 14 13 0
Address
Opcode:
00: Load
01: Store
10: Add
11: Branch
DIOU Master EAII

Camille Sp. RSEE 13
IR : Instruction Register
Store path
Load path 16
ACC
A B
Memory
Address
ALU
FSM Function 14
S 16M words
Opcode 2
LD MAR
IR
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
1. Instruction fetch:
IR : Instruction Register
- PC is moved into MAR
- Read from memory PC : Program Counter register
- Load instruction into IR
2. Instruction decode: Store path
- Op code bits to FSM(ADD)
- rest of bits is operand addr. Load path 16
ACC
A B
Memory
Address
ALU 1000110100110011
FSM Function 14
S 10110100110011
16M words
Opcode 2
LD MAR
1000110100110011
IR 10110100110011
incr
Branch 14 PC
14
Master EAII
3. Operand Fetch: IR : Instruction Register
- IR<address> -> MAR
- Read data from memory
4. Instr. Execute Store path
- Memory to ALU B
- AC to ALU Load path 16
- ALU Add 1000100011100111
- S to AC ACC
A B
Memory
0011001101110110 0101010101110001
Address
ALU 0101010101110001
FSM Function 14
controls 1000100011100111 16 bits wide
S 00110100110011
16M words
Opcode 2
LD MAR
1000110100110011
10110100110011
incr
Branch 14 PC
14
Master EAII
5. Housekeeping: IR : Instruction Register
- Increment PC
Store path
Load path 16
1000100011100111
ACC
A B
Memory
0011001101110110 0101010101110001
Address
ALU 0101010101110001
FSM Function 14
controls 1000100011100111 16 bits wide
S 00110100110011
16M words
Opcode 2
LD MAR
1000110100110011
10110100110011
10110100110100
incr
Branch 14 PC
14
Master EAII
A simple microprocessor : Architecture

To controller
(FSM)
To controller
(FSM)
16x16
registers
DIOU Master EAII

Camille data to/from memory Adress to memory
Sp. RSEE 18
A simple microprocessor : Instruction format
shift
or or
or
DIOU Master EAII

Camille Sp. RSEE 19
Instruction format
Instruction
Action
DIOU Master EAII

Camille Sp. RSEE 20
DIOU Master EAII

Camille Sp. RSEE 21
A simple microprocessor : test program
What will it do ?
0000 7C0A ;
0001 8C00 ; LOAD RC, #A
0002 7B04 ; ...
0003 7A0A ; ...
0004 9C7C ; ...
0005 611A ; ...
0006 614B ; ...
...
DIOU Master EAII

Camille Sp. RSEE 22
Compiler dependancies detection for ILP
• Detect data dependency at compile time:

– examples:
c[i]=a[i]+b[i]; potential dependency
d[i]=a[i]+c[j]; c[i] might be c[j]
c[1]=a[i]+b[i]; no dependency
d[i]=a[i]+c[2]; c[1] is never c[2]
DIOU Master EAII

Camille Sp. RSEE 23
2 Systolic ring
Reconfigurable computing : Instruction level parallelism (ILP)
• Superscalar processors must find dataflow graph at run time

• Reconfigurable architectures constructs data flow graph at compile time
• No FU limitations
• No control logic overhead
• No window size limitations
DIOU Master EAII

Camille Sp. RSEE 24
2 Systolic ring
Reconfigurable computing : Instruction level parallelism (ILP)

• RC scheme: • General Purpose Computer
add r1, r2, r4
r1 r2 r3 r1 r3 r2 add r1, r3, r5
sub r3, r2, r6
add r4 r5 r1
r4 r6 add r5 r6 r2
r5
r1 r2
Question: what is the advantage of RC against superscalar?
Answer: Dataflow graph constructed at compile time, thus, no overhead
DIOU Master EAII

Camille Sp. RSEE 25
2 Systolic ring
Reconfigurable computing : Why now ?
• Increasing number of transistors

• Complexity and cost of chip design increase fast
• Current computing demands are RC friendly :
Desktops & embedded demands driven NOT by Word or Excel but by
multimedia, encryption, filters (dataflow oriented applications
DIOU Master EAII

Camille Sp. RSEE 26
2 Systolic ring
RA versus microprocessors
• RA less flexible (like a VLIW with fixed instructions)
but
• RA provides more (customized) computation elements

• RA can decrease memory traffic
• RA can be tailored for specific algorithms and data types
RA will not replace µP, but complement them
DIOU Master EAII

Camille Sp. RSEE 27
2 Systolic ring
Systolic computing : definition
•A set of simple processing elements with regular and local connections

which takes external inputs and processes them in a predertermined
manner in a determined fashion
H.T. Kung
DIOU Master EAII

Camille Sp. RSEE 28
2 Systolic ring
Systolic computing : characteristics of best RC design
• Simple PE
• Regular and local interconnect
• Pipeline between Pes
• I/O at boundary
DIOU Master EAII

Camille Sp. RSEE 29
2 Systolic ring
Coarse grain RA model
In abstract :
Instructions configure both PE and interconnect every cycle
In reality :
Instruction Bandwidth / Memory too high, so…
COMPROMISE
DIOU Master EAII
Camille Sp. RSEE 30
2 Systolic ring
Communications…
Relationship of communication among processors
• Shared clock (Pipelined)
• Shared registers (VLIW)
• Shared memory (SMM)
• Shared network
DIOU Master EAII

Camille Sp. RSEE 31
2 Systolic ring
Reconfigurable computing
Actual available
hardware
Instructions
currently in hardware
ram
Instructions paged out

g
Pro
DIOU Master EAII

Camille Sp. RSEE 32
2 Systolic ring
Finite Impulse response filter (FIR)

N −1
y(n)=∑a(i)x(n−i−1)
i=0
xn
aN aN-1 aN-2 a1 a0
-1
Z Z
-1 -1
Z
-1
Z yn
3 coefficients filter
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)
xn
a2 a1 a0
Z
-1 -1
Z
-1
Z yn
DIOU Master EAII

Camille Sp. RSEE 33
2 Systolic ring
Systolic FIR implementation
(MAC unit)
DIOU Master EAII

Camille Sp. RSEE 34
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 35
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 36
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 37
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 38
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 39
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 40
2 Systolic ring
Optimize outer loop, preload-repeated value
DIOU Master EAII

Camille Sp. RSEE 41
2 Systolic ring
Optimize outer loop, broadcast common value
DIOU Master EAII

Camille Sp. RSEE 42
2 Systolic ring
Optimize outer loop, retime to eliminate broadcast
DIOU Master EAII

Camille Sp. RSEE 43
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 44
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 45
2 Systolic ring
DIOU Master EAII

Camille Sp. RSEE 46
2 Systolic ring
The Systolic Ring
• Coarse grain architecture

• Multi-mode dynamical reconfiguration
• Scalable, bidimentionnal array
• VHDL design
• Designed for SoC integration
DIOU Master EAII

Camille Sp. RSEE 47
2 Systolic ring
Dnode : word-level processing unit
Constitution
• Optimized Datapath (16 bits) µinst.
• Register File (4x16bits)
• Hardwired ALU and multiplier
Reg FILE
Features
• Complex computations in local mode (FIR,IIR, WT…)
• Low silicon area (0.07mm², 0.18µm CMOS process)
• Single-cycle operations (ex:MAC+register load)
ALU + MULT
DIOU Master EAII

Camille Sp. RSEE 48
2 Systolic ring
Local controller : Dynamical reconfiguration at the Dnode level

Constitution
• 8 configuration registers
• 3 differents run modes
• 1 programming mode
reg0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
49
2 Systolic ring
Programming mode
clk
reg0
inhib
reg1
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
50
2 Systolic ring
Programming mode
clk
Instruction 0 reg0
inhib
reg1
reg2 µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
51
2 Systolic ring
Programming mode
clk
Instruction 1 reg0
inhib
reg1
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 52
2 Systolic ring
Programming mode
clk
Instruction 2 reg0
inhib
reg1
Reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 53
2 Systolic ring
Programming mode
clk
Instruction 3 reg0
inhib
reg1
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 54
2 Systolic ring
Run-mode 1 : Fixed
clk
reg0 Instruction 0
inhib
reg1
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 55
2 Systolic ring
Run-mode 1 : Fixed
clk
reg0 Instruction 0
inhib
reg1
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 56
2 Systolic ring
Run-mode 1 : Fixed
clk
reg0 Instruction 0
inhib
reg1
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 57
2 Systolic ring
Run-mode 1 : Fixed
clk
reg0 Instruction 0
inhib
reg1
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 58
2 Systolic ring
Run-mode 2 : Dynamic
clk
reg0 Instruction 1
Inhib
reg1
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 59
2 Systolic ring
Run-mode 2 : Dynamic
clk
reg0 Instruction 2
inhib
reg1
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 60
2 Systolic ring
Run-mode 2 : Dynamic (one-time or loop)
clk
reg0 Instruction 3
inhib
reg1
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 61
2 Systolic ring
Array structure
• Unidirectional communications between neighbours Scalable
• Hard to implement datapath with greater pipeline
depth than the array
• Hard to implement recursive operations
Unités de
Configurable Flots de données
traitement Switchs Main dataflow (unidirectional)
UNIDIRECTIONNELS
blocks
ENTRÉES
SORTIES
OUTPUTS
INPUTS
BUS : Shared resources

BUS : ressource PARTAGÉE…
DIOU Master EAII

Camille Sp. RSEE 62
2 Systolic ring
Array structure
• Unidirectional communications between neighbours
Use of a Ring structure
• Hard to implement recursive operations
Unités de
Configurable Flots de données ª RING STRUCTURE
UNIDIRECTIONNELS
blocks
ENTRÉES
SORTIES
OUTPUTS
INPUTS
BUS : Shared resources

DIOU Master EAII

Camille Sp. RSEE 63
2 Systolic ring
Array structure
• Unidirectional communications between neighbours
• Hard to implement recursive operations Use of a bi-dataflows structure
ª RING STRUCTURE
Forward
Dataflow
Unités de
Configurable Flots de données
UNIDIRECTIONNELS
blocks
ENTRÉES
SORTIES
OUTPUTS
INPUTS
Reverse Dataflow
DIOU Master EAII

Camille
BUS : Shared resources Sp. RSEE 64
2 Systolic ring
Systolic Ring architecture Forward dataflow

Peak power : 3200 MIPS@200MHz (16 Dnodes version)
Dnode Dnode
Switch
E/S Switch Switch E/S
Dnode Dnode
Dnode Dnode Dnode Dnode Couche n

Flot de
E/S Switch Switch E/S
données
Dnode Dnode Dnode Dnode Couche n+1
Dnode Dnode
Switch
Switch Switch
E/S E/S
Dnode Dnode
DIOU Master EAII
Camille Sp. RSEE 65
2 Systolic ring
Systolic Ring architecture Forward dataflow

No complex data routing problems (crossbars…)
Unidirectional data transfers between adjacent layers (pipeline)
Linear performances increase with Dnode number
Provides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization
Forward Layer n-1
I/O
Dataflow
D-Node
D node D-Node
¾ Switch components:
Switch
I/O Switch Switch I/O
ªDirect FIFO connection for Data injection D-Node
D node D-Node
ªBUS connection for RISC communication

ªFull connectivity between 2 Dnode layers D-Node D-Node D-Node D-Node Layer n
Node
Config.
controller
D-Node D-Node D-Node D-Node
D node Layer n+1
D-Node D-Node
D node
Local mode : stand-alone
Switch
D-Node
D node D-Node
D node
Global mode : FPGA like
DIOU
I/O
Master EAII
Camille Sp. RSEE 66
2 Systolic ring
Feedback pipelines Reverse dataflow
Each switch writes computed data in his own feedback pipeline
Each switch has read ports on others switch’s pipelines
Easy implementation of various recursive algorithms (IIR, WT…)
D-Node D-Node
Switch
Switch Switch
D-Node D-Node
D-Node
Node D-Node D-Node D-Node
Switch Switch
D-Node D-Node
Switch
Switch Switch
DIOU Master EAII
Camille D-Node D-Node
Sp. RSEE 67
2 Systolic ring
D-Node D-Node
Switch
Switch Switch
D-Node D-Node
D-Node
Switch Switch
D-Node D-Node
Switch
Switch Switch
DIOU Master EAII
Sp. RSEE 68
2 Systolic ring
D-Node D-Node
Switch
Switch Switch
D-Node D-Node
D-Node
Switch Switch
D-Node D-Node
Switch
Switch Switch
DIOU Master EAII
Sp. RSEE 69
2 Systolic ring
2 levels dynamically reconfigurable architecture:
• Global mode (first level)
The program which manages the configuration runs on the RISC processor 1
The configuration of an entire cluster can be modified at each clock cycle 2
The operating layer computes the data coming from the host processor 3
• Local mode (second level)
Each Dnode runs his own up-to-8 instructions program
OPERATING layer Dnode
A B
+ Reg FILE
ALU +M ULT
+ * S
RAM
CONFIGURATION *
layer 3
2 DATA Host
µP
CONFIG
1
Config MANAGEMENT CODE
DIOU Master EAII
Camille Controller Sp. RSEE 70
2 Systolic ring
8 Dnodes version…
• ST* CMOS process 0.25 µm & 0.18 µm
Features :
• Parametrizable core (number of Dnodes)
• Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system)
• 1600 MIPS (PII@450MHz : 400 MIPS)
• 3 Gb/s bandwidth
Ring-8 Ring-8 Dnode

0.25 µm 0.18 µm 0.18 µm
Area 0.9 mm2 0.7 mm2 0.04 mm2
Fréquency 150 MHz 200 MHz 200 MHz
• Low Dnode area Possible to realize 128 Dnodes versions…

• Suited as an IP core for SoC
*: ST: STmicroelectronics
DIOU Master EAII

Camille Sp. RSEE 71
2 Systolic ring
Assembly-level programming
RISC
0000instructions
r:ldl(0,8) Layer
M1: selection
N1:clr N2:clr Dnodes instructions
0001 r:ldl(1,2) M2: N1:clr N2:clr

0002 r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1)
0003 r:jnz(1) M2: N1:mac(in1) N2:mac(in2)
0004 r: halt
Assembler
Prototype
File1.bin RAM FPGA
Simulator
Testbench
File2.m
RAM
DIOU Ring-8
Master EAII
Camille Sp. RSEE 72
2 Systolic ring
RIF filter : edge detection
Convolution mask : [ -1 1 0 ] yn=xn-xn-1.
Assembly code Timing diagrams
0000 r:ldl(0,1) M1: N1:rst N2:rst

0001 r:jmp(0) M1: N2:sub(fifo,fifo)
Assembler
Simulator
Testbench
File2.m
RAM
Ring-8
Input image Output image
DIOU Master EAII

Camille Sp. RSEE 73
2 Systolic ring
Polynomial calculus
• P(x)=a.x+b.x²+c.x3
x
x Æ reg0 /* load reg0,x */
1 /* load reg1,x² */
x.x Æ reg1
x x²
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
x x
/* load ACC,a.x */
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */
5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */
DIOU Master EAII

Camille Sp. RSEE 74
2 Systolic ring
Polynomial calculus

x.x Æ reg1
x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
x x²
/* load ACC,a.x */
ALU + MULT
4
/* load ACC,a.x+b.x² */
5
DIOU Master EAII

Camille Sp. RSEE 75
2 Systolic ring
Polynomial calculus
a
x.x Æ reg1
x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
/* load ACC,a.x */
a x
ALU + MULT
4
/* load ACC,a.x+b.x² */ a.x
5
DIOU Master EAII

Camille Sp. RSEE 76
2 Systolic ring
Polynomial calculus
b
x.x Æ reg1
x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
/* load ACC,a.x */
b x²
ALU + MULT
4
/* load ACC,a.x+b.x² */ a.x+b.x²
5
DIOU Master EAII

Camille Sp. RSEE 77
2 Systolic ring
Polynomial calculus
c
x.x Æ reg1
x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
/* load ACC,a.x */
c x3
ALU + MULT
4
/* load ACC,a.x+b.x² */ a.x+b.x²+c. x3
5
DIOU Master EAII

Camille Sp. RSEE 78
2 Systolic ring

N −1
y(n)=∑ai x(n−i−1)
i =0
xn -1
Z
-1
Z Z
-1
Z
-1
a0 a1 a2 aN-1 aN
yn
DIOU Master EAII

Camille Sp. RSEE 79
2 Systolic ring
N −1
y(n)=∑a(i)x(n−i−1)
i=0
xn
aN aN-1 aN-2 a1 a0
-1
Z Z
-1 -1
Z
-1
Z yn
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)
xn
a2 a1 a0
Z
-1 -1
Z
-1
Z yn
DIOU Master EAII

Camille Sp. RSEE 80
2 Systolic ring
FIR implementation
3 Dnodes / layer architecture use
• Piplelined implementation
• Samples are injected through dedicated lines
• Coefficients loaded during first cycle
x0, x0, x0
a2, a1, a0
x0 a2 a1 a2
Cycle 1
DIOU Master EAII

Camille Sp. RSEE 81
2 Systolic ring
FIR implementation
Feedback
x1, x1, x1
a2.x0
x1
x1 a2 a1
MAC Cycle 2
a2.x0
DIOU Master EAII

Camille Sp. RSEE 82
2 Systolic ring
FIR implementation
Feedback
x2, x2, x2
a2.x1 a2.x0+a1.x1
x2 x2
x2 a2 a1 a0
MAC MAC Cycle 3
a2.x1 a2.x0+a1.x1
DIOU Master EAII

Camille Sp. RSEE 83
2 Systolic ring
FIR implementation
Feedback
x3, x3, x3
a2.x2 a2.x1+a1.x2
x3 x3
x3 a2 a1 a0
MAC MAC Cycle 4
a2.x2 a2.x1+a1.x2
a2.x0+a1.x1 +a0.x2
OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

DIOU Master EAII
Camille Sp. RSEE 84
2 Systolic ring
Feedback
x4, x4, x4
a2.x3 a2.x2+a1.x3
x4 x4
x4 a2 a1 a0
MAC MAC Cycle 5
a2.x3 a2.x2+a1.x3
a2.x1+a1.x2 +a0.x3

DIOU Master EAII
Camille Sp. RSEE 85
2 Systolic ring
Feedback
x4, x4, x4
a2.x3 a2.x2+a1.x3
x4 x4
x4 a2 a1 a0
MAC MAC Cycle 6
a2.x3 a2.x2+a1.x3
a2.x1+a1.x2 +a0.x3

DIOU Master EAII
Camille Sp. RSEE 86
2 Systolic ring
Feedback
x5, x5, x5
a2.x4 a2.x3+a1.x4
x5 x5
x4 a2 a1 a0
MAC MAC Cycle 7
a2.x4 a2.x3+a1.x4
a2.x2+a1.x3 +a0.x4

DIOU Master EAII
Camille Sp. RSEE 87
2 Systolic ring
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)+a3.x(n−4)+a4.x(n−5)+a5.x(n−6)
xn
a2 a1 a0
MAC MAC MAC

Inter-layers
feedback
yn
xn
a5 a4 a3
MAC MAC
DIOU Master EAII

Camille Sp. RSEE 88
2 Systolic ring
Discrete Cosine Transform

• Usually bidimensional 8x8 points DCT
• Very demanding algorithm…
Original
image
DCT Quantification Coding
DCT Quantified Compressed

Coeff. Coeff. image
inverse
iDCT Quantification Decoding
Decompressed
image
DIOU Master EAII

Camille Sp. RSEE 89
2 Systolic ring
DCT algorithm
Direct transform
α (k )∑ xn cos ⎛⎜ π (2 n+1)k ⎞⎟
N −1
zk = 2 k = 0,1,……,N-1
N n =0 ⎝ 2N ⎠
Inverse transform
N −1
( ) ⎛ π (2 n +1)k ⎞
xn = 2
N ∑ α
k =0
k z k cos ⎜
⎝ 2N ⎟
⎠
n = 0,1,……,N-1
1/√2 for k = 0
with α (k ) =
1 else
DIOU Master EAII
Camille Sp. RSEE 90
2 Systolic ring
Image
• 64x64 points
• 8x8 pixels blocks
•16 bits coded image 8
64
⎡ x0, 0 x0,1 . . . . . x0, 7 ⎤
⎢x x1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . 8
. . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
64 ⎢⎣ x7 , 0 . . . . . . x7 , 7 ⎥⎦
64 blocs 8x8
DIOU Image initiale

Master EAII
Camille Sp. RSEE 91
2 Systolic ring
Implementation
• Matrix implementation
• Even / Odd frequency decomposition of the DCT algorithm
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ α = cos (π/4)
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ β = cos (π/8)
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ δ = sin (π/8)
2 ⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦
z= T (N )x
N ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤ λ = cos (π/16)
⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥ γ = cos (3π/16)
⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥ μ = sin (3π/16)
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ 7⎦ ⎣
z ν − μ γ − λ −
⎦⎣ 3 4 ⎦
x x ν = sin (π/16)
DIOU Master EAII

Camille Sp. RSEE 92
2 Systolic ring
Coefficients coding
• Fixed point
N −1
⎛ π (2 n + 1 )k ⎞
α (k )∑
2
zk = x n cos ⎜ ⎟
N n=0 ⎝ 2 N ⎠
α = 0000000000010110 - α = 1111111111101010
β = 0000000000011101 - β = 1111111111100011
δ = 0000000000001100 - δ = 1111111111110100
Example : n=6
λ = 0000000000011111 - λ = 1111111111100001
γ = 0000000000011010 - γ = 1111111111100101
μ = 0000000000010001 - μ = 1111111111101111
ν = 0000000000000110 - ν = 1111111111111010
DIOU Master EAII

Camille Sp. RSEE 93
2 Systolic ring
Implementation :
• ADD and SUB on the first Dnode layer
• Multiply-accumulate operations (MAC) on the second Dnodes layer
Dnode1 Dnode1
xn + x(N-1)-n z0 , z2 , z4 , z6
+ MAC
_
xn x(N-1)-n xn x(N-1)-n
xn - x(N-1)-n z1 , z3 , z5 , z7
MAC
Dnode2 Dnode2
Coefficients
Config
Config
DIOU Master EAII
Camille Sp. RSEE 94
2 Systolic ring
Computing… t=0 n=0
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Dnode1 Dnode1
+
_
Dnode2 Dnode2
x0 x7 x0 x7
Config
M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)

DIOU Master EAII
Camille Sp. RSEE 95
2 Systolic ring
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Dnode1 Dnode1
x0 + x7
+ MAC
_ x0 – x7
MAC
Dnode2 Dnode2
x1 x6 x1 x6
λ,x,1,x
Config
M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)
DIOU Master EAII
Camille Sp. RSEE 96
2 Systolic ring
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Dnode1 Dnode1
x1 + x6
+ MAC
_ x1 – x6
MAC
Dnode2 Dnode2
x2 x5 x2 x5
γ,x,1,x
DIOU Master EAII

Camille Sp. RSEE 97
2 Systolic ring
t=3 n=3
Computing…
⎡z ⎤ 1 ⎤ ⎡ x0 + x7 ⎤
1 1 0 ⎡ 1 ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Dnode1 Dnode1
x2 + x5
+ MAC
_ x2 – x5
MAC
Dnode2 Dnode2
x3 x4 x3 x4
μ,x,1,x
DIOU Master EAII

Camille Sp. RSEE 98
2 Systolic ring
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Dnode1 Dnode1
x3 + x4
+ MAC
_ x3 – x4
MAC
Dnode2 Dnode2
ν,x,1,x
DIOU Master EAII

Camille Sp. RSEE 99
2 Systolic ring
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Dnode1 Dnode1
x3 + x4
+ clear
z0
_ x3 – x4
clear z1
Dnode2 Dnode2
ν,x,1,x
Config
M1: N1:clear N2: clear
DIOU Master EAII
Camille Sp. RSEE 100
2 Systolic ring
Computing…
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Results
– 2 transforms issued each 5 machine cycles
– « Clear » performed during addition

20 cycles for 8 samples
DIOU Master EAII

2 Systolic ring
Achievable parallelisn on a 8 Dnodes structures : Ring-8
M0
DCT 1D - 4 first lines

Dnode
1
Switch
Dnode
2
Config
Switch
Dnode Dnode
1 2 Config
M3
M1
Config
Config
Switch Dnode Dnode
2 1
Dnode
2
Switch
Dnode
1
DCT 1D - 4 last lines
DIOU M2
Master EAII
2 Systolic ring
Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⇒ 5 cycles
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
2 partial transforms
DIOU Master EAII

2 Systolic ring
⎡ z '0 , 0 z '0 , 7 ⎤
⎢ z'
z '0,1 . . . . .
⇒ 20 cycles
⎢ 1, 0 z '1,1 . ⎥⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
1 Line – 8 partial transforms
DIOU Master EAII

2 Systolic ring
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥ ⇒ 80 cycles
⎢ ⎥
⎢ . . . ⎥
M0 M1
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
4 Lines - 32 partial transforms
DIOU Master EAII

2 Systolic ring
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥ ⇒ 80 cycles
⎢ ⎥
⎢ . . . ⎥
M0 M1
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
4 Lines - 32 partial transforms
DIOU Master EAII

2 Systolic ring
Overall performances ⇒ 80 cycles

M2 M3
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
8 Columns - 64 transforms
DIOU Master EAII

2 Systolic ring
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
DCT 2D sur 8 points :

160 CYCLES
DIOU Master EAII

2 Systolic ring
Comparisons : execution time (cycles)

VLIW : CPU64, TM1000, TI 320C60
Superscalar : Pentium I, Pentium II, NEC V830
400
350
300
250
# cycles
200
150
100
50
0
CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830
DIOU Master EAII
Camille VLIW Sp. RSEE Superscalar 109

Techniques D'optimisation Architecturale: Camille Diou Diou@

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Techniques D'optimisation Architecturale: Camille Diou Diou@

Transféré par

Droits d'auteur :

Formats disponibles

Techniques

DIOU Master EAII

Tristate components (inputs/ outputs)

Register file Arithmetic and Logic Unit (ALU)

DIOU Master EAII

Computation example : CONTROLLER DATAPATH

DIOU Master EAII

DIOU Master EAII

DIOU Master EAII

Computation example : CONTROLLER DATAPATH

DIOU Master EAII

Computation example : CONTROLLER DATAPATH

DIOU Master EAII

Computation example : CONTROLLER DATAPATH

DIOU Master EAII

Computation example : CONTROLLER DATAPATH

DIOU Master EAII

DIOU Master EAII

Fetch Cycle Execute Cycle

DIOU Master EAII

DIOU Address operand

Single Address Instruction: one of the registers is fixed (= accumulator)-

AC:= AC <operation> Memory(Address)

DIOU Master EAII

A simple microprocessor : Architecture

DIOU Master EAII

A simple microprocessor : Instruction format

DIOU Master EAII

DIOU Master EAII

DIOU Master EAII

DIOU Master EAII

Compiler dependancies detection for ILP

• Detect data dependency at compile time:

DIOU Master EAII

Reconfigurable computing : Instruction level parallelism (ILP)

• Superscalar processors must find dataflow graph at run time

DIOU Master EAII

Reconfigurable computing : Instruction level parallelism (ILP)

Question: what is the advantage of RC against superscalar?

Answer: Dataflow graph constructed at compile time, thus, no overhead

DIOU Master EAII

Reconfigurable computing : Why now ?

• Increasing number of transistors

DIOU Master EAII

• RA less flexible (like a VLIW with fixed instructions)

• RA provides more (customized) computation elements

RA will not replace µP, but complement them

DIOU Master EAII

Systolic computing : definition

•A set of simple processing elements with regular and local connections

DIOU Master EAII

Systolic computing : characteristics of best RC design

• Regular and local interconnect

• Pipeline between Pes

DIOU Master EAII

Coarse grain RA model

DIOU Master EAII

Instructions paged out

DIOU Master EAII

Finite Impulse response filter (FIR)

DIOU Master EAII

Systolic FIR implementation