Vous êtes sur la page 1sur 109

Techniques

d’optimisation
architecturale

Camille Diou
diou@univ-metz.fr

DIOU Master EAII


Camille Sp. RSEE 1
1 Microprocessor basics

Tristate components (inputs/ outputs)


BUS
CONTROLLER DATAPATH

t1
State t2
machine t3

ALU
A
B
C

Register file Arithmetic and Logic Unit (ALU)

DIOU Master EAII


Camille Sp. RSEE 2
1 Microprocessor basics

Computation example : CONTROLLER DATAPATH


t1 <- x
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2
t3 <- t3.t1
t3

ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C

DIOU Master EAII


Camille Sp. RSEE 3
1 Microprocessor basics

x
Computation example : CONTROLLER DATAPATH
t1 <- x
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2
t3 <- t3.t1
t3

ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C

#CYCLES: 1

DIOU Master EAII


Camille Sp. RSEE 4
1 Microprocessor basics

y
Computation example : CONTROLLER DATAPATH
t1 <- x
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2
t3 <- t3.t1
t3

ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C

#CYCLES: 2

DIOU Master EAII


Camille Sp. RSEE 5
1 Microprocessor basics

Computation example : CONTROLLER DATAPATH


t1 <- x A.t1
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t1
t3 <- t3.t1
t3

X
t2 <- B.t2
A
t3 <- t2+t3 A
B
out<- t3+C C

#CYCLES: 3

DIOU Master EAII


Camille Sp. RSEE 6
1 Microprocessor basics

Computation example : CONTROLLER DATAPATH


t1 <- x t3.t1
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t1
t3 <- t3.t1
t3

X
t2 <- B.t2
A
t3 <- t2+t3 t3
B
out<- t3+C C

#CYCLES: 4

DIOU Master EAII


Camille Sp. RSEE 7
1 Microprocessor basics

Computation example : CONTROLLER DATAPATH


t1 <- x B.t2
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 B
t3 <- t3.t1
t3

X
t2 <- B.t2
A
t3 <- t2+t3 t2
B
out<- t3+C C

#CYCLES: 5

DIOU Master EAII


Camille Sp. RSEE 8
1 Microprocessor basics

Computation example : CONTROLLER DATAPATH


t1 <- x t2+t3
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t2
t3 <- t3.t1
t3

+
t2 <- B.t2
A
t3 <- t2+t3 t3
B
out<- t3+C C

#CYCLES: 6

DIOU Master EAII


Camille Sp. RSEE 9
1 Microprocessor basics

S
Computation example : CONTROLLER DATAPATH
t1 <- x t3+C
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t3
t3 <- t3.t1
t3

+
t2 <- B.t2
A
t3 <- t2+t3 C
B
out<- t3+C C

#CYCLES: 7

DIOU Master EAII


Camille Sp. RSEE 10
1 Microprocessor basics
Execution principle

Fetch Cycle Execute Cycle

Fetch
FetchNext
Next Execute
Execute
START
START Instruction Instruction HALT
HALT
Instruction Instruction

DIOU Master EAII


Camille Sp. RSEE 11
1 Microprocessor basics
MAR : Memory Adress Register
A Single accumulator machine IR : Instruction Register
PC : Program Counter register

Store path

Load path n
Data flow
ACC
Control signals
Memory
A B
Address
FSM Function ALU m
controls 16 bits wide
16M words
Opcode MAR
S LD
IR
incr PC
Branch

DIOU Address operand


Master EAII
Camille Sp. RSEE Instruction path 12
1 Microprocessor basics

Single Address Instruction: one of the registers is fixed (= accumulator)-


AC is an implicit operand

AC:= AC <operation> Memory(Address)

Instruction:
15 14 13 0

Address
Opcode:

00: Load
01: Store
10: Add
11: Branch

DIOU Master EAII


Camille Sp. RSEE 13
1 Microprocessor basics
MAR : Memory Adress Register
IR : Instruction Register
PC : Program Counter register

Store path

Load path 16

ACC
A B
Memory

Address
ALU
FSM Function 14
controls 16 bits wide
S 16M words
Opcode 2
LD MAR

IR
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
Camille Sp. RSEE Instruction path 14
1 Microprocessor basics
MAR : Memory Adress Register
1. Instruction fetch:
IR : Instruction Register
- PC is moved into MAR
- Read from memory PC : Program Counter register
- Load instruction into IR
2. Instruction decode: Store path
- Op code bits to FSM(ADD)
- rest of bits is operand addr. Load path 16

ACC
A B
Memory

Address
ALU 1000110100110011
FSM Function 14
controls 16 bits wide
S 10110100110011
16M words
Opcode 2
LD MAR
1000110100110011
IR 10110100110011
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
Camille Sp. RSEE Instruction path 15
1 Microprocessor basics
MAR : Memory Adress Register
3. Operand Fetch: IR : Instruction Register
- IR<address> -> MAR
PC : Program Counter register
- Read data from memory
4. Instr. Execute Store path
- Memory to ALU B
- AC to ALU Load path 16
- ALU Add 1000100011100111
- S to AC ACC
A B
Memory
0011001101110110 0101010101110001

Address
ALU 0101010101110001
FSM Function 14
controls 1000100011100111 16 bits wide
S 00110100110011
16M words
Opcode 2
LD MAR
1000110100110011
10110100110011
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
Camille Sp. RSEE Instruction path 16
1 Microprocessor basics
MAR : Memory Adress Register
5. Housekeeping: IR : Instruction Register
- Increment PC
PC : Program Counter register

Store path

Load path 16
1000100011100111
ACC
A B
Memory
0011001101110110 0101010101110001

Address
ALU 0101010101110001
FSM Function 14
controls 1000100011100111 16 bits wide
S 00110100110011
16M words
Opcode 2
LD MAR
1000110100110011
10110100110011
10110100110100
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
Camille Sp. RSEE Instruction path 17
1 Microprocessor basics

A simple microprocessor : Architecture


To controller
(FSM)
To controller
(FSM)

16x16
registers

DIOU Master EAII


Camille data to/from memory Adress to memory
Sp. RSEE 18
1 Microprocessor basics

A simple microprocessor : Instruction format

shift

or or
or

DIOU Master EAII


Camille Sp. RSEE 19
1 Microprocessor basics
A simple microprocessor : Instruction format
Instruction format
Instruction
Action

DIOU Master EAII


Camille Sp. RSEE 20
1 Microprocessor basics
A simple microprocessor : Instruction format

DIOU Master EAII


Camille Sp. RSEE 21
1 Microprocessor basics
A simple microprocessor : test program
What will it do ?

0000 7C0A ;
0001 8C00 ; LOAD RC, #A
0002 7B04 ; ...
0003 7A0A ; ...
0004 9C7C ; ...
0005 611A ; ...
0006 614B ; ...

...

DIOU Master EAII


Camille Sp. RSEE 22
1 Microprocessor basics

Compiler dependancies detection for ILP

• Detect data dependency at compile time:


– examples:
c[i]=a[i]+b[i]; potential dependency
d[i]=a[i]+c[j]; c[i] might be c[j]

c[1]=a[i]+b[i]; no dependency
d[i]=a[i]+c[2]; c[1] is never c[2]

DIOU Master EAII


Camille Sp. RSEE 23
2 Systolic ring

Reconfigurable computing : Instruction level parallelism (ILP)

• Superscalar processors must find dataflow graph at run time


• Reconfigurable architectures constructs data flow graph at compile time
• No FU limitations
• No control logic overhead
• No window size limitations

DIOU Master EAII


Camille Sp. RSEE 24
2 Systolic ring

Reconfigurable computing : Instruction level parallelism (ILP)


• RC scheme: • General Purpose Computer
add r1, r2, r4
r1 r2 r3 r1 r3 r2 add r1, r3, r5
sub r3, r2, r6
add r4 r5 r1
r4 r6 add r5 r6 r2
r5

r1 r2

Question: what is the advantage of RC against superscalar?

Answer: Dataflow graph constructed at compile time, thus, no overhead

DIOU Master EAII


Camille Sp. RSEE 25
2 Systolic ring

Reconfigurable computing : Why now ?

• Increasing number of transistors


• Complexity and cost of chip design increase fast
• Current computing demands are RC friendly :
Desktops & embedded demands driven NOT by Word or Excel but by
multimedia, encryption, filters (dataflow oriented applications

DIOU Master EAII


Camille Sp. RSEE 26
2 Systolic ring

RA versus microprocessors

• RA less flexible (like a VLIW with fixed instructions)

but

• RA provides more (customized) computation elements


• RA can decrease memory traffic
• RA can be tailored for specific algorithms and data types

RA will not replace µP, but complement them

DIOU Master EAII


Camille Sp. RSEE 27
2 Systolic ring

Systolic computing : definition

•A set of simple processing elements with regular and local connections


which takes external inputs and processes them in a predertermined
manner in a determined fashion

H.T. Kung

DIOU Master EAII


Camille Sp. RSEE 28
2 Systolic ring

Systolic computing : characteristics of best RC design

• Simple PE

• Regular and local interconnect

• Pipeline between Pes

• I/O at boundary

DIOU Master EAII


Camille Sp. RSEE 29
2 Systolic ring

Coarse grain RA model

In abstract :
Instructions configure both PE and interconnect every cycle

In reality :
Instruction Bandwidth / Memory too high, so…
COMPROMISE
DIOU Master EAII
Camille Sp. RSEE 30
2 Systolic ring

Communications…
Relationship of communication among processors
• Shared clock (Pipelined)
• Shared registers (VLIW)
• Shared memory (SMM)
• Shared network

DIOU Master EAII


Camille Sp. RSEE 31
2 Systolic ring

Reconfigurable computing

Actual available
hardware

Instructions
currently in hardware
ram

Instructions paged out


g
Pro

DIOU Master EAII


Camille Sp. RSEE 32
2 Systolic ring

Finite Impulse response filter (FIR)


N −1
y(n)=∑a(i)x(n−i−1)
i=0
xn

aN aN-1 aN-2 a1 a0

-1
Z Z
-1 -1
Z
-1
Z yn
3 coefficients filter
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)
xn

a2 a1 a0

Z
-1 -1
Z
-1
Z yn

DIOU Master EAII


Camille Sp. RSEE 33
2 Systolic ring

Systolic FIR implementation

(MAC unit)

DIOU Master EAII


Camille Sp. RSEE 34
2 Systolic ring

Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 35
2 Systolic ring

Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 36
2 Systolic ring

Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 37
2 Systolic ring

Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 38
2 Systolic ring

Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 39
2 Systolic ring

Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 40
2 Systolic ring

Systolic FIR implementation

Optimize outer loop, preload-repeated value

DIOU Master EAII


Camille Sp. RSEE 41
2 Systolic ring

Systolic FIR implementation

Optimize outer loop, broadcast common value

DIOU Master EAII


Camille Sp. RSEE 42
2 Systolic ring

Systolic FIR implementation

Optimize outer loop, retime to eliminate broadcast

DIOU Master EAII


Camille Sp. RSEE 43
2 Systolic ring
Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 44
2 Systolic ring

Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 45
2 Systolic ring

Systolic FIR implementation

DIOU Master EAII


Camille Sp. RSEE 46
2 Systolic ring
The Systolic Ring

• Coarse grain architecture


• Multi-mode dynamical reconfiguration
• Scalable, bidimentionnal array
• VHDL design
• Designed for SoC integration

DIOU Master EAII


Camille Sp. RSEE 47
2 Systolic ring
Dnode : word-level processing unit
Constitution
• Optimized Datapath (16 bits) µinst.
• Register File (4x16bits)
• Hardwired ALU and multiplier
Reg FILE
Features
• Complex computations in local mode (FIR,IIR, WT…)
• Low silicon area (0.07mm², 0.18µm CMOS process)
• Single-cycle operations (ex:MAC+register load)
ALU + MULT

DIOU Master EAII


Camille Sp. RSEE 48
2 Systolic ring

Local controller : Dynamical reconfiguration at the Dnode level


Constitution
• 8 configuration registers
• 3 differents run modes
• 1 programming mode

reg0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
49
2 Systolic ring

Programming mode

clk

reg0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
50
2 Systolic ring

Programming mode

clk

Instruction 0 reg0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2 µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
51
2 Systolic ring

Programming mode

clk

Instruction 1 reg0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 52
2 Systolic ring

Programming mode

clk

Instruction 2 reg0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 53
2 Systolic ring

Programming mode

clk

Instruction 3 reg0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 54
2 Systolic ring

Run-mode 1 : Fixed

clk

reg0 Instruction 0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 55
2 Systolic ring

Run-mode 1 : Fixed

clk

reg0 Instruction 0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 56
2 Systolic ring

Run-mode 1 : Fixed

clk

reg0 Instruction 0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 57
2 Systolic ring

Run-mode 1 : Fixed

clk

reg0 Instruction 0
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 58
2 Systolic ring

Run-mode 2 : Dynamic

clk

reg0 Instruction 1
Inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 59
2 Systolic ring

Run-mode 2 : Dynamic

clk

reg0 Instruction 2
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 60
2 Systolic ring

Run-mode 2 : Dynamic (one-time or loop)

clk

reg0 Instruction 3
inhib

reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait

reg6

reg7

ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 61
2 Systolic ring
Array structure
• Unidirectional communications between neighbours Scalable
• Hard to implement datapath with greater pipeline
depth than the array
• Hard to implement recursive operations

Unités de
Configurable Flots de données
traitement Switchs Main dataflow (unidirectional)
UNIDIRECTIONNELS
blocks
ENTRÉES

SORTIES
OUTPUTS
INPUTS

BUS : Shared resources


BUS : ressource PARTAGÉE…

DIOU Master EAII


Camille Sp. RSEE 62
2 Systolic ring
Array structure
• Unidirectional communications between neighbours
• Hard to implement datapath with greater pipeline
Use of a Ring structure
depth than the array
• Hard to implement recursive operations

Unités de
Configurable Flots de données ª RING STRUCTURE
traitement Switchs Main dataflow (unidirectional)
UNIDIRECTIONNELS
blocks
ENTRÉES

SORTIES
OUTPUTS
INPUTS

BUS : Shared resources


BUS : ressource PARTAGÉE…

DIOU Master EAII


Camille Sp. RSEE 63
2 Systolic ring
Array structure
• Unidirectional communications between neighbours
• Hard to implement datapath with greater pipeline
depth than the array
• Hard to implement recursive operations Use of a bi-dataflows structure

ª RING STRUCTURE
Forward
Dataflow
Unités de
Configurable Flots de données
traitement Switchs Main dataflow (unidirectional)
UNIDIRECTIONNELS
blocks
ENTRÉES

SORTIES
OUTPUTS
INPUTS

Reverse Dataflow

DIOU Master EAII


Camille
BUS : Shared resources Sp. RSEE 64
BUS : ressource PARTAGÉE…
2 Systolic ring

Systolic Ring architecture Forward dataflow


ƒ Peak power : 3200 MIPS@200MHz (16 Dnodes version)

Dnode Dnode

Switch
E/S Switch Switch E/S

Dnode Dnode

Dnode Dnode Dnode Dnode Couche n


Flot de
E/S Switch Switch E/S
données
Dnode Dnode Dnode Dnode Couche n+1

Dnode Dnode
Switch

Switch Switch
E/S E/S
Dnode Dnode
DIOU Master EAII
Camille Sp. RSEE 65
2 Systolic ring

Systolic Ring architecture Forward dataflow


ƒ No complex data routing problems (crossbars…)
ƒ Unidirectional data transfers between adjacent layers (pipeline)
ƒ Linear performances increase with Dnode number
ƒ Provides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization

Forward Layer n-1

I/O
Dataflow
D-Node
D node D-Node

¾ Switch components:

Switch
I/O Switch Switch I/O
ªDirect FIFO connection for Data injection D-Node
D node D-Node

ªBUS connection for RISC communication


ªFull connectivity between 2 Dnode layers D-Node D-Node D-Node D-Node Layer n
Node
Config.
I/O Switch Switch I/O
controller
D-Node D-Node D-Node D-Node
D node Layer n+1

D-Node D-Node
D node
Local mode : stand-alone

Switch
I/O Switch Switch I/O

D-Node
D node D-Node
D node
Global mode : FPGA like
DIOU

I/O
Master EAII
Camille Sp. RSEE 66
2 Systolic ring
Feedback pipelines Reverse dataflow
ƒ Each switch writes computed data in his own feedback pipeline
ƒ Each switch has read ports on others switch’s pipelines
ƒ Easy implementation of various recursive algorithms (IIR, WT…)
D-Node D-Node

Switch
Switch Switch

D-Node D-Node

D-Node
Node D-Node D-Node D-Node

Switch Switch

D-Node D-Node D-Node D-Node

D-Node D-Node
Switch

Switch Switch
DIOU Master EAII
Camille D-Node D-Node
Sp. RSEE 67
2 Systolic ring
Feedback pipelines Reverse dataflow
ƒ Each switch writes computed data in his own feedback pipeline
ƒ Each switch has read ports on others switch’s pipelines
ƒ Easy implementation of various recursive algorithms (IIR, WT…)
D-Node D-Node

Switch
Switch Switch

D-Node D-Node

D-Node
Node D-Node D-Node D-Node

Switch Switch

D-Node D-Node D-Node D-Node

D-Node D-Node
Switch

Switch Switch
DIOU Master EAII
Camille D-Node D-Node
Sp. RSEE 68
2 Systolic ring
Feedback pipelines Reverse dataflow
ƒ Each switch writes computed data in his own feedback pipeline
ƒ Each switch has read ports on others switch’s pipelines
ƒ Easy implementation of various recursive algorithms (IIR, WT…)
D-Node D-Node

Switch
Switch Switch

D-Node D-Node

D-Node
Node D-Node D-Node D-Node

Switch Switch

D-Node D-Node D-Node D-Node

D-Node D-Node
Switch

Switch Switch
DIOU Master EAII
Camille D-Node D-Node
Sp. RSEE 69
2 Systolic ring
2 levels dynamically reconfigurable architecture:
• Global mode (first level)
ƒ The program which manages the configuration runs on the RISC processor 1
ƒ The configuration of an entire cluster can be modified at each clock cycle 2
ƒ The operating layer computes the data coming from the host processor 3
• Local mode (second level)
ƒ Each Dnode runs his own up-to-8 instructions program
OPERATING layer Dnode
A B

+ Reg FILE

ALU +M ULT

+ * S

RAM
CONFIGURATION *
layer 3
2 DATA Host
µP
CONFIG

1
Config MANAGEMENT CODE
DIOU Master EAII
Camille Controller Sp. RSEE 70
2 Systolic ring
8 Dnodes version…
• ST* CMOS process 0.25 µm & 0.18 µm
Features :
• Parametrizable core (number of Dnodes)
• Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system)
• 1600 MIPS (PII@450MHz : 400 MIPS)
• 3 Gb/s bandwidth

Ring-8 Ring-8 Dnode


0.25 µm 0.18 µm 0.18 µm
Area 0.9 mm2 0.7 mm2 0.04 mm2
Fréquency 150 MHz 200 MHz 200 MHz

• Low Dnode area Possible to realize 128 Dnodes versions…


• Suited as an IP core for SoC

*: ST: STmicroelectronics

DIOU Master EAII


Camille Sp. RSEE 71
2 Systolic ring
Assembly-level programming
RISC
0000instructions
r:ldl(0,8) Layer
M1: selection
N1:clr N2:clr Dnodes instructions

0001 r:ldl(1,2) M2: N1:clr N2:clr


0002 r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1)
0003 r:jnz(1) M2: N1:mac(in1) N2:mac(in2)
0004 r: halt

Assembler
Prototype

File1.bin RAM FPGA

Simulator
Testbench
File2.m
RAM
DIOU Ring-8
Master EAII
Camille Sp. RSEE 72
2 Systolic ring
RIF filter : edge detection
Convolution mask : [ -1 1 0 ] yn=xn-xn-1.

Assembly code Timing diagrams

0000 r:ldl(0,1) M1: N1:rst N2:rst


0001 r:jmp(0) M1: N2:sub(fifo,fifo)

Assembler
Simulator
Testbench
File2.m
RAM
Ring-8

Input image Output image

DIOU Master EAII


Camille Sp. RSEE 73
2 Systolic ring
Polynomial calculus
• P(x)=a.x+b.x²+c.x3
x
x Æ reg0 /* load reg0,x */
1 /* load reg1,x² */
x.x Æ reg1

x x²
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */

3
1 a.reg0 Æ ACC
x x
/* load ACC,a.x */
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */

5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */

DIOU Master EAII


Camille Sp. RSEE 74
2 Systolic ring
Polynomial calculus
• P(x)=a.x+b.x²+c.x3

x Æ reg0 /* load reg0,x */


1 /* load reg1,x² */
x.x Æ reg1

x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */

3
1 a.reg0 Æ ACC
x x²
/* load ACC,a.x */
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */

5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */

DIOU Master EAII


Camille Sp. RSEE 75
2 Systolic ring
Polynomial calculus
• P(x)=a.x+b.x²+c.x3
a
x Æ reg0 /* load reg0,x */
1 /* load reg1,x² */
x.x Æ reg1

x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */

3
1 a.reg0 Æ ACC
/* load ACC,a.x */
a x
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */ a.x

5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */

DIOU Master EAII


Camille Sp. RSEE 76
2 Systolic ring
Polynomial calculus
• P(x)=a.x+b.x²+c.x3
b
x Æ reg0 /* load reg0,x */
1 /* load reg1,x² */
x.x Æ reg1

x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */

3
1 a.reg0 Æ ACC
/* load ACC,a.x */
b x²
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */ a.x+b.x²

5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */

DIOU Master EAII


Camille Sp. RSEE 77
2 Systolic ring
Polynomial calculus
• P(x)=a.x+b.x²+c.x3
c
x Æ reg0 /* load reg0,x */
1 /* load reg1,x² */
x.x Æ reg1

x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */

3
1 a.reg0 Æ ACC
/* load ACC,a.x */
c x3
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */ a.x+b.x²+c. x3

5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */

DIOU Master EAII


Camille Sp. RSEE 78
2 Systolic ring

Finite Impulse response filter (FIR)


N −1
y(n)=∑ai x(n−i−1)
i =0
xn -1
Z
-1
Z Z
-1
Z
-1

a0 a1 a2 aN-1 aN

yn

DIOU Master EAII


Camille Sp. RSEE 79
2 Systolic ring
Finite Impulse response filter (FIR)
N −1
y(n)=∑a(i)x(n−i−1)
i=0
xn

aN aN-1 aN-2 a1 a0

-1
Z Z
-1 -1
Z
-1
Z yn

3 coefficients filter
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)
xn

a2 a1 a0

Z
-1 -1
Z
-1
Z yn

DIOU Master EAII


Camille Sp. RSEE 80
2 Systolic ring
FIR implementation
3 Dnodes / layer architecture use
• Piplelined implementation
• Samples are injected through dedicated lines
• Coefficients loaded during first cycle
x0, x0, x0
a2, a1, a0

x0 a2 a1 a2

Cycle 1

DIOU Master EAII


Camille Sp. RSEE 81
2 Systolic ring
FIR implementation
3 Dnodes / layer architecture use
• Piplelined implementation
• Samples are injected through dedicated lines
• Coefficients loaded during first cycle

Feedback
x1, x1, x1
a2.x0
x1
x1 a2 a1

MAC Cycle 2

a2.x0

DIOU Master EAII


Camille Sp. RSEE 82
2 Systolic ring
FIR implementation
3 Dnodes / layer architecture use
• Piplelined implementation
• Samples are injected through dedicated lines
• Coefficients loaded during first cycle

Feedback
x2, x2, x2
a2.x1 a2.x0+a1.x1
x2 x2
x2 a2 a1 a0

MAC MAC Cycle 3

a2.x1 a2.x0+a1.x1

DIOU Master EAII


Camille Sp. RSEE 83
2 Systolic ring
FIR implementation
3 Dnodes / layer architecture use
• Piplelined implementation
• Samples are injected through dedicated lines
• Coefficients loaded during first cycle
Feedback
x3, x3, x3
a2.x2 a2.x1+a1.x2
x3 x3
x3 a2 a1 a0

MAC MAC Cycle 4

a2.x2 a2.x1+a1.x2
a2.x0+a1.x1 +a0.x2

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE


DIOU Master EAII
Camille Sp. RSEE 84
2 Systolic ring
3 Dnodes / layer architecture use
• Piplelined implementation
• Samples are injected through dedicated lines
• Coefficients loaded during first cycle

Feedback
x4, x4, x4
a2.x3 a2.x2+a1.x3
x4 x4
x4 a2 a1 a0

MAC MAC Cycle 5

a2.x3 a2.x2+a1.x3
a2.x1+a1.x2 +a0.x3

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE


DIOU Master EAII
Camille Sp. RSEE 85
2 Systolic ring
3 Dnodes / layer architecture use
• Piplelined implementation
• Samples are injected through dedicated lines
• Coefficients loaded during first cycle

Feedback
x4, x4, x4
a2.x3 a2.x2+a1.x3
x4 x4
x4 a2 a1 a0

MAC MAC Cycle 6

a2.x3 a2.x2+a1.x3
a2.x1+a1.x2 +a0.x3

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE


DIOU Master EAII
Camille Sp. RSEE 86
2 Systolic ring
3 Dnodes / layer architecture use
• Piplelined implementation
• Samples are injected through dedicated lines
• Coefficients loaded during first cycle

Feedback
x5, x5, x5
a2.x4 a2.x3+a1.x4
x5 x5
x4 a2 a1 a0

MAC MAC Cycle 7

a2.x4 a2.x3+a1.x4
a2.x2+a1.x3 +a0.x4

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE


DIOU Master EAII
Camille Sp. RSEE 87
2 Systolic ring

6 coefficients filter
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)+a3.x(n−4)+a4.x(n−5)+a5.x(n−6)
xn

a2 a1 a0

MAC MAC MAC


Inter-layers
feedback
yn
xn

a5 a4 a3

MAC MAC

DIOU Master EAII


Camille Sp. RSEE 88
2 Systolic ring

Discrete Cosine Transform


• Usually bidimensional 8x8 points DCT
• Very demanding algorithm…

Original
image

DCT Quantification Coding

DCT Quantified Compressed


Coeff. Coeff. image

inverse
iDCT Quantification Decoding

Decompressed
image

DIOU Master EAII


Camille Sp. RSEE 89
2 Systolic ring

DCT algorithm

Direct transform

α (k )∑ xn cos ⎛⎜ π (2 n+1)k ⎞⎟
N −1
zk = 2 k = 0,1,……,N-1
N n =0 ⎝ 2N ⎠

Inverse transform

N −1
( ) ⎛ π (2 n +1)k ⎞
xn = 2
N ∑ α
k =0
k z k cos ⎜
⎝ 2N ⎟

n = 0,1,……,N-1

1/√2 for k = 0
with α (k ) =
1 else
DIOU Master EAII
Camille Sp. RSEE 90
2 Systolic ring

Image
• 64x64 points
• 8x8 pixels blocks
•16 bits coded image 8
64
⎡ x0, 0 x0,1 . . . . . x0, 7 ⎤
⎢x x1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . 8
. . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
64 ⎢⎣ x7 , 0 . . . . . . x7 , 7 ⎥⎦

64 blocs 8x8

DIOU Image initiale


Master EAII
Camille Sp. RSEE 91
2 Systolic ring

Implementation
• Matrix implementation
• Even / Odd frequency decomposition of the DCT algorithm

⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ α = cos (π/4)
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ β = cos (π/8)
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ δ = sin (π/8)
2 ⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦
z= T (N )x
N ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤ λ = cos (π/16)
⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥ γ = cos (3π/16)
⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥ μ = sin (3π/16)
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ 7⎦ ⎣
z ν − μ γ − λ −
⎦⎣ 3 4 ⎦
x x ν = sin (π/16)

DIOU Master EAII


Camille Sp. RSEE 92
2 Systolic ring

Coefficients coding
• Fixed point

N −1
⎛ π (2 n + 1 )k ⎞
α (k )∑
2
zk = x n cos ⎜ ⎟
N n=0 ⎝ 2 N ⎠

α = 0000000000010110 - α = 1111111111101010
β = 0000000000011101 - β = 1111111111100011
δ = 0000000000001100 - δ = 1111111111110100
Example : n=6
λ = 0000000000011111 - λ = 1111111111100001
γ = 0000000000011010 - γ = 1111111111100101
μ = 0000000000010001 - μ = 1111111111101111
ν = 0000000000000110 - ν = 1111111111111010

DIOU Master EAII


Camille Sp. RSEE 93
2 Systolic ring

Implementation :
• ADD and SUB on the first Dnode layer
• Multiply-accumulate operations (MAC) on the second Dnodes layer

Dnode1 Dnode1
xn + x(N-1)-n z0 , z2 , z4 , z6

+ MAC

_
xn x(N-1)-n xn x(N-1)-n

xn - x(N-1)-n z1 , z3 , z5 , z7
MAC
Dnode2 Dnode2
Coefficients
Config

Config
DIOU Master EAII
Camille Sp. RSEE 94
2 Systolic ring
Computing… t=0 n=0
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦

Dnode1 Dnode1

+
_
Dnode2 Dnode2
x0 x7 x0 x7

Config

M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)


DIOU Master EAII
Camille Sp. RSEE 95
2 Systolic ring
Computing… t=1 n=1
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦

Dnode1 Dnode1
x0 + x7

+ MAC

_ x0 – x7
MAC
Dnode2 Dnode2
x1 x6 x1 x6

λ,x,1,x

Config
M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)
DIOU Master EAII
Camille Sp. RSEE 96
2 Systolic ring
Computing… t=2 n=2
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦

Dnode1 Dnode1
x1 + x6

+ MAC

_ x1 – x6
MAC
Dnode2 Dnode2
x2 x5 x2 x5

γ,x,1,x

DIOU Master EAII


Camille Sp. RSEE 97
2 Systolic ring
t=3 n=3
Computing…
⎡z ⎤ 1 ⎤ ⎡ x0 + x7 ⎤
1 1 0 ⎡ 1 ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦

Dnode1 Dnode1
x2 + x5

+ MAC

_ x2 – x5
MAC
Dnode2 Dnode2
x3 x4 x3 x4

μ,x,1,x

DIOU Master EAII


Camille Sp. RSEE 98
2 Systolic ring
Computing… t=4 n=4
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦

Dnode1 Dnode1
x3 + x4

+ MAC

_ x3 – x4
MAC
Dnode2 Dnode2
ν,x,1,x

DIOU Master EAII


Camille Sp. RSEE 99
2 Systolic ring
Computing… t=5 n=0
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦

Dnode1 Dnode1
x3 + x4

+ clear
z0

_ x3 – x4
clear z1

Dnode2 Dnode2
ν,x,1,x

Config
M1: N1:clear N2: clear
DIOU Master EAII
Camille Sp. RSEE 100
2 Systolic ring
Computing…
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦

Results
– 2 transforms issued each 5 machine cycles

– « Clear » performed during addition


20 cycles for 8 samples

DIOU Master EAII


Camille Sp. RSEE 101
2 Systolic ring
Achievable parallelisn on a 8 Dnodes structures : Ring-8
M0

DCT 1D - 4 first lines


Dnode
1

Switch
Dnode
2

Config
Switch
Dnode Dnode
1 2 Config
M3
M1
Config

Config
Switch Dnode Dnode
2 1

Dnode
2
Switch

Dnode
1
DCT 1D - 4 last lines

DIOU M2
Master EAII
Camille Sp. RSEE 102
2 Systolic ring

Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⇒ 5 cycles
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦

2 partial transforms

DIOU Master EAII


Camille Sp. RSEE 103
2 Systolic ring

Overall performances
⎡ z '0 , 0 z '0 , 7 ⎤
⎢ z'
z '0,1 . . . . .
⇒ 20 cycles
⎢ 1, 0 z '1,1 . ⎥⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦

1 Line – 8 partial transforms

DIOU Master EAII


Camille Sp. RSEE 104
2 Systolic ring

Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥ ⇒ 80 cycles
⎢ ⎥
⎢ . . . ⎥
M0 M1
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦

4 Lines - 32 partial transforms

DIOU Master EAII


Camille Sp. RSEE 105
2 Systolic ring

Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥ ⇒ 80 cycles
⎢ ⎥
⎢ . . . ⎥
M0 M1
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦

4 Lines - 32 partial transforms

DIOU Master EAII


Camille Sp. RSEE 106
2 Systolic ring

Overall performances ⇒ 80 cycles


M2 M3
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦

8 Columns - 64 transforms

DIOU Master EAII


Camille Sp. RSEE 107
2 Systolic ring

Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦

DCT 2D sur 8 points :


160 CYCLES

DIOU Master EAII


Camille Sp. RSEE 108
2 Systolic ring

Comparisons : execution time (cycles)


VLIW : CPU64, TM1000, TI 320C60
Superscalar : Pentium I, Pentium II, NEC V830

400

350

300

250
# cycles

200

150

100

50

0
CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830
DIOU Master EAII
Camille VLIW Sp. RSEE Superscalar 109

Vous aimerez peut-être aussi