Académique Documents
Professionnel Documents
Culture Documents
d’optimisation
architecturale
Camille Diou
diou@univ-metz.fr
t1
State t2
machine t3
ALU
A
B
C
ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C
x
Computation example : CONTROLLER DATAPATH
t1 <- x
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2
t3 <- t3.t1
t3
ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C
#CYCLES: 1
y
Computation example : CONTROLLER DATAPATH
t1 <- x
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2
t3 <- t3.t1
t3
ALU
t2 <- B.t2
A
t3 <- t2+t3
B
out<- t3+C C
#CYCLES: 2
X
t2 <- B.t2
A
t3 <- t2+t3 A
B
out<- t3+C C
#CYCLES: 3
X
t2 <- B.t2
A
t3 <- t2+t3 t3
B
out<- t3+C C
#CYCLES: 4
X
t2 <- B.t2
A
t3 <- t2+t3 t2
B
out<- t3+C C
#CYCLES: 5
+
t2 <- B.t2
A
t3 <- t2+t3 t3
B
out<- t3+C C
#CYCLES: 6
S
Computation example : CONTROLLER DATAPATH
t1 <- x t3+C
S=Ax²+By+C t2 <- y
t1
t3 <- A.t1
t2 t3
t3 <- t3.t1
t3
+
t2 <- B.t2
A
t3 <- t2+t3 C
B
out<- t3+C C
#CYCLES: 7
Fetch
FetchNext
Next Execute
Execute
START
START Instruction Instruction HALT
HALT
Instruction Instruction
Store path
Load path n
Data flow
ACC
Control signals
Memory
A B
Address
FSM Function ALU m
controls 16 bits wide
16M words
Opcode MAR
S LD
IR
incr PC
Branch
Instruction:
15 14 13 0
Address
Opcode:
00: Load
01: Store
10: Add
11: Branch
Store path
Load path 16
ACC
A B
Memory
Address
ALU
FSM Function 14
controls 16 bits wide
S 16M words
Opcode 2
LD MAR
IR
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
Camille Sp. RSEE Instruction path 14
1 Microprocessor basics
MAR : Memory Adress Register
1. Instruction fetch:
IR : Instruction Register
- PC is moved into MAR
- Read from memory PC : Program Counter register
- Load instruction into IR
2. Instruction decode: Store path
- Op code bits to FSM(ADD)
- rest of bits is operand addr. Load path 16
ACC
A B
Memory
Address
ALU 1000110100110011
FSM Function 14
controls 16 bits wide
S 10110100110011
16M words
Opcode 2
LD MAR
1000110100110011
IR 10110100110011
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
Camille Sp. RSEE Instruction path 15
1 Microprocessor basics
MAR : Memory Adress Register
3. Operand Fetch: IR : Instruction Register
- IR<address> -> MAR
PC : Program Counter register
- Read data from memory
4. Instr. Execute Store path
- Memory to ALU B
- AC to ALU Load path 16
- ALU Add 1000100011100111
- S to AC ACC
A B
Memory
0011001101110110 0101010101110001
Address
ALU 0101010101110001
FSM Function 14
controls 1000100011100111 16 bits wide
S 00110100110011
16M words
Opcode 2
LD MAR
1000110100110011
10110100110011
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
Camille Sp. RSEE Instruction path 16
1 Microprocessor basics
MAR : Memory Adress Register
5. Housekeeping: IR : Instruction Register
- Increment PC
PC : Program Counter register
Store path
Load path 16
1000100011100111
ACC
A B
Memory
0011001101110110 0101010101110001
Address
ALU 0101010101110001
FSM Function 14
controls 1000100011100111 16 bits wide
S 00110100110011
16M words
Opcode 2
LD MAR
1000110100110011
10110100110011
10110100110100
incr
Branch 14 PC
14
DIOU 16 Address operand
Master EAII
Camille Sp. RSEE Instruction path 17
1 Microprocessor basics
16x16
registers
shift
or or
or
0000 7C0A ;
0001 8C00 ; LOAD RC, #A
0002 7B04 ; ...
0003 7A0A ; ...
0004 9C7C ; ...
0005 611A ; ...
0006 614B ; ...
...
c[1]=a[i]+b[i]; no dependency
d[i]=a[i]+c[2]; c[1] is never c[2]
r1 r2
RA versus microprocessors
but
H.T. Kung
• Simple PE
• I/O at boundary
In abstract :
Instructions configure both PE and interconnect every cycle
In reality :
Instruction Bandwidth / Memory too high, so…
COMPROMISE
DIOU Master EAII
Camille Sp. RSEE 30
2 Systolic ring
Communications…
Relationship of communication among processors
• Shared clock (Pipelined)
• Shared registers (VLIW)
• Shared memory (SMM)
• Shared network
Reconfigurable computing
Actual available
hardware
Instructions
currently in hardware
ram
aN aN-1 aN-2 a1 a0
-1
Z Z
-1 -1
Z
-1
Z yn
3 coefficients filter
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)
xn
a2 a1 a0
Z
-1 -1
Z
-1
Z yn
(MAC unit)
reg0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
49
2 Systolic ring
Programming mode
clk
reg0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
50
2 Systolic ring
Programming mode
clk
Instruction 0 reg0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2 µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out
51
2 Systolic ring
Programming mode
clk
Instruction 1 reg0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 52
2 Systolic ring
Programming mode
clk
Instruction 2 reg0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 53
2 Systolic ring
Programming mode
clk
Instruction 3 reg0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 54
2 Systolic ring
Run-mode 1 : Fixed
clk
reg0 Instruction 0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 55
2 Systolic ring
Run-mode 1 : Fixed
clk
reg0 Instruction 0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 56
2 Systolic ring
Run-mode 1 : Fixed
clk
reg0 Instruction 0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 57
2 Systolic ring
Run-mode 1 : Fixed
clk
reg0 Instruction 0
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 58
2 Systolic ring
Run-mode 2 : Dynamic
clk
reg0 Instruction 1
Inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 59
2 Systolic ring
Run-mode 2 : Dynamic
clk
reg0 Instruction 2
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 60
2 Systolic ring
clk
reg0 Instruction 3
inhib
reg1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2
µinst.
Mux
Reg3
Mux
reg4
Reg FILE
reg5 wait
reg6
reg7
ck 8 ALU + MULT
2
mode
3
wait
Controller Decoder
enex
DIOU Master EAII
Camille Sp. RSEE
out 61
2 Systolic ring
Array structure
• Unidirectional communications between neighbours Scalable
• Hard to implement datapath with greater pipeline
depth than the array
• Hard to implement recursive operations
Unités de
Configurable Flots de données
traitement Switchs Main dataflow (unidirectional)
UNIDIRECTIONNELS
blocks
ENTRÉES
SORTIES
OUTPUTS
INPUTS
Unités de
Configurable Flots de données ª RING STRUCTURE
traitement Switchs Main dataflow (unidirectional)
UNIDIRECTIONNELS
blocks
ENTRÉES
SORTIES
OUTPUTS
INPUTS
ª RING STRUCTURE
Forward
Dataflow
Unités de
Configurable Flots de données
traitement Switchs Main dataflow (unidirectional)
UNIDIRECTIONNELS
blocks
ENTRÉES
SORTIES
OUTPUTS
INPUTS
Reverse Dataflow
Dnode Dnode
Switch
E/S Switch Switch E/S
Dnode Dnode
Dnode Dnode
Switch
Switch Switch
E/S E/S
Dnode Dnode
DIOU Master EAII
Camille Sp. RSEE 65
2 Systolic ring
I/O
Dataflow
D-Node
D node D-Node
¾ Switch components:
Switch
I/O Switch Switch I/O
ªDirect FIFO connection for Data injection D-Node
D node D-Node
D-Node D-Node
D node
Local mode : stand-alone
Switch
I/O Switch Switch I/O
D-Node
D node D-Node
D node
Global mode : FPGA like
DIOU
I/O
Master EAII
Camille Sp. RSEE 66
2 Systolic ring
Feedback pipelines Reverse dataflow
Each switch writes computed data in his own feedback pipeline
Each switch has read ports on others switch’s pipelines
Easy implementation of various recursive algorithms (IIR, WT…)
D-Node D-Node
Switch
Switch Switch
D-Node D-Node
D-Node
Node D-Node D-Node D-Node
Switch Switch
D-Node D-Node
Switch
Switch Switch
DIOU Master EAII
Camille D-Node D-Node
Sp. RSEE 67
2 Systolic ring
Feedback pipelines Reverse dataflow
Each switch writes computed data in his own feedback pipeline
Each switch has read ports on others switch’s pipelines
Easy implementation of various recursive algorithms (IIR, WT…)
D-Node D-Node
Switch
Switch Switch
D-Node D-Node
D-Node
Node D-Node D-Node D-Node
Switch Switch
D-Node D-Node
Switch
Switch Switch
DIOU Master EAII
Camille D-Node D-Node
Sp. RSEE 68
2 Systolic ring
Feedback pipelines Reverse dataflow
Each switch writes computed data in his own feedback pipeline
Each switch has read ports on others switch’s pipelines
Easy implementation of various recursive algorithms (IIR, WT…)
D-Node D-Node
Switch
Switch Switch
D-Node D-Node
D-Node
Node D-Node D-Node D-Node
Switch Switch
D-Node D-Node
Switch
Switch Switch
DIOU Master EAII
Camille D-Node D-Node
Sp. RSEE 69
2 Systolic ring
2 levels dynamically reconfigurable architecture:
• Global mode (first level)
The program which manages the configuration runs on the RISC processor 1
The configuration of an entire cluster can be modified at each clock cycle 2
The operating layer computes the data coming from the host processor 3
• Local mode (second level)
Each Dnode runs his own up-to-8 instructions program
OPERATING layer Dnode
A B
+ Reg FILE
ALU +M ULT
+ * S
RAM
CONFIGURATION *
layer 3
2 DATA Host
µP
CONFIG
1
Config MANAGEMENT CODE
DIOU Master EAII
Camille Controller Sp. RSEE 70
2 Systolic ring
8 Dnodes version…
• ST* CMOS process 0.25 µm & 0.18 µm
Features :
• Parametrizable core (number of Dnodes)
• Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system)
• 1600 MIPS (PII@450MHz : 400 MIPS)
• 3 Gb/s bandwidth
*: ST: STmicroelectronics
Assembler
Prototype
Simulator
Testbench
File2.m
RAM
DIOU Ring-8
Master EAII
Camille Sp. RSEE 72
2 Systolic ring
RIF filter : edge detection
Convolution mask : [ -1 1 0 ] yn=xn-xn-1.
Assembler
Simulator
Testbench
File2.m
RAM
Ring-8
x x²
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
x x
/* load ACC,a.x */
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */
5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */
x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
x x²
/* load ACC,a.x */
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */
5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */
x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
/* load ACC,a.x */
a x
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */ a.x
5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */
x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
/* load ACC,a.x */
b x²
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */ a.x+b.x²
5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */
x x² x3
2
1 reg0.reg1 Æ reg2
/* load reg1,x3 */
3
1 a.reg0 Æ ACC
/* load ACC,a.x */
c x3
ALU + MULT
4
1 b.reg1 + ACC Æ ACC
/* load ACC,a.x+b.x² */ a.x+b.x²+c. x3
5
1 c.reg2 + ACC Æ ACC
/* load ACC,a.x+b.x²+c.x3 */
a0 a1 a2 aN-1 aN
yn
aN aN-1 aN-2 a1 a0
-1
Z Z
-1 -1
Z
-1
Z yn
3 coefficients filter
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)
xn
a2 a1 a0
Z
-1 -1
Z
-1
Z yn
x0 a2 a1 a2
Cycle 1
Feedback
x1, x1, x1
a2.x0
x1
x1 a2 a1
MAC Cycle 2
a2.x0
Feedback
x2, x2, x2
a2.x1 a2.x0+a1.x1
x2 x2
x2 a2 a1 a0
a2.x1 a2.x0+a1.x1
a2.x2 a2.x1+a1.x2
a2.x0+a1.x1 +a0.x2
Feedback
x4, x4, x4
a2.x3 a2.x2+a1.x3
x4 x4
x4 a2 a1 a0
a2.x3 a2.x2+a1.x3
a2.x1+a1.x2 +a0.x3
Feedback
x4, x4, x4
a2.x3 a2.x2+a1.x3
x4 x4
x4 a2 a1 a0
a2.x3 a2.x2+a1.x3
a2.x1+a1.x2 +a0.x3
Feedback
x5, x5, x5
a2.x4 a2.x3+a1.x4
x5 x5
x4 a2 a1 a0
a2.x4 a2.x3+a1.x4
a2.x2+a1.x3 +a0.x4
6 coefficients filter
y(n)=a0.x(n−1)+a1.x(n−2)+a2.x(n−3)+a3.x(n−4)+a4.x(n−5)+a5.x(n−6)
xn
a2 a1 a0
a5 a4 a3
MAC MAC
Original
image
inverse
iDCT Quantification Decoding
Decompressed
image
DCT algorithm
Direct transform
α (k )∑ xn cos ⎛⎜ π (2 n+1)k ⎞⎟
N −1
zk = 2 k = 0,1,……,N-1
N n =0 ⎝ 2N ⎠
Inverse transform
N −1
( ) ⎛ π (2 n +1)k ⎞
xn = 2
N ∑ α
k =0
k z k cos ⎜
⎝ 2N ⎟
⎠
n = 0,1,……,N-1
1/√2 for k = 0
with α (k ) =
1 else
DIOU Master EAII
Camille Sp. RSEE 90
2 Systolic ring
Image
• 64x64 points
• 8x8 pixels blocks
•16 bits coded image 8
64
⎡ x0, 0 x0,1 . . . . . x0, 7 ⎤
⎢x x1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . 8
. . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
64 ⎢⎣ x7 , 0 . . . . . . x7 , 7 ⎥⎦
64 blocs 8x8
Implementation
• Matrix implementation
• Even / Odd frequency decomposition of the DCT algorithm
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ α = cos (π/4)
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ β = cos (π/8)
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ δ = sin (π/8)
2 ⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦
z= T (N )x
N ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤ λ = cos (π/16)
⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥ γ = cos (3π/16)
⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥ μ = sin (3π/16)
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ 7⎦ ⎣
z ν − μ γ − λ −
⎦⎣ 3 4 ⎦
x x ν = sin (π/16)
Coefficients coding
• Fixed point
N −1
⎛ π (2 n + 1 )k ⎞
α (k )∑
2
zk = x n cos ⎜ ⎟
N n=0 ⎝ 2 N ⎠
α = 0000000000010110 - α = 1111111111101010
β = 0000000000011101 - β = 1111111111100011
δ = 0000000000001100 - δ = 1111111111110100
Example : n=6
λ = 0000000000011111 - λ = 1111111111100001
γ = 0000000000011010 - γ = 1111111111100101
μ = 0000000000010001 - μ = 1111111111101111
ν = 0000000000000110 - ν = 1111111111111010
Implementation :
• ADD and SUB on the first Dnode layer
• Multiply-accumulate operations (MAC) on the second Dnodes layer
Dnode1 Dnode1
xn + x(N-1)-n z0 , z2 , z4 , z6
+ MAC
_
xn x(N-1)-n xn x(N-1)-n
xn - x(N-1)-n z1 , z3 , z5 , z7
MAC
Dnode2 Dnode2
Coefficients
Config
Config
DIOU Master EAII
Camille Sp. RSEE 94
2 Systolic ring
Computing… t=0 n=0
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Dnode1 Dnode1
+
_
Dnode2 Dnode2
x0 x7 x0 x7
Config
Dnode1 Dnode1
x0 + x7
+ MAC
_ x0 – x7
MAC
Dnode2 Dnode2
x1 x6 x1 x6
λ,x,1,x
Config
M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)
DIOU Master EAII
Camille Sp. RSEE 96
2 Systolic ring
Computing… t=2 n=2
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Dnode1 Dnode1
x1 + x6
+ MAC
_ x1 – x6
MAC
Dnode2 Dnode2
x2 x5 x2 x5
γ,x,1,x
Dnode1 Dnode1
x2 + x5
+ MAC
_ x2 – x5
MAC
Dnode2 Dnode2
x3 x4 x3 x4
μ,x,1,x
Dnode1 Dnode1
x3 + x4
+ MAC
_ x3 – x4
MAC
Dnode2 Dnode2
ν,x,1,x
Dnode1 Dnode1
x3 + x4
+ clear
z0
_ x3 – x4
clear z1
Dnode2 Dnode2
ν,x,1,x
Config
M1: N1:clear N2: clear
DIOU Master EAII
Camille Sp. RSEE 100
2 Systolic ring
Computing…
⎡ z0 ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x0 + x7 ⎤ ⎡ z1 ⎤ ⎡λ γ μ ν ⎤ ⎡ x0 − x7 ⎤
⎢ z ⎥ ⎢ β δ − δ − β⎥ ⎢ x + x ⎥ ⎢ z ⎥ ⎢ γ − ν − λ − μ⎥ ⎢ x − x ⎥
⎢ 2⎥ = ⎢ ⎥⎢ 1 6 ⎥ ⎢ 3⎥ = ⎢ ⎥⎢ 1 6 ⎥
⎢ z 4 ⎥ ⎢α − α − α α ⎥ ⎢ x2 + x5 ⎥ ⎢ z5 ⎥ ⎢μ − λ ν γ ⎥ ⎢ x2 − x5 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ z6 ⎦ ⎣ δ − β β − δ⎦ ⎣ x3 + x4 ⎦ ⎣ z7 ⎦ ⎣ν − μ γ − λ ⎦ ⎣ x3 − x4 ⎦
Results
– 2 transforms issued each 5 machine cycles
Switch
Dnode
2
Config
Switch
Dnode Dnode
1 2 Config
M3
M1
Config
Config
Switch Dnode Dnode
2 1
Dnode
2
Switch
Dnode
1
DCT 1D - 4 last lines
DIOU M2
Master EAII
Camille Sp. RSEE 102
2 Systolic ring
Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⇒ 5 cycles
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
2 partial transforms
Overall performances
⎡ z '0 , 0 z '0 , 7 ⎤
⎢ z'
z '0,1 . . . . .
⇒ 20 cycles
⎢ 1, 0 z '1,1 . ⎥⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥ ⇒ 80 cycles
⎢ ⎥
⎢ . . . ⎥
M0 M1
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥ ⇒ 80 cycles
⎢ ⎥
⎢ . . . ⎥
M0 M1
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
8 Columns - 64 transforms
Overall performances
⎡ z '0 , 0 z '0,1 . . . . . z '0 , 7 ⎤
⎢ z' z '1,1 . ⎥⎥
⎢ 1, 0
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢⎣ z '7 , 0 . . . . . . z '7 , 7 ⎥⎦
400
350
300
250
# cycles
200
150
100
50
0
CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830
DIOU Master EAII
Camille VLIW Sp. RSEE Superscalar 109