Académique Documents
Professionnel Documents
Culture Documents
CDMA-based Network-on-Chip (NoC) Modulation/demodulation algorithm Hierarchical star topology Applications to FFT and MPEG processors Hardware Router Implementations Multicasting and Bandwidth Reallocation Asynchronous FIFO Interfaces
1
Background: Networks-on-Chip
Solutions to overcome bus based Systemon-Chip limitation
Scalability, bandwidth, design regularity, IP reuse. Packet switching via routers. Parallel processing. Resource efficiency.
Motivation
System Core NI SRAM DRAM System Core NI
Networks-on-Chip
Processor core DSP core Mixed signal block
Wireless CDMA techniques must be modified for the wired on-chip data communication environment. Spreading code.
Walsh code is used rather than a PN sequence.
Zero cross-correlation effect. Scalable by using longer Walsh codewords.
4
NI
NI
FPGAs block
NI DSP Core
NI Media CPU
NI
NI
SRAM
UART
...
Sn
Wl Decision n
...
Dn
S: source,
D: destination,
W: Walsh codeword,
: decision factor
L1 i=0
D[i] L
Summation.
Modulated data is represented as a positive value.
+1 -1 0
1 0 No data sent
Example - continued
Demodulation step.
D1 2S[i] 10 10 10 2 6 6 6 6 0 1 0 1 0 1 0 1 2 -2 2 6 -2 2 -2 2 8 / 8 = +1 1 2S[i] W2 D[i] Data D2 10 10 10 2 6 6 6 6 0 0 1 1 0 0 1 1 2 2 -2 6 -2 -2 2 2 8 / 8 = +1 1 2S[i] W3 D[i] Data D3 10 10 10 2 6 6 6 6 0 1 1 0 0 1 1 0 2 -2 -2 -6 -2 2 2 -2 - 8 / 8 = -1 0
Walsh code
Data flow S1 D6 S2 D7
Sum (S[i])
W1 D[i]
1
S2 D2
1
S7 D7 W2: 00110011 W3: 01100110 W1: 01010101
-2S[i] + L
If codeword[i] = 1 Data
S3 D3
1 1
D6
S6
W4: 00001111
D7 10 10 10 2 6 6 6 6 0 1 1 0 1 0 0 1 2 -2 -2 -6 2 -2 -2 2 - 8 / 8 = -1 0
0
S4 D4
1
S5 D5
W7: 01101001
S7 D1
10
Example - continued
Demodulation step.
D1 2S[i] W1 D[i] -2S[i] + L If codeword[i] = 1 0/8=0 No data sent Data 8 / 8 = +1 1 Data - 8 / 8 = -1 0 6 8 4 6 4 6 6 0 0 1 0 1 0 1 0 1 -2 0 -4 2 -4 2 -2 8 2S[i] W2 D[i] D2 6 8 4 6 4 6 6 0 0 0 1 1 0 0 1 1 -2 0 4 2 -4 -2 2 8 2S[i] W3 D[i] D3 6 8 4 6 4 6 6 0 0 1 1 0 0 1 1 0 -2 0 4 -2 -4 2 2 -8
1
S2 D2
1 0
D7
S7
Data
No
S3 D3
S6
D7 6 8 4 6 4 6 6 0 0 1 1 0 1 0 0 1 -2 0 4 -2 4 -2 -2 8 8 / 8 = +1 1
D6
No
S4 D4
0
S5 D5
W7: 01101001
11
12
Hierarchical switching
Packet structure
GROUP SOURCE DESTINATION PAYLOAD
Group [3 bits]: determines local switch group. Source [3 bits]: determines source address. Destination [3 bits]: determines destination address. Payload [N bits]: includes actual data. 1 0
0 0 0 1 1
W1: 01010101 W2: 00110011 R5 (LS1) R6 (LS1) R7 (LS1) R7 (LS4) R1 (LS3) R3 (LS3) R4 (LS3) R2 (LS2) R1 (LS5) R6 (LS3) R2 (LS5) R3 (LS4) R3 (LS6) R7 (LS7) R3 (LS7) R1 (LS2) R1 (LS1) R2 (LS1) R2 (LS1) R6 (LS1) R3 (LS1) R4 (LS1)
Codeword assignment
Local switch group LS1 LS2 LS3 LS4 LS5 LS6 LS7 Destination R1 R2 R3 R4 R5 R6 R7 Assigned address of packet 001 010 011 100 101 110 111 Assigned codeword
W3: 01100110
1
01010101 00110011 01100110 00001111 01011010 00111100 01101001
W6: 00111100
0
W7: 01101001
15
16
Example - continued
Sum and demodulation step.
D[i] 2S[i] - L -2S[i] + L 1 i L-1 If codeword[i] = 0 If codeword[i] = 1 (i: integer) Data flow R1(LS1)R2 R2(LS1)R6 R3(LS1)R4 R5(LS1)R6 (1) (0) (0) (1) Sum (s[i]) 2s[i] Local Switch 1 D[i] -6 -6 6 6 -2 -2 6 6 -6 -6 6 6 2 2 -6 -6 -6 -6 -6 -6 2 2 6 6 Contention (lose) and Recovered data 8/8=+1 1 -8/8=-1 0 -8/8=-1 0 contention
Simulation environment
Simulated our entire architecture using SystemC. All of the data was randomly generated. System clock and latency model.
Tsys_clk = N * Tcode_clk.
11113311 22226622
Local Switch 2 Data flow Sum (s[i]) 2s[i] D[i] and Recovered data 8/8=+1 1 8/8=+1 1 Data flow Sum (s[i]) 2s[i]
11113311 22226622
-4 -6 6 8 -4 -6 6 8 -4 6 -6 8 -4 6 -6 8
01211210 02422420
-8 6 4 -6 -6 4 6 -8 -8 -6 4 6 6 4 -6 -8
Local Switch 4 Data flow Sum (s[i]) 2s[i] D[i] and Recovered data -8/8=-1 0 -8/8=-1 0 Data flow Sum (s[i]) 2s[i]
02201111 04402222
-8 4 4 8 6 -6 -6 6 -8 4 4 -8 -6 6 6 -6
R3(LS6)R7 (1)
10010110 20020220
-8 8 8 -8 8 -8 8
17
18
Simulation results
Performance metrics.
Throughput [packets/sec]. (rather than [bits/sec])
Throughput = The number of packets Unit time [sec]
Latency [ns]: packet transmission delay from MOD to DEMOD via CDMA switch.
Packet size [bits] 24 36 48 56 Throughput [packets/sec] 182 M 121 M 91 M 78 M Latency [ns] 22.6 28.4 36.2 44.8
new data in a parallel and pipelined fashion and all data transmission between source and destination PEs are executed concurrently through the CDMA switch. Post processing step: Storing the computation results. Each value is represented as a signed 16-point fixed point number with 10 fractional bits.
19 20
}
FFT
Data Resource
} } } } } } }
Comp.
FFT Comp.
PE7, PE8
FFT Comp.
PE #
Data Resource
Data Sink
21
22
16
45 0
14 14
14 12
40 0 35 0 30 0 25 0 20 0 15 0 10 0 50 0
10 10 Latency [us]
10 8
213 160
6 6
6 4
0 33 49 65
33
49
65
(a )
450 400 350 300 250 200 15 0
(b )
70
426 Direct FFT Computation Response time [usec] 320 In direct FFT
60
58 46 34
D irec t F F T In direc t F F T
50
40
30
20 14 10
18
10 0 50 0 33 49 65
0 33 49 65
(c )
(d )
Functional verification
Performance results
24
Video out
Audio out
Media CPU
Scheduler
3D CPU
190
SDRAM
0.5
60
600
40
40
SRAM1
TX 2 RX 2
TX 7 RX 7
Traffic generator
CDMA SW 1
0.5
Audio DSP
910
32 250
RISC CPU
CDMA SW 2
SRAM2
Code Adder
TX 3 RX 3 TX 6 RX 6
Mapping decision
173 500
Scaling QUANT
670
Up Sam ple
Synthesize
RX 4
RX 5
TX 4
CDMA Switch
25
TX 5
Area report
26
Simulation environment
Simulation is done using synthesizable VHDL.
Modelsim is used for functional verification.
Packets are generated from a normalized probability table.
Generated Traffic file is used to simulate an MPEG-4 mapped system.
Video out
Audio out
Media CPU
3D CPU
190
SDRAM
0.5
60
600
40
40
SRAM1
Simulation parameters: payload size, codeword size, operating clock period, buffer depth, topology, switching (CDMA or crossbar) etc. Synplify ASIC tool for synthesize.
Chip Express CX4001 0.25 m structured library.
CDMA SW 1
0.5
Audio DSP
910
32 250
RISC CPU
CDMA SW 2
Performance metrics
SRAM2
173 500
QUA NT
670
Up Sample
Scaling
i =1
Latency Comparison CDMA Star NoC Average 28 clock cycles Crossbar Mesh NoC 269 clock cycles
After analyzing the traffic log file after simulation, we can obtain travel times of
Hop count: The number of routers it has been forwarded through. The CDMA star topology has favorable hop count values. Buffer depth of 8 is used in both platforms. Platform described in VHDL is synthesized using Synplify ASIC tool with Chip Express CX4001 0.25 m structured library. The estimated maximum frequency is 76 MHz.
31 32
the packets during the transmission. Latency values include the effects due to contention between packets destined for the same address at the same time.
IP 2
BUFF
HD
HD
Scheduler
was found to be 76 MHz. multiplied by the payload size in bytes. Of the cases considered, the 64-bit payload size is sufficient to meet this constraint: 76 MHz x 8 bytes = 608 MBytes/sec. This is a best-case value that does not take into account possible effects due to contention. However, the number is sufficiently high to strongly suggest that the network is fast enough to meet the MPEG-4 throughput requirements.
IP 7
HD
Walsh Cod es
M O D
DE M O D
DE M O D
M O D HD
M OD DEM OD M OD DEM OD
BUF F
IP 3
BUF F
Code Adder
DEMOD M OD
BUF F
IP 4
HD M O D DE M O D DE M O D M O D
HD
BUFF
BUFF
HD
IP 6
IP 5
33
34
35
36
MOD
Selected Codeword: 01100110 01100110
1
8 MUXes
...
3 3
...
3
Gnt
...
...
8 0/1
...
1
x1
101
x2 x3 F A S 0
x4
011 1
x5 x6 x7 F A S 0
x1 x2 x3
111
F A S 1
x4
110 0
x 5 x6 x7 C F A S 0
x 1 x2 x 3
011
F A S 0
x4
101 1
x5 x6 x7 C F A S 0 0
001
x 2 x3 C F A S 1
x4
000 0
x 5 x6 x7
C 1
C 1
C 1
F A C S 0 0
7
0
0 F
0 F
0 F 0 C 0 A S 1
0 F
... ...
... ...
C A S 0 1
C A S 0 1
C A S 0 1
8 7
...
Code Adder
1 F A S
1 C
1 F A S
1 C
1 F A S
0 C S0 S2
0 F A S S
S2
10
S1
S0
S2
10
S1
S0
S2
10
S1
00
S0
5
x1
5
x4
5
x4
1
......
x4
... ...
... ...
... ...
010
x2 x3 F A S 1
010 1
x5 x6 x7 F
PAYLOAD[16 ]
x1 x2 x3
000
F A S 0
111 0
x 5 x6 x7 F 0
100
x1 x2 x 3 F C 1 A S
100 1
x5 x6 x7 F
x 1 x2 x 3
110
F C A S 0
x4
001 0
x5 x6 x7 F C A S 1
10011001
C 0
A C S 0 1
C 0
A C S 1 1
A C S 0 1
...
1 1 F A C S 1 1
0 1 F A C S 0 1
1 1 F A C S 1 1
0 C 0
1 F A S 1
S 2 2 [i ] 7 8 7 3
...
...
...
8 0/1
0 C S2
0 F A S S1
0 C S0 S2
1 F A S S1
0 C S0 S2
0 F A S S1
1 C S0 S2
0 F A S S1
...
BUFFER
01
01
01
01
S0
Data 0 1 No Data
37
38
A0
B0
=
i =0
L-1
0 0
D[i ] N
Components 8 bits 4-input LUTs MOD BUFF HD SCHE 1769 1239 0 49 3033 4822 10912 out of 178156 6.1 % Flip Flops 0 1920 42 7 0 105 2074 out of 178156 1.2 % 16 bits 4-input LUTs 2835 1984 0 46 4736 7289 16890 out of 178156 9.5 % Flip Flops 0 2870 42 7 0 161 3080 out of 178156 1.7 %
Payload Size 32 bits 4-input LUTs 5009 3507 0 48 7826 12757 29147 out of 178156 16.4 % Flip Flops 0 4774 42 7 0 273 5096 out of 178156 2.9 % 64 bits 4-input LUTs 7839 5487 0 49 17336 20224 50935 out of 178156 28.6 % Flip Flops 0 8686 42 7 0 497 9232 out of 178156 5.2 % 128 bits 4-input LUTs 14534 10174 0 48 25016 111900 161672 out of 178156 90.7 % Flip Flops 0 16202 42 7 0 945 17196 out of 17815 6 9.6 %
A1
B1
A2
B2
S4
S3
S2
S1
A3
B3
A4
2S[i]- L
B4
-2S[i]+L
S[i] is the summation of all modulated values. L is the codeword length. D[i] is the decision variable. is the decision factor.
C5
S4
S3
S2
S1
S0
CA
Decision Factor ( ) +1 -1 0
DEMOD
sel
2S[i]-L
-2S[i]+L
Total
Utilization
39
40
Performance Results (Throughput & Latency) Payload Size 7-port Aggregate Throughput 7 * 8 * 71.4 MHz = 3.998 Gbps 7 * 16 * 71.4 MHz = 7.996 Gbps 7 * 32 * 71.4 MHz = 15.993Gbps 7 * 64 * 71.4 MHz = 31.987 Gbps 7 * 128 * 55.5 MHz = 49.728 Gbps Average Latency 90 ns 90 ns 90 ns 90 ns 115.7 ns
41
42
0.18 m ChipExpress cx5000 structured ASIC library Payload Size 8 bits 16 bits 32 bits 64 bits 128 bits Optimal Estimated Frequency 94.9 MHz 94.7 MHz 94.2 MHz 93.8 MHz 93.4 MHz Optimal Estimated Period 10.539 ns 10.560 ns 10.615 ns 10.660 ns 10.706 ns Cell Usage Gate Count 19416 29113 47754 86912 167740 Area 26390.0 m2 39736.0 m2 65746.0 m2 119000.0 m2 228480.0 m2 43
44
0.18 m (cx5000)
5314 Mbps (5.314 Gbps) 10606 Mbps (10.606 Gbps) 21100 Mbps (21.100 Gbps) 42022 Mbps (42.022 Gbps) 83686 Mbps (83.686 Gbps) Transfer Latency Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Packet 7
0.25 m (cx4001)
2800 Mbps (2.800 Gbps) 5532 Mbps (5.532 Gbps) 11200 Mbps (11.200 Gbps) 22400 Mbps (22.4 Gbps) 44800 Mbps (44.8 Gbps)
S1
S2
S11 S1 2 N
S3 N Sn S11 S
S2
S1
S1
S3 N
1 2
S1
SN
Multicasting Example
(same codeword W2 assigned to multiple demodulators)
Average Latency
46
PE 4
NA
Lo ca l CD MA Sw itch
PE8
PE9
TRANSFER TYPE PE2 PE4 PE5 PE9 PE19 Multicasting to PE1, PE5 and PE7 Unicasting to PE6 Unicasting to PE3 with DBW Unicasting through C-SW to PE4 Broadcasting to SW3
S EL
BCN Pool
Scheduler Scheduler
PE7
M OD 0 DE M OD 0
PE10
NA
NA
MOD 2
MOD 3
PE11
P E2
S EL
P E5
FIFO MOD 1 MOD 4 FIFO
M OD 2
NA
NA
NA
NA
Co de A dder
FIFO MOD 0 DEM DE OD0 MOD MOD 7 MOD 6 DEM OD9 MOD 5 DEM OD8 FIFO
S EL
P E1
P E6
DE M OD2
(2)
NA NA
PE4
DE M OD 1
PE12
PE5
PE6
PE13
M OD 1
(5)
A RB (C-SW )
PE3
(3)
PE14
NA
NA
NA
NA
NA
PE16
PE 7
Lo ca l CD MA Sw itch
PE20
PE17
NA
NA
PE19
47
PE18
PE15
PE2
PE1
48
Benefits - Throughput
.
Benefits - Latency
.
GALS (Globally Asynchronous, Locally Synchronous) No serious clock skew problems. No global synchronization issues. No obstacle for the increasing demands of mixed-clock platform with systems. Our
asynchronous FIFOs
Switch input asynchronous FIFO: mixed-clock interface purpose. Switch output asynchronous FIFO: buffer (OQ) and mixed-clock interface purposes.
CLK 4
A rbit er
CLK 5 CLK 2
r_data
r_data
CLK 2 IP
CLK 5 IP
r_data
r_data
Switch F abric
CLK 1
buff_full
CLK 7 IP
CLK 6 IP CLK 6
Async FIFO
Async FIFO
Async FIFO
Async FIFO
Fully asynchronous
51
52
CLK 3 CLK 3 IP
CLK 4 CLK 4 IP
Asy nc FIFO VOQ Input Port
Arbiter
r_d ata
CLK 5 CLK 5 IP
r_d ata
r_d ata
r_d ata
Switch Fabric
buff_fu ll
CLK 7 IP CLK 7
CLK 6 IP CLK 6
Async FIF O
Asy n c FI FO
Asy nc FI FO
Async FI FO
53
54
56
Assigned Operating Frequencies IP Modules Test Scenario I CLK1 IP (Switch) CLK2 IP CLK3 IP CLK4 IP CLK5 IP CLK6 IP CLK7 IP 71.4 MHz 62.5 MHz 55.6 MHz 50.0 MHz 71.4 MHz 62.5 MHz 55.6 MHz Test Scenario II 83.3 MHz 71.4 MHz 62.5 MHz 55.6 MHz 83.3 MHz 71.4 MHz 62.5 MHz Test Scenario III 100.0 MHz 83.3 MHz 71.4 MHz 62.5 MHz 100.0 MHz 83.3 MHz 71.4 MHz
Line rate equal to the switch speed for a fair comparison of the two platforms.
57
58
Under Test Scenario II OCSN Platform s Platform I (OQ) Platform II (CIOQ) Buffer Overflow Managem ent With Async FIFO Receiver Controller Async FIFO Receiver Controller Average Latency 305 ns 295 ns Average Aggregated Throughput 2.60 Gbps 2.69 Gbps Average Packet Drop Probability 16.3 % 14.3 %
Under Test Scenario III OCSN Platform s Platform I (OQ) Platform II (CIOQ) Buffer Overflow Managem ent With Async FIFO Receiver Controller Async FIFO Receiver Controller Average Latency 295 ns 270 ns Average Aggregated Throughput 2.69 Gbps 2.94 Gbps Average Packet Drop Probability 15.8 % 14.4 %
59