Vous êtes sur la page 1sur 8

Outline

CDMA-Based Network-on-Chip Design

CDMA-based Network-on-Chip (NoC) Modulation/demodulation algorithm Hierarchical star topology Applications to FFT and MPEG processors Hardware Router Implementations Multicasting and Bandwidth Reallocation Asynchronous FIFO Interfaces
1

Background: Networks-on-Chip
Solutions to overcome bus based Systemon-Chip limitation
Scalability, bandwidth, design regularity, IP reuse. Packet switching via routers. Parallel processing. Resource efficiency.

CDMA-based NoC architecture


Proposal
Novel NoC architecture based on a CDMA-based modulation/demodulation algorithm.

Motivation
System Core NI SRAM DRAM System Core NI

Networks-on-Chip
Processor core DSP core Mixed signal block

Wireless CDMA techniques must be modified for the wired on-chip data communication environment. Spreading code.
Walsh code is used rather than a PN sequence.
Zero cross-correlation effect. Scalable by using longer Walsh codewords.
4

NI

NI

FPGAs block

Communication Networks Memory block

Regular Switch based Communication Platform

CPLDs block Dedicated HW block

RAM ROM CAM CASH

NI DSP Core

NI Media CPU

NI

NI

SRAM

UART

CDMA Using a Walsh code


General concepts: Unique code is assigned to each user Modulation = original data x codeword Demodulaton = inner-product of encoded signal and codeword The signals are spread with a destination code. Avoid complex internal routing Transmit data to different outputs at the same time Allow multiple IP blocks to transmit simultaneously without interference: Orthogonality property
6

CDMA NoC switch architecture


Block diagram
Wi S1 Wj S2 Wk S3 Decision 3 D3 Decision 2 D2 Decision 1 D1

...
Sn

Wl Decision n

...
Dn

S: source,

D: destination,

W: Walsh codeword,

: decision factor

CDMA mod/demod algorithm


Algorithm
Modulation.
Data 0 1 No data sent Modulated data Codeword itself (W) Inverted codeword (~W) All-zero codeword

CDMA mod/demod algorithm cont.


Demodulation.
if codeword[i ] is 0 2S[i] L D[i] = 2S[i] + L if codeword[i ] is 1
S[i]: summation of all modulated values. L: codeword length. D[i]: the decision variable.

L1 i=0

D[i] L

: decision factor. 1 i L-1 (i: integer)

Direct bit mapping without integer conversion.


Decision factor () Demodulated data [bit]

Summation.
Modulated data is represented as a positive value.

+1 -1 0

1 0 No data sent

Example without no data sent case


Modulation and sum step.
D[i] S1 D1 M odulated codew ord 11000011 (~W6) 01101001 ( W7) 1 i L-1 S3 D4 S4 D3 S5 D2 W5: 01011010 S6 D5 W6: 00111100 10100101 (~W5) 10101010 (~W1) 11110000 (~W4) 01100110 ( W3) 11001100 (~W2) 55513333 2S[i] W4 D[i] Data (i: integer) 2S[i] - L If codeword[i] = 0

Example - continued
Demodulation step.
D1 2S[i] 10 10 10 2 6 6 6 6 0 1 0 1 0 1 0 1 2 -2 2 6 -2 2 -2 2 8 / 8 = +1 1 2S[i] W2 D[i] Data D2 10 10 10 2 6 6 6 6 0 0 1 1 0 0 1 1 2 2 -2 6 -2 -2 2 2 8 / 8 = +1 1 2S[i] W3 D[i] Data D3 10 10 10 2 6 6 6 6 0 1 1 0 0 1 1 0 2 -2 -2 -6 -2 2 2 -2 - 8 / 8 = -1 0

Walsh code

Data flow S1 D6 S2 D7

Sum (S[i])

W1 D[i]

1
S2 D2

1
S7 D7 W2: 00110011 W3: 01100110 W1: 01010101

-2S[i] + L

If codeword[i] = 1 Data

S3 D3

1 1
D6

S6

W4: 00001111

D4 10 10 10 2 6 6 6 6 0 0 0 0 1 1 1 1 2 2 2 -6 2 2 2 2 8 / 8 = +1 1 2S[i] W5 D[i] Data

D5 10 10 10 2 6 6 6 6 0 1 0 1 1 0 1 0 2 -2 2 6 2 -2 2 -2 8 / 8 = +1 1 2S[i] W6 D[i] Data

D6 10 10 10 2 6 6 6 6 0 0 1 1 1 1 0 0 2 2 -2 6 2 2 -2 -2 8 / 8 = +1 1 2S[i] W7 D[i] Data

D7 10 10 10 2 6 6 6 6 0 1 1 0 1 0 0 1 2 -2 -2 -6 2 -2 -2 2 - 8 / 8 = -1 0

0
S4 D4

1
S5 D5

W7: 01101001

S7 D1

10

Example with no data sent case


Modulation and sum step.
D[i] S1 D1 Walsh code Data flow S1 D4 S2 D5 S3 D7 No S5 D3 No S7 D2 M odulated codew ord 11110000 (~W4) 01011010 ( W5) 1 i L-1 10010110 (~W7) 00000000 01100110 ( W3) 00000000 11001100 (~W2) 34232330 2S[i] W4 D[i] Data (i: integer) Sum (S[i]) 2S[i] - L If codeword[i] = 0

Example - continued
Demodulation step.
D1 2S[i] W1 D[i] -2S[i] + L If codeword[i] = 1 0/8=0 No data sent Data 8 / 8 = +1 1 Data - 8 / 8 = -1 0 6 8 4 6 4 6 6 0 0 1 0 1 0 1 0 1 -2 0 -4 2 -4 2 -2 8 2S[i] W2 D[i] D2 6 8 4 6 4 6 6 0 0 0 1 1 0 0 1 1 -2 0 4 2 -4 -2 2 8 2S[i] W3 D[i] D3 6 8 4 6 4 6 6 0 0 1 1 0 0 1 1 0 -2 0 4 -2 -4 2 2 -8

1
S2 D2

1 0
D7

S7

W1: 01010101 W2: 00110011 W3: 01100110

Data

No
S3 D3

S6

W4: 00001111 W5: 01011010 W6: 00111100

D4 6 8 4 6 4 6 6 0 0 0 0 0 1 1 1 1 -2 0 -4 -2 4 2 2 8 8 / 8 = +1 1 2S[i] W5 D[i] Data

D5 6 8 4 6 4 6 6 0 0 1 0 1 1 0 1 0 -2 0 -4 2 4 -2 2 -8 - 8 / 8 = -1 0 2S[i] W6 D[i] Data

D6 6 8 4 6 4 6 6 0 0 0 1 1 1 1 0 0 -2 0 4 2 4 2 -2 -8 0/8=0 No data sent 2S[i] W7 D[i] Data

D7 6 8 4 6 4 6 6 0 0 1 1 0 1 0 0 1 -2 0 4 -2 4 -2 -2 8 8 / 8 = +1 1

D6

No
S4 D4

0
S5 D5

W7: 01101001

11

12

CDMA star topology NoC architecture


CDMA based local & central switch

CDMA-based star topology NoC


Hierarchical switch architecture:

Local Switch (LS)

Central Switch (CS)


13 14

Hierarchical switching
Packet structure
GROUP SOURCE DESTINATION PAYLOAD

Example of hierarchical switching


Modulation step.
Walsh Code Data flow Modulated Codeword for Central Switch N.A. N.A. N.A. N.A. 00001111 (W4) N.A. 00110011 (W2) 01100110 (W3) 00001111 (W4) 01101001 (W7) 00110011 (W2) Modulated Codeword for Local Switch 11001100 (~W2) 00111100 (W6) 00001111 (W4) 11000011 (~W6) 01101001 (W7) 01100110 (W3) 11001100 (~W2) 00111100 (W6) 01100110 (W3) 10010110 (~W7) 10101010 (~W1)

Group [3 bits]: determines local switch group. Source [3 bits]: determines source address. Destination [3 bits]: determines destination address. Payload [N bits]: includes actual data. 1 0

0 0 0 1 1
W1: 01010101 W2: 00110011 R5 (LS1) R6 (LS1) R7 (LS1) R7 (LS4) R1 (LS3) R3 (LS3) R4 (LS3) R2 (LS2) R1 (LS5) R6 (LS3) R2 (LS5) R3 (LS4) R3 (LS6) R7 (LS7) R3 (LS7) R1 (LS2) R1 (LS1) R2 (LS1) R2 (LS1) R6 (LS1) R3 (LS1) R4 (LS1)

Codeword assignment
Local switch group LS1 LS2 LS3 LS4 LS5 LS6 LS7 Destination R1 R2 R3 R4 R5 R6 R7 Assigned address of packet 001 010 011 100 101 110 111 Assigned codeword

W3: 01100110

1
01010101 00110011 01100110 00001111 01011010 00111100 01101001

W4: 00001111 W5: 01011010

W6: 00111100

0
W7: 01101001

15

16

Example - continued
Sum and demodulation step.
D[i] 2S[i] - L -2S[i] + L 1 i L-1 If codeword[i] = 0 If codeword[i] = 1 (i: integer) Data flow R1(LS1)R2 R2(LS1)R6 R3(LS1)R4 R5(LS1)R6 (1) (0) (0) (1) Sum (s[i]) 2s[i] Local Switch 1 D[i] -6 -6 6 6 -2 -2 6 6 -6 -6 6 6 2 2 -6 -6 -6 -6 -6 -6 2 2 6 6 Contention (lose) and Recovered data 8/8=+1 1 -8/8=-1 0 -8/8=-1 0 contention

Simulation environment

Simulated our entire architecture using SystemC. All of the data was randomly generated. System clock and latency model.
Tsys_clk = N * Tcode_clk.

11113311 22226622

Local Switch 2 Data flow Sum (s[i]) 2s[i] D[i] and Recovered data 8/8=+1 1 8/8=+1 1 Data flow Sum (s[i]) 2s[i]

Local Switch 3 D[i] and Recovered data -8/8=-1 0 -8/8=-1 0

R4(LS3)R2 (1) R3(LS7)R6 (1)

11113311 22226622

-4 -6 6 8 -4 -6 6 8 -4 6 -6 8 -4 6 -6 8

R1(LS3)R3 (0) R1(LS5)R6 (0)

01211210 02422420

-8 6 4 -6 -6 4 6 -8 -8 -6 4 6 6 4 -6 -8

Local Switch 4 Data flow Sum (s[i]) 2s[i] D[i] and Recovered data -8/8=-1 0 -8/8=-1 0 Data flow Sum (s[i]) 2s[i]

Local Switch 7 D[i] and Recovered data 8/8=+1 1

Tone_packet_delivery = Lpacket_length * Tsys_clk + 3 * Tsys_clk.


(Within a local switch.)

R7(LS1)R7 (0) R2(LS5)R3 (0)

02201111 04402222

-8 4 4 8 6 -6 -6 6 -8 4 4 -8 -6 6 6 -6

R3(LS6)R7 (1)

10010110 20020220

-8 8 8 -8 8 -8 8

Tone_packet_delivery = Lpacket_length * Tsys_clk + 11 * Tsys_clk.


(Between different local switches through a central switch.)

17

18

Simulation results
Performance metrics.
Throughput [packets/sec]. (rather than [bits/sec])
Throughput = The number of packets Unit time [sec]

Parallel FFT Computation with a CDMA NoC


Mapping the FFT Algorithm onto the NoC 16-point radix-2 decimation-in-frequency (DIF) FFT. FFT computation steps. Pre processing step: Loading the input data. Main processing step: All processing elements (PEs) accept

Latency [ns]: packet transmission delay from MOD to DEMOD via CDMA switch.
Packet size [bits] 24 36 48 56 Throughput [packets/sec] 182 M 121 M 91 M 78 M Latency [ns] 22.6 28.4 36.2 44.8

new data in a parallel and pipelined fashion and all data transmission between source and destination PEs are executed concurrently through the CDMA switch. Post processing step: Storing the computation results. Each value is represented as a signed 16-point fixed point number with 10 fractional bits.
19 20

Parallel FFT Computation with a CDMA NoC


Direct mapping
PE1, PE2

Parallel FFT Computation with a CDMA NoC


Indirect mapping
Stage I Stage II Stage III Stage IV

}
FFT

Data Resource

} } } } } } }

Comp.

PE3, PE4, PE5, PE6

FFT Comp.

PE7, PE8

FFT Comp.

Data Sink FFT Comp.

PE #

Data Resource

Data Sink

21

22

Parallel FFT Computation with a CDMA NoC


Experimental Results and Analysis Simulation environments Cycle-accurate SystemC. Channel and PEs are operated synchronously. FFT computational load is equally distributed. Switch operates at 64 MHz. Clock period: Tsys_clk = L x Tcode_clk (L: codeword length) Performance metrics Latency: Packet transmission delay.
Data transferred btw nodes Throughput (max,avg)= I(max,avg) . Latency
I: the number of simultaneous data transactions.

Parallel FFT Computation with a CDMA NoC


Performance analysis

16

45 0

426 Direct FFT 320 245 320 Indirect FFT

14 14
14 12

Direct FFT Indirect FFT Max Throughput [Mbytes/sec]

40 0 35 0 30 0 25 0 20 0 15 0 10 0 50 0

10 10 Latency [us]
10 8

213 160

6 6
6 4

0 33 49 65

33

49

65

Packet s ize [By tes ]

Packet s ize [Bytes ]

(a )
450 400 350 300 250 200 15 0

(b )
70

426 Direct FFT Computation Response time [usec] 320 In direct FFT

Avg Throughput [Mbytes/sec]

60

58 46 34

D irec t F F T In direc t F F T

50

213 186 140 94

40

30

20 14 10

18

10 0 50 0 33 49 65

Response time: Elapsed time of one FFT computation. .N e tw o rk u tiliz a tio n =


A v e ra g e th ro u g h p u t M ax im u m th ro u g h p u t
23

0 33 49 65

Packet s ize [Bytes ]

Packet s ize [By tes ]

(c )

(d )

Functional verification

Performance results
24

MPEG-4 Performance Analysis for a CDMA NoC


Overall CDMA star NoC platform for MPEG-4 mapping
RX 1 TX 1

MPEG-4 Performance Analysis for a CDMA NoC


MPEG-4 mapping and implementation procedure
Parameters: Communication pattern and other user specific values (Bandwidth, payload size, buffer size, operation clock period and etc.)

Video out

Audio out

Media CPU

Scheduler
3D CPU

190
SDRAM

0.5

60

600

40

40
SRAM1

TX 2 RX 2

TX 7 RX 7

Traffic generator

CDMA SW 1

0.5
Audio DSP

910

32 250
RISC CPU

CDMA SW 2

SRAM2

Code Adder
TX 3 RX 3 TX 6 RX 6

Generated input traffic

Mapping decision

173 500
Scaling QUANT

670
Up Sam ple

CDMA NoC Platform Model described in VHDL

Post simulation log file

Synthesize

RX 4

RX 5

TX 4

CDMA Star NoC

CDMA Switch
25

MPEG-4 Performance Analysis for a CDMA NoC


Simulation methodology

Log file format

MPEG-4 Performance Analysis for a CDMA NoC


MPEG-4 mapping on two comparison platforms

TX 5

Latency estimation report

Area report

26

MPEG-4 Performance Analysis for a CDMA NoC


Constrained bandwidth requirement
1. Audio Output processor 2. Audio DSP processor 3. Media CPU 4. Video output processor 5. 3D Graphic processor 6. SDRAM 8. SRAM1 9. Quantization module 10. SRAM2 11. RISC CPU 12. Scaling module 13. Upsampling module

Bandwidth requirements for the MPEG-4 System 27

Normalized Bandwidth requirements for the MPEG-4 System 28

Simulation environment
Simulation is done using synthesizable VHDL.
Modelsim is used for functional verification.
Packets are generated from a normalized probability table.
Generated Traffic file is used to simulate an MPEG-4 mapped system.

Map to CDMA switched star NoC

Map to crossbar switched mesh NoC

Video out

Audio out

Media CPU

3D CPU

190
SDRAM

0.5

60

600

40

40
SRAM1

Simulation parameters: payload size, codeword size, operating clock period, buffer depth, topology, switching (CDMA or crossbar) etc. Synplify ASIC tool for synthesize.
Chip Express CX4001 0.25 m structured library.

CDMA SW 1

0.5
Audio DSP

910

32 250
RISC CPU

CDMA SW 2

Performance metrics
SRAM2

173 500
QUA NT

670
Up Sample

Scaling

Latency: Packet transmission delay. Area overhead. Hop count.


29 30

MPEG-4 Performance Analysis for a CDMA NoC


Results
Hop Count Comparison CDMA Star NoC Average Min Max Standard deviation 1.451 1 2 0.65 Crossbar Mesh NoC 3.063 2 6 1.39 Payload size 8 bits 16 bits 32 bits 64 bits Area Comparison CDMA Star NoC 109,090.0 165,003.6 273,014.0 498,256.4 Crossbar Mesh NoC 42.383.5 69,535.8 128,946.0 232,813.8

MPEG-4 Performance Analysis for a CDMA NoC


Results
average latency =

i =1

T( i ) received - T(i )transmitted N

N : Total number of received packets

Latency Comparison CDMA Star NoC Average 28 clock cycles Crossbar Mesh NoC 269 clock cycles

After analyzing the traffic log file after simulation, we can obtain travel times of
Hop count: The number of routers it has been forwarded through. The CDMA star topology has favorable hop count values. Buffer depth of 8 is used in both platforms. Platform described in VHDL is synthesized using Synplify ASIC tool with Chip Express CX4001 0.25 m structured library. The estimated maximum frequency is 76 MHz.
31 32

the packets during the transmission. Latency values include the effects due to contention between packets destined for the same address at the same time.

MPEG-4 Performance Analysis for a CDMA NoC


Results Bandwidth constraints Given highest bandwidth requirement of system is 455 MBytes/sec. We would like to determine if our CDMA NoC can meet that constraint. The largest possible bandwidth is the maximum clock frequency, which

CDMA Router Design and Implementation


CDMA router architecture
IP 1
BUF F

IP 2
BUFF

HD

HD

Scheduler

was found to be 76 MHz. multiplied by the payload size in bytes. Of the cases considered, the 64-bit payload size is sufficient to meet this constraint: 76 MHz x 8 bytes = 608 MBytes/sec. This is a best-case value that does not take into account possible effects due to contention. However, the number is sufficiently high to strongly suggest that the network is fast enough to meet the MPEG-4 throughput requirements.
IP 7
HD

Walsh Cod es

M O D

DE M O D

DE M O D

M O D HD

M OD DEM OD M OD DEM OD

BUF F

IP 3

BUF F

Code Adder
DEMOD M OD
BUF F

IP 4

HD M O D DE M O D DE M O D M O D

HD

BUFF

BUFF

HD

IP 6

IP 5

33

34

CDMA Router Design and Implementation


FIFO Buffer and Walsh Codeword Input buffering that has a lower cost and a simpler implementation is used. The width of each buffer is equal to the packet size and storeand-forward switching is used for its implementation simplicity. The spreading code used in our design is the 8-chip Walsh code that has zero cross-correlation orthogonal property. The size of the code is variable in sizes of 2N (N1: integer). Each codeword corresponds to a given packet destination address.

CDMA Router Design and Implementation


Header Decoder and Scheduler Header decoder (HD) detects the packet destination information from the FIFO buffer and then communicates with a scheduler by a request/grant signal. Scheduler (SCHE) was designed for two purposes. First, it must determine if data is present in the buffer. Second, it has to manage any data contention with a priority scheme when packets from two or more IP blocks are destined for the same target address simultaneously.

35

36

CDMA Router Design and Implementation


Modulator
Scheduler
DST Req Gnt [3] Head er Deco der DST [3] SRC[3] DST[3]
0 0 1 0 1 1 0 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 DST [3]

CDMA Router Design and Implementation


Code Adder
S 1 [i ] 7 8 7 7 8 7
x1

Walsh Codewords Memory


8 8 8 8 8 8 8 8

MOD
Selected Codeword: 01100110 01100110
1

8 MUXes

11000011 01101001 11110000 01100110 11001100 10100101 10101010 ------S 1 [i]=[55513333]

...
3 3

...
3

Gnt

...

...
8 0/1

... ... ...

...
1

x1

101
x2 x3 F A S 0

x4

011 1
x5 x6 x7 F A S 0

x1 x2 x3

111
F A S 1

x4

110 0
x 5 x6 x7 C F A S 0

x 1 x2 x 3

011
F A S 0

x4

101 1
x5 x6 x7 C F A S 0 0

001
x 2 x3 C F A S 1

x4

000 0
x 5 x6 x7

C 1

C 1

C 1

F A C S 0 0

7
0

0 F

0 F

0 F 0 C 0 A S 1

0 F

... ...

... ...

... ... ... ...

C A S 0 1

C A S 0 1

C A S 0 1

8 7

...

Code Adder

1 F A S

1 C

1 F A S

1 C

1 F A S

0 C S0 S2

0 F A S S

S2

10

S1

S0

S2

10

S1

S0

S2

10

S1

00

S0

5
x1

5
x4

5
x4

1
......
x4

... ...

... ...

... ...

010
x2 x3 F A S 1

010 1
x5 x6 x7 F

PAYLOAD[16 ]

x1 x2 x3

000
F A S 0

111 0
x 5 x6 x7 F 0

100
x1 x2 x 3 F C 1 A S

100 1
x5 x6 x7 F

x 1 x2 x 3

110
F C A S 0

x4

001 0
x5 x6 x7 F C A S 1

10011001

C 0

A C S 0 1

C 0

A C S 1 1

A C S 0 1

...

1 1 F A C S 1 1

0 1 F A C S 0 1

1 1 F A C S 1 1

0 C 0

1 F A S 1

S 2 2 [i ] 7 8 7 3

...

...

...
8 0/1

0 C S2

0 F A S S1

0 C S0 S2

1 F A S S1

0 C S0 S2

0 F A S S1

1 C S0 S2

0 F A S S1

...

BUFFER

01

01

01

01

S0

Data 0 1 No Data

Codeword Assignment Codeword itself Inverted Codeword All-zero Codeword

An S[i] Example in Code Adder Block Diagram using CSAs.

Overall Calculation of a Code Adder Block Diagram.

37

38

CDMA Router Design and Implementation


Demodulator
(2 S[i ]-L) if codeword [i ] is 0 D[i ] = (-2 S [i ]+L ) if codeword [i ] is 1
sel

CDMA Router Design and Implementation


FPGA synthesis
Area result
Xilinx Virtex4 XC4VLX200 (Area)

1-bit Shift register: 2S[i ]


(M SB)

A0 0 L=8 (L: code length) 0 12 A1 B1 A2 B2 S0 A3 B3 A4 B4 0 0 0 0 1 B0

A0

B0

=
i =0

L-1

0 0

D[i ] N
Components 8 bits 4-input LUTs MOD BUFF HD SCHE 1769 1239 0 49 3033 4822 10912 out of 178156 6.1 % Flip Flops 0 1920 42 7 0 105 2074 out of 178156 1.2 % 16 bits 4-input LUTs 2835 1984 0 46 4736 7289 16890 out of 178156 9.5 % Flip Flops 0 2870 42 7 0 161 3080 out of 178156 1.7 %

Payload Size 32 bits 4-input LUTs 5009 3507 0 48 7826 12757 29147 out of 178156 16.4 % Flip Flops 0 4774 42 7 0 273 5096 out of 178156 2.9 % 64 bits 4-input LUTs 7839 5487 0 49 17336 20224 50935 out of 178156 28.6 % Flip Flops 0 8686 42 7 0 497 9232 out of 178156 5.2 % 128 bits 4-input LUTs 14534 10174 0 48 25016 111900 161672 out of 178156 90.7 % Flip Flops 0 16202 42 7 0 945 17196 out of 17815 6 9.6 %

A1

B1

5-bit RCA (2S[i]-L) C5


ig no re

A2

B2

S4

S3

S2

S1

A3

B3

A4
2S[i]- L

B4
-2S[i]+L

S[i] is the summation of all modulated values. L is the codeword length. D[i] is the decision variable. is the decision factor.

5-bit RCA -(2S[i]-L) 2's complement o f (2S[i]-L)


ignore

C5

S4

S3

S2

S1

S0

CA

Decision Factor ( ) +1 -1 0

Demodulated Data [bit] 1 0 No Data Sent

DEMOD

sel

2S[i]-L

-2S[i]+L

Total

Utilization

39

40

CDMA Router Design and Implementation


FPGA Synthesis
Timing result.
Xilinx Virtex4 XC4VLX200 (Timing) Payload Size Optimal Estimated Frequency (to avoid negative slack) 74.3 MHz (13.451 ns) 76.4 MHz (13.082 ns) 78.0 MHz (12.815 ns) 76.1 MHz (13.139 ns) 61.8 MHz (16.174 ns) Total Path Delay (Propagation + Setup) Logic 6.742 ns (50.1%) 6.377 ns (48.7%) 6.134 ns (47.9%) 6.377 ns (58.6%) 8.826 ns (54.6%) Route 6.709 ns (49.9%) 6.706 ns (51.3%) 6.681 ns (52.1%) 6.752 ns (51.4%) 7.348 ns (45.4%)

CDMA Router Design and Implementation


Performance results (throughput and latency)

Performance Results (Throughput & Latency) Payload Size 7-port Aggregate Throughput 7 * 8 * 71.4 MHz = 3.998 Gbps 7 * 16 * 71.4 MHz = 7.996 Gbps 7 * 32 * 71.4 MHz = 15.993Gbps 7 * 64 * 71.4 MHz = 31.987 Gbps 7 * 128 * 55.5 MHz = 49.728 Gbps Average Latency 90 ns 90 ns 90 ns 90 ns 115.7 ns

8 bits 16 bits 32 bits 64 bits 128 bits

8 bits 16 bits 32 bits 64 bits 128 bits

41

42

CDMA Router Design and Implementation


ASIC synthesis timing and area results
0.25 m ChipExpress cx4001 structured ASIC library Payload Size 8 bits 16 bits 32 bits 64 bits 128 bits Optimal Estimated Frequency 50.0 MHz 49.4 MHz 50.0 MHz 50.0 MHz 50.0 MHz Optimal Estimated Period 20.005 ns 20.223 ns 20.005 ns 20.000 ns 19.997 ns Cell Usage Gate Count 17838 27153 44314 81266 160906 Area 39304.2 m
2

CDMA Router Design and Implementation


ASIC synthesis components area overhead

Components area overhead of 8 bits payload case in 0.25 m technology


Components Modulator (MOD) Buffer (BUFF) Header Decoder (HD) Scheduler (SCHE) Area [ m2] 3930.4 6681.7 432.4 786.1 2358.2 24761.6 353.8 39304.2

59780.5 m2 99480.8 m2 181060.0 354877.0 m2 m2

0.18 m ChipExpress cx5000 structured ASIC library Payload Size 8 bits 16 bits 32 bits 64 bits 128 bits Optimal Estimated Frequency 94.9 MHz 94.7 MHz 94.2 MHz 93.8 MHz 93.4 MHz Optimal Estimated Period 10.539 ns 10.560 ns 10.615 ns 10.660 ns 10.706 ns Cell Usage Gate Count 19416 29113 47754 86912 167740 Area 26390.0 m2 39736.0 m2 65746.0 m2 119000.0 m2 228480.0 m2 43

Code Adder (CA) Demodulator (DEMOD) Others TOTAL

44

CDMA Router Design and Implementation


Performance results
Ideal 7-port Aggregate Throughput Payload 8 bits 16 bits 32 bits 64 bits 128 bits

Multicasting & BandwidthReallocation


.

0.18 m (cx5000)
5314 Mbps (5.314 Gbps) 10606 Mbps (10.606 Gbps) 21100 Mbps (21.100 Gbps) 42022 Mbps (42.022 Gbps) 83686 Mbps (83.686 Gbps) Transfer Latency Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Packet 7

0.25 m (cx4001)
2800 Mbps (2.800 Gbps) 5532 Mbps (5.532 Gbps) 11200 Mbps (11.200 Gbps) 22400 Mbps (22.4 Gbps) 44800 Mbps (44.8 Gbps)

S1

S2

S11 S1 2 N

S3 N Sn S11 S

S2

S1

S1

S3 N

1 2

N 140 ns 140 ns 220 ns 140 ns 140 ns 100 ns 100 ns 160 ns 45

S1

SN

Multicasting Example
(same codeword W2 assigned to multiple demodulators)

Bandwidth Reallocation Example


(PE 1 operates in double-bandwidth mode by utilizing PE2's bandwidth)

Average Latency

46

Modified Switch Architectures


PE3
NA BCN Pool

20-Core Design Example


. Flow (1)
NA NA

PE 4
NA

Lo ca l CD MA Sw itch

FLOW TYPE Local Local Local Global Global

PE8

PE9

TRANSFER TYPE PE2 PE4 PE5 PE9 PE19 Multicasting to PE1, PE5 and PE7 Unicasting to PE6 Unicasting to PE3 with DBW Unicasting through C-SW to PE4 Broadcasting to SW3

S EL

BCN Pool

Scheduler Scheduler

PE7
M OD 0 DE M OD 0

PE10

(2) (3) (4) (5)

NA

NA

DEM DE OD3 MOD DEM OD2

MOD 2

MOD 3

DEM OD5 DEM DE OD6 MOD

PE11

P E2
S EL

P E5
FIFO MOD 1 MOD 4 FIFO

DE MO D3 Lo ca l C DMA S witch Cod e Ad der FIF O M OD 3

M OD 2

FIF O Lo ca l C DMA Switc h

NA

NA

NA

NA

Co de A dder
FIFO MOD 0 DEM DE OD0 MOD MOD 7 MOD 6 DEM OD9 MOD 5 DEM OD8 FIFO
S EL

P E1

P E6

DE M OD2

(2)
NA NA

PE4
DE M OD 1

PE12

PE5

PE6

Local CDMA Switch 2

PE13

M OD 1

Local CDMA (1) Switch 1


NA

Central CDMA Switch (4)

(5)

Local CDMA Switch 3


NA

A RB (C-SW )

PE3

(3)

PE14

NA

NA

NA

NA

NA

PE16

Central CDMA Switch

PE 7

Lo ca l CD MA Sw itch

Local Switch (8-bit codewords)

Central Switch (4-bit codewords)


Normal BW Double BW

Local CDMA Switch 4


NA NA

PE20

PE17

NA

NA

PE19

47

PE18

PE15

PE2

PE1

48

Benefits - Throughput
.

Benefits - Latency
.

(worst-case, average and best-case patterns, with unicasting or multicasting)


49

(local and global flows, with unicasting or multicasting)


50

Asynchronous FIFO Interfaces for GALS NoC Platform

Asynchronous FIFO Interfaces for GALS NoC Platform


Asynchronous FIFO based mixed-clock GALS NoC platform I Asynchronous FIFO for output queue (OQ) switch of the GALS NoC.

GALS (Globally Asynchronous, Locally Synchronous) No serious clock skew problems. No global synchronization issues. No obstacle for the increasing demands of mixed-clock platform with systems. Our
asynchronous FIFOs

Switch input asynchronous FIFO: mixed-clock interface purpose. Switch output asynchronous FIFO: buffer (OQ) and mixed-clock interface purposes.

CLK 3 CLK 3 IP CLK 4 IP

CLK 4

Switch Clock Domain On-Chip 4 x 4 OQ Switch Domain


r_request r_grant r_empty

A rbit er

CLK 5 CLK 2

Async FIFO Input Port

r_data

Async FIFO Input Port

r_data

CLK 2 IP

CLK 5 IP

Async FIFO Input Port

r_data

Async FIFO Input Port

r_data

Switch F abric

CLK 1
buff_full

CLK 7 IP

CLK 6 IP CLK 6

Buff er Overflo w Co ntro ller

Async FIFO

Async FIFO

Async FIFO

Async FIFO

Single global clock

Fully asynchronous

Point-to-point based mixed-clock G ALS

Switch based mixed-clock G ALS


CLK 7

51

52

Asynchronous FIFO Interfaces for GALS NoC Platform


Asynchronous FIFO based mixed-clock GALS NoC platform II Asynchronous FIFO for combined input output queue (CIOQ) switch of the GALS NoC. Switch input asynchronous FIFO: buffer (VOQ) and mixed-clock interface purposes. Switch output asynchronous FIFO: buffer (OQ) and mixed-clock interface purposes.

Asynchronous FIFO Interfaces for GALS NoC Platform


Asynchronous FIFO block diagram

CLK 3 CLK 3 IP

CLK 4 CLK 4 IP
Asy nc FIFO VOQ Input Port

Switch Clock Domain On-Chip 4 x 4 CIOQ Switch Domain


r_ reques t r_gran t r_empt y

Arbiter

r_d ata

Asynchronous FIFO block diagram with buffer overflow control signal

CLK 2 CLK 2 IP CLK 1

CLK 5 CLK 5 IP

Asy nc FIFO VO Q Input Port

r_d ata

Async FIFO VOQ Inp ut Po rt

r_d ata

Asy nc FIFO VOQ Input Port

r_d ata

Switch Fabric

buff_fu ll

CLK 7 IP CLK 7

CLK 6 IP CLK 6

Bu ffer Overflow C ontroller

Async FIF O

Asy n c FI FO

Asy nc FI FO

Async FI FO

53

54

Asynchronous FIFO Interfaces for GALS NoC Platform


Asynchronous FIFO buffer overflow control principle When the buffer overflow flag is asserted, the buffer full flag signal is sent to the read controller so that a temporary false empty signal is issued to the arbiter. Thus, the arbiter immediately disables the grant signal. Packet transmission out of the buffer stops since none of the data stored in the buffer can be read at this point. An empty flag is issued when the read pointer and the synchronized write pointer are equal. Therefore, the temporary false empty flag can be generated by comparing the buffer overflow signal with the read pointer signal. While the grant signal is temporarily disabled and the actual read pointer and the synchronized write pointer are not equal, the writing operation is continued successfully without regard to the false empty flag.
55

Asynchronous FIFO Interfaces for GALS NoC Platform


Schematic diagram of the synthesized read controller with buffer overflow control signal Synthesized with Chip Express CX4000 structured ASIC library for 0.25 m technology. Area: 93.8 m2. Timing: estimated frequency of 177.4MHz and a slack of 1.165ns.

56

Asynchronous FIFO Interfaces for GALS NoC Platform


Simulation parameters: The size of the asynchronous FIFO we implemented. 32 words (256 bits) and a 4-bit memory address (16 addresses). Packet size 256 bits (4-bit source address, 2-bit switch group address, 2-bit destination address and 248-bit payload field). Six dummy IP cores attached switch groups generated a uniform random traffic data with different clock rates. Three different test scenaios having increasingly larger frequency ranges. Two different network platforms:
Using only output queueing (OQ) Using combined input and output queueing (CIOQ)

Asynchronous FIFO Interfaces for GALS NoC Platform


Results and analysis Mixed-clock domain frequency assignments for the 3 test scenarios.

Assigned Operating Frequencies IP Modules Test Scenario I CLK1 IP (Switch) CLK2 IP CLK3 IP CLK4 IP CLK5 IP CLK6 IP CLK7 IP 71.4 MHz 62.5 MHz 55.6 MHz 50.0 MHz 71.4 MHz 62.5 MHz 55.6 MHz Test Scenario II 83.3 MHz 71.4 MHz 62.5 MHz 55.6 MHz 83.3 MHz 71.4 MHz 62.5 MHz Test Scenario III 100.0 MHz 83.3 MHz 71.4 MHz 62.5 MHz 100.0 MHz 83.3 MHz 71.4 MHz

Line rate equal to the switch speed for a fair comparison of the two platforms.

57

58

Asynchronous FIFO Interfaces for GALS NoC Platform


Test scenario simulation results: CIOQ performs better than OQ in each case.
Under Test Scenario I OCSN Platform s Platform I (OQ) Platform II (CIOQ) Buffer Overflow Managem ent With Async FIFO Receiver Controller Async FIFO Receiver Controller Average Latency 320 ns 280 ns Average Aggregated Throughput 2.48 Gbps 2.83 Gbps Average Packet Drop Probability 15.1 % 14.9 %

Under Test Scenario II OCSN Platform s Platform I (OQ) Platform II (CIOQ) Buffer Overflow Managem ent With Async FIFO Receiver Controller Async FIFO Receiver Controller Average Latency 305 ns 295 ns Average Aggregated Throughput 2.60 Gbps 2.69 Gbps Average Packet Drop Probability 16.3 % 14.3 %

Under Test Scenario III OCSN Platform s Platform I (OQ) Platform II (CIOQ) Buffer Overflow Managem ent With Async FIFO Receiver Controller Async FIFO Receiver Controller Average Latency 295 ns 270 ns Average Aggregated Throughput 2.69 Gbps 2.94 Gbps Average Packet Drop Probability 15.8 % 14.4 %

59

Vous aimerez peut-être aussi