MMTH

Design of a Low Power NoC Router
using Marching Memory Through type

1
Ryota Yasudo, Takahiro Kagami, Hideharu Amano, Yasunobu Nakase,

1
2
2
2
Masashi Watanabe, Tsukasa Oishi, Toru Shimizu, Tadao Nakamura
1
Keio University, Japan

2
Renesas Electronics Corp., Japan
Outline
Background.

Marching Memory Through Type (MMTH).

Proposed router.

Evaluation.

Conclusion.
September 18, 2014
NOCS 2014
NoC (Network-on-Chip)
In many-core chips, NoC consumes significant power

MIT 16-core RAW CMP: 36%

Intel 80-core Tera FLOPS: 28%

Intel 48-core SCC:
10%

Reducing the power of NoCs is essential

September 18, 2014
NOCS 2014
Input buffers in routers
Input buffers in routers consume significant part of the

total power of NoCs

About 46% of the total power [P.Kundu, Workshop on On- and
Off-Chip Interconnection Networks for Multicore Systems06]

The largest leakage power consumer [X. Chen et al., ISLPED03]

Dynamic power is also high and it increases rapidly as the traffic
increases [T. T. Ye et al., DAC02]

Power optimization of input buffer is required

September 18, 2014
NOCS 2014
Various approaches:
reduction of the power in buffers
Remove buffers completely
Buffer-less deflection routers

[T. Moscibroda and O. Mutlu, ISCA09]

Reduce the number of buffers

Reconfigurable routers

[D. Matos, et al., IEEE Transactions on VLSI Systems11]

Utilize
next-generation memory
A Hybrid buffer design with STT-MRAM

[H. Jang, et al., NOCS12]
September 18, 2014
NOCS 2014
Our approach
Using a novel power efficient buffer memory called

Marching Memory Through Type (MMTH) in input
buffers

Complicated buffer management is unnecessary

September 18, 2014
NOCS 2014
Outline
Background.


Proposed router.

Evaluation.

Conclusion.
September 18, 2014
NOCS 2014
What is Marching Memory(MM)?
Novel memory that can avoid the memory bottleneck

[T. Nakamura and M. J. Flynn, UCAS10]

Data are shifted to the CPU synchronized with the clock

September 18, 2014
NOCS 2014
DRAM based memory cell

7
MMTH
(Marching Memory Through Type)
An implementation example of the MM concept

The structure and target are completely different from original MM

Not capacitors, but transparent latches

September 18, 2014
NOCS 2014
Behavior of MMTH

Write:

column
Data are written from the input

port to the column indicated by
the W-pointer.

Read:

Data in the column indicated by
the R-pointer are transferred to
the output port.

MMTH requires some time

delay of signals as

read latency

September 18, 2014
NOCS 2014
Structure of MMTH
Inverted T
A memory cell is composed of a

transparent latch

Control signal
This structure reduces power and

area

Simple memory cell

Local clock is unnecessary

September 18, 2014
NOCS 2014
Memory cell
10
Power consumption of MMTH
Power consumption depends on data contents

If the same bit is written continuously, switching power is not consumed.

Overwrite occurs
BCR(Bit Change Rate): Probability of bit change

57%
September 18, 2014
NOCS 2014
11
Summary:
Characteristics of MMTH
Advantage

Low power (68% lower power compared to a bunch of FFs)

High speed (2GHz)

Traditional register-based FIFO : works at only 800MHz

Disadvantage

Read latency (1 clock cycle at 2GHz, 8depth)
September 18, 2014
NOCS 2014
12
Outline
Background.


Proposed router.

Evaluation.

Conclusion.
September 18, 2014
NOCS 2014
13
Router architecture
Standard input-buffered router for 2D mesh

Virtual-channel flow control (2VCs per input port)
Baseline: a bunch of FFs

Proposed: MMTH
September 18, 2014
NOCS 2014
14
Router pipeline
Speculative technique and look-ahead routing are used.
Extra Buffer Read stage

Baseline (using register-based FIFO)
September 18, 2014
NOCS 2014
Naive design with MMTH

15
Reduction of BR stage
Routing information is stored in an additional temporary flit

(pre-header flit)

Pre-header flit arrives at the next router 1 clock earlier than header flit

Producer of BR stage
Bypass buffers
September 18, 2014
NOCS 2014
16
Proposed router pipeline
Pre-header flit bypasses buffers!

September 18, 2014
NOCS 2014
17
Latency from source to destination

H: the number of hops
Latency = 3H
Latency = 4H
Overhead
Latency = 3H + 1
September 18, 2014
NOCS 2014
18
Additional external signals
A reset signal

Reset MMTH after finishing transmitting a packet

Reset
An invalid signal

Invalidate data while BR stage

during BR
September 18, 2014
NOCS 2014
Invalidate(flit type<=none)
19
Outline
Background.


Proposed router.

Evaluation.

Conclusion.
September 18, 2014
NOCS 2014
20
Overview of evaluation
Performance : Cycle-accurate simulation

GEM5 full system simulator

NAS parallel benchmark

A set of programs designed to help evaluate the performance of
parallel supercomputers.

Derived from computational fluid dynamics applications.

Power consumption : RTL-based simulation

Renesas 40nm CMOS design technology

Apache Power Artist

Synopsys Liberty library format
September 18, 2014
NOCS 2014
21
Simulation setup
(a)CMP System Configuration

in GEM5 full system simulator
(b)NoC System Configuration

in RTL models
System Parameters
Details
System Parameters
Details
Processor
X86-64
Clock frequency
2GHz
# of processors
Topology
2D-Mesh
# of directories
# of cores
# of L2 caches
16
# of VCs /input port
L1 I/D cache size
32KB
Buffer size
8flits
L2 cache size
256KB
Routing
XY routing
Coherence protocol
MOESI directory
Arbiter type
Round-robin
Flit size
4x4 Mesh

Packet size
Each router has a L2 cache bank

Routers in the four corners are

Traffic pattern
connected with processors and directories

September 18, 2014
NOCS 2014
64bit
1 header + 6 bodies
Uniform
22
Performance overhead:
Network simulation
50
Baseline
MM(naive)
MM(proposed)
40
30
20
10
Average Flit Latency [cycle]
Average Flit Latency [cycle]
60
50
40
30
20
10
0
0 0.05 0.1 0.15 0.2 0.25
Injected Traffic [packets/node/cycle
0
0
0.05 0.1 0.15 0.2
Injected Traffic [packets/node/cycle
(b) Bit-complement traffic
(a)Uniform random traffic

September 18, 2014
Baseline
MM(naive)
MM(proposed)
NOCS 2014
23
Performance overhead:
Full system simulation
1.2
Execution time(Normalized)
1.15
Baseline
MM(naive)
MM(proposed)
1.1
10%
CG: conjugate gradient
2%
1.05
1
0.95
0.9
0.85
0.8
IS
September 18, 2014
MG
CG
FT
BT
LU
EP
Benchmark programs
NOCS 2014
SP
UA
Ave.
24
Bit Change Rate
100
Frequency[%]
80
1
0
1
0
to
to
to
to
0
1
1
0
BCR[%]
60
only 25% on average

40
20
0
IS MG CG FT BT LU EP SP UA Ave.
Benchmark programs
Power = P_min + (P_max P_min)BCR / 100

September 18, 2014
NOCS 2014
25
Power consumption
Power consumption[mW]
30
The others
Input buffers
25
Ave. 13%
Ave. 68%
Max 45.4%
20
Ave.42.4%
15
10
5
0
Baseline MM(MAX) MM(IS) MM(MG) MM(CG) MM(FT) MM(BT) MM(LU) MM(EP) MM(SP) MM(UA) MM(Ave.)
September 18, 2014
NOCS 2014
26
Outline
Background.


Proposed router.

Evaluation.

Conclusion.
September 18, 2014
NOCS 2014
27
Conclusion
Router using MMTH (a novel power efficient buffer

memory) has been presented.

We have compared a router using traditional register-based
FIFO and our proposed router.

It reduces the power consumption by 42.4% on average at 2GHz.

Performance overhead becomes only 0.5-2.0% by using the proposed
mechanism based on look-ahead technique.

September 18, 2014
NOCS 2014
28

MMTH

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

MMTH

Transféré par

Droits d'auteur :

Formats disponibles

Design of a Low Power NoC Router

using Marching Memory Through type

Ryota Yasudo, Takahiro Kagami, Hideharu Amano, Yasunobu Nakase,

Keio University, Japan

September 18, 2014

In many-core chips, NoC consumes significant power

Reducing the power of NoCs is essential

Input buffers in routers

Input buffers in routers consume significant part of the

Power optimization of input buffer is required

Remove buffers completely

Buffer-less deflection routers

Reduce the number of buffers

Using a novel power efficient buffer memory called

September 18, 2014

September 18, 2014

What is Marching Memory(MM)?

Novel memory that can avoid the memory bottleneck

Data are shifted to the CPU synchronized with the clock

September 18, 2014

DRAM based memory cell

An implementation example of the MM concept

Not capacitors, but transparent latches

Data are written from the input

MMTH requires some time

A memory cell is composed of a

This structure reduces power and

September 18, 2014

Power consumption of MMTH

Power consumption depends on data contents

BCR(Bit Change Rate): Probability of bit change

September 18, 2014

Low power (68% lower power compared to a bunch of FFs)

September 18, 2014

September 18, 2014

Standard input-buffered router for 2D mesh

Baseline: a bunch of FFs

Speculative technique and look-ahead routing are used.

Extra Buffer Read stage

Naive design with MMTH

Routing information is stored in an additional temporary flit

Proposed router pipeline

Pre-header flit bypasses buffers!

Latency from source to destination

Additional external signals

September 18, 2014

Performance : Cycle-accurate simulation

Power consumption : RTL-based simulation

September 18, 2014

(b)NoC System Configuration

# of VCs /input port

L1 I/D cache size

Average Flit Latency [cycle]

Average Flit Latency [cycle]

(b) Bit-complement traffic

(a)Uniform random traffic

CG: conjugate gradient

September 18, 2014

Bit Change Rate

only 25% on average

Power = P_min + (P_max P_min)BCR / 100

September 18, 2014

September 18, 2014

Router using MMTH (a novel power efficient buffer