Vous êtes sur la page 1sur 35

23

Performance Evaluation and Design Trade-Offs for Wireless


Network-on-Chip Architectures
KEVIN CHANG and SUJAY DEB, Washington State University
AMLAN GANGULY, Rochester Institute of Technology
XINMIN YU, SUMAN PRASAD SAH, PARTHA PRATIM PANDE, BENJAMIN BELZER,
and DEUKHYOUN HEO, Washington State University
Massive levels of integration are making modern multicore chips all pervasive in several domains. High
performance, robustness, and energy-efciency are crucial for the widespread adoption of such platforms.
Networks-on-Chip (NoCs) have emerged as communication backbones to enable a high degree of integration
in multicore Systems-on-Chip (SoCs). Despite their advantages, an important performance limitation in
traditional NoCs arises from planar metal interconnect-based multihop links with high latency and power
consumption. This limitation can be addressed by drawing inspiration from the evolution of natural complex
networks, which offer great performance-cost trade-offs. Analogous with many natural complex systems,
future multicore chips are expected to be hierarchical and heterogeneous in nature as well. In this article
we undertake a detailed performance evaluation for hierarchical small-world NoC architectures where
the long-range communications links are established through the millimeter-wave wireless communication
channels. Through architecture-space exploration in conjunction with novel power-efcient on-chip wireless
link design, we demonstrate that it is possible to improve performance of conventional NoC architectures
signicantly without incurring high area overhead.
Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture
and Design
General Terms: Design, Performance
Additional Key Words and Phrases: Multicore, NoC, small-world, wireless links
ACM Reference Format:
Chang, K., Deb, S., Ganguly, A., Yu, X., Sah, S. P., Pande, P. P., Belzer, B., and Heo, D. 2012. Performance
evaluationand designtrade-offs for wireless network-on-chip architectures. ACMJ. Emerg. Technol. Comput.
Syst. 8, 3, Article 23 (August 2012), 25 pages.
DOI = 10.1145/2287696.2287706 http://doi.acm.org/10.1145/2287696.2287706
1. INTRODUCTION
Power density limitations will continue to drive an increase in the number of cores in
modern electronic chips. While traditional cluster computers are more constrained by
power and cooling costs for solving extreme-scale (or exascale) problems, the continuing
progress and integration levels in silicon technologies make possible complete end-user
systems on a single chip. This massive level of integration makes modern multicore
This article is an extended version of the conference paper that appeared in ASAP [Deb et al. 2010].
This work was supported by the National Science Foundation under CAREER grant CCF-0845504 and in
part by the National Science Foundation under CAREER grant ECCS-0845849.
Authors addresses: K. Chang and S. Deb, Washington State University; A. Ganguly, Rochester Institute of
Technology; X. Yu, S. P. Sah, P. P. Pande (corresponding author), B. Belzer, and D. Heo, Washington State
University; email: pande@eecs.wsu.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for prot or commercial advantage and that
copies showthis notice on the rst page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specic permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c 2012 ACM 1550-4832/2012/08-ART23 $15.00
DOI 10.1145/2287696.2287706 http://doi.acm.org/10.1145/2287696.2287706
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:2 K. Chang et al.
chips all pervasive in domains ranging from climate forecasting and astronomical data
analysis, to consumer electronics, and biological applications [Pande et al. 2011]. Ac-
cording to the U.S. Environmental Protection Agency (EPA), one of the promising ways
to reduce energy dissipation of data centers is to design energy-efcient multicore chips
[EPA 2007]. With increasing number of cores, high performance, robustness, and low
power are crucial for the widespread adoption of such platforms. Achieving all of these
goals cannot simply be attained by traditional paradigms and we are forced to rethink
the basis of designing such systems, in particular the overall interconnect architec-
ture. Network-on-Chip (NoC) is accepted as the preferable communication backbone
for multicore Systems-on-Chip (SoCs). The achievable performance gain of a tradi-
tional NoC is limited by planar metal interconnect-based multihop links, where the
data transfer between two far apart blocks causes high latency and power consump-
tion. With a further increase in the number of cores on a chip, this problem will be
signicantly aggravated. On the other hand, natural complex networks are known to
provide excellent trade-offs between latency and power with limited resources [Peter-
mann et al. 2006]. Thus, drawing inspiration fromsuch networks could enable radically
new designs. The human brain, colonies of microbes, and many other natural complex
networks have the so-called small-world property, which means that the average hop
count between any two nodes is very short due to the addition of a few long-range
links. Such an approach can be incorporated in NoCs, as has been done with metal
wires in the past [Ogras et al. 2006]. However, the performance gain was limited due
to the multihop wired links that are necessary for longer distances. In this article we
evaluate the performance of hierarchical small-world NoC architectures with millime-
ter (mm)-wave wireless communication channels used as long-range shortcuts. These
on-chip wireless shortcuts are CMOS-compatible and do not need any new technology.
But they have associated antenna and wireless transceiver area and power overheads.
Thus, to achieve the best performance, the wireless resources need to be placed and
used optimally. To accomplish that goal, hybrid, hierarchical networks where nearby
cores communicate through traditional metal wires, but long distance communications
are predominantly achieved through high-performance single-hop wireless links, have
been proposed [Ganguly et al. 2010]. In this article we perform a detailed performance
analysis and establish trade-offs for various architectural choices for hierarchical wire-
less NoCs. The novel contributions of this work are as follows.
The hybrid and hierarchical nature of the mm-wave wireless NoC (mWNoC) intro-
duces various possibilities for the overall system architecture. We benchmark the
performance of several mWNoC architectures and establish suitable design trade-
offs. The analysis undertaken in this article helps us to choose the topological con-
guration of a particular mWNoC architecture that offers the best trade-off in terms
of achievable peak bandwidth, energy dissipation, and area overhead.
As a part of the performance evaluation, we also evaluate the performance of the
mWNoCarchitecture with respect to two other types of hierarchical small-world NoC
architectures where the long-range links are implemented with the RF-Interconnect
(RF-I) [Chang et al. 2008] and G-lines [Mensink et al. 2007].
On-chip wireless transceiver circuits are crucial components of the mWNoCs. The
energy efciencies of mWNoC architectures are shown to improve by incorporating
novel body biased mm-wave transceiver circuit design methodologies.
2. RELATED WORK
The NoC paradigm has emerged as a communication backbone to enable a high degree
of integration in multicore System-on-Chips (SoCs) [Pande et al. 2005]. To alleviate
the problem of multihop communication links, the concept of express virtual channels
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:3
is introduced in Kumar et al. [2008b]. It is shown that by using express virtual lanes
to connect distant cores in the network, it is possible to avoid the router overhead
at intermediate nodes, and thereby greatly improve NoC performance. Performance
is further improved by incorporating ultra low-latency multidrop on-chip global lines
(G-lines) for owcontrol signals [Krishna et al. 2008]. NoCs have been shown to perform
better by inserting long-range wired links following principles of small-world graphs
[Ogras et al. 2006]. Despite signicant performance gains, the preceding schemes still
require laying out long wires across the chip and hence performance improvements
beyond a certain limit may not be achievable.
The design principles of photonic NoCs are elaborated in various recent publications
[Shacham et al. 2008; Joshi et al. 2009; Kurian et al. 2010]. The components of a
complete photonic NoC, including dense waveguides, switches, optical modulators, and
detectors, are now viable for integration on a single silicon chip. It is estimated that
a photonic NoC will dissipate signicantly less power than its electrical counterpart.
Another alternative is NoCs with multiband RF interconnects [Chang et al. 2008]. In
these NoCs, Electromagnetic (EM) waves are guided along on-chip transmission lines
created by multiple layers of metal and dielectric stack. As the EM waves travel at
the effective speed of light, low-latency and high-bandwidth communication can be
achieved.
Recently, the design of a wireless NoC based on CMOS Ultra Wideband (UWB)
technology was proposed [Zhao et al. 2008]. The antennas used in Zhao et al. [2008]
achieve a transmission range of 1 mm with a length of 2.98 mm. Consequently, for a
NoC spreading typically over a die area of 20 mm 20 mm, this architecture essen-
tially requires multihop communication through the on-chip wireless channels. The
performance of silicon integrated on-chip antennas for intra- and inter-chip commu-
nication with longer range have already been demonstrated by the authors of Lin et
al. [2007]. They have primarily used metal zig-zag antennas operating in the range
of tens of GHz. The propagation mechanisms of radio waves over intra-chip channels
with integrated antennas were also investigated [Zhang et al. 2007]. At mm-wave fre-
quencies, the effect of metal interference structures such as power grids, local clock
trees, and data lines on on-chip antenna characteristics like gain and phase are inves-
tigated in Seok et al. [2005]. The demonstration of intra-chip wireless interconnection
in a 407-pin ip-chip package with a Ball Grid Array (BGA) mounted on a PC board
[Branch et al. 2005] has addressed the concerns related to the inuence of packaging
on antenna characteristics. Design rules for increasing the predictability of on-chip
antenna characteristics have been proposed in Seok et al. [2005]. Using antennas with
a differential or balanced feed structure can signicantly reduce coupling of switching
noise, which is mostly common-mode in nature [Mehta et al. 2002]. In Lee et al. [2009],
the feasibility of designing on-chip wireless communication networks with miniature
antennas and simple transceivers that operate at the sub-THz range of 100500 GHz
has been demonstrated. If the transmission frequencies can be increased to THz/optical
range then the corresponding antenna sizes decrease, occupying much less chip real es-
tate. One possibility is to use nanoscale antennas based on Carbon NanoTubes (CNTs)
operating in the THz/optical frequency range [Kempa et al. 2007]. Consequently, build-
ing an on-chip wireless interconnection network using THz frequencies for inter-core
communications becomes feasible. The design of a small-world wireless NoC operating
in the THz frequency range using CNT antennas is elaborated in Ganguly et al. [2010].
Though this particular NoC is shown to improve the performance of traditional wire-
line NoC by orders of magnitude, the integration and reliability of CNT devices need
more investigation. The basic ideas regarding the design of a small-world NoC with
mm-wave wireless links were proposed in Deb et al. [2010]. Following the basic de-
sign principles proposed in Deb et al. [2010], the current article undertakes a detailed
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:4 K. Chang et al.
performance evaluation and aims to establish the design trade-offs associated with
hierarchical small-world mm-wave wireless NoC architectures and highlight the key
design considerations necessary for high-bandwidth and low-power on-chip wireless
transceivers.
3. MM-WAVE WIRELESS NOC ARCHITECTURES
3.1. Proposed Architecture
In a traditional wired NoC, the communications among embedded cores are generally
via multiple switches/routers and wired links. This multihop communication becomes
a major bottleneck in system performance, which gives rise to high latency and energy
dissipation. To overcome this performance limitation we propose to adopt novel archi-
tectures inspired by complex network theory in conjunction with strategically placed
on-chip mm-wave wireless links to design high-performance and low-power NoCs.
Modern complex network theory [Albert et al. 2002] provides a powerful method to
analyze network topologies. Between a regular, locally interconnected mesh network
and a completely random Erd os-R enyi topology, there are other classes of graphs, such
as small-world and scale-free graphs. Networks with the small-world property have
a very short average path length, dened as the number of hops between any pair of
nodes. The average shortest path length of small-world graphs is bounded by a polyno-
mial in log(N), where N is the number of nodes, making them particularly interesting
for efcient communication with minimal resources [Buchanan 2003; Teuscher 2007].
A small-world topology can be constructed from a locally connected network by
rewiring connections randomly, thus creating shortcuts in the network [Watts et al.
1998]. These random long-range links can be established following probability distri-
butions depending on the inter-node distances [Petermann et al. 2006] and frequency
of interaction between nodes. NoCs incorporating these shortcuts can perform sig-
nicantly better than locally interconnected mesh-like networks [Ogras et al. 2006;
Teuscher 2007], yet they require far fewer resources compared to a fully connected
system.
Our goal here is to use the small-world approach to build a highly efcient NoC
based on both wired and wireless links. The small-world topology can be incorporated
in NoCs by introducing long-range, high-bandwidth, and low-power wireless links be-
tween far apart cores. We rst divide the whole system into multiple small clusters
of neighboring cores and call these smaller networks subnets. Subnets consist of rela-
tively fewer cores, enhancing exibility in designing their architectures. These subnets
have NoC switches and links as in a standard NoC. As subnets are smaller networks,
intra-subnet communication will have a shorter average path length than a single NoC
spanning the whole system. The cores are connected to a centrally located hub through
direct links and the hubs from all subnets are connected in a 2
nd
-level network forming
a hierarchical structure. This upper hierarchical level is designed to have small-world
graph characteristics constructed with both wired and wireless links. The hubs con-
nected through wireless links require Wireless Interfaces (WIs). To reduce wireless
link overheads and increase network connectivity, neighboring hubs are connected by
traditional wired links and a fewwireless links are distributed between hubs separated
by relatively long physical distances. As will be described in a later section, we use a
Simulated Annealing (SA) [Kirkpatrick et. al. 1983]-based algorithmto optimally place
the WIs. The key to our approach is establishing an optimal overall network topology
under given resource constraint, that is, number of WIs.
The proposed hybrid (wireless/wired) and hierarchical NoC architecture is shown in
Figure 1 with the augmenting heterogeneous subnets. The hubs are interconnected
via both wireless and wired links while the subnets are wired only. The hubs with
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:5




Wireless Link
Embedded Core Hub
Switch Wireless Link
Fig. 1. A hybrid (wireless/wired) hierarchical NoC architecture with heterogeneous subnets and small-
world-based upper-level conguration.
wireless links are equipped with WIs that transmit and receive data packets over the
wireless channels. For inter-subnet data exchange, a packet rst travels to its respective
hub and reaches the hub of the destination subnet via the small-world network, where
it is then routed to the nal destination core.
There can be various subnet architectures, like mesh, star, ring, etc. Similarly, the ba-
sic architecture of the 2
nd
level of the hierarchy may vary. As an example the hubs may
be connected in a mesh architecture with a few long-range wireless links spread across
them creating a small-world network in the 2
nd
level of the hierarchy. As case studies,
in this work we consider two types of subnet architectures, namely mesh and star-ring
(a ring architecture with a central hub connecting to every core). Corresponding to each
subnet architecture, we consider two upper-level small-world congurations, mesh and
ring, with long-range wireless shortcuts distributed among the hubs. Thus, the fol-
lowing four hierarchical mm-wave NoC architectures are considered: Ring-StarRing,
Ring-Mesh, Mesh-StarRing, and Mesh-Mesh. As an example, in the Ring-StarRing ar-
chitecture, the rst term (Ring) denotes the upper-level architecture and the second
term (StarRing) indicates that the subnet is a star-ring topology. The same nomencla-
ture applies to the rest of the hierarchical architectures in this article.
3.2. Placement of WIs
The WI placement is crucial for optimumperformance gain as it establishes high-speed,
low-energy interconnects on the network. Finding an optimal network topology with a
limited number of WIs is a nontrivial problem with a large search space. It is shown
in Ganguly et al. [2010] that for placement of wireless links in a NoC, the Simulated
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:6 K. Chang et al.
Number of Hubs,
N
Number of
WIs, n
Initial
probability of
WI setup, P
i,j
Initial Network
configuration
Perform
Simulated
Annealing
Optimal network
configuration
Routing
Protocol
Optimization
Metric
Fig. 2. Flow diagram for the simulated annealing-based optimization of mWNoC architectures.
Annealing (SA) algorithm converges to the optimal conguration much faster than the
exhaustive search technique. Hence, we adopt an SA [Kirkpatrick et al. 1983]-based
optimization technique for placement of the WIs to get maximum benets of using
the wireless shortcuts. SA offers a simple, well-established, and scalable approach for
the optimized placement of WIs as opposed to an exhaustive search. Initially, the WIs
are placed randomly with each hub having equal probability of getting a WI. The only
constraint observed while deploying the WIs to the hubs is that a single hub could have
a maximum of one WI.
Once the network is initialized randomly, an optimization step is performed using
SA. Since the deployment of WIs is only on the hubs, the optimization is performed
solely on the 2
nd
-level network of hubs. If there are N hubs in the network and n WIs
to distribute, the size of the search space S is given by
|S| =

N
n

. (1)
Thus, with increasing N, it becomes increasingly difcult to nd the best solution by
exhaustive search. To perform SA, a metric is established, which is closely related to
the connectivity of the network. To compute , the shortest distances between all pairs
of hubs are computed following the routing strategy outlined in the next section. The
distances are then weighted with a normalized frequency of communication between
the particular pair of hubs. The optimization metric can be computed as
=

i j
h
i j
f
i j
, (2)
where h
i j
is the distance (in hops) between the i
th
source and j
th
destination. The
normalized frequency f
i j
of communication between the i
th
source and j
th
destination
is the apriori probability of trafc interactions between the subnets determined by
particular trafc patterns depending upon the application mapped onto the NoC. In
this case, equal weightage is attached to both inter-hub distance and frequency of
communication. The steps used to optimize the network are shown in Figure 2.
An important point to note here is that similar results can also be obtained using
other optimization techniques, like Evolutionary Algorithms (EAs) [Eiben et al. 2003]
and coevolutionary algorithms [Sipper 1997]. Although EAs are generally believed to
give better results, SA reaches comparably good solutions much faster [Jansen et al.
2007]. We have used SA in this work as an example.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:7
Fig. 3. Zig-zag antenna simulation setup.
4. COMMUNICATION SCHEME
The hubs with WIs are responsible for supporting efcient data transfer between the
distant nodes within the mWNoC by using the wireless communication channel. In
this section we describe the various components of the WIs and the adopted data
routing strategy. The two principal components of the WIs are the antenna and the
transceiver. Characteristics of these two components are discussed in Sections 4.1 and
4.2, respectively.
4.1. On-Chip Antennas
To be effective for the mWNoC application the on-chip antenna must be wideband,
highly efcient, and sufciently small. It has to provide the best power gain for the
smallest area overhead. A metal zig-zag antenna [Floyd et al. 2002] has been demon-
strated to possess these characteristics. This antenna also has negligible effect of rota-
tion (relative angle between transmitting and receiving antennas) on received signal
strength, making it most suitable for mWNoC application [Zhang et al. 2007]. The
zig-zag antenna is designed with 10 m trace width, 60 m arm length, and 30

bend
angle. The axial length depends on the operating frequency of the antenna which is
determined in Section 5.1. The details of the antenna simulation setup and antenna
structure are shown in Figure 3.
4.2. Wireless Transceiver Circuit
The design of a low-power wideband wireless transceiver is the key to guarantee the
desired performance of the mWNoC. Therefore, at both architecture and circuit lev-
els of the transceiver, low-power design considerations were taken into account. As
illustrated in the transceiver architecture diagram in Figure 4, the transmitter (TX)
circuitry consists of an up-conversion mixer and a Power Amplier (PA). At the receiver
(RX) side, direct-conversion topology is adopted, which consists of a LowNoise Amplier
(LNA), a down-conversion mixer, and a baseband amplier. An injection-lock Voltage-
Controlled Oscillator (VCO) is reused for TX and RX. With both direct-conversion and
injection-lock technologies, a power-hungry Phase-Lock Loop (PLL) is eliminated in
the transceiver [Kawasaki et al. 2010]. Moreover, at circuit level, body-enabled design
techniques [Deen et al. 2002], including both Forward Body-Bias (FBB) with DC volt-
ages, as well as body-driven by AC signals [Kathiresan et al. 2006], are implemented
in most of the circuit subblocks to further decrease their power consumptions.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:8 K. Chang et al.
Fig. 4. Block diagram of the mm-wave direct-conversion transceiver with injection-lock VCO.
Fig. 5. Schematic of the body-biased LNA with a feed-forward path for bandwidth extension.
The LNA is a crucial component in the RX chain as it determines the sensitivity of
the entire receiver. To achieve a wide bandwidth, a novel feed-forward path is imple-
mented. Moreover, using body-enabled design, low power consumption is maintained.
Figure 5 demonstrates the circuit topology of the proposed low-power wideband LNA,
consisting of three stages. A Common-Source (CS) amplier with inductive source de-
generation is chosen for the rst stage since it has better noise performance than the
cascode topology. At the drain of the transistor M
1
, inductors L
3
and L
4
form a bridged-
shunt-series peaking structure that serves to extend the bandwidth [Shekhar et al.
2006]. The second stage employs a cascode topology to enhance the overall gain and
reverse isolation of the LNA. Inductor L
5
is adjusted to peak the gain at a slightly dif-
ferent frequency from the rst stage, realizing a wideband overall frequency response.
Moreover, a feed-forward path, which can boost up the gain of the rst stage, is intro-
duced in the third stage, directly coupling the gate of M
2
to M
4
[Yu et. al. 2010]. With
the peak gain of the second stage set at a higher frequency than that of the rst stage,
this feed-forward path extends the bandwidth of the entire LNA at the lower end.
Moreover, the feed-forward path only causes trivial degradation to the overall Noise
Figure (NF) of the LNA, since the noise introduced by M
4
is suppressed by the gain of
the rst stage. In addition, M
4
reuses the bias current with M
5
, hence no extra power
consumption is introduced.
As can be seen in Figure 5, FBB is implemented in the last two stages of the LNA.
The threshold voltage of an NMOS transistor can be expressed as [Sedra et al. 2004]
V
t
= V
t0
+

2
F
+ V
SB

2
F

,
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:9
Fig. 6. Schematic of the body-driven down-conversion mixer with body-biased dummy switching pair for
LO-feedthrough cancellation.
Fig. 7. Circuit topology of the low-power wideband body-biased PA.
where V
SB
is the voltage between body and source terminals, V
t0
is the threshold
voltage when V
SB
= 0, is a process-dependent parameter, and
F
is the bulk Fermi
potential. This indicates that by applying a positive bias voltage at the body terminal,
the threshold voltage of the NMOS can be effectively decreased without degradations
in device characteristics in terms of gain, linearity, and noise gure [Deen et al. 2002].
Accordingly, in the second stage of the LNA, since the source voltages of M
2
and M
3
are different, two different DC voltage levels are generated by the bias voltage V
B
and the voltage divider R
B1
and R
B2
, and applied to the body terminals of M
2
and
M
3
, respectively. The FBB in the third stage is implemented in a similar way. This
decreases the threshold voltages of these transistors, and hence the supply voltage is
reduced from 1 V to 0.8 V.
The down-conversion mixer shown in Figure 6 uses a bulk-driven topology to save
power without sacricing the performance. Since the body terminal acts as a back-
gate, the RF signal is directly fed into the body terminals of the switching pair. In
this way, not only the switching pair can be biased at very low DC current, the re-
moval of the stacked transconductance stage also leads to a lower supply voltage. In
addition, in order to eliminate Local Oscillator (LO) feed through at the Intermediate
Frequency (IF) port, a novel body-biased dummy switching pair consisting of M
3
and
M
4
is introduced. By adjusting the body-bias voltage of the dummy pair, the level of
LO cancellation can be optimized.
At the TX side, due to the short communication range of the mWNoC, the required
PA output power is much lower than in conventional mm-wave power ampliers. Nev-
ertheless it still needs to maintain a wide bandwidth for the required high data rate.
The circuit topology of the proposed three-stage PA is shown in Figure 7. The cascode
structure is used in the rst stage for its high gain and better reverse isolation. Sim-
ilar to the LNA design, FBB is implemented in the cascode stack to lower the power
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:10 K. Chang et al.
Fig. 8. The body-driven up-conversion mixer circuit.
Fig. 9. The body-biased injection-lock VCO circuit.
consumption of the PA. The other two are both CS stages, which can provide larger volt-
age headroom and thus better linearity. Moreover, for bandwidth extension, inductive
peaking is created by L
2
and L
4
at the output of the rst and the second stages, respec-
tively. Note that the bias current densities of the last two stages are set to around 0.3
mA/m for maximum linearity [Yao et al. 2007]. A body-driven double-balanced mixer
serves as the up-conversion mixer. As depicted in Figure 8, the baseband signal is fed
into the body terminals of the switching pair, modulating the 55-GHz carrier signal.
The proposed schematic of the injection-lock VCO is shown in Figure 9. The injection
locking technique [Razavi 2004] not only lowers the phase noise, but also reduces the
frequency and phase variation in the VCO without the use of a PLL. Moreover, FBB is
applied at the body terminals of all the transistors to lower their threshold voltages, so
that a lower bias voltage can be used to decrease the power consumption. As shown in
the schematic, transistors M
1
and M
2
form a NMOS cross-coupled pair. Transistor M
3
acts as a tail current source as well as signal injection point for the VCO. Furthermore,
in order to achieve a desirable locking range for the VCO, an injection amplier M
4
is
also implemented to boost the signal before being fed into M
3
.
4.3. Adopted Routing Strategy
In this proposed hierarchical NoC, intra-subnet data routing is done depending on the
topology of the subnet. In this work, we consider two subnet topologies (i.e., mesh and
star-ring). In a mesh subnet, the data routing follows a deadlock-free dimension order
(e-cube) routing. In a star-ring subnet, if the destination core is within two hops on the
ring from the source, then the it is routed along the ring. If the destination is more
than two hops away, then the it goes through the central hub to its destination. Thus,
within the star-ring subnet, each core is at a distance of at most two hops from any
other cores. To avoid deadlock, we adopt the virtual channel management scheme from
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:11
Fig. 10. An example of token-ow-control-based distributed routing.
Red Rover algorithm [Draper et al. 1997], in which the ring is divided into two equal
sets of contiguous nodes. Messages originating from each group of nodes use dedicated
virtual channels. This scheme breaks cyclic dependencies and prevents deadlock.
Inter-subnet data routing, however, requires the its to use the upper-level net-
work consisting of wired and wireless links. By using the wireless shortcuts between
the hubs with WIs, its can be transferred in a single hop between them. If the
source hub does not have a WI, the its are routed to the nearest hub with a WI via
the wired links and are then transmitted through the wireless channel. Likewise, if the
destination hub does not have a WI, then the hub nearest to it with a WI receives the
data and routes it to the destination through wired links. Between a pair of source and
destination hubs without WIs, the routing path involving the wireless medium is cho-
sen if it reduces the total path length compared to the wired path. This can potentially
give rise to a hotspot situation in all the WIs because many messages try to access
wireless shortcuts simultaneously, thus overloading the WIs and resulting in higher
latency. A token ow control [Kumar et al. 2008a] along with a distributed routing
strategy is adopted to alleviate this problem. Tokens are used to communicate the sta-
tus of the input buffers of a particular WI to the other nearby hubs, which need to use
that WI for accessing wireless shortcuts. Every input port of a WI has a token and the
token is turned on if the availability of the buffer at that particular port is greater than
a xed threshold and turned off otherwise. The routing adopted here is a combination
of dimension order routing for the hubs without WIs and South-East routing algorithm
for the hubs with WIs. This routing algorithm is proved to be deadlock free in Ogras
et al. [2006]. Figure 10 shows a particular communication snapshot of a mesh-based
upper-level network where hub 8 wants to communicate with hub 3. First at source 8,
the nearest WI (4 in this case) is identied. Then the routing algorithm checks whether
taking this WI reduces the total hop count. If so, the token for the south input port of
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:12 K. Chang et al.
hub 4 is checked and this path is taken only if the token is available. If this is not the
case, the message at hub 8 follows dimension order routing towards the destination
and arrives at hub 9. At hub 9, again the shortest path using WIs is searched and
if the token from hub 10 allows the usage of wireless shortcuts, then the message is
routed through hub 10. Otherwise, the message follows dimension order routing and
keeps looking for the shortest path using WIs at every hub until the destination hub
is reached. Consequently, the distributed routing along with token ow control pre-
vents deadlocks and effectively improves performance by distributing trafc though
alternative paths. It is also livelock free since it generates a minimal path towards the
destination, as the adopted routing here ensures that the wireless shortcuts are only
followed if that reduces the hop count between source and destination. As a result, this
routing always tries to nd the shortest path and never allows routing away from the
destination.
In a ring-based upper-level network, the same principle of distributed routing and
token ow control is used. The message follows ring routing and keeps looking for the
shortest path with available WI at every hub until the destination hub is reached. As
mentioned before, the ring routing adopted here is based on the Red Rover algorithm
[Draper et al. 1997], which provides deadlock-free routing by dividing the ring into
two equal sectors and using virtual channels. In this case also routing will never allow
any packet to be routed away from the destination and hence the routing is livelock
free.
As all the wireless hubs are tuned to the same channel and can send or receive
data from any other wireless hub on the chip, an arbitration mechanism needs to be
designed in order to grant access to the wireless medium to a particular hub at a given
instant to avoid interference and contention. To avoid the need for a centralized control
and synchronization mechanism, the arbitration policy adopted is a wireless token
passing protocol. It should be noted that the use of the word token in this case differs
from the usage in the aforementioned token ow control. According to this scheme,
the particular WI possessing the wireless token can broadcast its into the wireless
medium. All other hubs will receive the it as their antennas are tuned to the same
frequency band. However, only if the destination address matches the address of the
receiving hub is the it accepted for further routing, either to a core in the subnet of
that hub or to an adjacent hub. The wireless token is forwarded to the next hub with
a WI after all its belonging to a packet at the current wireless token-holding hub are
transmitted.
5. PERFORMANCE EVALUATION
In this section we characterize the performance of the proposed mWNoC through rig-
orous simulation and analysis in presence of various trafc patterns. First, we present
the characteristics of the on-chip wireless communication channel by elaborating the
performances of the antenna and the transceiver circuits. Then we describe the detailed
network-level simulations considering various system sizes and trafc patterns.
Figure 11 shows an overview of the performance evaluation setup for a mWNoC. To
obtain the gain and bandwidth of the antennas we use ADS momentum tool [Agilent
2012]. Bandwidth and gain of the antennas are necessary for establishing the required
design specications for the transceivers. The mm-wave wideband wireless transceiver
is designed and simulated using Cadence tools with TSMC 65-nm standard CMOS pro-
cess to obtain its power and delay characteristics. The subnet switches and the digital
components of the hubs are synthesized using Synopsys tools with 65-nm standard cell
library from TSMC at a clock frequency of 2.5 GHz. Energy dissipation of all the wired
links are obtained from actual layout in Cadence assuming a 20 mm20 mm die area.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:13
Implementaon of Communicaon Infrastructure
Implemented using
ADS Momentum
Implemented using
Cadence tools with 65 nm
CMOS library from TSMC
Implemented using
Synopsys tools with 65 nm
CMOS library from TSMC
Laid out using Cadence tools
and exact power and delay
numbers are obtained
NoC Simulator
T
h
r
o
u
g
h
p
u
t
,

L
a
t
e
n
c
y
,

E
n
e
r
g
y
Simulated Annealing
(SA) based
opmized Network
Conguraon
G
a
i
n
,

B
a
n
d
w
i
d
t
h
P
o
w
e
r
,

D
e
l
a
y
P
o
w
e
r
,

D
e
l
a
y
P
o
w
e
r
,

D
e
l
a
y
On-Chip Antenna Wireless Transceiver
Subnet Routers,
Inter-subnet Hubs
Metal wires of dierent
lengths for inter-subnet
and intra-subnet
communicaons
Fig. 11. Overview of performance evaluation setup for mWNoC.
45 50 55 60 65 70
-32
-31
-30
-29
-28
-27
-26
Frequency in GHz
S
2
1

i
n

d
B
3dB Bandwidth
Fig. 12. Antenna S21 response.
All the power and delay numbers of various components are then fed into the network
simulator to obtain overall mWNoC performance.
5.1. Wireless Channel Characteristics
The metal zig-zag antennas described earlier are used to establish the on-chip wireless
communication channels. Figure 3 shows the simulation setup for on-chip wireless
antennas. High resistivity silicon substrate ( =5 k-cm) is used for the simulation. In
our experimental setup, depending on the architecture of the 2
nd
level of the hierarchy,
the maximum distance between a transmitter and receiver pair is 18 mm and for the
antenna simulations this communication range is used. The forward transmission gain
(S21) of the antenna is shown in Figure 12. For optimum power efciency, the quarter
wave antenna needs an axial length of 0.38 mm in the silicon substrate.
As shown in Figure 11 earlier, the mm-wave wideband wireless transceiver is de-
signed and simulated using TSMC 65-nm standard CMOS process. The overall con-
version gain and Noise Figure (NF) of the receiver, including the proposed LNA and
mixer, are shown in Figure 13. The VCO generated 5 dBm of output power, and the
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:14 K. Chang et al.
45 50 55 60 65
8
10
12
14
16
18
20
22
Frequency (GHz)
C
o
n
v
e
r
s
i
o
n

G
a
i
n

(
d
B
)
55 57 59 61 63 65
5
5.5
6
6.5
Frequency (GHz)
N
o
i
s
e

F
i
g
u
r
e

(
d
B
)
(a) (b)
Fig. 13. Simulated (a) gain, and (b) double-sideband NF of the receiver.
40 45 50 55 60 65 70
2
4
6
8
10
12
14
16
18
Frequency (GHz)
C
o
n
v
e
r
s
i
o
n

G
a
i
n

(
d
B
)
Fig. 14. Simulated TX conversion gain.
LO frequency was set to 55 GHz, in accordance with the center frequency of the LNAs
pass-band. The conversion gain is 20 dBat the center frequency, and rises up to 20.5 dB
at the peak. This indicates that the mixer contributes around 7 dB of conversion gain.
Figure 13(a) also shows that the overall 3-dB bandwidth of the receiver front-end is
18 GHz. From Figure 13(b), it can be seen that the overall NF stays below 6 dB.
Note that double sideband (DSB) NF is used since the proposed receiver has a direct-
conversion structure. The LNA and the mixer drain 10.8 mA and 0.75 mA of current,
respectively, from the 0.8 V DC bias. Accordingly, the aggregate power consumption of
the receiver front-end is 9.24 mW.
The conversion gain of the transmitter is illustrated in Figure 14. The transmitter
has a peak gain of 15 dB, and a 3-dB bandwidth of 18.1 GHz. Furthermore, circuit
simulation also shows that the output 1-dB gain compression point (P
1dB
) of the trans-
mitter is 0 dBm. With the implementation of body-enabled design in both the PA and
up-conversion mixer, the TX front-end consumes only 14 mW from a 0.8-V supply
voltage.
The achieved aggregate power consumption of the entire transceiver is 36.7 mW,
16% lower than the previous design without using body-enabled techniques [Yu et al.
2010; Deb et al. 2010]. It is able to support a data rate of at least 16 Gbps, and a BER
< 10
15
using an OOK modulation scheme.
Using the preceding characteristics of the antennas and mm-wave transceivers, we
carried out a comparative performance analysis between the wireless and wired com-
munication channels. The wired channels are considered to be 32 bits wide, which is
equal to the it width considered in this article. Each link of the wired channel is
designed with the optimum number of uniformly placed and sized repeaters. Figure 15
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:15
Fig. 15. The variation of energy dissipation per bit with distance for a wired and a wireless link.
presents how energy dissipated per bit changes as a function of length for both wireless
and wired links. From this plot it can be observed that wireless shortcuts are always
energy efcient whenever the link length is 7 mm or more. Moreover, wireless links
eliminate the need to lay out long buffered wire links with many via cuts from the
upper-layer wires all the way down to the substrate. Hence, implementation of long-
range links beyond 7 mm using wireless makes the design energy efcient and simpler
in terms of layout. In our implementation, the minimum and maximum distances be-
tween the WIs communicating using the wireless channel are 7.07 mm and 18 mm,
respectively. Therefore, in this case, using the wireless channel is always more energy
efcient.
5.2. Network-Level Performance Evaluation
In this section, we analyze the characteristics of the proposed mWNoC and study the
trends in performance as the number of WIs is increased for a particular system.
Our aim is to establish a design trade-off point that helps us in selecting the optimum
number of WIs for certain performance requirements. For our experiments, we consider
three different system sizes, namely 128, 256, and 512 cores, and the die area is kept
xed at 20 mm 20 mm in all the cases. The subnet switch architecture is adopted
from Pande et al. [2005]. It has three functional stages, namely, input arbitration,
routing/switch traversal, and output arbitration. Each packet consists of 64 its. The
input and output ports including the ones on the wireless links have 4 virtual channels
per port, each having a buffer depth of 2 its. The subnet switches and the hubs handle
the inputs and outputs separately as explained in Dally [1992]. The average message
latency decreases when buffer size increases [Duato et al. 2002]. According to Duato et
al. [2002], the effect of buffer size on performance is small. Moreover, increasing buffer
capacity does not increase performance signicantly if messages are longer than the
diameter of the network times the total buffer capacity of a virtual channel (which is
the case for all the system sizes considered in this study) [Duato et al. 2002]. This is
also veried using our own simulations, and the trade-off point for buffer depth that
provides optimum performance without consuming excessive silicon area is obtained
as two its.
Similar to the intra-subnet communication, we adopt wormhole routing in the wire-
less channel too. Consequently, the hubs have similar architectures to the NoCswitches
in the subnets. Hence, each port of the hub has the same input and output arbiters, and
an equal number of virtual channels with same buffer depths as the subnet switches.
The number of ports in a hub depends on the number of links connected to it. The hubs
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:16 K. Chang et al.
Fig. 16. Performance variation with change in buffer depth for the ports associated with WIs for a 256-core
Mesh-StarRing system.
also have three functional stages, but as the number of cores increases in a subnet,
the delays in arbitration and switching for some cases are more than a clock cycle.
Depending on the subnet sizes, traversal through these stages needs multiple cycles
and this has been taken into consideration when evaluating overall performance of the
mWNoC. The ports associated with the WIs have an increased buffer depth of 8 its
to avoid excessive latency penalty while waiting for the token. Increasing the buffer
depth beyond this limit does not produce any further performance improvement for
this particular packet size, but will give rise to additional area overhead. This is shown
in Figure 16 for a 256-core Mesh-StarRing systemdivided into 16 subnets. The wireless
ports of the WIs are assumed to be equipped with antennas and wireless transceivers.
A self-similar trafc injection process is assumed.
The network architectures developed earlier are simulated using a cycle-accurate
simulator. The delays in it traversals along all the wired interconnects that enable the
proposed hybrid NoC architecture are considered when quantifying the performance.
These include the intra-subnet core-to-hub wired links and the inter-hub links in the
upper level of the network. The delays through the switches and inter-switch wires of
the subnets and the hubs are taken into account as well.
To quantify the energy dissipation characteristics of the proposed mWNoC architec-
ture, we determine the packet energy dissipation, E
pkt
. The packet energy is the energy
dissipated on average by a packet from its injection at the source to delivery at the
destination. This is calculated as
E
pkt
=
N
intrasubnet
E
subnet,hop
h
subnet
+ N
intersubnet
E
sw
h
sw
N
intrasubnet
+ N
intersubnet
,
where N
intrasubnet
and N
intersubnet
are the total number of packets routed within the
subnet and between the subnets, respectively, E
subnet,hop
is the energy dissipated by a
packet traversing a single hop on the wired subnet including a wired link and a switch,
and E
sw
is the energy dissipated by a packet traversing a single hop on the 2
nd
level
of the mWNoC network, which has the small-world property. The average number of
hops per packet in the subnet and the upper-level small-world network are denoted by
h
subnet
and h
sw
respectively.
To determine the optimumdivision of the proposed hierarchical architecture in terms
of achievable bandwidth, we evaluate the performance by dividing the whole system
in various alternative ways. Figure 17 shows the achievable bandwidth for a 256-core
Mesh-StarRing mWNoCdivided into different numbers of subnets. As can be seen from
the plot, the division of the whole systeminto 16 subnets with 16 cores in each performs
the best. Similarly, the suitable hierarchical division that achieves best performance is
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:17
Fig. 17. Bandwidth of a 256-core Mesh-StarRing mWNoC for various hierarchical congurations.
Fig. 18. Achievable bandwidth and energy trade-offs for Mesh-Mesh and Mesh-StarRing architectures with
varying number of WIs.
Fig. 19. Achievable bandwidth and energy trade-offs for Ring-Mesh and Ring-StarRing architectures with
varying number of WIs.
determined for all the other systemsizes. For systemsizes of 128 and 512, the optimum
number of subnets turns out to be 8 and 32, respectively.
Figures 18 and 19 showthe achievable bandwidth and packet energy trade-offs of the
proposed mWNoC with varying number of WIs under uniform random spatial trafc
distribution for the three system sizes of 128, 256, and 512 cores divided into 8, 16,
and 32 subnets, respectively. Figure 18 considers two specic architectures (i.e., Mesh-
Mesh and Mesh-StarRing), where the upper levels are mesh-based topologies, and the
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:18 K. Chang et al.
subnets are Mesh and StarRing, respectively. Figure 19 represents the characteristics
of Ring-Mesh and Ring-StarRing architectures. It can be observed that, for all system
sizes and architectures, the bandwidth increases as we start placing more WIs in the
system initially, then it reaches a saturation point and starts decreasing. Placing more
WIs in the system beyond a certain threshold has negative impact on system perfor-
mance. This happens because the wireless communication channel is a shared medium.
As explained in Section 4.3 before, each WI needs to obtain a token to access the wireless
channel. Since there is only one wireless token circulating among all the WIs, as the
number of WIs goes up, the delay in acquiring the wireless medium for a particular WI
increases. Consequently, it degrades the overall system performance. Moreover, as the
number of WIs increases, the overall energy dissipation from the WIs becomes higher,
and it causes the packet energy to increase as well. Considering all these factors, we
can nd the optimum number of WIs corresponding to each system size.
From our experiments we nd that the overall bandwidth reaches a maximum with
4, 6, and 10 WIs for 128, 256, and 512 core systems, respectively. Beyond this limit, the
bandwidth degrades, but the energy dissipation continues to increase. This study helps
us to determine the optimum number of WIs for each system size. It is also evident
that the Mesh-StarRing architecture always outperforms Mesh-Mesh architecture as
it has better connectivity in the subnets. It results in lower packet energy for the
Mesh-StarRing architecture compared with Mesh-Mesh architecture. The same trend
is also observed in the architectures with ring-based upper levels, where systems with
StarRing subnets always achieve higher bandwidth than those with Mesh subnets. In
terms of upper-level topologies, systems with a mesh-based upper level always perform
better and have lower packet energy than those with ring-based upper-level due to a
more efcient upper-level network. These advantages of the mesh-based upper-level
network come at the cost of extra wiring overhead. For a 256-core system, a mesh-based
upper-level network has 24 wired links whereas a ring-based counterpart only has 16
wired links, or 33% less links. The new Mesh-StarRing architecture along with the
routing mechanism elaborated in Section 4.3 results in 14.6% bandwidth improvement
and 48% saving in packet energy for a 256-core system with 6 WIs in comparison with
our previous work [Deb et al. 2010].
5.3. Comparative Performance Evaluation with Other Interconnect Technologies
In this section, we consider two other possible interconnect technologies, namely, RF-I
and G-line, that can be used as long-range links in the proposed hierarchical small-
world architecture. Here, we consider a 256-core Mesh-StarRing architecture, and
undertake a comparative study to establish the relative performance benets achieved
by using these alternatives as long-range shortcuts in the upper-level small-world
network. We rst design a small-world NoC (RFNoC) using RF-I links as the shortcuts,
maintaining the same hierarchical topology. As mentioned in Chang et al. [2008], in
65-nm technology it is possible to have 8 different frequency channels each operating
with a data rate of 6 Gbps. Like the wireless channel, these RF links can be used
as the long-range shortcuts in the hierarchical NoC architecture. These shortcuts are
optimally placed using the same SA-based optimization as used for placing the WIs in
the mWNoC. The optimum locations of the hubs with RF-I interfaces are not the same
as those of mWNoC due to the difference in their characteristics. Next, we construct a
small-world NoC (GLNoC) by replacing the shortcuts of RFNoC by G-lines [Mensink
et al. 2007], and still maintain the same hierarchical topology. Each G-line uses a
capacitive preemphasis transmitter that increases the bandwidth and decreases the
voltage swing without the need of an additional power supply. To avoid cross-talk,
differential interconnects are implemented with a pair of twisted wires. A decision
feedback equalizer is employed at the receiver to further increase the achievable data
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:19
Fig. 20. Implementation details of different NoC shortcuts.
Fig. 21. Comparative analysis of hierarchical small-world NoCs with three types of long-range shortcuts
with uniform random trafc.
rate. Each G-line can sustain a bandwidth of around 2.5 Gbps for a wire length of
11 mm. The exact implementation details for the NoC architectures with different
shortcuts are shown in Figure 20.
Figure 21 shows the overall system bandwidth and packet energy dissipation for
mWNoC, RFNoC, and GLNoC in a 256-core Mesh-StarRing system under uniform
random trafc. A 256-core at mesh is also included for comparison. Compared to the
at mesh architecture, the three hierarchical NoCs achieve much higher bandwidth
and consume signicantly less energy. This is because a hierarchical network reduces
the average hop count, and hence improves the performance. Moreover, packets get
routed faster and hence occupy resources for less time and dissipate less energy in the
process. The energy dissipation for G-line and RFI is obtained from Chang et al. [2008]
and Mensink et al. [2007] respectively. It can be observed that, among the three types
of small-world NoCs, GLNoC has the lowest bandwidth and highest packet energy.
The length of the shortcuts is more than 11 mm and hence the G-line communication
essentially becomes multihop. Consequently, this reduces the overall achievable band-
width for GLNoC. In addition, the high capacitance of G-line links causes more energy
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:20 K. Chang et al.
Fig. 22. (a) 256-core system with mesh-based upper-level architecture. (b) 256-core system with ring-based
upper-level architecture.
dissipation. Though mWNoC and RFNoC have the same hierarchical architecture, the
latter performs better because in RFNoC multiple shortcuts can work simultaneously
whereas in mWNoConly one pair can communicate at a particular instant of time since
the wireless channel is a shared medium. But the long-range link area overhead and
layout challenges of the RFNoC will be more than that of the mWNoC. For example,
in case of a 20 20 mm
2
die, an RF interconnect of approximately 100 mm length
has to be laid for RFNoC. This is signicantly higher than the combined length of all
the antennas used in mWNoC, which is 3.8 mm for the highest system size (512-core
system with 10 WIs) considered in this article. A detailed link area overhead analysis
for all the different NoC architectures is presented in Section 5.5.
5.4. Performance Evaluation with Nonuniform Trafc
In order to evaluate the performance of the proposed mWNoCarchitecture with nonuni-
form trafc patterns, we consider both synthetic, and application-based trafc distri-
butions. In the following analysis, the system size considered is 256 (with 16 subnets
and 16 cores per subnet), and the number of WIs is chosen to be 6 as obtained from
Section 4.2. The WIs are placed optimally depending on the particular trafc scenario
following the algorithm discussed in Section 3.2.
We consider two types of synthetic trafc to evaluate the performance of the proposed
mWNoCarchitecture. First, a transpose trafc pattern [Ogras et al. 2006] is considered
where a certain number of hubs communicate more frequently with each other. We have
considered 3 such pairs of subnets and 50% of packets generated from one of these
subnets are targeted towards the other in the pair. The other synthetic trafc pattern
considered was the hotspot [Ogras et al. 2006], where each hub communicates with a
certain number of hubs more frequently than with the others. We have considered three
such hotspot locations to which all other hubs send 50% of the packets that originate
from them. To represent a real application, a 256-point Fast Fourier Transform (FFT)
is considered on the same 256-core mWNoC subdivided into 16 subnets. Each core is
considered to performa 2-point radix 2 FFT computation. The trafc pattern generated
in performing multiplication of two 128 128 matrices is also used to evaluate the
performance of the mWNoC.
Figure 22 presents the bandwidth and packet energy for different trafc patterns.
The same performance-energy trade-off trend as with the uniform random trafc is
observed in this case also. The architectures with mesh-based upper levels always
outperform the architectures with ring-based upper levels. Systems with StarRing
subnets also perform better than those with Mesh subnets.
We also evaluate the performance of a 256-core Mesh-StarRing mWNoC with re-
spect to a corresponding RFNoC and GLNoC of the same size in the presence of the
aforementioned application-specic trafc patterns. As shown in Figure 23, the same
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:21
Fig. 23. Comparison of the three types of hierarchical small-world NoC architectures with nonuniform
trafc patterns.
Fig. 24. Total silicon area for 256-core mWNoC, RFNoC, and GLNoC.
bandwidth and energy dissipation trend is observed as in the uniform random trafc
case.
5.5. Area Overheads
In this subsection, we quantify the area overhead for the hierarchical mWNoC as
well as for the RFNoC and the GLNoC studied in the previous section. In mWNoC,
the antenna used is a 0.38 mm long zig-zag antenna. The area of the transceiver cir-
cuits required per WI is the total area required for the OOK modulator/demodulator,
LNA, PA, and VCO. The total area overhead per wireless transceiver turns out to be
0.3 mm
2
for the selected frequency range. The digital part for each WI, which is very
similar to a traditional wireline NoC switch, has an area overhead of 0.40 mm
2
. There-
fore, total circuit area overhead per hub with a WI is 0.7 mm
2
. The transceiver area
overhead for RF-I and G-line are taken from Chang et al. [2008] and Mensink et al.
[2007] respectively. Total silicon area overheads for mWNoC, RFNoC, and GLNoC in
a 256-core Mesh-StarRing system are shown in Figure 24. The required silicon area
is largely dominated by the intra-subnet switches associated with the NoC architec-
tures. The area overheads arising due to the hubs along with the required transceivers
for different shortcuts are shown separately. The hub area is dominated by the non-
transceiver digital components. Though the transceiver areas of the mm-wave wireless
links, RF-I, and G-line vary, the digital components of the hubs are the same irre-
spective of the particular interconnect technology used. Consequently, the overall hub
area overhead does not vary much depending on the interconnect technology. This is
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:22 K. Chang et al.
Fig. 25. Area overhead and performance degradation trade-off with decreasing the number of connections
of each hub for a 256-core mWNoC.
Fig. 26. Total wiring requirements for 256-core mWNoC, RFNoC, and GLNoC.
also clearly evident from the plots of Figure 24. By decreasing the number of core-
to-hub direct connections in each subnet the hub area overhead can be reduced. But
that will result in performance degradation. In Figure 25, we show the area overhead-
performance degradation due to decreasing the number of connections of each hub. It is
clear that by reducing the number of core-to-hub direct connections the hub area over-
head reduces but it also signicantly affects the overall achievable bandwidth of the
system.
All the three hierarchical NoC architectures have varying link area requirements.
Figure 26 shows the wiring overhead of the three NoCs (i.e., mWNoC, RFNoC, GLNoC)
built upon a 256-core Mesh-StarRing architecture in a 20 mm20 mm die. The wiring
requirements for a at mesh architecture are shown for comparison. In Figure 26, it is
obvious that RFNoC requires an additional 100 mm wire to establish long-range short-
cuts in the upper level compared to mWNoC. Similarly, GLNoC also requires several
additional long wires as the long-range shortcuts. In mWNoC, long-range communica-
tion is predominantly done by wireless links and as a result there is no requirement of
long wires as can be seen from Figure 26. It may be noted that in the hierarchical NoC
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:23
architectures, there are no inter-subnet direct core to core links. Hence, all the three
hierarchical NoC architectures eliminate a number of wireline links along the subnet
boundaries which are present in the at mesh topology.
6. CONCLUSIONS
The Network-on-Chip (NoC) is an enabling methodology to integrate large numbers of
embedded cores on a single die. The existing method of implementing a NoCwith planar
metal interconnects is decient due to high latency and signicant power consumption
arising out of multihop links used in data exchanges. One of the promising ways to ad-
dress this is to replace multihop wired interconnects with high-bandwidth single-hop
long-range millimeter (mm)-wave wireless links. Through a detailed performance eval-
uation, we establish the relevant design trade-offs for various mm-wave wireless NoC
(mWNoC) architectures. A detailed comparison of hierarchical NoCs based on wireless,
RF-I, and G-line links shows their respective merits and limitations. It is shown that in
mWNoC, by adopting a hierarchical and small-world architecture incorporating body-
biased on-chip wireless links, the performance improves signicantly without unduly
large area overhead compared to more conventional wire line counterparts.
REFERENCES
AGILENT 2012. Advanced Design System (ADS). http://www.home.agilent.com
ALBERT, R. AND BARABASI, A.-L. 2002. Statistical mechanics of complex networks. Rev. Modern Phys. 74, 4797.
BRANCH, J., GUO, X., GAO, L., SUGAVANAM, A., LIN, J.-J., AND O, K. K. 2005. Wireless communication in a ip-chip
package using integrated antennas on silicon substrates. IEEE Electron. Dev. Lett. 26, 2, 115117.
BUCHANAN, M. 2003. Nexus: Small Worlds and the Groundbreaking Theory of Networks. W.W. Norton &
Company, Inc.
CHANG, M. F., CONG, J., KAPLAN, A., NAIK, M., REINMAN, G., SOCHER, E., AND TAM, S.-W. 2008. CMP network-on-
chip overlaid with multi-band RF-interconnect. In Proceedings of the IEEE International Symposium on
High-Performance Computer Architecture (HPCA 08). 191202.
DALLY, W. J. 1992. Virtual channel ow control. IEEE Trans. Parallel Distrib. Syst., 3, 2, 194205.
DEB, S., GANGULY, A., CHANG, K., PANDE, P. P., BELZER, B., AND HEO, D. 2010. Enhancing performance of
network-on-chip architectures with millimeter-wave wireless interconnects. In Proceedings of the IEEE
International Conference on ASAP. 7380.
DEEN, M. J. AND MARINOV, O. 2002. Effect of forward and reverse substrate biasing on low-frequency noise in
silicon PMOSFETs. IEEE Trans. Electron. Dev. 49, 3, 409413.
DRAPER, J. AND PETRINI, F. 1997. Routing in bidirectional k-ary n-cube switch the red rover algorithm. In
Proceedings of the International Conference on Parallel and Distributed Processing Techniques and
Applications. 118493.
DUATO, J., YALAMANCHILI, S., AND NI, L. 2002. Interconnection NetworksAn Engineering Approach. Morgan
Kaufmann.
EIBEN, A. E. AND SMITH, J. E. 2003. Introduction to Evolutionary Computing. Springer Berlin.
FLOYD, B. A., HUNG, C.-M., AND O, K. K. 2008. Intra-Chip wireless interconnect for clock distribution imple-
mented with integrated antennas, receivers, and transmitters. IEEE J. Solid-State Circ. 37, 5, 543552.
GANGULY, A., CHANG, K., DEB, S., PANDE, P. P., BELZER, B., AND TEUSCHER, C. 2010. Scalable hybrid wireless
network-on-chip architectures for multi-core systems. IEEE Trans. Comput. 99, 1.
JANSEN, T. AND WEGENER, I. 2007. A comparison of simulated annealing with a simple evolutionary algorithm
on pseudo-boolean functions of unitation. Theor. Comput. Sci. 386, 7393
JOSHI, A., BATTEN, C., KWON, Y.-J., BEAMER, S., SHAMIM, I., ASANOVIC, K., AND STOJANOVIC, V. 2009. Silicon-Photonic
clos networks for global on-chip communication. In Proceedings of the 3rd ACM/IEEE International
Symposium on Networks-on-Chip (NOCS 09). 124133.
KATHIRESAN, G. AND TOUMAZOU, C. 1999. A low voltage bulk driven down-conversion mixer core. In Proceedings
of the IEEE International Symposium on Circuit and Systems. 598601.
KAWASAKI, K., AKIYAMA, Y., KOMORI, K., UNO, M., TAKEUCHI, H., ITAGAKI, T., HINO, Y., KAWASAKI, Y., ITO, K., AND
HAJIMIRI, A. 2010. A millimeter-wave intra-connect solution. IEEE J. Solid-State Circ. 45, 12, 26552666.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
23:24 K. Chang et al.
KEMPA, K., RYBCZYNSKI, J., HUANG, Z., GREGORCZYK, K., VIDAN, A., KIMBALL, B., CARLSON, J., BENHAM, G.,
WANG, Y., HERCZYNSKI, A., AND REN, Z. 2007. Carbon nanotubes as optical antennae. Adv. Mater. 19,
421426.
KIRKPATRICK, S., GELATT, JR., C. D., AND VECCHI, M. P. 1983. Optimization by simulated annealing. Sci. 220,
671680.
KRISHNA, T., KUMAR, A., CHIANG, P., EREZ, M., AND PEH, L.-S. 2008. NoC with near-ideal express virtual
channels using global-line communication. In Proceedings of the IEEE Symposium on High Performance
Interconnects (HOTI 08). 1120.
KUMAR, A., PEH, L.-S., AND JHA, N. K. 2008a. Token ow control. In Proceedings of the 41st IEEE/ACM
International Symposium on Microarchitecture (MICRO 08). 342353.
KUMAR, A., PEH, L.-S., KUNDU, P., AND JHA, N. K. 2008b. Toward ideal on-chip communication using express
virtual channels. IEEE Micro 28, 1, 8090.
KURIAN, G., MILLER, J. E., PSOTA, J., EASTEP, J., LIU, J., MICHEL, J., KIMERLING, L. C., AND AGARWAL, A. 2010. Atac:
A 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the Conference on
Parallel Architectures and Compilation Techniques (PACT 10).
LEE, S.-B., TAM, S.-W., PEFKIANAKIS, I., LU, S., CHANG, M. F., GUO, C., REINMAN, G., PENG, C., NAIK, M.,
ZHANG, L., AND CONG, J. 2009. A scalable micro wireless interconnect structure for CMPs. In Proceed-
ings of ACM Annual International Conference on Mobile Computing and Networking (MobiCom 09).
2025.
LIN, J.-J., WU, H.-T., SU, Y., GAO, L., SUGAVANAM, A., BREWER, J. E., AND O, K. K. 2007. Communica-
tion using antennas fabricated in silicon integrated circuits. IEEE J. Solid-State Circ. 42, 8, 1678
1687.
MEHTA, J., BRAVO, D., AND O, K. K. 2002. Switching noise picked up by a planar dipole antenna mounted near
integrated circuits. IEEE Trans. Electro-Magnetic Compat. 44, 5, 282290.
MENSINK, E., SCHINKEL, D., KLUMPERINK, E., VAN TUIJL, E., AND NAUTA, B. 2007. A 0.28pJ/b 2Gb/s/ch transceiver
in 90nm CMOS for 10mm on-chip interconnects. In Proceedings of the IEEE International Solid-State
Circuits Conference (ISSCC 07). 414612.
OGRAS, U. Y. AND MARCULESCU, R. 2006. Its a small world after all: NoC performance optimization via long-
range link insertion. IEEE Trans. VLSI Syst. 14, 7, 693-706.
PANDE, P. P., GRECU, C., JONES, M., IVANOV, A., AND SALEH, R. 2005. Performance evaluation and design trade-offs
for network-on-chip interconnect architectures. IEEE Trans. Comput. 54, 8, 10251040.
PANDE, P., CLERMIDY, F., PUSCHINI, D., MANSOURI, I., BOGDAN, P., MARCULESCU, R., AND GANGULY, A. 2011. Sus-
tainability through massively integrated computing: Are we ready to break the energy efciency wall
for single-chip platforms? In Proceedings of the Desing, Automation and Test in Europe Conference
(DATE11). 16.
PETERMANN, T. AND DE LOS RIOS, P. 2006. Physical realizability of small-world networks. Phys. Rev. E 73,
026114.
RAZAVI, B. 2004. A study of injection locking and pulling in oscillators. IEEE J. Solid-State Circ. 39, 9,
14151424.
SEDRA, A. S. AND SMITH, K. C. 2004. Microelectronic Circuits, 5th Ed. Oxford University Press.
SEOK, E. AND KENNETH, K. O. 2005. Design rules for improving predictability of on-chip antenna character-
istics in the presence of other metal structures. In Proceedings of the IEEE International Interconnect
Technology Conference. 120122.
SHACHAM, A., BERGMAN, K., AND CARLONI, L. P. 2008. Photonic networks-on-chip for future generations of chip
multiprocessors. IEEE Trans. Comput. 57, 9, 12461260.
SHEKHAR, S., WALLING, J. S., AND ALLSTOT, D. J. 2006. Bandwidth extension techniques for CMOS ampliers.
IEEE J. Solid-State Circ. 41, 11, 24242439.
SIPPER, M. 1997. Evolution of Parallel Cellular Machines: The Cellular Programming Approach. Springer
Berlin.
TEUSCHER, C. 2007. Nature-inspired interconnects for self-assembled large-scale network-on-chip designs.
Chaos 17, 2, 026106, 2007.
U. E. P. AGENCY. 2012. Report to congress on server and data center energy efciency public law 109431.
http://www.energystar.gov/.
WATTS, D. J. AND STROGATZ, S. H. 1998. Collective dynamics of small-world networks. Nature 393, 440
442.
YAO, T., GORDON, M. Q., TANG, K. K. W., YAU, K. H. K., YANG, M.-T., SCHVAN, P., AND VOINIGESCU, S. P. 2007.
Algorithmic design of CMOS LNAs and PAs for 60-GHz radio. IEEE J. Solid-State Circ. 42, 5, 1044
1056.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Performance Evaluation and Design Trade-Offs for Wireless Network-on-Chip 23:25
YU, X., SAH, S. P., BELZER, B., AND HEO, D. 2010. Performance evaluation and receiver front-end design for
on-chip millimeter-wave wireless interconnect. In Proceedings of the International Conference on Green
Computing (IGCC 10). 555560.
ZHANG, Y. P., CHEN, Z. M., AND SUN, M. 2007. Propagation mechanisms of radio waves over intra-chip channels
with integrated antennas: frequency-domain measurements and time-domain analysis. IEEE Trans.
Antennas Propag. 55, 10, 29002906.
ZHAO, D. AND WANG, Y. 2008. SD-MAC: Design and synthesis of a hardware-efcient collision-free QoS-aware
MAC protocol for wireless network-on-chip. IEEE Trans. Comput. 57, 9, 12301245.
Received April 2011; revised August 2011; accepted September 2011
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 3, Article 23, Pub. date: August 2012.
Best of Both Worlds: A Bus Enhanced NoC (BENoC)



Ran Manevich
1
, Isask'har Walter
1
, Israel Cidon
2
, and Avinoam Kolodny
2
Electrical Engineering Department, Technion Israel Institute of Technology, Israel
1
{ranman,zigi}@tx.technion.ac.il
2
{cidon,kolodny}@ee.technion.ac.il


Abstract

While NoCs are efficient in delivering high
throughput point-to-point traffic, their multi-hop
operation is too slow for latency sensitive signals. In
addition, NoCS are inefficient for multicast operations.
Consequently, although NoCs outperform busses in
terms of scalability, they may not facilitate all the
needs of future SoCs. In this paper, the benefit of
adding a global, low latency, low power shared bus as
an integral part of the NoC architecture is explored.
The Bus-enhanced NoC (BENoC) is equipped with a
specialized bus that has low and predictable latency
and performs broadcast and multicast. We introduce
and analyze MetaBus, a custom bus optimized for such
low-latency low power and multicast operations. We
demonstrate its potential benefits using an analytical
comparison of latency and energy consumption of a
BENoC based on MetaBus versus a standard NoC.
Then, simulation is used to evaluate BENoC in a
dynamic non-uniform cache access (DNUCA)
multiprocessor system.



1. Introduction

Novel VLSI literature promotes the use of a multi-
stage Network-on-Chip (NoC) as the main on-chip
communication infrastructure (e.g., [2], [4], [6]). NoCs
are conceived to be more cost effective than buses in
terms of traffic scalability, area and power in large
scale systems [3]. Thus, NoCs are considered to be the
practical choice for future CMP (Chip Multi-
Processor) and SoC (System on Chip) system
communication.
The majority of the traffic delivered by the
interconnect in SoC and CMP systems involves latency
insensitive, point-to-point, large data transfers such as
DMA memory replication. However, other kinds of
communication should also be facilitated by the


interconnect. Examples include L2 cache read
requests, invalidation commands for cache coherency,
interrupt signals, cache line search operations, global
timing and control signals and broadcast or multicast
valid recourses query. Although the volume of traffic
of these operations is relatively small, the manner in
which the interconnect supports them heavily affects
the performance of the system due to their latency
sensitive nature. While interconnect architectures
which solely rely on a network are cost effective in
delivering large blocks of data, they have significant
drawbacks when other services are required. Multi-hop
networks impose inherent multi-cycle packet delivery
latency on the time-sensitive communication between
modules. Moreover, advanced communication services
like broadcast and multicast incur prolonged latency
and involve additional hardware mechanisms or
massive duplication of unicast messages.
Current NoC implementations are largely
distributed (borrowing concepts from off-chip
networks). We argue that the on-chip environment
provides the architect with a new and unique
opportunity to use "the best of both worlds" from the
on-chip and the off-chip worlds. In particular,
communication schemes that are not feasible in large
scale networks become practical, since on-chip
modules are placed in close proximity to each other.
Consequently, we propose a new architecture termed
BENoC (Bus-Enhanced Network on-Chip) composed
of two tightly-integrated parts: a low latency, low
bandwidth specialized bus, optimized for system-wide
distribution of control signals, and a high performance
distributed network that handles high-throughput data
communication between module pairs (e.g., XPipes
[2], QNoC [4], AEthereal [6]). We also propose an
implementation of a BENoC bus termed MetaBus,
optimized for low latency and multicast, that is used
throughout the paper. As the bus is inherently a single
hop, broadcast medium, BENoC is shown to be more
cost-effective than pure network-based interconnect.
Fig. 1 demonstrates BENoC for a cache-in-the-middle
CMP. In this example, a grid shaped NoC serves point-
to-point transactions, while global, time critical control
______________________________________________________________________________

A preliminary concise version of this work was published in IEEE


Computer Architectures Letters, Volume 7, Issue 1, 2008.

Figure 1. A BENoC-based CMP system,
composed of 16 processors and 4x4 L2
caches.

messages are sent using the bus.
BENoC's bus (e.g. MetaBus) is a synergetic media
operating in parallel with the network, improving
existing functionality and offering new services. In
previous NoC proposals that include a bus (e.g., [10],
[11]) the bus typically serves as a local mechanism in
the interconnect geographical hierarchy. In [11], each
cluster of modules uses a shared bus for intra-cluster
traffic while inter-cluster traffic uses the network. This
way, the delivery of data to short distances does not
involve the multi-hop network. [10] suggests a bus-
NoC hybrid within a uniprocessor system. By
replacing groups of adjacent links and routers with fast
bus segments performance is improved. However, the
data which is delivered over the network is also
delivered over the bus. In contrast, BENoC's bus is
used to send messages that are different than those
delivered by the network, such as control and multicast
messages.
This paper also introduces the benefit of BENoC for
facilitation cache access, coherency and search in CMP
systems. In particular the searching for migrating cache
lines in CMP DNUCA (Dynamic Non-Uniform Cache
Architecture). More services of the built-in bus include
NoC subsystem control, multicast, anycast and
convergecast services.
The paper is organized as follows: in Section 2, we
discuss possible usages of the BENoC architecture. In
section 3 we introduce MetaBus a BENoC custom
bus implementation. MetaBus employs a novel
combination of a tree topology and a power saving
masking mechanism that disables signal propagation
toward unused tree leaves. In section 4 we analyze
MeatBus based BENoC latency and energy
consumption and show it's advantage over
conventional NoC. Our latency model is confirmed
using a layout implementation described in this
section. In section 5 we analyze the power saving
benefits of the masking mechanism. In section 6 we
evaluate the proposed architecture using network
modeler simulation. Section 7 concludes the paper.
2. BENoC Service

BENoC is composed of two tightly related parts: a
packet switched network (e.g., Xpipes[2], QNoC [4],
AEthereal [6]) for point-to-point massive data
transfers, and MetaBus that functions as a low latency
broadcast/multicast/unicast media. MetaBus is used for
NoC subsystem control, propagation of critical signals
and special custom services. In this section we describe
some of the benefits of using BENoC.

2.1 BENoC for latency sensitive signaling

In typical NoC-based systems, packets that traverse
a path of multiple hops suffer from high latency, due
the routers switching and routing delays accumulated
along its way. This latency is often unacceptable for
short but urgent signaling messages required for the
timely operation of the system. This is stated many
times as one of the main obstacles for an early
adoption of a NoC-based architecture. MetaBus,
designed for low bandwidth and short latency, offers a
valuable alternative: urgent messages may be sent over
the bus, traversing only a single arbitration stage. This
enables quick delivery of time critical signals between
modules.

2.2 BENoC multicast services

BENoC enables efficient implementation of
advanced communication services. For example, a SoC
may include multiple specialized resources distributed
across the chip (e.g., DSP processors, ALUs,
multipliers, memory banks). A processor may wish to
send a task to one (or more) of these resources. Hence,
the processor needs to know which of them are idle. As
an alternative to extensive probing, BENoC can
integrate an anycast service where a task is delivered
obliviously to one of the idle resources. For instance,
the processor may initiate a bus multicast destined at
"any idle multiplier". In response, idling multipliers
may arbitrate for the bus and send back their ID, while
the data itself is delivered point-to-point over network.
Even more sophisticated buses may include a
convergecast mechanism that facilitates the efficient
collection of acknowledgements or negative responses
back to the initiator. Finally, the most basic service
provided by the bus is a multicast (broadcast)
operation: In order to deliver a message from one
source to a group of (all) destinations using a basic
NoC, the sender needs to generate multiple unicast
messages [5]. While NoCs may include a built-in
multicast mechanism (e.g., [7]), it will fall behind the
simplicity and low latency of the proposed bus.
2.3 BENoC for CMP cache

A broadcast operation is valuable in shared memory
CMP systems. Typically, each of these processors is
equipped with a local (L1) cache and they all share a
distributed L2 cache (Figure 1). In order to facilitate
cache coherency, the system should provide a
mechanism that prevents applications from reading
stale data. More specifically, when a processor issues a
read exclusive (i.e., read for ownership) command to
one of the L2 caches, all other processors holding a
copy of that cache line should invalidate their local
copy, as it no longer reflects the most updated data.
Such invalidation signal is best propagated using a
broadcast/multicast service. As wire latency becomes a
dominant factor, the L1 miss penalty is heavily
affected by the distance between the processor and the
L2 cache bank holding the fetched line. This
observation gave rise to the DNUCA (Dynamic Non-
Uniform Cache Architecture) approach: instead of
having a few statically allocated possible L2 locations,
cache lines are moved towards processors that access
them [8], [9]. One of the major difficulties in
implementing DNUCA is the need to lookup cache
lines: whenever a processor needs to conduct a line fill
transaction (fetch a line into its L1 cache), it needs to
determine the identity of the L2 cache bank/processor
storing its updated copy. As described above, in a
network-based interconnect, the line can be looked for
using multiple unicast probes. BENoC offers a more
efficient alternative: MetaBus can be used to broadcast
the query to all cache banks. The particular cache
storing the line can acknowledge the request over the
bus and simultaneously send the line's content over the
NoC. As queries include small meta-data (initiating
processor's ID and line's address), they do not create
substantial load on MetaBus.

2.4 BENoC for system management

MetaBus can also facilitate the configuration and
management of the NoC itself. For example, when
changing the system's operation mode ("usecases" in
[12]), the network may need to be configured. Such a
configuration may include updating routing tables,
adjusting link speeds and remapping the system
modules address space. It may also be desirable to shut
off parts of the NoC when they are not used for a long
time in order to save power. Interestingly, although
these operations are not performed during the run-time
of the system, they should be handled with extreme
care, since the configuration of network elements may
interfere. For example, if a configuration packet turns
off a certain link (or a router), other configuration
messages may not be able to reach their destination due
to "broken paths." Similarly, trying to update routing
table while the network is being used to deliver other
configuration messages is problematic. Alternatively,
configuration can be done via MetaBus making the
configuration and management process much simpler.

3. MetaBus implementation

In this section we present MetaBus a custom bus
for BENoC. The design principles and guidelines are
presented first, followed by bus architecture, control
mechanisms and communication protocol.

3.1 MetaBus Architecture

MetaBus serves as a low-bandwidth complementary
bus aside a high-bandwidth network. Conventional
system busses (e.g. [20], [21], [22]) are too expensive
in terms of area, power and system complexity for the
limited bus tasks in the BENoC architecture. Thus,
MetaBus does not utilize segmentation, spatial reuse,
pipelining, split transactions and other costly
throughput boosting mechanisms. MetaBus should
pose a low, predictable latency communication
medium that outperforms the network in terms of
power and latency for short unicast, broadcast and
multicast transactions.
MetaBus is constructed as a tree (not necessarily
binary or symmetric) with the communicating modules
on its leaves (Fig 2). The bus is comprised of a root
and "bus stations" on the internal vertices of the tree.
Bus access is regulated by the well known Bus Request
(BR) Bus Grant (BG) interface. For example, if,
module 2 wishes to transmit a metadata packet to
module 9, it issues a bus request that propagates
through the bus stations up to the root. The root
responds with a bus grant that propagates all the way
back. At this stage of the transaction a combinatorial
path between the transmitting module and receiving
modules is built up. When the bus grant is received,
module 2 transmits an address flit that is followed by
data flits. These flits first go up to the root and then are
spread throughout the tree (broadcast) or through
selected branches (unicast or multicast).
A masking mechanism that is mastered by the root
and configured according to the address flit, prevents
the data from reaching unnecessary tree branches and
thus saves power in unicast and multicast transactions.
After the reception of all valid data flits, the receiver
(module 9) releases the bus by an acknowledgement to
the root and a new transaction may take place.


Figure 2. An example of a 9 modules MetaBus.

MetaBus data and its control signals are
synchronized with a bus clock that is connected only to
the root and to the modules in the leaves of the tree.
Bus stations are clocked by a Bus Grant signal from
the level above them.

3.1.1 Arbitration. The proposed bus utilizes a
distributed arbitration mechanism. The BR/BG
interface is used between all bus units including
modules, bus stations and the root. A requesting bus
station, if granted, will pass a BG to one of its
descendant vertices. The arbitration block in the bus
stations (Fig. 3) uses the BG input as a clock for its
edge sensitive arbitration state machine. The proposed
mechanism enables arbitration priorities adjustments
by placing higher priority modules in higher tree levels
or manipulating specific bus stations arbitration logic.


Figure 3. Arbitration and data switch block in a
binary bus station and their connectivity.

3.1.2 Data Path. MetaBus data path is combined of
two parts from the transmitter to the root and from
the root to the receivers. A combinatorial route
between the data sending module and the root is
established during bus grant and propagates down to
the transmitter by data switches in the bus stations
(Fig. 3). From the root down, the data propagates to the
designated recipients and is blocked from arriving to
irrelevant modules by a masking mechanism described
in the next sub-section. Once established, there is a
combinatorial data path between the transmitter and the
receivers. The delay of this path defines the minimum
clock period for the bus.

3.1.3 Masking. In conventional busses, data reach all
the receivers connected to the bus (or to a bus segment)
even in unicast transactions. The masking mechanism's
role is to save power by preventing the data to reach
unnecessary modules.


Figure 4. Masking Concept.

The mask control logic is located in the root and
controls data propagation from the root down using a
dedicated line to every bus station. In Figure 4 module
3 transmits data to modules 1 and 5. Due to masking,
the data do not reach 6 of the 7 non-recipient modules,
saving redundant switching of wires and gates.

3.1.4 Addressing. Unlike conventional system busses,
MetaBus does not include a separate address bus for
simplicity and area reduction. We dedicate the first
word of every transaction for the destination address,
which is consistent with the NoC addressing scheme.
Transactions lengths are either constant or defined by
the sender using an End Of Transaction signal. We
allocate part of the address space to define commonly
used pre-defined multicast sets. For instance if the data
bus width is 8 (K=8) and we have a 64 modules
system, we have 192 possible addresses for pre-defined
multicast sets (one of them is broadcast). This scheme
is similar to the multicast address convention in
Ethernet and IP.

3.1.5 Acknowledge signals. The bus supports two
feedback signals Ack and Nack. Each module has
active high Ack and Nack outputs that are joined with
AND gates in the bus stations and form a global Ack
and a global Nack that are generated in the root. Since
the bus is acting as a fast, low-bandwidth metadata
transmission medium aside a high bandwidth network,
many of the signals delivered through the bus may
trigger a response through the NoC. The Ack and Nack
may be used to distinguish between the case when the
transmitter is supposed to receive data through the
NoC (Ack) and the case when all the recipients
intercepted the current message and free the bus
(Nack).

3.2 State machine and communication protocol

In this sub-section we present a communication
protocol for a variable length transaction bus with a
data valid line. The proposed bus is clocked at the root
and the modules in the leaves. The bus state machine is
implemented in the arbitration unit in the root and is
presented in figure 5. The events sequence starts from
the "Wait for BR" state. As soon as one of the modules
issues a bus request by setting its BR high, one of
root's BR inputs goes high. Root arbitration unit issues
a BG to one of the children that requested the bus
according, for example, to a round robin principle.
Once BG is issued, the root waits for the first word
from the transmitter. This word includes the address of
the receiver or a multicast set and is marked with a data
valid signal. Then, according to address decoding,
mask lines are activated. Following the masking, data
is transmitted until sender sets data valid bit to "0".
Next, the root de-asserts the mask by masking all the
bus again, waits for Ack or Nack, de-asserts BG timed
according to the received acknowledgement type and
passes again to BR wait state.
We dedicate half of the bus clock cycle for
interactions between the modules and the root (bus
grant propagation, address word decoding,
acknowledgement, masking setting, etc.) and a full
clock cycle for data transmission between
communicating modules. Under these assumptions
each transaction requires 3 clock cycles for arbitration,
addressing, masking and acknowledgement processes.
For each transaction we define the transaction latency
as the number of bus clock cycles required from bus
request of the transmitter until the last data word
clocked by the receiver. In the proposed protocol, a K
data words transaction latency is K+2.5 clock cycles.


Figure 5. Bus state machine. * - BG goes low
according to acknowledge type.
4. Latency and power analysis

For simplicity, the NoC (including the NoC part of
BENoC) is assumed to have a mesh topology, where
system modules are square tiles. The following
notation is used: n is The number of modules in the
system, V is the logic voltage swing, C
0
and R
0
are
the capacitance and resistance of wires per millimeter
of length and P is tile size [mm]. The number of global
wire segments between the modules and the root is
denoted by B
D
, and L
W
is the data link width in bits.
The minimal clock period for MetaBus is the sum
of all segment latencies along the data path. The NoC
latency is the number of NoC clock cycles required to
complete a given wormhole transaction. Our power
dissipation analysis takes into account the switching of
global wires and their drivers, for both the NoC and
MetaBus. Local logic power dissipation is neglected.
We define
inv inv
R C
as the technology time
constant, where R
inv
and C
inv
are the input capacitance
and the effective output resistance of an inverter.
Assuming that the load capacitance C
driver
at the end of
each wire segment on the bus is a driver of another
identical segment, we get an upper bound for the bus
segment delay [13]:
( )
0.7 0.4
Wire Driver
Wire Driver Wire Wire
Driver
C C
T R C R C
C
+
= + +


(1)
If the capacitance driven by the wire is a small gate
such as in a NoC link, the load is negligible, and the
delay becomes:

0.7 0.4
Wire Wire Wire
Driver
T C R C
C

= +
(2)
For the latency and energy of transactions in a NoC-
based system, we assume that a broadcast is composed
of multiple unicast messages. Since an average unicast
message has to travel n modules away from the
source, the minimal time to complete the unicast is:

, net uni CiR Nclk Nclk flits
T nN T T N = +
(3)
Where T
Nclk
stands for NoC clock period, N
CiR
for the
router latency in clock cycles and N
flits
for the number
of flits per transaction. In Broadcast, we assume that
the first messages are sent to the most distant modules,
and the last ones to the nearest neighbors. We assume
that the source transmits a flit every NoC clock cycle:

, net broad Nclk flits
T n T N
(4)
Note that the NoC latency approximations used do
not account for network congestion, and thus, may
underestimate the realistic values.
Assuming a NoC link is P millimeters long, its
resistance and capacitance are
0 NL
R P R =
and
0 NL
C P C =
. We use (2) to find the appropriate NoC
driver strength C
ND.
Therefore, the NoC link delay is
equal to one half of NoC clock cycle under the
assumption that between two routers, data is sent and
received on opposite network clock edges. The input
capacitance of the routers is neglected since this is a
capacitance of a driver that drives an internal logic, and
therefore is small. We get:

0.7
0.5 0.4
NL
ND
Nclk NL NL
C
C
T R C

(5)
In order to calculate the total energy required for
NoC broadcast, the number of packet transmissions is
determined. In a regular mesh, a source node may have
at most 8 modules at a distance of one, 16 modules two
hops away, 24 modules three hops away and so on. In
the energy-wise best case, the broadcasting module is
located exactly in the middle of the mesh. It therefore
has to send 8 messages that would each travel a single
link each, 16 messages that travel two links, and in
general, 8j messages to a distance of j hops, until
transmitting a total of n-1 messages. It can be easily
shown that if n is an integral, odd number, then the
Manhattan distance between the module in the middle
of the mesh and the ones in its perimeter is exactly
( ) max
1 / 2 D n =
. Since a message transmitted to a
destination j hops away has to traverse j router-to-
router links, the minimal number of transmissions
required to complete the broadcast is
max
2
max max
0
8 1 16 2 24 3 ... 8 8
D
j
K D D j
=
= + + + + =

(6)
Consequently, the lower bound of the total energy
consumed by a single broadcast operation would be:

( )
2
net flits NL ND
E V N K C C = +
(7)
An average network unicast transaction would
comprise of
( ) max
1 / 2 D n =
hops, therefore:

( )
( )
2
,
1
2
net uni flits W NL ND
n
E V N L C C


= +


(8)
Note that (7) and (8) underestimate the NoC energy
consumption since in-router power is neglected.
Similarly, we estimate the latency, and energy
consumption for MetaBus. The data path is comprised
of two parts from the transmitter up to the root, and
from the root down to the receivers. The die size is
defined as P
D
=nP. We assume that the average hop
distance between the modules and the root is P
D
/2 and
that the bus stations are uniformly spread along this
distance. An average data link segment length between
two bus units would be P
D
/2B
D
.
The actual structure of the first part of the MetaBus
data path (from the modules up to the root) is
illustrated in figure 6. The second part of the data path
(from the root down) differs with the fan-out of each
bus station. Moreover, the data switch is replaced with
a masking gate that drives the same driver, but this
difference does not change the delay according the
model we use (1). The model of a driver that drives a
wire with an identical driver in it's end, implies an
upper delay bound for the second part of the data path
since the capacitances of the large drivers are loaded
simultaneously by the input gates of the bus stations.
The first data path part is identical for both
broadcast and unicast transactions, data goes through
B
D
global wire segments. The energy required for this
part is given by:

( )
2
, , bus up flits D BL BD up
E V N B C C = +
(9)
Where C
BL
stands for MetaBus data segment
capacitance and equals to C
0
P
D
/(2B
D
) and C
BD,up
is the
capacitance of drivers that drive data segments in the
first part of the data path.


Figure 6. MetaBus modules-to-root bus
segment. BG's are Bus Grant signals to the next
bus stations. These signals also control the data
switch.

In the second data path part data spread from the root
across the whole tree in broadcast and, using masking,
propagates to a much smaller set of modules that
includes the recipient of a unicast (from space
considerations, multicast is not presented). We
distinguish between data driver strength of the first part
and the second part since the driver of the second part,
potentially drives a larger capacitance by a factor B
R
,
where B
R
is the bus tree rank. We define as the
"wires overlap factor". This factor is 1 if data wires
from the root down are physically separated, and it
asymptotically reaches 1/B
R
if data wires are separated
only very near to units that share the same upper level
unit. The role of this factor is illustrated in figure 7. In
broadcast, the power consumption for the way down is:
1
2
, , ,
1 1
D D
B B
n n
bus broad down flits BL R BD down R
n n
E V N C B C B

= =

= +



(10)
Where C
BD,down
is the same as C
BD,up
but for the second
part of the data path (from the root down).
In a unicast, due to masking, only B
R
links are
loaded in every tree level, thus:
( ) ( )
2
, , ,
1
bus uni down flits R D BL D BD down
E V N B B C B C = +
(11)

Figure 7. The parameter gamma describes the
amount of wire sharing in the MetaBus datapath
from the root to the leaves (gamma=1 means no
wire sharing, gamma = 1/2 for maximal wire
sharing in a tree of rank 2).

We deduce bus data path delay by summing the
delays of global wires segments. Using (1) we get:

( )
, ,
,
,
,
0.7 0.4
BUS UP D link UP
BL BD up
D BL BD up BL BL
BD up
T B T
C C
B R C R C
C

= =

+

+ +


(12)

( )
, ,
, ,
,
0.7 0.4
BUS DOWN D link DOWN
R BL BD down BL BD down
D BL BL
BD down
T B T
B C C R C
B R C
C

= =
+
+ +


(13)
Where R
BL
stands for bus data segment resistance
and equals to R
0
P
D
/(2B
D
). We find the optimal drivers
sizing C
BD,up
and C
BD,down
by finding the minima of the
functions in (12) and (13) and get:
,
BL
BD up opt
BL
C
C
R

=

2
,
R BL
BD down opt
BL
B C
C
R

=
(14)
Consequently, using (12) and (13) with the optimal
driver sizes from (14) we can get the estimation of the
MetaBus minimum data path delay:

( ) ( ) , , , , MetaBus BUS UP BD up opt BUS DOWN BD down opt
T T C T C

= +
(15)
A bus transaction requires 1.5 clock cycles for
arbitration and bus access management before data
transmission and a single clock cycle afterwards for
acknowledgement (section 3.3). For latency calculation
we consider only the first 1.5 cycles, thus, N
flits

transaction latency would be:

( )
1.5
flits MetaBus
Bus Latency N T = +
(16)
Using (9), (10), (11), and (14) we can deduce the
total bus energy consumption for N
flits
transaction.
The derived expressions are valuated using typical
values for 0.18um technology (C
0
=243e-15F/mm;
R
0
=22/mm; =7.5ps ; V=1.5V, [15] [16]). We also
assume a die size P
D
=10mm, =0.3 (a reasonable
number for a topology where architecturally close bus
stations and modules are close topologically as well),
NoC clock frequency 1/T
Nclk
=1Ghz, data link width
L
W
=16, and the number of flits N
flits
=3 including the
address flit, for both MetaBus and NoC cases.
Analytical results of power dissipation vs. number of
modules for MetaBus and NoC broadcast and unicast
transactions are presented in figure 8. MetaBus is
assumed to be a homogeneous tree with a rank of 4,
and its drivers were speed optimized according to (14).
The broadcast transactions latencies of the NoC and
MetaBus are presented in figure 9. Note that the
MetaBus latency decreases with the number of
modules which seems counter intuitive. This happens
since the bus stations are acting as repeaters along the
long resistive data interconnect. Calculations for a 64
modules system are presented in table 1 for future use.

0
0.5
1
1.5
2
2.5
3
3.5
0 5 10 15 20 25 30 35 40
Number of Modules
E
n
e
r
g
y

p
e
r

t
r
a
n
s
a
c
t
i
o
n

[
n
J
]
MetaBus Broadcast
Network Broadcast
MetaBus Unicast
Network Unicast

Figure 8. Bus and NoC unicast and broadcast
energy per transaction.

0
20
40
60
80
100
120
0 5 10 15 20 25 30 35 40
Number of modules
L
a
t
e
n
c
y

[
n
s
]
MeatBus
Network Broadcast
Network Unicast

Figure 9. Bus and NoC broadcast latencies.

Note that for a reasonable system size, MetaBus
significantly outperforms the NoC in terms of latency
for both unicast and broadcast transactions. Therefore,
BENoC latency should also outperform a NoC with a
built in broadcast replication mechanism. Our analysis
predicts MetaBus to also outperform NoC in energy
consumption for broadcast transactions.

Table 1.
MetaBus
Type
Energy [nJ] Latency [ns]
Broad Uni Broad Uni
Speed Opt. 4.13 1.26 1.98 1.98
NoC 5.69 0.13 192 27
Our latency model was verified by simulations
using extracted layout parameters of a 0.18um process.
First we constructed a floor-plan of a balanced 64
modules binary tree on a 10mmX10mm die (Figure
10). Then automated layout and routing with timing
optimization was performed. Our tools do not perform
repeater insertion. In addition, although the MetaBus
infrastructure occupied less than 0.1% of the die area,
the layout tool did not perform wire sizing and rarely
used high metal levels. Despite these limitations, by
deducing average driver strength, wire capacitance and
resistance per unit of length (C
0
=130e-15F/mm, R
0
~1K
/mm, C
BD,UP
~3.7e-15F, C
BD,DOWN
~5.8e-15F) and
using =7.5ps, we estimated the data-path delay
using our model and compared the results to the post-
layout timing simulation results. The bus delay
according our model was ~8.5ns while the minimum
post layout simulation data-path delay was ~10ns
within 15% from the estimation.


Figure 10. A 64 modules binary MetaBus floor-
plan. The bus stations are bounded with
rectangles that were placed by hand. The blue
thick lines stand for airlines of data links and
arbitration interfaces between bus-stations and
form a tree structure. The thin lines are the
masking wires that go separately from the root to
each bus-station.

5. Masking mechanism benefits

In figure 4, the inputs of modules and bus stations
that are driven with unmasked bus station are counted
as "active gates". The power calculations consider the
active gates and the wires power. We define "power
ratio" as the ratio between the power consumption of
data transfer from the root to the modules with and
without the use of the masking mechanism. The power
consumption of the first data path part is much smaller
than that of the second data path part, even with
masking (by a factor of the tree rank for unicast and by
a higher ratio for multicast and broadcast). Therefore,
we disregard the first part in our power ratio
calculations. In terms of active gates this ratio would
be:
( )
( )
Active Gates
Total Gates
AG
PR
TG
=
(17)
We also disregard the contribution of the masking
mechanism to the power consumption since it only
drives, once in a transaction, a single global line per
active bus station.


(a) (b)
Figure 11. 3 levels trees with a rank of 4. The 4
highlighted leaves are the recipients. Blue bus
stations are not masked (active).

For the sake of simplicity, we assume a balanced
tree. In multicast transactions, the masking mechanism
power saving efficiency depends on recipient's
distribution across the tree leaves. Dispersed multicast
sets utilize non-adjacent bus-stations across the tree, a
fact that increases the number of active gates (Fig. 11).
We deduce an analytical lower bound of the number
of active gates for a given multicast set size (MN). It is
obvious that the lowest active gate count is achieved if
modules are arranged from left to right (or vise versa)
as in figure 11a. In this example, AG starts from 12
and jumps by 4 (R=Tree Rank) every 4 modules, by 8
(2R) every 16 modules, by 12 (3R) every 64 modules
etc. Generally, in tree of depth D:

1
1
( ) ( 1)
DATA
K
K
MN
AG MN R D
R

=

= +



(18)
A simulation of power ratio vs. multicast set size for
256 modules trees with ranks of 2,4 and 16 was
performed. For each rank and multicast set size we
have deduced the average PR of 10,000 different
random sets. In figure 12 we present the results
together with the minimum bound analytical results
from Eq. 18. As expected, the masking mechanism is
highly effective for unicast and small multicast sets.
Moreover, if modules are topologically arranged with
masking and multicast set awareness such as that
commonly used sets are concentrated topologically in
the tree, the masking mechanism stays effective even
for large multicast sets.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Relative multicast set size
P
o
w
e
r

r
a
t
i
o
R=2, Average
R=4, Average
R=16, Average
R=2, Min
R=16, Min

Figure 12. Average and analytical minimum
power ratios.

6. Experimental results

In this section, we evaluate the performance of a
large scale chip multi-processor (CMP) using dynamic
non-uniform cache access (DNUCA) for the on-chip
shared cache memory. The cache is comprised of
multiple banks, as illustrated in Figure 1. We compare
a classic NoC and a MetaBus based BENoC
infrastructure. We use the 0.18um technology
parameters from section 3. Two time-critical
operations are addressed. The first is a basic line-fill
("read") transaction, performed by a processor that
reads a line into its L1 cache. If an L2 cache has a valid
copy of the line, it must provide its content to the
reading processor. If the most updated copy resides in
a L1 cache of another processor, it is asked to
"writeback" the line. Else, the line is fetched from a
lower memory hierarchy level (L3 cache or memory).
The second transaction being addressed is the read-for-
ownership ("read-exclusive") transaction. While
similar to the basic line-fill operation, it also implies
that the reading processor wishes to own the single
valid copy of that line for updating its content. In order
to complete the transaction, all other L1 copies of the
line (held by an owning processor or by sharers) must
be invalidated.
In a classic DNUCA implementation, the processor
has to lookup the line prior to the read/read exclusive
operation. When a regular NoC is used, the line is
sought using multiple unicast messages, while in
BENoC the search is conducted over MetaBus. In this
work, a distributed directory model is assumed: each
L2 cache line includes some extra (directory) bits to
keep track of the current sharers/owner of the line [1].
As explained in Section 2, a static directory would
render the DNUCA policy useless; hence these bits
migrate together with the line. The simulated system
consists of 16 processors and 64 L2 cache tiles (80
modules in total). The modules are arranged as
depicted in Fig 1, while the 4x4 cache array is replaced
by an 8x8 array of banks. The network link is set at
16Gbits/s (reflecting 1Ghz network clock and 16 bits
wide links). MetaBus is 16 bits wide with transaction
latency of 4ns (twice the optimal number from table 1).
In order to evaluate the proposed technique, a
combination of two simulators is used. The BENoC
architecture is simulated using OPNET [15]. The
model accounts for all network layer components,
including wormhole flow control, virtual channels,
routing, buffers and link capacities. In addition, it
simulates the bus arbitration and propagation latencies.
The DNUCA system is modeled using the Simics [17]
simulator running SPLASH-2 and PARSEC
benchmarks [18] [19].

0
100
200
300
400
500
600
700
NoC BENoC
T
r
a
n
s
a
c
t
i
o
n

L
a
t
e
n
c
y

[
n
s
]
APACHE
BARNES
CANNEAL
FACESIM
OCEAN
SPECJBB
WATER
Average

0
20
40
60
80
100
120
140
160
NoC BENoC
M
e
g
a

t
r
a
n
s
a
c
t
i
o
n
s

/

s
e
c

Figure 13. Performance improvement in
BENoC compared to a NoC-based CMP for 7
different applications. Upper (a): average read
transaction latency [ns]. Lower (b): application
speed [trans./sec].

Fig. 13 presents the improvement achieved by the
BENoC architecture compared to a classic NoC in
terms of average transaction latency and application
speedup. Average transaction latency is 3-4 times
lower in BENoC (Fig. 13a). Fig. 13b depicts
application execution speed in terms of read
transactions/sec. On average, BENoC introduced an
execution speedup around 300%. The FACESIM
benchmark was accelerated only by 3% since it
produces relatively little coherency traffic.
(a)
(b)
7. Summary

The BENoC architecture, described in this paper,
combines a customized bus with a NoC for getting the
best of two worlds: the low latency and broadcast
capability inherent in the bus, together with the spatial
reuse and high throughput of the network. The
customized bus can circumvent most weaknesses of the
NoC, since critical signals which require low latency
are typically comprised of just a few bits. Similarly, the
complexity and cost of broadcast operations in the
NoC can be avoided by using the bus, because only
short metadata messages are usually transmitted in
broadcast mode.. Operations requiring global
knowledge or central control can be readily
implemented on the bus, and typically do not involve
massive data transfer. The bus can also support
specialized operations and services, such as broadcast,
anycast and convergecast, which are important for
common operations such as cache line search and
cache invalidation. BENoC is superior to a classical
NoC in terms of delay and power. In addition to the
BENoC concept, a customized and power/latency
optimized bus termed MetaBus was presented. Our
analysis shows that a BENoC using a MetaBus is more
advantageous than a traditional NoC even for a
relatively small system size of 10-20 modules, and the
advantage becomes very significant as system size
grows. Consequently, BENoC type architectures
introduce a lower risk and a faster migration path for
future CMP and SoC designs.

8. References

[1] B. M. Beckmann and D. A. Wood, Managing wire delay in
large chip multiprocessor caches, IEEE MICRO 37, pp.
319-330, 2004
[2] D. Bertozzi and L. Benini, "Xpipes: A Network-on-Chip
Architecture for Gigascale Systems-on-Chip", IEEE Circuits
and Systems Magazine, Vol. 4, Issue 2, pp. 18-31, 2004
[3] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, "Cost
Considerations in Network on Chip", Integration - the
VLSI Journal, Vol. 38, pp. 19-42, 2004
[4] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, "QNoC:
QoS Architecture and Design Process for Network on Chip",
Journal of Systems Architecture, Vol. 50, pp. 105-128,
2004
[5] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny,
"The Power of Priority: NoC based Distributed Cache
Coherency", Proc. First Int. Symposium on Networks-on-
Chip (NOCS), pp. 117-126, 2007
[6] K. Goossens, J. Dielissen, and A. Radulescu, "AEthereal
Network on Chip: Concepts, Architectures, and
Implementations", IEEE Design and Test of Computers, pp.
414-421, 2005
[7] Y. Jin, E. J. Kim, and K. H. Yum, "A Domain-Specific On-
Chip Network Design for Large Scale Cache Systems", Proc.
13th Int. Symp. on High-Performance Computer
Architecture, pp. 318-327, 2007
[8] C. Kim, D. Burger, and S.W. Keckler, "An Adaptive, Non-
Uniform Cache Structure for Wire-Delay Dominated On-
Chip Caches", 10th International Conference on
Architectural Support for Programming Languages and
Operating Systems, pp. 211-222, 2002
[9] C. Kim, D. Burger, and S.W. Keckler, "Nonuniform Cache
Architectures for Wire Delay Dominated on-Chip Caches",
IEEE Micro, 23:6, pp. 99-107, 2003
[10] N. Muralimanohar and R. Balasubramonian, "Interconnect
design considerations for large NUCA caches", Proc. 34th
annual International Symposium on Computer architecture,
pp. 369-380, 2007
[11] T.D. Richardson, C. Nicopoulos, D. Park, V. Narayanan, Y.
Xie, C. Das, V. Degalahal, "A Hybrid SoC Interconnect with
Dynamic TDMA-Based Transaction-Less Buses and On-
Chip Networks", Proc. 19th International Conference on
VLSI Design, pp. 657-664, 2006
[12] S. Murali, M. Coenen, A. Radulescu, K. Goossens, and
G. De Micheli, "A Methodology for Mapping Multiple
Use-Cases onto Networks on Chips", Proc. Design,
Automation and Test in Europe (DATE) 2006
[13] I. Sutherland, R.F. Sproull, and D. Harris, "Logical Effort:
Designing Fast CMOS Circuits", The Morgan Kaufmann
Series in Computer Architecture and Design, ISBN: 978-1-
55860-557-2
[14] H.B. Bakoglu, Circuits, Interconnections and
Packaging for VLSI, Adison-Wesley, pp. 194-219, 1990
[15] ITRS International Technology Roadmap for
Semiconductors, http://www.itrs.net
[16] OPNET Modeler, http://www.opnet.com
[17] P. S. Magnusson, M. Christensson, J. Eskilson, D.
Forsgren, and G. Hallberg, "Simics: A full system
simulation platform", IEEE Computer, 35(2):5058,
2002
[18] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A.
Gupta., "The SPLASH-2 Programs: Characterization
and Methodological Considerations", In Proceedings of
the 22nd Annual International Symposium on Computer
Architecture, pages 2437, 1995
[19] C. Bienia, S. Kumar, J. P. Singh and K. Li, The
PARSEC Benchmark Suite: Characterization and
Architectural Implications, Proc. of the 17th
International Conference on Parallel Architectures and
Compilation Techniques, 2008
[20] K. K. Ryu, E. Shin, and V. J. Moony, "A Comparison of
Five Different Multiprocessor SoC Bus Architectures",
Euromicro, 2001
[21] AMBA System Architecture,
http://www.arm.com/products/solutions/AMBAHomePa
ge.html
[22] CoreConnect Bus Architecture, http://www-
01.ibm.com/chips/techlib/techlib.nsf/productfamilies/Co
reConnect_Bus_Architecture

Vous aimerez peut-être aussi