Vous êtes sur la page 1sur 8

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

MACRON: The NoC-based Many-Core Parallel Processing Platform and its


Applications in 4G Communication Systems

Xiang Ling, Yiou Chen, Zhiliang Yu, Shihua Chen, Xiaodong Wang, Gui Liang
National Key Laboratory of Science and Technology on Communications
University of Electronic Science and Technology of China
Chengdu, China
{xiangling, chenyiou}@uestc.edu.cn

AbstractThe increasing demand of computation capacity often connected by bus. If dozens or even hundreds cores are
has made many-core parallel processing (MPP) a compelling grouped together, we call it many-core. The many-core
choice for computation-intensive applications. The networks-on- parallel processing (MPP) architecture is a prominent solution
chip (NoC) architecture is an effective way to interconnect dozens for heavy load computation in digital signal processing (DSP)
of processing cores, while the logic circuits and the actual [3,4]. In MPP architecture, the traditional bus-based
performance need to be verified in specific platform. We interconnection shows the shortcoming of low bandwidth.
proposed and implemented the MACRON platform to provide Assume one processor obtains the bus token; other dozens of
verification for complicated applications based on NoC cores are excluded from bus usage, which leads to the inter-
architecture by coordinating the software tool and the hardware
core communication to be the bottleneck of the many-core
devices closely. In MACRON, the virtual output queue with look-
scenario. Therefore how to organize, connect and schedule a
ahead routing is proposed to reduce the transmission delay
through the NoC router. The heterogeneous processing elements: large number of processing cores to meet the computation
vector processor core, scalar processor core and accelerator core requirements is an urgent problem in digital signal processing
are designed, thus a thorough soft signal processing can be of wireless communications.
approached. A real-time 4G wireless communication system Since the beginning of this century, several research
based on NoC is demonstrated on this MACRON platform. groups proposed new integrated circuit architecture, called
network-on-chip (NoC), which applies computer network
Keywordsnetworks-on-chip; many-core architecture; FPGA;
techniques into chip design and thoroughly solves the problem
digital signal process; wireless communications; vector processor
in the traditional bus-based SoC architecture [5]. NoC makes
I. INTRODUCTION full use of the communication mode in the distributed
computer networks; the bus-based operation is replaced by
Along with the development in decades of years, the routing and packet switching techniques. NoC provides a new
wireless communication systems has evolved from the 2nd design methodology and will become a promising prospect in
Generation (e,g. GSM) to the 3rd Generation (e.g. CDMA, MPP applications.
WiMax), and to the 4th Generation (e.g. LTE) in recent years.
Meanwhile the techniques of the 5th Generation have already In this paper, we propose a NoC based architecture for the
been put on the agenda [1]. This evolution has led to the MPP platform, called MACRON (MAny-Core paRallel On
coexistence of a variety of communication standards, and Noc). The main novelties of MACRON are: (1) combine a
multi-modes operation will become an important trend in the large number of processor-cores with the NoC architecture to
global wireless communications market. In this case, the provide flexible solutions; (2) integrate the hardware platform
traditional way that radio defined by hardware cannot satisfy with the software tool to verify the signal processing
the demand any more. Thus the so-called software defined applications. The rest of the paper is organized as follows.
radio (SDR) emerges, in which the signal processing is mostly Section II reviews some related works. Section III discusses
implemented by software, so the signal processing requires the hardware architecture and its critical circuits. Section IV
more powerful computing abilities. The traditional uni- introduces the software tool. Section V describes the
processor, or even the multi-processors with a few processing processing elements for parallel processing. Section VI
cores, cannot meet the demand of massive computing. In provides some numerical analysis. Section VII verifies the
addition, since it is difficult to keep shortening the feature size NoC-based MPP platform with a 4G Long Term Evolution
of semiconductor continuously, it becomes more difficult to (LTE) wireless communication system. Finally, Section VIII
achieve higher computing performance by accelerating clock concludes the paper.
speed. As a result, multi-processor parallel processing, which
makes use of several processors to process data concurrently, II. RELEATED WORKS
becomes an inevitable trend in future. This method can greatly With the success of MPP technique in high performance
boost the computing performance. Particularly, the computers, the idea of MPP is also adopted in microprocessor
heterogeneous multi-processors can cooperate to fulfill design. PC20x from Picochip, Tile64 from Tilera and Fermi
different types of tasks [2]. In general, multi-core is named if from NVIDIA are some typical commercial chips integrated
several up to more than a dozen cores are organized, which is dozens of processing units.

1066-6192/15 $31.00 2015 IEEE 396


DOI 10.1109/PDP.2015.86
The combination of many-core architecture and NoC
infrastructure has been proposed and studied recently. In [6],
the topology virtualization techniques are proposed for NoC-
based many-core processors with core-level redundancy to
isolate hardware changes caused by on-chip defective cores. In
[7], the problem of how to tolerate run-time core failures on
homogeneous many-core platform through system
reconfigurations for real-time embedded applications is
discussed. Zhang et al. [8] addresses the fault tolerance in
NoC and presents an on-the-field test and configuration
infrastructure for a 2D-mesh NoC, which can be used in many
generic shared-memory many-core tiled architectures and
MPSoCs. Ebrahimi et al. [9] tried to maintain the performance
of NoC in the presence of faults by taking advantage of a fully
adaptive routing algorithm using one and two virtual channels
along the X and Y dimensions. Kumar and Lipari [10]
provided an approach to analyze the communication latency Fig. 1. The interconnection among FPGAs, front panel and backplane. (For
on NoC with wormhole switching and credit-based virtual concision, partial GTX channels between FPGAs on the diagonal are not
channel flow control. shown.)

Some hardware platforms are designed and implemented demand of a large amount of data transmission among
to verify the algorithms and circuits in NoC architecture, in processing cores. Moreover, several high speed ports are
which Field Programmable Gate Array (FPGA) are widely expanded to the front panel and the backplane, by which two
used to prototype or emulate the NoC circuits and or more baseband boards can be cascaded to form a larger
performance [11,12]. Wang et al. [13] introduced a fast and scale of FPGA array. The NoC architecture and the many-core
flexible FPGA-based NoC simulation architecture, which parallel processing will be implemented in this FPGA array.
virtualizes the NoC by mapping its components to a generic
NoC simulation engine and is composed of a fully connected B. NoC Architecture
collection of fundamental components. Mohamed and Vaughn The NoC architecture mainly consists of processing
[14] verified the interconnection, area and power consumption elements (PE), network interfaces (NI), routers and links.
of NoC solutions on FPGAs. Routers are connected by links to form specific topologies.
The function of NI is to accomplish data transfer between the
Besides the existed platforms, our MACRON platform co- router and the PE, thus data packets can be transmitted and
ordinates the software tools and hardware devices closely to received by PEs. After a PE finishes its own task, it sends the
provide verification for complicated applications based on generated data in packets to the next PE for the subsequent
NoC architecture. The virtual output queue (VOQ) with look- processing. So the router connected to the local PE via NI
ahead routing is proposed to reduce the transmission delay stores the data packets in the buffer and then triggers the
through the NoC router. The processing elements (PE) in forward procedure. During the forward procedure, the router
MACRON can be homogeneous or heterogeneous. Vector determines the routing direction, allocates the queuing
processor core, scalar processor core and accelerator core are resource and switches the data to the output port. Then the
designed, thus a thorough soft signal processing can be data packets will be sent to next router and the same routing
executed. And a complicated real-time 4G wireless procedure will be repeated until they reach the target PE.
communication system is demonstrated on this MACRON
platform. C. Switch and Queuing
(i) Switch architecture
III. MACRON HARDWARE IMPLEMENTATION
Aiming at solving the head-of-line (HOL) blocking
A. FPGA Array problem [7] in the router design, we take advantage of the idea
MACRON hardware platform is composed of the of virtual output queue (VOQ) [15], and propose a router
baseband board and the radio frequency (RF) board. The architecture in which the HOL blocking problem is addressed
baseband board consists of four XC7K325T FPGA chips and and the processing delay is minimized. As shown in Fig.2,
one CPU chip, shown in Fig.1. The processing cores we there are four VOQ queues located at each input port
designed are placed on the four FPGAs to take charge of corresponding to different output ports in four directions.
baseband signal processing; and the CPU is responsible for VOQ(i,j) stores the flit from input port i to output port j. Note
handling high-level protocol stacks. There are three kinds of that i j , because the data packet from a port wont be sent
interconnections among these FPGAs listed as follows: (i) back to the same port. When a flit arrives at the input queue,
4.9Gbps high-speed serial giga-bit transceiver (GTX) channels the queue requests for a corresponding output port; and the
among FPGAs; (ii) a sRIO switch CPS1432 provides 10Gbps switch allocator (SA) of each output port arbitrates all the
sRIO channel connecting to each FPGA; (iii) a GE (gigabit requests received. If one request is approved, the flit in this
Ethernet) switch BCM5396 is placed to connect every FPGA input queue will be sent out immediately. Compared with the
and provide 1Gbps GE ports. Abundant interconnection virtual channel (VC) switching, processing delay of VOQ is
guarantees a fast data switch capability, which can satisfy the decreased, because no virtual channel allocation is needed.

397
Fig. 4. The infrastructure of the switch allocator (SA).

(ii) Look-ahead Routing


In order to reduce the delay for a packet goes through the
Fig. 2. The infrastructure of VOQ router. router, the look-ahead routing is designed in our VOQ
architecture to decrease the progressing stages compared with
the VC architecture. In the VC switching structure, the route
decision unit of the present router determines the output port
(East, South, West, North or PE). Suppose the source and the
destination of the data transmission are node 31 (in row 3 and
column 1) and node36 (in row 3 and column 6). In the VC
switching structure, when the head flit arrives at node 33, its
route decision unit calculates the output port (which is East in
this case) and sends it to node 34. However in the VOQ
switching structure, the queue VOQ(i,j) of each input port i is
(a) Packets request for the same output port in the next router
fixed to particular output port j. When the data packet arrives
at node 33, the output port marker at node 34 is needed for
arbitration, which is previously determined at node 32 by
look-ahead routing calculation. Therefore look-ahead routing
minimizes the processing pipeline from five stages to four
stages by small complicity overhead.
(iii) Switch Allocator
According to the above analysis, we propose a switch
allocator (SA) structure based on wormhole routing. Each
output port has its own SA, shown in Fig.2. The requests for
(b) Packets request for the different output ports in the next router output port j with the look-ahead information are given to the
Fig. 3. The packets transmission between VOQ routers. (The output ports in SA j by the look-ahead route decision units. And credits from
the current router and in the next router are considered jointly during the next router are also fed back into the SA j to notice the
arbitration. H: Head Flit, B: Body Flit, T: Tail Flit) empty/full states of VOQs of the next router. The SA outputs
the grant signals to control the crossbar. Fig.4 shows the
Meanwhile the HOL problem is resolved because the data infrastructure of a SA. The requests with the same output port
packets arriving at each input port for different output port in the current router and the same output port in the next
will be assigned to different queues naturally. router will enter a 1st-stage arbiter. When one of the requests
When VOQ is applied with wormhole routing, the flits in gets admission in the 1st-stage arbiter, the admission keeps
the same packet will move consecutively in order to improve valid. When the tail flit of a data packet leaves the router, a
the utilization of storage resources, shown in Fig.3(a). In our corresponding signal will trigger a next round of arbitration.
design, look-ahead route calculation is adopted, where the So the flits from different data packets wont overlap at the
output ports in the current router and the next router are output port of the next node. After the 1st-stage arbitration, the
considered jointly. More details will be discussed in the packets enter the 2nd-stage arbitration. In the 2nd-stage
subsequent sections. When flits in different VOQs request for arbitration, if the data packets from different input port request
the same output port in the current router and different output the same output port in the current router, the packets will
ports in the next router, the flits are sent to the next router by share the output port and be sent out alternatively. The final
polling, shown in Fig.3(b). grants are sent to the crossbar after this two-stage arbitration.

398
D. Network Interface
The network interface (NI) plays the role of a bridge
between a PE and its corresponding rouuter. When a PE
completes its computation task, it sends thee packet from the
local processor to the router via NI. The NI is
i implemented on
the basis of processor local bus (PLB). Sincce a big gap exists
between the clock frequencies of the processor and the
underlying router, it is necessary for NI to do the speed
matching when transmitting data betweenn the PE and the
router.
IV. SOFTWARE TOOL
In MPP NoC, dozens of PEs can cooperaate for a dedicated
application, in which different tasks are maapped to different
PEs. If an unsuitable mapping solution asssigns tasks to PEs
with large transmission delay or serious poower consumption,
the potential on NoC will be restrained. So a software tool is
(a) The tasks flow graph of a 4G comm
munication transmitter. (The tasks and
developed in our MACRON platform m to provide a transmission loads among tasks are exttracted from Simulink automatically.)
recommended mapping solution.
A. Estimation Flow
This electronic design automatic (E EDA) tool is a
simulation and estimation software developeed using Microsoft
Foundation Classes (MFC). By the aid of the basic module
library and the user-defined module library, this software with
featured functions, such as delay or power consumption
statistics, user-defined modules, visual simuulation and so on,
will finally give the estimation result of NoC
N mapping and
scheduling. Fig. 5 shows the tool windows.
The operation procedures of this EDA A tool works as
follows: (i) Create a new project. The user can create a new
project through menu bar or tool bar to autom matically generate
a Simulink project folder in a specific directory, which
includes the schematic file, the configuraation file and the
relevant codes of user-defined modules. (ii) Connect the
processing modules. The ports of the modules can be
(b) A mapping solution given by the EDA tool. (Note that the bandwidths of
connected manually, or generate a script to connect these links are inconsistent, since the intra-FPGA and inter-FPGAs transmission
modules automatically. Besides, the tool can c configure the channels have different data rates.)
communication bandwidth to set differentt constraints. (iii)
Fig. 5. The graphical windows of the EDA
E tool.
Select the mapping algorithm. This tool provides
p common
mapping algorithms library and the user can choose one
algorithm to map the parallel or pipelined tasks
t to NoC PEs. V. PROCEESSING CORES
(iv) Assess the performance of the NoC under a specific
mapping solution. This EDA tool provides NoC performance To support the state-of-tthe-art software radio or soft
evaluation such as delay and power consumption;
c the baseband processing, it providdes the system great flexibility
graphical interface will intuitively show thee efficiency of the for processors take most of o the signal processing and
optimized mapping solution analyzed by the build-in tool. computation. We designed three types of PEs in the
MACRON platform: a vecttor processor core, a scalar
B. Mapping Methods processor core and an accelerattor. The vector processor core is
The mapping algorithms library in thhis software tool designed dedicated to repeatedd and compute-intensive signal
mainly includes ant colony algorithm, geneetic algorithm and processing, e.g. matrix compputations. The scalar processor
other user-defined algorithms. The parrameters of the core is designed for control annd bit level processing. And the
algorithms can be adjusted by user. For exxample, if the ant accelerator is the circuits dedicated to specific functions, e.g.
colony mapping algorithm is chosen, the user can set ant channel decoding.
colony scale and iteration times. And the optimization goal A. Vector Processor Core
can be chosen for minimizing the delay, the power
consumption or the bandwidth in order that the algorithm can The processing speed of the vector processor improves
provide different mapping solutions for users different greatly because the Very Longg Instruction Word (VLIW) and
requirements. The user-defined algorithm iss a scalable option Single Instruction Multiple Data (SIMD) techniques are
and can be added into the software in a presccribed way. applied, so that eight instructioons are executed in parallel and

399
sixteen data are computed simultaneously in every instruction application functions such as division operation or convolution
slot, shown in Fig. 6. The vector processor is mainly used for operation, which are common in instruction set of DSP
channel estimation and equalization in our LTE system processor.
demonstration, so the instruction set is optimized dedicatedly
The RISC processor has six pipeline stages in order to
for baseband signal processing.
improve the processing speed. The control instructions always
The vector processing has five pipeline stages: PF (pre- lead to a complicated program counter finite state machine
fetch), FE (fetch), DC (decode), EXE1 (execute) and EXE2 (FSM) and a complicated pipeline. The multi-stages pipeline
(only for some memory access operation with double cycles). will cause control hazard or data hazard. The control hazard
usually means that several instructions have been pushed into
By analyzing the channel estimation and equalization the processing pipeline before a branch instruction executed,
algorithms, we group the vector processor instructions into so NOOP instructions are inserted to solve this problem by
four categories: arithmetic logical unit (ALU) instructions, compiler. For data hazard that a register is read/written
multiplication and addition (MAC) instructions, memory incorrectly due to the register access being interfered by other
access instructions and control instructions. A VLIW gets pipelined stage, by-pass circuits and NOOP instructions
eight instructions in one word, where each instruction is insertion are adopted in our implementation.
encoded by 24 bits. All the eight instructions are assigned to
the different decoding units and the different executing units C. Accelerator
to be executed. At the stage of decode, the ALU instruction In the baseband signal processing, some tasks are heavy-
will be dispatched to instruction slot 1 or slot 2, the MAC load computations and require severe real-time constraint,
instruction to slot 3 or slot 4, the memory access instruction to such as Turbo decoding in every 0.5ms slot. These extremely
slot 5, slot 6 or slot 7, and the control instructions will be heavy-load tasks cannot be completed by the processor core in
dispatched to slot 8. real-time. So hardware accelerator is also one kind of PEs in
After the analysis of the resource usage, sixteen scalar some applications.
registers and eight vector registers are designed to constitute Turbo code is widely used in modern communication
the register file. The scalar register is 16-bit wide and the system, so the parallel circuits of the turbo decoding is
vector register is 256-bit wide, so the processor is able to implemented to promote the processing throughput [18]. The
process sixteen data simultaneously in one instruction slot. turbo decoding is exploited in iterative manner and two Single
The vector processor can access one data memory, one Input Single Output (SISO) decoders share the extrinsic
program memory and sixteen coefficient memories. The information to generate soft outputs in each iteration. Popular
program memory and the coefficient memory are read-only by algorithms for SISO decoding are BCJR algorithms and
the vector processor core. forward-backward algorithm [16], such as maximum-a-
B. Scalar Processor core posteriori (MAP) algorithm. In Turbo codec, interleaver
scrambles data in a pseudo random order to minimize the
The scalar processor is designed on the basis of reduced- correlation of neighboring bits at the input of the
instruction-set computer (RISC) architecture with special convolutional encoders. The quadratic permutation
instructions which are dedicated to the communication signal polynomial (QPP) interleaver guarantees the desirable
processing. From the perspective of application, the scalar contention-free property for parallel memory access and has
processor is used for light load computation or non-vector been adopted in LTE for turbo code. In our implementation,
computation such as the control operation, the bit level each input frame is divided into N sub-blocks and then each
operation and so on. So the instruction set of the scalar sub-block is processed on a L-BCJR decoder using adequate
processor is based on the RISC instruction, and some special initializations.
instructions are provided in order to meet particular
VI. NUMERICAL ANALYSIS
A. VOQ Switch Performances
We use Verilog hardware description language (HDL) to
implement the VC switching structure and the VOQ switching
structure and compare their performances in NoC. The
relevant experiment parameters are listed in Table I. In
experiment, a global clock is used to measure the transmission
delay for all packets.
The delay through the router is computed as follow:
Pall
Latency = (Te (i ) Ts (i )) Pall (1)
i =1

where Pall stands for the total number of data packets. Ts (i ) ,


i = (1, 2,..Pall ) is the moment when the head flit of packet i is
injected into the on-chip networks; and Te (i ) , i = (1, 2,..Pall )
Fig. 6. The block diagram of the vector processor core.

400
refers to the moment when the head flit is sunk by the PE. In
the measurement, every PE injects 100000 data packets into
the on-chip networks. The injection rate Rinj and the
throughput Rthr are calculated as follows:
N Sj

Rinj = L jm NTclk (2)


j =1 m =1

N Rj

Rthr = L jn NTclk (3)


j =1 n =1

where Tclk stands for the clock period; N is the number of all
nodes; S j and R j are the numbers of injected packets and
sunk packets in node j respectively; L jm is the packet length of
the m-th packet injected by node j; and L jn is the packet
length of the n-th packet sunk by node j. Fig. 7. The processing delays of VC and VOQ router under the Bit
Complement traffic pattern.
We compare the delays of the two switching structures
under the bit complement traffic pattern [17], shown in Fig. 7.
processing on multi-cores, we tested Fast Fourier Transform
We found that if the injection rate is unsaturated, the VOQ (FFT) computations under different points scenarios. To
switching can achieve 11% delay improvement than VC guarantee the accuracy of the measurement, the value of
switching, because it saves one cycle for packet passing processing delay is averaged of 10 times measurements.
through the VOQ switching structure. If the injection rates go
beyond 0.2 for VOQ and 0.22 for VC respectively, delays start Firstly, we get the processing time of different points of
to deteriorate significantly due to the lack of buffer resources. FFT in the condition of single core, dual cores and quad cores,
It means, if we use XY dimension-order routing, the shown in table II. 1536-point FFT is tested only in the
extraordinary heavy traffic will cause VOQ switching suffer condition of quad cores due to its particularity of parallel
more performance degradation than VC switching. In the case method.
of XY dimension-order routing, a packet enters the router For those whose FFT points equal to 2M, dual cores can
from the southern port or northern port wont go to the eastern achieve the speedup ratio up to 1.7, and quad cores can bring
port or western port. Thus VOQ(2,1), VOQ(2,3) and the speedup ratio up to 3.5. But in 1536-points FFT whose
VOQ(4,1), VOQ(4,3) are under-utilized, i.e. 20% buffer number of point does not equal to 2M, its parallel processing
resources in the VOQ switching structure are unused, which procedure is quite difference. The steps of 1536-points
causes VOQ to block under extreme heavy traffic. In other Discrete Fourier Transform (DFT) parallelization are: (i)
words, queuing resources in VOQ can be saved further if XY divide the input data into three 512-point sequences and
dimension-order routing is adopt. By our statistics, the process traditional 512-points FFT computations; (ii) the
throughputs of VC and VOQ are similar. This is because the second and third sequences are multiplied by common twiddle
affecting factors of throughput are the same in the two factors; (iii) sum the values of the three sequences together to
get the final result. Due to the increase of data exchanged
experimental scenarios, such as the number of active links, the
among cores in non-2M-points DFT, the efficiency of quad
average hops across the network.
cores decreased but can still provide a speedup ratio to 2.4. If
B. Speedup Ratio we use more than a dozen of cores in parallel FFT
In order to observe the speedup ratios under parallel computation, the system can operate at a higher processing
speed. But data traffic among cooperated cores requires
communication latency, which will decrease the processing
efficiency when overmany cores are used.
TABLE I. EXPERIMENT PARAMETERS FOR SWITCH SCHEMES

Network topology 2D 88 Mesh


Routing algorithm Dimension-order routing
TABLE II. THE STATISTICS OF THE PROCESSING TIMES OF MULTICORES
Traffic pattern Bit Complement
Processing Points of FFT
Number of ports per router 5 time
(cycles) 128 256 512 1024 1536 2048
Number of VCs/VOQs per port 4
Single core 2889 6634 15493 35377 46447 79542
Buffer length per VC/VOQ 6
Dual cores 1768 3966 9332 21601 48090
--
Flit size (bits) 32 (Speedup ration) (1.6) (1.7) (1.7) (1.6) (1.7)
Quad cores 819 1870 4380 10008 19252 22709
Packet length (flits) 10 (Speedup ration) (3.5) (3.5) (3.5) (3.5) (2.4) (3.5)

401
TABLE III. VECTOR SIGNAL PROCESS UTILIZATION estimation and equalization, etc. Since OFDM,
Vector signal Memory Control synchronization, channel estimation and equalization are
ALU MAC compute-intensive tasks, these three tasks are assigned two
processing access (Scalar)
2048-points FFT
60.6% 39.4% 99.4% 0.3% PEs for parallel processing, and each of the rest tasks is
(complex numbers) processed by one PE respectively. Fig. 8 shows the task
48-taps FIR filter partition of LTE baseband processing. The numbers next to
96.8% 97.0% 99.2% 0.3%
(real numbers)
Channel estimation the modules identify the positions of which they are mapped
28.3% 34.8% 66.1% 4.4% to.
and equalization

C. Vector Processing Utilization As shown in Fig. 9, a LTE transceiver is accommodated in


In order to save the resource area, the instruction slots in
VLIW vector processor are not all-rounders. Some slots are
dedicated for ALU, MAC, and some for memory access and
control. During vector processing, if all instruction slots are
executed, the vector processor is fully utilized. In practice,
only partial instruction slots are executed in each cycle, while
others are idle. Thus the utilizations of the instruction slots are
critical to the vector processing efficiency.
Table III summarizes the utilizations of four types of
vector instruction slots in typical signal processing. We found
that Finite Impulse Response (FIR) filter is computation-
intensive; the MAC slots are executed in 97.0% of the cycles
during FIR computation.
VII. DEMOSTRATION
The many-core platform is a flexible way for wireless
communication systems to support multiple standards, like
GSM, WCDMA, LTE, LTE-A, by only downloading the
instruction codes without circuits changing. We implement a
4*4 Mesh NoC on the MACRON platform. We divide a
complicated 4G system into several tasks and assign them
Fig. 8. The baseband tasks partition of LTE communicaiton system.
onto different PEs for parallel and pipelined processing. By
accomplishing the baseband signal processing in parallel, a
LTE communication system is carried out for demonstration.
A. LTE Demo
We take a 2*2 Multiple-Input Multiple Output (MIMO)
wireless communication system as the target application, and
implement both the transmitter and the receiver. Some system
parameters are listed in table IV.
The tasks mapping is done firstly in the MACRON EDA
tool to provide an injective function solution. The LTE
baseband processing tasks include Cyclic Redundancy Check
(CRC), Turbo codec, modulation, Orthogonal Frequency
Division Multiplexing (OFDM), synchronization, channel

TABLE IV. LTE SYSTEM PARAMETERS

Antenna 2 Tx / 2 Rx
Bandwidth 5MHz
FFT points 2048
Turbo code rate 1/3, 1/2, 2/3
Modulations QPSK, 16QAM
Channel estimation method Least square algorithm
Fig. 9. LTE baseband modules mapped onto MACRON platform. (The
Equalization method Zero force algorithm yellow blocks are the vector processor cores, the grey ones are scalar
processor cores, the blue one is the accelerator, and dashed blocks are idle. )

402
REFERENCES
[1] Andrews J.G., Buzzi S., Wan Choi, Hanly S.V., Lozano A., Soong
A.C.K., Zhang J.C. "What will 5G be?" IEEE Journal on Selected Areas
in Communications. Vol.32, No.6, pp.1065-1082, 2014.
[2] Rosen, O.; Medvedev, A. "Efficient Parallel Implementation of State
Estimation Algorithms on Multicore Platforms". IEEE Transactions on
Control Systems Technology. Vol.21, No.1, pp.107-120, 2013.
[3] de Dinechin, B.D.; van Amstel, D.; Poulhies, M.; Lager, G. "Time-
critical computing on a single-chip massively parallel processor".
Design, Automation and Test in Europe Conference and Exhibition
(DATE). 2014, pp.1-6.
(a) The visual states of PEs. (b) The visual states of links.
[4] Dan Chen; Lizhe Wang; Gaoxiang Ouyang; Xiaoli Li. "Massively
Fig. 10. The states of PEs and links on MACRON are monitored and Parallel Neural Signal Processing on a Many-Core Platform".
displayed. Computing in Science & Engineering. Vol.13, No.6, pp.42-51. 2011.
[5] Marculescu, R.; Ogras, U.Y.; Li-Shiuan Peh; Jerger, N.E.; Hoskote, Y.
MACRON hardware platform. The modules of transmitter are "Outstanding Research Problems in NoC Design: System,
implemented in FPGA_C. The baseband signals from OFDM Microarchitecture, and Circuit Perspectives". IEEE Transactions on
are sent to FPGA_D via GTX, and are delivered to the RF Computer-Aided Design of Integrated Circuits and Systems. Vol.28,
No.1, pp.3-21, 2009.
board via HSIQ interface for wireless propagation. The
[6] Lei Zhang; Yue Yu; Jianbo Dong; Yinhe Han ; Shangping Ren ;
receiver modules are accommodated in other three FPGAs. Xiaowei Li. "Performance-asymmetry-aware topology virtualization for
When RF board receives signals from the mobile terminal, it defect-tolerant NoC-based many-core processors". Design, Automation
sent the sampled data to the baseband board via HSIQ & Test in Europe Conference & Exhibition (DATE). 2010, pp.1566-
interface after analog-to-digital conversion. After a series of 1571.
baseband signal processing, the voice or video stream can be [7] Zheng Li; Shuhui Li; Xiayu Hua; Hao Wu; Shangping Ren. "Run-Time
restored and sent to user via an Ethernet interface. Reconfiguration to Tolerate Core Failures for Real-Time Embedded
Applications on NoC Manycore Platforms". IEEE 10th International
B. NoC States Monitoring Conference on High Performance Computing and Communications &
2013 IEEE International Conference on Embedded and Ubiquitous
We can monitor the state of NoC in real-time by collecting Computing (HPCC_EUC). 2013, pp.1990-1997
the states of all buffers of the router. The state of the LTE [8] Zhang Z, Refauvelet D., Greiner A., Benabdenbi M., Pecheux F. "On-
system is also monitored. The collected state information the-Field Test and Configuration Infrastructure for 2-D-Mesh NoCs in
consists of busy/idle states of PEs and links, MIMO channel Shared-Memory Many-Core Architectures". IEEE Transactions on Very
gains, signal to noise ratio, constellation graph, and frame Large Scale Integration Systems. Vol.22, No.6, pp.1364-1376. 2014.
error rate. The collected state information are sent to the [9] Ebrahimi, M.; Daneshtalab, M.; Plosila, J. "High Performance Fault-
Tolerant Routing Algorithm for NoC-Based Many-Core Systems". 2013
computer via a SFP port on the baseband board, a data 21st Euromicro International Conference on Parallel, Distributed and
acquisition board ML605, and the PCIE interface of a Network-Based Processing (PDP). 2013, pp.462-469
computer in turn. Finally, the states of NoC and LTE system [10] Kumar, S.; Lipari, G. "Latency Analysis of Network-on-Chip Based
can be displayed on the computer. Many-Core Processors". 22nd Euromicro International Conference on
Parallel, Distributed and Network-Based Processing (PDP). 2014.
The busy/idle statistics of PEs are shown in Fig. 10(a). A pp.432-439
higher bar means a busier state of PE. Compared with Fig. 10, [11] Francesco Robino, Johnny Oberg. From Simulink to NoC-based
the bars of node 23 and 24 are the highest because OFDM MPSoC on FPGA. Design, Automation and Test in Europe Conference
modules are executed in these two PEs, where FFT and Exhibition (DATE). 2014.
computations lead to hotspots. The busy/idle statistics of links [12] Sudhir N. Shelke, Pramod B. Patil. Low-latency, low-area overhead
are shown in Fig. 10(b). The thicker arrow means there are and high throughput NoC architecture for FPGA based computing
system. 2014 International Conference on Electronic Systems, Signal
more data transmitted on this link. The data flows from node Processing and Computing Technologies. 2014
13 to 23 and from node 14 to 24 are the busiest flows, where a [13] Danyao Wang, Charles Lo, Jasmina Vasiljevic, Natalie Enright Jerger, J.
large number of packets are transmitted from synchronization Gregory Steffan. DART: A programmable architecuture for NoC
modules to the OFDM modules. simulation on FPGAs. IEEE Trans. on Computers. Vol.63, No.3,
pp.664-678, March 2014.
VIII. CONCLUSIONS [14] Mohamed S. Abdelfattah, Vaughn Betz. The case for enbedded
networks on chip on field-programmable gate arrays. IEEE Micro.
We have presented a NoC-based platform, MACRON, for Vol.34, No.1, pp.80-89. 2014
many-core parallel processing verification. By combining the
[15] William J. Dally,Virtual-Channel Flow Control, Computer
software tool and the hardware device, MACRON can provide Architecture. 17th Annual International Symposium, 1990.pp.6068.
estimation and verification for complicated applications. We [16] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, "Optimal decoding of linear
implemented a LTE wireless communication system in codes for minimizing symbol error rate", IEEE Trans. Inf. Theory,
MACRON, which proves that the proposed VOQ architecture vol.IT-20, no. 2, pp. 284287, Mar. 1974.
with look-ahead routing is competent to reduce the [17] Jie Chen; Gillard, P.; Cheng Li. Performance evaluation of three
transmission delay, and the designed processor cores are Network-on-Chip (NoC) architectures. 1st IEEE International
effective for vector processing and scalar processing. Conference on Communications in China (ICCC). 2012, pp.91-96
[18] X. Ran, X. Ling, B. Liu. The Design of a Multi-Core-Based
Parallel Turbo-Decoder for LTE-A. IEEE International Conference on
Information Theory and Information Security (ICITIS) 2011.

403

Vous aimerez peut-être aussi