Vous êtes sur la page 1sur 18

J Sign Process Syst (2011) 64:7592 DOI 10.


Exploration of Soft-Output MIMO Detector Implementations on Massive Parallel Processors

Robert Fasthuber Min Li David Novo Praveen Raghavan Liesbet Van Der Perre Francky Catthoor

Received: 13 November 2009 / Revised: 11 May 2010 / Accepted: 12 May 2010 / Published online: 8 June 2010 Springer Science+Business Media, LLC 2010

Abstract Emerging Software Defined Radio (SDR) baseband platforms are based on multiple processors with massive parallelism. Although the computational power of these platforms would theoretically enable SDR solutions with advanced wireless signal processing, existing work implements still rather basic algorithms. For instance, current Multiple-Input Multiple-Output (MIMO) detector implementations are typically based on simple linear hard-output and not on advanced near-Maximum Likelihood (ML) soft-output detection. However, only the latter enables to exploit the full potential of MIMO technology. In this work, we explore the feasibility of advanced soft-output near-ML MIMO detectors on massive parallel processors. Although such detectors are considered to be very challenging due to their high computational complexity, we combine architecture-friendly algorithm design, application

specific instructions and instruction-level/data-level parallelism explorations to make SDR solutions feasible. We show that, by applying the proposed combination of techniques, it is possible to obtain SDR implementations which can deliver data rates that are sufficient for future wireless systems. For example, a 2 4 Coarse Grain Array (CGA) processor with 16-way Single Instruction Multiple Data (SIMD) can deliver 192/368 Mbps throughput for 2 2 64/16QAM transmissions. Finally, we estimate the area and power consumption of the programmable solution and compare it against a traditional Application Specific Integrated Circuit (ASIC) approach. This enables us to draw conclusions from the cost perspective. Keywords MIMO SDR SSFE LLR CGA ASIC

1 Introduction
R. Fasthuber (B) M. Li D. Novo P. Raghavan L. Van Der Perre F. Catthoor IMEC, Kapeldreef 75, 3001 Leuven, Belgium e-mail: robert.fasthuber@imec.be M. Li e-mail: limin@imec.be D. Novo e-mail: novo@imec.be P. Raghavan e-mail: ragha@imec.be L. Van Der Perre e-mail: vdperre@imec.be F. Catthoor e-mail: catthoor@imec.be

With the exploding design and processing cost in the deep sub-micron era, programmable or reconfigurable baseband solutions are becoming popular. The Software Defined Radio (SDR) paradigm, which was mainly successful in the base-station and military segments, is emerging in the handset market. Parallel instruction set architectures, especially such which combine Instruction Level Parallel (ILP) and Data Level Parallel (DLP) features [4, 21, 23, 26, 29], are becoming very prevailing. Most of these published architectures offer massive parallelism, i.e. they include multiple independent computational processing units and offer a data parallelism of 100. For instance, the NXP EVP processor includes ten Functional Units


J Sign Process Syst (2011) 64:7592

(FUs) and six of them support 16-way Single Instruction Multiple Data (SIMD) [29]. The SODA processor includes four Processing Elements (PEs), each supporting 32-way SIMD instructions [21]. Theoretically, these massive parallel processors would enable SDR implementations of advanced wireless signal processing algorithms. However, only simple SDR systems and algorithms have been demonstrated and reported in literature. Multiple-Input Multiple-Output (MIMO) technology offers increased spectral efficiency compared to single antenna systems. For this reason, it has become the basis of all upcoming wireless communication standards, such as IEEE 802.11n, WiMAX, 3GPP LTE and 3GPP2 UMB. Supporting advanced MIMO technology is therefore a necessity for future SDR systems. However, the implementations in [21, 23, 29] do not support MIMO technology. The references [4, 31] demonstrate MIMO processing, but based on simple linear detection, which does not enable to fully exploit the potential of MIMO technology [13]. The implementation of MIMO processing on a Sandblaster processor in [16] does not include the computational dominant soft-output computation. Wu et al. [33] demonstrates advanced MIMO processing on a floating-point Graphics Processing Unit (GPU). However, the energyefficiency of such a solution is typically not feasible for wireless devices. In a MIMO Space Division Multiplexing (SDM) receiver, the MIMO detector recovers the multiple transmitted data streams. For the implementation of the detector, a wide range of different detection algorithms is available [2]. Linear detection has a low complexity, but suffers from poor Bit-Error-Rate (BER) performance. In contrary, soft-output Maximum Likelihood (ML) detection offers maximal performance but at the cost of very high complexity. Near-ML detection provides typically the best trade-off. Recently, a nearML Selective Spanning with Fast Enumeration (SSFE) detector has been proposed and implemented for SDR systems [18, 20]. The proposed implementation is based on hard-output detection. However, with hard-output detection, a large part of the remarkable potential of MIMO technology is still not exploited. The key reason is that modern Forward Error Correction (FEC) decoders, such as Turbo and Low Density Parity Check (LDPC) decoders, require soft information as input to deliver the best possible BER performance. In fact, soft-output near-ML MIMO detectors bring 24 dB Signal-to-Noise-Ratio (SNR) gain compared to their hard-output counterparts and 612 dB SNR gain compared to linear detectors. Efficient implementations of soft-output near-ML MIMO detectors, which have the

capability of approaching the limit of Shannon bounds [13], are therefore highly requested. Our work explores the feasibility of advanced softoutput MIMO detector implementations on processors with massive parallelizations. We specifically consider the TI TMS320C6416 Very Long Instruction Word (VLIW) processor [28] and the ADRES Coarse Grain Array (CGA) processor [22] in our explorations. First, we design an architecture-friendly algorithm with low complexity. The resulting algorithm, which is mostly based on area and energy-efficient operators, allows to fully exploit the abundant parallelism of SDR platforms. Second, we combine Application Specific Instruction (ASI) design and code transformations to significantly reduce the number of required computations and required memory accesses. Then, we perform the dimensioning of ILP/DLP for a given throughput requirement. We show that, by applying the proposed combination of techniques, it is feasible to obtain SDR implementations which can deliver data rates that are sufficient for future wireless systems. For instance, a 2 4 CGA processor with 16-way SIMD can deliver 192/368 Mbps throughput for 2 2 64/16-Quadrature Amplitude Modulation (QAM) transmissions. To advance the feasibility study further, we estimate the area and power consumption of the programmable solution and compare it against a traditional Application Specific Integrated Circuit (ASIC) design. For drawing conclusions, we take existing work on Application Specific Instruction Set Processors (ASIPs) into account. This paper builds on the previous work presented in [19]. The main extensions of [19] are: 1) design of different ASIs, 2) mapping and ILP/DLP explorations, 3) comparison with ASIC approach. The latter leverages on ASIC design results previously published in [9]. The remaining part of this paper is structured as follows: Section 2 explains the MIMO system model and reviews the algorithmic background of soft-output MIMO detection. In Section 3 the architecture-friendly algorithm design of the Log-Likelihood-Ratio (LLR) generator is explained. Section 4 provides an overview of subsequent implementation and exploration experiments. In Section 5 the mapping results for the TI TMS320C6416 processor are given. In Section 6 application specific instructions are proposed, code transformations and ILP/DLP explorations are shown and implementation results for an ADRES based solution are provided. Section 7 presents the design of an ASIC reference. In Section 8 the examined implementations and existing work are compared. Finally, Section 9 concludes the work.

J Sign Process Syst (2011) 64:7592


2 Background This section reviews the MIMO system model and explains the algorithmic background of the MIMO signal detection. Especially for Section 3, the knowledge of this section is essential. 2.1 MIMO System Model The MIMO system model, which was utilized for this paper, is illustrated in Fig. 1. For the sake of completeness, the Forward Error Correction (FEC) blocks are also shown. The number of transmit and receive antennas are denoted as Nt and Nr respectively. For a C -QAM modulation, a symbol represents one out of C = 2q constellation points. Note that for 16-QAM a symbol consists of 4bits and for 64-QAM of 6bits. At once, the transmitter maps one qNt 1 binary vector x to a Nt 1 symbol vector s. The transmission of a vector s over a flat-fading MIMO channel can be modeled as y = Hs + n. Thereby y denotes a Nr 1 symbol vector, H characterizes a Nt Nr channel matrix and n is a noise vector whose entries are independent complex Gaussian random variables with mean zero and variance N0 /2. 2.2 MIMO Signal Detection The task of a MIMO detector is to recover the symbol vector s that was sent by the transmitter. Soft-output MIMO detectors do not only provide the most likely symbol vector s (like hard-output detectors do), but also the Log-Likelihood-Ratio (LLR), which is the probability that a bit is logical 0 or 1, for each bit in s. Modern FEC decoders, such as Turbo and LDPC decoders, which are an essential part of emerging standards, require soft-input to achieve the best BER performance. Most soft-output MIMO detectors can be decomposed into two main parts: List generator and LLR generator. 2.2.1 List Generator The list generator computes a list L of the most likely symbol vectors s. Popular schemes for this calcula-

tion include linear detection, Successive Interference Cancellation (SIC) and Maximum-Likelihood (ML)/ Near-ML detection. Linear detection has a low implementation complexity, but suffers from poor BitError-Rate (BER) performance. In contrary, Maximum Likelihood (ML) detection offers maximal performance but at the cost of high complexity. Recently, near-ML detection algorithms, which offer almost ML performance at a significant lower implementation cost, have become popular. Extensive surveys about MIMO detection schemes can be found in [2] and [25]. In this paper, we exploit the near-ML Selective Spanning with Fast Enumeration (SSFE) algorithm for list generation [20]. The SSFE algorithm, which is the result of our previous work, was explicitly optimized for parallel architectures. Contrary to other near-ML algorithm, such as the traditionally utilized K-Best algorithm [6, 7, 12, 27], the SSFE algorithm results in a completely regular and deterministic dataflow structure. This is important for enabling an efficient mapping on parallel architectures. In addition, the SSFE does not require expensive memory-operations. Moreover, the SSFE algorithm is based on very simple and architecture-friendly operations such as additions, subtractions and shifts, which clearly reduces the implementation complexity. Besides, the SSFE algorithm is well-suited for scalable implementations, because it offers a parameter which determines the complexityperformance trade-off of an algorithm instance. For ML detection, the MIMO detector is designed to solve = arg min y Hs s


Modulator Src. FEC Encoder



y Nr


MIMO Detector

FEC Decoder




H (estimated)

Figure 1 MIMO system model including FEC blocks.

where Nt is the set containing all possible Nt 1 vector signals s. Solving (1) corresponds to an exhaustive search. For near-ML detection, not all, but only a limited number of vector signals s are considered in the search. A SSFE algorithm instance is uniquely characterized by a scalar vector m = [m1 , . . . , m Nt ], mi C . The entries in this vector specify the number of scalar symbols si that are considered at antenna Ni . With the parameter m, the complexity-performance trade-off point of an algorithm instance is selected. The computation of s can be visualized with a spanning tree (Fig. 2). In this tree each node at level i {1, 2, .., Nt }) is uniquely described by a partial symbol vector si = [si , si+1 , .., s Nt ]. Starting from level i = Nt , SSFE spans each node at level i + 1 to mi nodes at level i. An example of a tree for m = [1, 2, 2, 4] is shown in Fig. 2b.

Ant. 4 i=4=Nt Ant. 3 i =3 Ant. 2 i =2 Ant. 1 i =1

J Sign Process Syst (2011) 64:7592

Ant. 4 i=4=Nt m4 =4
root node

Ant. 3 i =3 m3 =2

Ant. 2 i =2 m2 =2

Ant. 1 i =1 m1=1

7 6 5 4

1 8

one fixed path

3 2 1 0

-7 -6 -5 -4 -3 -2 -1 0 1

(a) K-Best

(b) SSFE


Figure 2 K-Best and SSFE search-tree topologies for 4 4 Quadrature Phase Shift Keying (QPSK) modulation. K-Best first spans the K nodes at level i + 1 to KC nodes. After spanning, K-Best sorts the KC nodes, the K best nodes are selected and the rest of the nodes are deleted. These approach results in a nondeterministic data-flow. In contrary, the spanned nodes in SSFE are never deleted. Therefore the dataflow in SSFE is completely regular and deterministic.

Figure 3 A fast enumeration of eight constellation points shown on an example.

Initiate the root node with T Nt +1 = 0. Starting from level i = Nt , the Partial Euclidean Distance (PED) of a symbol vector si = [si , si+1 , .., s Nt ] is given by Ti (si ) = Ti+1 (si+1 ) + ||ei (si )||2 (2)

where ||ei (si )||2 describes the PED increment. The SSFE algorithm has to select a set of si = [si , si+1 , .., s Nt ] so that the PED increment ||ei (si )||2 from (2) is minimized. By assuming a previous QR decomposition of H (H = QR, Q is an orthogonal matrix and R is an upper triangular matrix), the PED increment ||ei (si )||2 can be computed as

closest constellation point to i is p1 = Q(i ), where Q is the slicing operator. When mi > 1, more constellations can be enumerated based on the vector d = i Q(i ). Fundamentally, the technique applied here is to incrementally grow the set around i by applying heuristicbased approximations. The heuristic in SSFE is called Fast Enumeration (FE). Figure 3 shows an example. Compared to other schemes [5, 10], the FE is independent on constellation size, so that handling 64-QAM is as efficient as handling QPSK. Moreover, the FE can be implemented with simple and architecture friendly operators, such as additions, subtractions, bit-negations and shifts. More information about the SSFE algorithm and BER performance comparisons with other schemes can be found in [18, 20].

i ||ei (si )||2 = || y


Rijs j||2 .


2.2.2 LLR Generator The list generator provides a list of most likely candidate symbol vectors s, denoted by L. The task of the LLR generator is to compute the LLR( j, b ) for each b th bit of the jth scalar symbol in s. This is done for all candidate symbol vectors s in L. For the calculation of LLR( j, b ) the max-log approximation can be used [13]. It is formulated as LLR( j, b ) = 1 ( min y Hs 2 2 s 0 j,b

Equation 3 can be rewritten to


i ||ei (s )|| = || y
i 2 j=i+1

Rijs j Rii si ||2 .


b i+1 (si+1 )

Since the minimization of ||ei (si )||2 is equivalent to the minimization of ||ei (si )/ Rii ||2 , (4) can be transformed to ||ei (si )/ Rii ||2 = || b i+1 (si+1 )/ Rii si ||2 = ||i si ||2 .


min y Hs 2 ).
s 1 j,b


The task of the SSFE is to select a set of the closest constellation points around i . This is essentially done by minimizing ||ei (si )/ Rii ||2 in (5). When mi = 1, the

1 0 j,b and j,b are the disjoint sets of symbol vectors that have their b th bit in their jth scalar symbol set to 0 and 1 respectively. 2 is the variance of the noise.

J Sign Process Syst (2011) 64:7592


Considering that the LLR generator needs to compute the LLR only for entries in L, (6) can be transformed to LLR( j, b ) = 1 ( min y Hs 2 2 sL 0 j,b

and 3) replacing the Euclidean-Norm with the Manhattan-Norm. In Section 3 we will show that further comprehensive optimizations are still possible. Importantly, the transformations will maintain I/O consistency.


sL 1 j,b

y Hs 2 ).

(7) 3 LLR Generator Optimization In this section we will propose further techniques to decrease the implementation complexity of the LLR computation. In a first step, we apply the partial and incremental update approach to reduce the number of required update operations. In a second step, we reduce the number of required computations and memory accesses per update operation by performing algebraic simplifications and strength reductions on the low level data-flow. Importantly, we will also replace all multiplications by shift and add operations which enables a more efficient implementation. After introducing the proposed optimization techniques, we will estimate the achievable gain.

1 In (7), 0 j,b and j,b have been replaced by the joint 0 sets L j,b and L 1 j,b respectively. To simplify the computation, the bit-flipping strategy can be applied [30]. When flipping bits in symbol vectors in L to 0 or 1 respectively, two new sets L0 j,b and 1 L j,b are obtained. Considering these new sets, (7) can be modified to

LLR( j, b ) = 1 ( min y Hs 2 2 sL0j,b


min y Hs 2 ).
sL1 j,b


By applying the QR decomposition to the channel matrix H, it can be shown that y Hs


Rs =c+ y


3.1 Optimization Technique 1: Selective and Incremental Update Approach 3.1.1 Overview A list generator that works with the Euclidean-norm provides a set L of s with y Hs 2 minimized. By applying the QR decomposition, the minimization of y Hs 2 is transformed to the minimization of y Rs 2 . As mentioned in Section 2.2.1, solving the above equation can be explained on a spanning tree. In this tree each node at level i {1, 2, .., Nt }) is uniquely described by a partial symbol vector si = [si , si+1 , .., s Nt ]. The PED of a partial symbol vector si = [si , si+1 , .., s Nt ] is given by (2). The PED-increment ei (si ) 2 can be computed with (3). As indicated in Section 2.2.2, the bit-flipping strategy can be applied for reducing the complexity of the LLR generation. When flipping the b th bit of the jth scalar 0 symbol in L to get L1 j,b and L j,b , an original partial symi(1) 1 [si , . . . , s0 j,b , . . . , s Nt ] and s j,b = [si , . . . , s j,b , . . . , s Nt ] respectively. Note, s0 j,b means that the bits have been flipped to 0 and s1 j,b means that the bits have been flipped to 1. Considering the explanations above and the optimizations proposed in Section 2.2.2, the task of the 0) bol vector si = [si , . . . , s j, . . . , s Nt ] is flipped to sij( ,b =

= Q H y and c = constant. where y The equation above enables us to transform (8) to LLR( j, b ) = 1 Rs ( min y 2 2 sL0j,b

Rs 2 ). min y
sL1 j,b


The squared Euclidean-Norm 2 of a complex number is calculated as ()2 + ()2 . To avoid multiplications, the squared Euclidean-Norm can be approximated by the Manhattan-Norm LLR( j, b ) = 1 Rs 1 ) min y Rs 1 ). ( min y 2 2 sL0j,b sL1 j,b (11)

The Manhattan-Norm 1 of a complex number is calculated as | ()| + | ()|. Note, this approximation causes a BER performance degradation. However, this degradation is typically below 1 dB [5, 17]. So far, the complexity of the LLR generation was significantly reduced by 1) applying the bitflipping strategy, 2) applying the QR decomposition


J Sign Process Syst (2011) 64:7592

0) approach, the complexity for calculating ei (sij( ,b ) 1) ei (sij( ,b ) 1 1

LLR generator can be summarized and formulated as follows: 1. Calculation of the Partial Manhattan Distance 0) i(1) (PMD) increments, ei (sij( ,b ) 1 and ei (s j,b ) 1 , for 0 flipped partial symbol vectors in L1 j,b and L j,b with
0) ei (sij( ,b ) 1 j1 Nt 1) Rik sk ||1 ei (sij( ,b )


is considerably reduced.

3.2 Optimization Technique 2: Algebraic Simplification and Strength Reduction Practical communication systems adopt Gray-coded modulation schemes. Two examples of Gray-coded 16QAM constellations are shown in Fig. 4. Figure 4a illustrates the scheme in 3GPP LTE and IEEE 802.16e2005 (WiMAX) and Fig. 4b illustrates a common Graycoded scheme that is used in other systems. As it can be seen, the two Nb /2 most significant bits out of the Nb bits determine the position of the modulated signal on the I-axis and the two Nb /2 least significant bits determines the position on the Q-axis. The characteristic that the position in the I/Q constellation diagram is determined by specific bits in the data word is very usual for Gray-coded schemes. Because of this attribute, s0 j,b s j and s1 s have a real or imaginary part that is zero. j j,b We can exploit this observation for applying algebraic simplifications. Let b (s j) denote the b th bit in the scalar symbol s j and let (b ) denote the shift distance of constellations. The latter is relevant when flipping the b th bit of s j from 0 to 1. Note that (b ) is a real number. As mentioned above, when 0 b < Nb /2, the constellation-shift is on the Q-axis:
0) ( eij( ,b ) = ( Rij )b (s j ) (b ) 0) ( eij( ,b ) = ( Rij )b (s j ) (b ) 1) ( eij( ,b ) = ( Rij )b (s j ) (b ) ( Rij ) (b )
0) ( eij( ,b )

i = || y

Rik sk Rijs0 j,b

k= j+1 Nt j1

i = || y

Rik sk Rijs1 j,b

k= j+1

Rik sk ||1 .


0) 2. Update of PMD 0 and PMD 1, Ti0 (sij( ,b ) and 1) Ti1 (sij( ,b ), for flipped partial symbol vectors with 0) i+1(0) 0) 0 Ti0 (sij( ) + ei (sij( ,b ) = Ti+1 (s j,b ,b ) 1

1) i+1(1) 1) 1 Ti1 (sij( ) + ei (sij( ,b ) = Ti+1 (s j,b ,b ) 1 .


3.1.2 Optimization 1 Since we leverage on the bit-flipping strategy, the following is noticeable: When flipping the b th bit of the jth scalar symbol in the partial symbol vectors, only {si } with i {1, . . . , j} are influenced, but {si } with i { j + 1, . . . , Nt } remain unchanged. Hence, we only need to calculate (12) and (13) for i {1, . . . , j}. Such a selective updating reduces the number of computations significantly. 3.1.3 Optimization 2 We can rewrite (12) as
0) ei (sij( ,b ) 1 Nt

i = y

Rik sk + Rij(s j

s0 j,b ) 1

1) ( eij( ,b ) =

( Rij) (b ) ( Rij)b (s j) (b )
0) ( eij( ,b )

= ei (si ) Rij(s0 j,b s j )

0) eij( ,b

1) ei (sij( ,b )

= ei (si ) Rij(s1 j,b s j )

1) eij( ,b



This has two advantages: First, only one scalar symbol in the partial symbol vectors is modified. Second, we can reuse ei (si ), since it has already been computed by the list generator. If the intermediate results of ei (si ) are temporarily stored and accessible by the LLR gen0) i(1) erator, we only need to calculate eij( ,b , e j,b and reuse ei (si ) from the storage. With this incremental update



Figure 4 Examples of Gray-coded 16-QAM constellations. a The scheme in 3GPP LTE and IEEE 802.16e-2005; b a common scheme in other systems.

J Sign Process Syst (2011) 64:7592


Contrary, when Nb /2 b < Nb , the constellationshift is on the I-axis:

0) ( eij( ,b ) = ( Rij )b (s j ) (b ) 0) ( eij( ,b ) = ( Rij )b (s j ) (b ) 1) ( eij( ,b ) =

( Rij) (b ) ( Rij)b (s j) (b )
0) ( eij( ,b )

1) ( eij( ,b ) = ( Rij ) (b ) ( Rij )b (s j ) (b ) (

0) eij( ,b )


On parallel programmable architectures that support predication, the use of conditional executions does not reduced the number of operations. Since this feature is often present in massive parallel architectures, we did not exploit conditional executions based on b (s j) {0, 1} in the formulations above. However, if run-time conditional executions do not hamper the efficiency on the targeted architecture, we can refine the above for0) 1) mulations further. Since b (s j) {0, 1}, eij( eij( ,b or ,b must be 0: With 0 b < Nb /2 and b (s j) = 0:
0) eij( ,b = 0 1) ( eij( ,b ) = ( Rij ) (b ) 1) ( eij( ,b ) =

is often the case, because QAM constellation points are usually scaled for normalized average power. For instance, in IEEE 802.16e-2005, QPSK, 16-QAM and 64-QAM are scaled by 1/ 2, 1/ 10 and 1/ 42, respectively. If we cancel the scaling at the receiver side and restore the original QAM constellations instead, which is possible because the I and Q values of constellations come from a specific set, ( Rij) (b ) and ( Rij) (b ) can be computed with bit-shifts and additions. For Gray-coded 16-QAM and 64-QAM schemes, | (b )| {2, 4, 6, 8, 10, 12, 14}. Therefore the multiplication | (b )| can be efficiently implement with maximally two bit-shifts and one addition. Some examples ( denotes left bit-shift operations): x6= x2+x4= x 1+x 2 x 12 = x 4 + x 8 = x 2+x 3 x 10 = x 2 + x 8 = x 1+x 3 x 14 = x 16 x 2 = x 4x 1

The proposed optimizations decrease the complexity of the LLR computation significantly. However, due to the lack of low-level (bit-level) instructions, the proposed optimizations are typically not efficiently implementable on state-of-the-art processors. To overcome this issue, we will propose specific instructions and demonstrate them on a CGA processor. 3.3 Estimation of Achievable Gain To estimate the gain, which is achievable by implementing the proposed optimizations, we compare the optimized LLR generator to a direct implementation of (11). Thereby we estimate the reduction of real additions, bit-shifts and memory operations (load and store operations). Note, the optimized LLR generator does not contain multiplications anymore. We calculate the number of low-level operations based on fixed point arithmetic. The overhead of address generation is considered as well. Since we leverage on the SSFE algorithm for list generation, the addresses can be computed with bit-shift and add operations only. Figure 5 shows the reduction of operations for 2 2 and 4 4 transmissions with 16/64-QAM modulation scheme. We chose 2 2 and 4 4 transmissions because they are commonly used in commercial systems, such as WiMAX and 3GPP LTE. 16-QAM and 64QAM are considered, because for lower order modulation schemes, such as QPSK and BPSK, simple exhaustive search can be applied and therefore the proposed optimizations are not relevant. As it can be seen in Fig. 5, multiple options for m are investigated. As mentioned above, the parameter m determines the

( Rij) (b )


With 0 b < Nb /2 and b (s j) = 1:

0) ( eij( ,b ) = ( Rij ) (b ) 0) ( eij( ,b ) = ( Rij ) (b ) 1) eij( ,b = 0


With Nb /2 b < Nb and b (s j) = 0:

0) eij( ,b = 0 1) ( eij( ,b ) =

( Rij) (b ) (18)

1) ( eij( ,b ) = ( Rij ) (b )

With Nb /2 b < Nb and b (s j) = 1:

0) ( eij( ,b ) = ( Rij ) (b ) 0) ( eij( ,b ) = ( Rij ) (b ) 1) eij( ,b = 0


The major computations in the above formulations are ( Rij) (b ) and ( Rij) (b ). These multiplications can be converted to simple bit-shifts and additions if the original input signal y is properly scaled. This

1.1 1 0.9 0.8 Reduction Rate Reduction Rate 0.7 0.6 0.5 0.4 0.3 0.2

J Sign Process Syst (2011) 64:7592

1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

[1 2]

[2 4]

[4 4] [4 8] [4 16] Search Range m of SSFE

[8 16]

[2 4]

[4 8]

[4 16] [8 16] [8 32] Search Range m of SSFE

[8 64]

(a) 2 2 16-QAM
1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

(b) 2 2 64-QAM
1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

Reduction Rate

Reduction Rate

[1 1 1 2]

[1 1 2 4] [1 1 4 4] [1 2 4 8] [1 2 4 16] [2 4 8 16] Search Range m of SSFE

[1 1 2 4] [1 2 4 8] [1 2 4 16] [1 2 8 16] [1 2 8 32] [1 4 8 64] Search Range m of SSFE

(c) 4 4 16-QAM

(d) 4 4 64-QAM

Figure 5 Reductions of additions, bit-shifts and memory operations of the proposed LLR generator compared to a reference based on (11). Besides, all multiplications were removed.

search range during list generation and therefore the complexity of the algorithm instance. In addition to the complete removal of expensive multiplications, we can observe a significant reduction of additions, bit-shifts and memory operations. Specifically, 26% to 83% of additions, 76% to 94% of bit-shifts and 63% to 91% of memory operations were reduced for the case study. When comparing Fig. 5a d, we can notice that the gain increases with the modulation size, with more antennas and with larger m. The results show that the proposed optimizations lead to substantial improvements and are therefore very relevant.

4 Implementation Overview 4.1 Targeted Throughput and Algorithm Instance Our work targets a minimum throughput of 120 Mbps for a 2 2 near-ML soft-output 64-QAM transmission. A previously reported MIMO receiver, based on the ADRES processor, delivers similar throughput in linear hard-output MIMO detection mode [4]. We focus on 2 2 64-QAM and 16-QAM systems because of two main reasons: 1) this transmission scheme is part of all major wireless communication standards and 2) the complexity is lower compared to 4 4, which

J Sign Process Syst (2011) 64:7592


makes the implementation on programmable architectures more feasible. To take advantage of soft-decoding, we target not only high-throughput but also high communication performance (BER). Therefore we allow only maximal 0.1 dB SNR degradation in regard to the maximal obtainable performance (ML detection; m = [C , C ] in SSFE). To fulfill this specification with the typically lowest required complexity, we select the SSFE algorithm instance to be m = [1, 16] for 16-QAM and m = [1, 64] for 64-QAM. For the BER performance evaluation, we exploit the 3GPP/3GPP2 Spatial Channel Models (SCM): Suburban macro, Urban macro and Urban micro. The starting point for the implementations is a manually written low-level C code for the SSFE list generator, the optimized LLR generator and the reference LLR generator which is based on (11). 4.2 Outline In Section 3.3 we showed that the number of operations and memory accesses of the proposed LLR generator are significantly lower compared to the reference LLR generator. However, to obtain a higher gain of the proposed optimizations, the architecture has to support certain low-level instructions. We will first evaluate the effective gain by comparing both LLR generators on the TI TMS320C6416 processor in Section 5. In Section 6, we will propose specialized instructions for improving the implementation efficiency. Subsequently, the benefit of these instructions will be demonstrated on the basis of an extended CGA processor. Section 7 shows the implementation of the proposed MIMO detection algorithm as ASIC. By having all of these implementations available, fundamental conclusions about the feasibility of soft-output MIMO detectors on massive parallel processors can be made.

5.2 Implementation and Results We implemented the SSFE list generator, the reference LLR generator (V0) as well as the optimized LLR generator (V1) on the TMS320C6416. Table 1 shows the mapping results, which are based on the decoding complexity of one 2 2 64-QAM MIMO symbol. From the number of instructions we can observe that the complexity of LLR generation is indeed dominant for soft-output MIMO detection. Therefore focusing on the optimization of the LLR generation is essential. When comparing both LLR generator implementation we can notice that the proposed algorithm reduces the number of instructions, the number of L1D accesses as well as the number of L1D misses significantly. Remarkably, the number of L1D accesses has been reduced by a factor of more than 100. This results from the fact that the proposed algorithm requires less intermediate storage and therefore less accesses than the reference algorithm. From Table 1 we can further observe that the cycle count of LLR V1 is higher than the cycle count of LLR V0. At first sight that seems to be confusing because the number of instructions are actually lower for LLR V1. However, the results can be explained as followed: Since the TI processor does not offer specialized lowlevel instructions, the innermost loop of the optimized algorithm requires many standard instructions and an huge amount of live registers (for storing all intermediate values). For this reason, the compiler fails to apply software pipelining techniques efficiently and as a consequence, the number of cycles are higher than for LLR V0. Nevertheless, if we assume that the TI compiler can map the algorithm very efficiently, i.e. with Instructions per Cycle (IPC) to be 6 (as in case of the list generator), 13,841 cycles for decoding one MIMO symbol are required. This optimistic assumptions translate to a throughput of less than 2 Mbps even with the fourway SIMD supported by the TMS320C6416 processor (800 MHz clock frequency). Clearly, for meeting the targeted 120 Mbps throughput, a processor with specialized instructions and more parallelism is necessary.

5 Implementation on a State-of-the-Art TI Processor 5.1 Architecture We chose the TI TMS320C6416 VLIW DSP processor as representative state-of-the-art reference architecture. It includes eight parallel FUs that are organized in two clusters. Each FU can execute a 32 bit instruction per cycle. The level-1 memory consists of 16 K-Byte direct-mapped instruction cache (L1P) and 16 K-Byte 2-way set-associative data cache (L1D). More information can be found in [28].

Table 1 Mapping results on the TI TMS320C6416. List G. (SSFE) Instructions Cycles L1D accesses L1D misses 6,209 1,057 791 96 LLR G. V0 (ref.) 234,898 57,189 52,651 1,260 LLR G. V1 (opt.) 76,834 59,479 11,236 12


J Sign Process Syst (2011) 64:7592

6 Implementation on an Enhanced ADRES CGA Processor 6.1 Architecture As demonstrated in Section 5, for achieving the targeted throughput of 120 Mbps, a processor which offers more parallelism and specialized instructions is needed. In this work we investigate in the ADRES CGA processor template [22]. An instance of the processor template is shown in Fig. 6. As it can be seen, the parameterizable template consists of an Coarse Grain Array (CGA) of densely interconnected FUs that have local Register Files (RFs) and individual configuration memories (loop buffers). Besides, a few VLIW FUs are present. The VLIW FUs and a limited subset of the CGA FUs are connected to the global (shared) data RF. This shared data RF enables to exchange data between both types of FUs. Since the VLIW FUs and the CGA FUs operate time multiplexed, two modes are available: VLIW mode and CGA mode. All FUs support SIMD. For our explorations, we leverage on the DRESC C compiler framework [22]. The compiler supports both, the VLIW mode and the CGA mode. In general, loops are mapped on the CGA section and the rest of the code is scheduled on the VLIW section. The ADRES template enables to instantiate a processor with a specific amount of ILP and DLP. By changing the size of the array (number of FUs), the

amount of supported ILP can be tuned. By changing the number of SIMD slots in each FU, the amount of supported DLP can be tuned. The required amount of ILP and DLP is application dependent. To determine the best combination of ILP/DLP, i.e. fulfill the performance requirements with lowest implementation complexity, extensive explorations are typically needed. In Section 6.3 we will show this explorations for the MIMO detector design. 6.2 Application Specific Instructions In this section we propose Application Specific Instructions (ASIs) to increase the efficiency of the MIMO detector implementation. An ASI is a large cluster of connected operators. Packing cascaded operators into one instruction allows to execute many operators in one clock cycle which leads again to a higher throughput. In addition, by leveraging on ASIs, the requirements on intermediate storage are reduced. Furthermore, ASIs will reduce the size of the optimization problem for the compiler. As a consequence, the compiler can allocate and schedule resources in a more effective and efficient way. In the targeted MIMO detector design the computations are dominated by only few equations. Therefore the overhead of implementing ASIs for these equations will be acceptable. An estimation of the cost will be given in Section 6.4. 6.2.1 Overview of the Instructions

Data Memory

Global Predication Register File Global Data Register File

VLIW Control Unit




Configuration Memories





CGA Section





VLIW Section

Instruction Fetch Instruction Dispatch Branch Control Mode Control CGA & VLIW

In our work, we design ASIs based on algorithmic insights. Candidates for ASIs are especially computational dominant parts. In Table 1 it has been shown that in a soft-output MIMO detector the LLR generator is dominant. Therefore designing only ASIs for the LLR generator could be sufficient. However, to further increase the efficiency, we also consider the list generator as candidate. In total we designed four ASIs, denoted as ASI0-3. ASI0 is designed for the LLR generation; ASI1-3 for the list generation. ASI1

Instr. Cache

Table 2 Overview of application specific instructions. ASI0 Function Equation Multip. Add./sub. Abs. Shift Mux. LLR G. (opt.) (13, 14) 0 8 2 4 4 ASI1 List G. (ant. 2) (4) 0 7 2 4 0 ASI2 List G. (ant. 1a) (4) 0 8 0 8 0 ASI3 List G. (ant. 1b) (5) 2 12 0 4 4





CGA View




Figure 6 ADRES instance with 16 CGA FUs and three VLIW FUs.

J Sign Process Syst (2011) 64:7592


is used for the list generation at antenna 2 ( Nt ); ASI2 and ASI3 are used for the list generation at antenna

1 (other than Nt ). An overview of the designed ASIs with information about implemented equations and

(a) ASI0 (LLR generator)

(c) ASI2 (List generator for antenna 1a)

(b) ASI1 (List generator for antenna 2)

(d) ASI3 (List generator for antenna 1b)

Figure 7 Datapath implementation of ASI0-3 with embedded control logic. Except for ASI3, no multipliers are used.


J Sign Process Syst (2011) 64:7592

operation count is provided in Table 2. A schematic of the datapath implementations is shown in Fig. 7. It can be noticed that a rather large number of operations have been included within an ASI. Because of the lowlevel algorithm optimizations of the SSFE list generator and the LLR generator, ASIs are mostly based on lowcost operators. Except for ASI3, no multiplications are required. As it can be expected, ASI0 is the most important one in terms of number of executions. Specifically, when the detector is configured for 2 2 64-QAM (m = [1, 64]), to detect one MIMO symbol, ASI0 is executed 1,152 times whereas ASI1-3 are only executed 64 times each. When defining the degree of parallelization, this unbalanced distribution of execution time has to be considered. 6.2.2 ASI0 (LLR Generation) The datapath of ASI0 is illustrated in Fig. 7a. Re() and Im() denote the real and image part of a signal; PED_inc denotes ei (si ). The datapath part 1 calculates 0) ( Rij)| (b )| and ( Rij)| (b )| as required for eij( ,b and

the multiplication with 1/ Rii is part of ASI3. The lowermost datapath parts of ASI1 and ASI3 compute the PED increment. 6.2.4 Implementation To estimate the maximal delay and area, the proposed ASIs have been implemented in VHDL and synthesized with TSMC 90 nm General-Purpose (GP) library. The signal s has been quantized with 8 bit (4 bit for () and 4 bit for ()) and all other data signals with 16 bit. For the standard-cell synthesis Synopsys Design Compiler was used (optimization constraints: min. delay and min. area; worst-case design corner). As in [4], we target to run the ADRES processor at 400 MHz clock frequency. To estimate the number of required clock cycles for executing a certain ASI, we take the following into account: Critical path delay of an ASI based on synthesis results Additional overhead for integrating ASIs into FUs (i.e. delay of large multiplexers)

1) eij( ,b . Because of the applied low-level optimizations (see Section 3.2), the computation can be performed with shift and add operations. Designed for (14), part 2 calculates |( (ei (si )) ( Rij)| (b )|)|, |( (ei (si )) ( Rij)| (b )|)| or |( (ei (si )) ( Rij)| (b )|)|, |( (ei (si )) ( Rij)| (b )|)|. (0) i(1) Part 3 calculates T ij, b and T j,b as formulated in (13). The computations that the ASI has to perform depends on Nb , b and s j,b . Instead of multiple fixed ASIs, only one flexible ASI, which supports the required parameter range, is implemented. To cope with the necessary flexibility, a small control logic is embedded in the datapath. Because of embedding the control logic in the datapath, the cost for providing flexibility is typically reduced.

Table 3 shows the synthesis results, the required data input/output Bit-Width (BW) and the estimated number of clock cycles. Although the ASIs combine a substantial number of cascaded operators, only two or three clock cycles are required for their execution. By inserting pipeline register in the ASIs, the total number of required clock cycles for the MIMO detection can potentially be reduced. Nevertheless, for the following ILP/DLP explorations, we leverage on the clock cycle numbers provided in Table 3 (worst-case estimation). 6.3 ILP/DLP Explorations 6.3.1 Initial Code-Transformations We can apply pre-compiler code transformations to further improve the efficiency. Instead of executing the list generator and the LLR generator independent from each other, we can merge the loops from these

6.2.3 ASI1-3 (List Generation) The datapaths of ASI1-3 are shown in Fig. 7bd. ASI1 implements (4) and ASI2-3 implement (5). The upper datapath part of ASI1 generates the symbol for antenna 2 (s = 1..64) with the required data format (see Section 3.2). The multiplications, additions and subtractions in (4) have been implemented with specific SH-A/S units. As it can be seen in Fig. 7b), a SH-A/S unit consists of two shifter, two adder/subtractor and of an embedded control logic. This efficient implementation was enabled by applying low-level optimizations. The division in (5), which has a low duty cycle, is considered as part of the channel matrix pre-processing. However,

Table 3 Implementation of ASI in TSMC 90 nm. ASI0 Max. delay (ns) Area (m2 ) Required clock cycles @ 400 Mhz Data input BW (per operand) Data output BW 4.91 13,070 2 52 32 ASI1 3.55 9,058 2 32 56 ASI2 2.42 12,278 2 36 32 ASI3 6.19 20,104 3 48 56

J Sign Process Syst (2011) 64:7592


components together and execute the whole code in one common loop. By performing this optimization, storage requirements are reduced and the locality of data accesses is improved. For decoding one 2 2 64QAM MIMO symbol, the transformed code with ASIs requires only 2,577 instructions and 36 L1D accesses whereas the original code requires 76,834 instructions and 11,236 L1D accesses on the TMS320C6416. An improvement of 30 for the number of instructions and more than 300 for the number of memory accesses is achieved. This results clearly show the positive nature of ASIs. 6.3.2 Explorations and Results We consider a one-processor solution for soft-output MIMO detection as sufficient because of the following previous improvements: 1) architecture-friendly algorithm design, 2) extension with ASIs and 3) codetransformations to increase the advantages of ASIs. The ILP/DLP explorations (array size and SIMD width) are combined with loop transformations to improve the scheduling density for a chosen configuration. During the explorations, all FUs in the CGA are assumed to support ASI0 (LLR generation), whereas ASI1, ASI2 and ASI3 (list generation) are only supported in one VLIW FU. This decision is based on the knowledge that ASI0 is executed more often than other ASIs. As in the NXP EVP processor (16-way) [29] or in the SODA processor (32-way) [21], we exploit very wide SIMD slots. Since the proposed algorithm is explicitly designed for DLP architectures, SIMD can be fully exploited without causing a major overhead. Therefore the throughput scales linearly with the number of SIMD slots. The results of the ILP/DLP explorations are summarized in Table 4. The VLIW and the CGA cycle count informs on how many cycles have been executed on the corresponding section (see Fig. 6). The Instructions Per-Cycle (IPC) metric indicates how-well the available
Table 4 ILP/DLP explorations targeting 120 Mbps throughput with 2 2 64-QAM. CGA size 64-QAM 24 44 64 84 16-QAM 24 44 64 84 SIMD Total cycles 397 276 228 212 139 118 105 98

ILP has been exploited. A system level metric, to inform on how efficiently architectural resources haven been utilized, is given by Mbps/FU/SIMD. From the results we can examine that the targeted 120 Mbps for 2 2 64-QAM are achievable with a feasible amount of parallelization. For instance, 192/368 Mbps (64/16-QAM) are obtainable on an ADRES instance with eight FUs, each with 16-way SIMD. A commercial design with comparable complexity is the NXP EVP processor, which has ten FUs, from which six support 16-way SIMD [29]. In should be mentioned that the instruction issue of the ADRES and NXP EVP FUs are different. From Table 4 we can further observe that the efficiency of resource utilization decreases with the size of the FU array. For instance, for 2 2 64-QAM, a 2 4 array achieves an IPC of 6.8 with a scheduling density of 85%. However, a 4 4 array achieves only an IPC of 11.4 with a scheduling density of 71.25%. The Mbps/FU/SIMD metric provides a similar indication. This behavior can be explained as follows: The increment of array size results in an exponential increase in complexity. Because of high complexity, the compiler can not perform an efficient resource allocation and scheduling anymore and therefore the IPC goes down. Among the explored options, the 2 4 array with 16-way SIMD gives the best throughput and the best scheduling density (85%). Therefore we select this instance for further calculations. 6.4 Area and Power Estimations In order to get an idea about the cost of a soft-output MIMO detector implementation on a massive parallel processor, we roughly estimate the area and power consumption of a representative ADRES instance. For the estimation, we start from a ADRES template instance with the following configuration: Three VLIW FUs with 64 bit wide datapath
CGA cycles 355 227 183 163 99 75 61 54 IPC Total TP (Mbps) 12.1 16 = 192.6 17.4 8 = 139.2 21.1 8 = 168.8 22.6 8 = 180.8 23.0 16 = 368.0 27.1 8 = 216.8 30.5 8 = 244.0 32.7 8 = 261.6 TP/FU/ SIMD 1.51 1.09 0.88 0.71 2.88 1.69 1.27 1.02

VLIW cycles 42 49 45 49 40 43 44 44

16 8 8 8 16 8 8 8

6.8 11.4 15.5 18.1 5.6 8.2 10.6 11.4


J Sign Process Syst (2011) 64:7592

Eight CGA FUs with 64 bit wide datapath (2 4 array) 512K data memory 32K instruction cache

ASIs 27%

FUs VLIW 3% FUs CGA 10% RFs VLIW 7%

We extend this template instance to support the required 16-way SIMD for ASIs. From Table 3 it can be seen that an operand width of 64 bit is sufficient for loading data in and out of ASIs. Therefore we choose the bit-width of an ASI SIMD slot to be 64 bit. We consider the following modifications:

Peripherals 1%

9 mm2 RFs CGA 16%

Instr. Cache 15% Data Memory 16%

Add the datapath of ASI1, ASI2 and ASI3 for 16way support to one VLIW FU (considered in the ILP/DLP exploration in Section 6.3.2) Increase the VLIW global data register file from 4K to 8K (the size of 8K for MIMO detection is sufficient, because basically only one VLIW FU is active and because ASIs are deployed, therefore the intermediate storage requirements are reduced) Add the datapath of ASI0 for 16-way support to all eight CGA FUs (see ILP/DLP exploration result in Section 6.3.2) Extend the default local CGA register file size by factor 16 (because of the 16-way SIMD of ASI0)

Config. Memories 5%

Figure 8 Estimated area breakdown of the ADRES instance with 2 4 CGA and 16-way SIMD.

7 Implementation as ASIC 7.1 Architecture The ASIC implementation is based on the same algorithm as the ADRES implementation. It supports 2 2 near-ML soft-output MIMO detection for 16-QAM and 64-QAM. The architecture, which leverages on a rather high degree of data parallelism and on pipelining, can be seen in Fig. 9. Application Specific Block (ASB) 0, which performs the LLR computation, consists basically of six parallel ASI0 datapaths. With this degree of parallelization, one ASB0 can compute the LLR for q-bits (q = 6 in 64-QAM) and for one antenna simultaneously. ASB1 and ASB2/3, which implement ASI1, ASI2 and ASI3 respectively, are the functional blocks for list generation. Comparison Blocks (CPBs) are required for selecting the symbol with the highest probability. The control unit, which generates control signals for the datapath and the output, is implemented

Note, the extension to 16-way SIMD causes effectively more overhead than considered here. For instance, we neglected the impact on interconnect. However, this is not an issue if we assume that the routing for the 16-way extension is feasible by employing semicustom design techniques [24]. The area consumption of the design was estimated based on synthesis results of the components. Therefore the obtained results are considered as a lower bound. Figure 8 shows the area breakdown. The total area in TSMC 90 nm GP technology is about 9 mm2 . 40% of the total area is occupied by the datapath of FUs, of which 27% is consumed by the ASIs. We roughly estimate the power consumption of the MIMO detector in 64-QAM mode based on statistical power simulations and experience/results from previous designs. Thereby we assume that the extended ADRES instance operates 10% in VLIW mode and 90% in CGA mode (see Table 4). Based on this rough estimation, the ADRES consumes about 160 mW in VLIW and about 400 mW in CGA mode. The average power consumption is therefore about 376 mW at 400 MHz clock frequency.

start qam



valid busy

^ y


Antenna 2 ASB0

R 1/R ASB0


LLR00 LLR01 Softoutput LLR10 LLR11

clk rst



Figure 9 Architecture of the MIMO detector ASIC for 2 2 16/64-QAM.

J Sign Process Syst (2011) 64:7592


as Finite State Machine (FSM). In this architecture the number of required clock cycles to detect one symbol corresponds to the number of candidate symbols. Since we chose an algorithm instance in which 16/64 candidate symbols for 16/64-QAM modulation are considered, also 16/64 clock cycles to detect one symbol are required. This translates to a throughput of 200 Mbps in 16-QAM mode and 75 Mbps in 64-QAM mode when assuming a clock frequency of 400 MHz. Opportunities to increase the throughput of this architecture include Computing the candidate symbols in parallel (i.e. instantiating more blocks) Inserting more pipeline register and increase the clock frequency.

8 Comparison Table 5 summarizes the maximal achievable throughput, the area consumption and the power consumption of the MIMO detector implementations considered in this work. As shown in Section 5.2, the achievable throughput on the TI TMS320C6416, which is a stateof-the-art VLIW processor, is less than 2 Mbps for 64QAM. A soft-output MIMO detector implementation on a conventional VLIW processor is therefore not feasible. However, by adding the support for specialized instructions, the throughput of processor implementations can significantly be increased. The result of the ADRES implementation proves that soft-output MIMO detection can be deployed on programmable architectures and that a considerable high throughput can be achieved. Nevertheless, when compared to the ASIC implementation, the ADRES solution consumes 12 more area and 6 more power (for 64-QAM). Considering that both leverage on the same datapath, this overhead raises mainly from data transport, data storage, control overhead and inefficiency. The considered ADRES instance supports ASIs as well as generic instructions. Therefore different algorithms can be mapped on the architecture. However, since the introduced ASIs are very specific, typically only the proposed MIMO detector can benefit from it. Because of many differences, such as provided flexibility, BER performance, technology or accuracy of estimations, a fair quantitative comparison with work in literature is difficult. Nevertheless, the following overview gives an idea about the efficiency of related implementations: The references [1, 4, 8, 15, 16] are based on simple linear hard-output detection. Although the complexity of linear hard-output detection is much lower compared to near-ML soft-output detection, the implementation of [16] consumes about 1 mW power while offering less than 50 Mbps throughput for 2 2 64-QAM. Reference [32] implements a near-ML hard-output detector on a Nvidia 9600GT floatingpoint Graphical Processor Unit (GPU), which includes 64 streams processors and 512MB DDR3 memory. The stream processors are clocked at 1.9GHz and the memory at 2GHz respectively. It is interesting to observe that only 15 Mbps throughput for near-ML hardoutput 4 4 64-QAM detection is achievable on the mentioned GPU. Reference [31] implements a 4 4 linear soft-output detector which achieves a throughput of 600 Mbps. Nevertheless, the LLR computation complexity for linear detection is significantly lower than for near-ML detection [31]. Besides, the multicore floating-point architecture of [31] is basically a reconfigurable ASIC rather than a processor and the

In general, it can be pessimistically assumed that the throughput scales linearly with area and power. More information about the scalability of this architecture can be found in [9]. 7.2 Area and Power Estimations The ASIC was implemented in VHDL, synthesized for TSMC 90 nm GP technology with Synopsys Design Compiler, placed and routed with Cadence SoC Encounter. The resulting layout confirms that 400 MHz clock frequency is feasible. Figure 10 shows the area breakdown of the design. As assumed, the area is clearly dominated by the LLR computation blocks. This proves one time more that the optimization of the LLR computation is very vital. The total area is 0.3 mm2 . Based on statistical activity, the ASIC implementation consumes about 25 mW power. Because the ASIC is based on the optimized SSFE and LLR algorithm, it is more efficient than other state-of-theart ASICs. A comparison is provided in [9].

ASB0 (LLR G.) 76%

ASB1-3 (List G.) 11% CPB 4% 0.3 mm2 Top-Level Register 9% Ctrl. Unit 0%

Figure 10 Area breakdown of the ASIC implementation.

90 Table 5 Comparison of soft-output MIMO detection on different architectures. TI processor (without ASI) Throughput 64-QAM (Mbps) Throughput 16-QAM (Mbps) Total area (mm2 ) Total power (64-QAM) (mW) Mbps/mm2 64-QAM Mbps/mm2 16-QAM Mbps/mW 64-QAM <2 <4 NA NA NA NA NA

J Sign Process Syst (2011) 64:7592 ADRES proc. (with ASI) 192 368 9 376 21 41 0.5 ASIC 75 200 0.3 25 250 667 3 Difference ADRES/ASIC ()2.6 ()1.8 30 15 12 16 6

floating-point nature indicates a low energy efficiency. A near-ML soft-output MIMO detector implementation is demonstrated in [33]. However, this implementation offers less than 10 Mbps throughput for 2 2 64-QAM. Besides, [33] is again based on the forementioned floating-point GPU which is generally not a feasible option for portable devices. In contrary to these implementations from literature, the ADRES based solution offers the combination of advanced MIMO detection, reasonable throughput and more tolerable power consumption. The results of this paper clearly show that providing flexibility is costly in terms of energy and area. Hence, it should be carefully analyzed in which parts of the architecture flexibility is essential and in which parts it is not needed. In general, the ASIP architecture style, which implies a scalability in terms of customization, seems to be a promising choice [1, 11, 14, 15]. As motivated in [3], the energy-efficiency of flexible implementations can be improved by exploiting application dynamism at run-time.

based on ASIPs, which typically offer a better trade-off between flexibility and efficiency [11, 14]. To further enhance the efficiency of programmable implementations, the application dynamism should be exploited [3].

1. Antikainen, J., Salmela, P., Silveny, O., Juntti, M., Takala, J., & Myllyla, M. (2008). Fine-grained application-specific instruction set processor design for the K-Best list sphere detector algorithm. In International conference on embedded computer systems (IC-SAMOS) (pp. 108115). 2. Bolcskei, H., Gesbert, D., Papadias, C. B., & van der Veen, A. (2006). Space-time wireless systems: From array processing to MIMO communications. Cambridge: Cambridge University Press. 3. Bougard, B., Li, M., Novo, D., Van Der Perre, L., & Catthoor, F. (2008). Bridging the energy gap in size, weight and power constrained software defined radio: Agile baseband processing as a key enabler. In IEEE international conference on acoustics, speech and signal processing (ICASSP). 4. Bougard, B., De Stutter, B., Rabou, S., Novo, D., Allam, O., Dupont, S., et al. (2008). A coarse-grained array based baseband processor for 100 Mbps+ software defined radio. In Design, automation and test in Europe (DATE) (pp. 716 721). 5. Burg, A., Borgmann, M., Wenk, M., Zellweger, M., Fichtner, W., & Bolcskei, H. (2005). VLSI implementation of MIMO detection using the sphere decoding algorithm. IEEE Journal of Solid-State Circuits, 40(7), 15661577. 6. Chen, S., & Zhang, T. (2007). Low power soft-output signal detector design for wireless MIMO communications systems. In Proceedings of the intern. Symposium on low power electronics and design (pp. 232237). 7. Chen, S., Zhang, T., & Xin, Y. (2007). Relaxed K-Best MIMO signal detector design and VLSI implementation. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(3), 328337. 8. Eberli, S., Burg, A., & Fichtner, W. (2009). Implementation of a 2 2 MIMO-OFDM receiver on an application specific processor. Microelectronics Journal, 40(11), 16421649. 9. Fasthuber, R., Li, M., Novo, D., Raghavan, P., Van Der Perre, L., & Catthoor, (2009). Novel energy-efficient scalable soft-output SSFE MIMO detector architectures. In International conference on embedded computer systems (IC-SAMOS). 10. Garrett, D., Woodward, G. K., Davis, L., & Nicol, C. (2005). A 28.8 Mb/s 4 4 MIMO 3G CDMA receiver for frequency selective channels. IEEE International Solid-State Circuits Conference (ISSCC), 40(1), 320330.

9 Conclusion In this paper, we presented the algorithm-architecture co-design of a 2 2 soft-output near-ML MIMO detector on massive parallel processors. We combined architecture-friendly algorithm design, design of application specific instructions, code transformations and ILP/DLP explorations to enable efficient implementations. We showed that an ADRES CGA processor with a 2 4 CGA, 16-way SIMD and support for specialized instructions, achieves a considerable high throughput of 192/368 Mbps in 64/16-QAM mode. Such a processor solution, which enables SDR, is therefore generally suitable for replacing traditional ASIC implementations. However, in this particular case, the corresponding ASIC implementation consumes still 12 less area and 6 less power. Clearly, this overhead is hardly affordable for low-cost mobile SDR platforms. Future work should therefore investigate in solutions that are

J Sign Process Syst (2011) 64:7592 11. Gries, M., Keutzer, K., Meyr, H., & Martin, G. (2005). Building ASIPS: The mescal methodology. Berlin: Springer. 12. Guo, Z., & Nilsson, P. (2006). Algorithm and implementation of the K-Best sphere decoding for MIMO detection. IEEE Journal on Selected Areas in Communications, 24(3), 491 503. 13. Hochwald, B. M., & ten Brink, S. (2003). Achieving nearcapacity on a multiple-antenna channel. IEEE Transactions on Communications, 51(3), 389399. 14. Ienne, P., & Leupers, R. (2006). Customizable embedded processors: Design technologies and applications. San Francisco: Morgan Kauffman. 15. Jafri, A. R., Karakolah, D., Baghdadi, A., & Jezequel, M. (2009). ASIP-based flexible MMSE-IC linear equalizer for MIMO turbo-equalization applications. In Design, automation and test in Europe (DATE). 16. Janhunen, J., Silven, O., Juntti, M., & Myllyla, M. (2008). Software defined radio implementation of K-Best list sphere detector algorithm. In International conference on embedded computer systems (IC-SAMOS) (pp. 100107). 17. Koike, T., Seki, Y., Murata, H., Yoshida, S., & Araki, K. (2005). FPGA implementation of 1 Gbps real-time 4 4 MIMO-MLD. Vehicular Technology Conference, 2, 1110 1114. 18. Li, M., Bougard, B., Lopez, E., Bourdoux, A., Novo, D., Van Der Perre, L., et al. (2008). Selective spanning with fast enumeration: A near maximum-likelihood MIMO detector designed for parallel programmable baseband architectures. In IEEE intern. conference on communications (ICC) 2008 (pp. 737741). 19. Li, M., Bougard, B., Naessens, F., Van Der Perre, L., & Catthoor, F. (2008). An implementation friendly low complexity multiplierless LLR generator for soft MIMO sphere decoders. In IEEE workshop on signal processing systems (SIPS). 20. Li, M., Bougard, B., Xu, W., Novo, D., Van Der Perre, L., & Catthoor, F. (2008). Optimizing near-ML MIMO detector for SDR baseband on parallel programmable architectures. In Design, automation and test in Europe (DATE) (pp. 444 449). 21. Lin, Y., Lee, H., Woh, M., Harel, Y., Mahlke, S., Mudge, T., et al. (2007). SODA: A high-performance DSP architecture for software-defined radio. IEEE Micro, 27(1), 114 123. 22. Mei, B., Lambrechts, A., Mignolet, J. Y., Verkest, D., & Lauwereins, R. (2005). Architecture exploration for a reconfigurable architecture template. IEEE Design and Test of Computers, 22(2), 90101. 23. Nilsson, A., Tell, E., & Liu, D. (2008). An 11 mm2 70 mW fully-programmable baseband processor for mobile WiMAX and DVB-T/H in 0.12 um CMOS. In Intern. solid-state circuits conference (ISSCC) (pp. 266612). 24. Noll, T. G., Weiss, O., & Gansen, M. (2001). A flexible datapath generator for physical oriented design. In European solid-state circuits conf. (ESSCIRC) (pp. 393396). 25. Paulraj, A. J., Gore, D. A., Nabar, R. U., & Bolcskei, H. (2004). An overview of MIMO communicationsa key to gigabit wireless. Proceedings of the IEEE, 92(2), 198218. 26. Ramacher, U. (2007). Software-defined radio prospects for multistandard mobile phones. Computer, 40(10), 6269. 27. Shariat-Yazdi, R., & Kwasniewski, T. (2007). Reconfigurable K-Best MIMO detector architecture and FPGA implementation. In International symposium on intelligent signal processing and communication systems (ISPACS) (pp. 349352). 28. Texas Instruments (2005). Datasheet of the TMS320C6416 f ixed-point digital signal processor.

91 29. van Berkel, K., Heinle, F., Meuwissen, P., Moerman, K., & Weiss, M. (2005). Vector processing as an enabler for software-defined radio in handheld devices. Journal on Applied Signal Proc. (EURASIP), 2005, 26132625. 30. Wang, R., & Giannakis, G. B. (2004). Approaching MIMO channel capacity with reduced-complexity soft sphere decoding. IEEE Wireless Communications and Networking Conference (WCNC), 3, 16201625. 31. Wu, D., Eilert, J., & Liu, D. (2009). Implementation of a high-speed MIMO soft-output symbol detector for software defined radio. Journal of Signal Processing Systems, 111. ISSN 19398018. doi:10.1007/s11265-009-0369-9. 32. Wu, M., Gupta, S., Sun, Y., & Cavallaro, J. R. (2009). A GPU implementation of a real-time MIMO detector. In IEEE workshop on signal processing systems (SiPS09). 33. Wu, M., Sun, Y., & Cavallaro, J. R. (2009). Reconfigurable real-Time MIMO detector on GPU. In Asilomar conf. on signals, systems and computers (ASILOMAR09).

Robert Fasthuber received the MSc degree in Hardware/Software Systems Engineering from the University of Applied Science (FH) Hagenberg, Austria, in 2007. In September 2007 he became a researcher at the Interuniversity MicroElectronics Center (IMEC) Belgium and at the Katholieke Universiteit (K.U.) Leuven. He is a PhD student since July 2008. His research focuses on technology-aware low-power architectures for Software Defined Radio (SDR) and Cognitive Radio (CR) implementations.

Min Li received the BE degree (with the highest honor) in July 2001 from Zhejiang University, Hangzhou, China. From September 2001 to September 2004 he was a postgraduate student at

92 Zhejiang University. From January 2003 to September 2003 he was an employee at Lucent Bell Labs Research China; working on network processors. From September 2003 to September 2004 he was employed at Microsoft Research Asia; working on low power mobile computing. From September 2004 to September 2009 he was a Ph.D researcher at IMEC Belgium and a PhD student at K.U. Leuven. Since October 2009 he is an employed researcher at IMEC. His main technical interests are low- power signal processing and low-power implementations.

J Sign Process Syst (2011) 64:7592 of the next generation platform architecture in the framework of the IMEC SDR/CR program. Besides, he is coordinating PhD students in the field of low-power design at IMEC. His research interests include low power design, low power architectures, system design, and SDR/CR.

David Novo is a member of the wireless group at IMEC and a PhD candidate at the K.U. Leuven. He received the MSc degree in Electronic Engineering from the University Autonoma of Barcelona, Spain, in 2005. His research interests include energy-efficient circuits, architectures and systems for wireless communication with special focus on Software Defined Radio implementations.

Liesbet Van der Perre received the MSc degree in Electrical Engineering from the K.U. Leuven, Belgium, in 1992. The research for her thesis was completed at the Ecole Nationale Superieure de Telecommunications in Paris. She graduated with a PhD in Electrical Engineering from the K.U. Leuven in 1997; on the topic of radio propagation modelling. At IMEC, she was a system architect for OFDM ASICs and the project leader for the Turbo codec. She is the scientific director of the wireless research group for digital baseband SDR/CR. She is an author and co-author of over 150 scientific publications.

Praveen Raghavan received the bachelor degree from the National Institute of Technology, Trichy, India and the master degree in Electrical Engineering from Arizona State University. He received his Ph.D in Electrical Engineering from the K.U. Leuven in 2009. As wireless systems researcher, he is in charge

Francky Catthoor is a fellow at IMEC Belgium and a fellow at IEEE. He received the Engineering degree and a PhD in Electrical Engineering from the K.U. Leuven, Belgium, in 1982 and 1987 respectively. Between 1987 and 1999, he has headed research domains in the area of architectural and system-level synthesis methodologies, within the DESICS (formerly VSDM) division at IMEC. His main current research activities belong to the field of architecture design methods and system-level explorations for power and memory footprints within real-time constraints; oriented towards data storage management, global data transfer optimization and concurrency exploitation. Platforms that contain both, customizable/configurable architectures and (parallel) programmable instruction-set processors, are targeted. Also deep-submicron technology issues are included.