Vous êtes sur la page 1sur 5

FPGA Implementation of QRD-RLS Algorithm with Embedded Nios Soft Processor

Deepak Boppana Kully Dhanoa Jesse Kempa


Altera Corporation Altera Corporation Altera Corporation
101 Innovation Dr Holmers Farm Way 101 Innovation Dr
San Jose, USA High Wycombe, UK San Jose, USA
408-544-7887 1494-602-000 408-544-8520
dboppana@altera.com kdhanoa@altera.com jkempa@altera.com

I. INTRODUCTION increasingly being considered for high sample


Adaptive signal processing algorithms such as rate applications, such as digital predistortion,
least mean squares (LMSs), normalized LMSs beamforming and MIMO signal processing.
(NLMSs), and recursive least squares (RLSs) FPGAs are the preferred hardware platform for
have been historically used in numerous wireless such applications because of their capability to
applications such as equalization, beamforming, deliver enormous signal processing bandwidth.
and adaptive filtering (Figures 1 and 2). With the In recent years, FPGAs have become available
advent of wideband 3G wireless systems, with increasingly powerful embedded soft
adaptive weight calculation algorithms are also processor cores that give designers the flexibility
being considered for new applications, such as and portability of high-level software design,
polynomial-based digital predistortion and multi- while maintaining the performance benefits of
in/multi-out (MIMO) antenna solutions (Figure parallel hardware operations in FPGAs (Figure
3). These applications generally involve solving 4).
for an over-specified set of equations, as shown The rest of this paper describes the proposed
below where m > N. implementation of the QR-decomposition-based
x1 (1) c 0 + x 2 (1) c1 + ....... x N (1) c N = y (1) + e (1 ) RLS algorithm (QRD-RLS) on Altera’s
Stratix® FPGA with embedded Nios® soft
x1 ( 2 ) c 0 + x 2 ( 2 ) c1 + ....... x N ( 2 ) c N = y ( 2 ) + e ( 2 )
processor technology and the Avalon™ switch
M fabric. Figure 1 gives an outline of the proposed
implementation of the QRD-RLS algorithm
x1 ( m ) c 0 + x 2 ( m ) c1 + ....... x N ( m ) c N = y ( m ) + e ( m ) using the concept of custom peripherals and
custom instructions.
Among the different algorithms, the recursive
least squares algorithm is generally preferred for
its fast convergence property. The least squares Avalon
approach attempts to find the set of coefficients CORDIC
c n that minimizes the sum of squares of the Accelerator

errors, i.e., {min ∑ e ( m) 2 }. Representing the


m Multiply CI N i os Program
& Data
above set of equations in the matrix form, we
Divide CI Memory
have:
Xc = y + e (1)
where X is a matrix (mxN, with m>N) of noisy Figure 1. Nios processor-based QRD-RLS
observations, y is a known training sequence, implementation
and c is the coefficient vector to be computed
such that the error vector e is minimized.
Direct computation of the coefficient vector c II. OVERVIEW OF QRD-RLS ALGORITHM
involves matrix inversion, which is generally As described earlier, the least squares
undesirable for hardware implementation. Matrix algorithm attempts to solve for the coefficient
decomposition, based least squares schemes, vector c from X and y. To realize this, the QR-
such as Cholesky, LU, SVD and QR- decomposition algorithm is first used to
decompositions, avoid explicit matrix inversions transform the matrix X into an upper triangular
and are more robust and well-suited for hardware matrix R (NxN matrix) and the vector y into
implementation (Figure 1). Such schemes are another vector u such that Rc=u. The
coefficients vector c is then computed using a
CF-QRD031505-1.0
procedure called back substitution, which
involves solving the equations shown below. x1(2) x2(2) x3(2) x4(2) xN(2) y(2)
uN x1(1) x2(1) x3(1) x4(1) xN(1) y(1)
cN = (2)
R NN

f
o
r
i
N
-
1
,
.
.
.
1
1 ⎛ N ⎞ R11 R12 R13 R14 R1N u1
ci = ⎜ui − ∑ R ij c j ⎟ = (3)
R ii ⎜ ⎟
⎝ j = i +1 ⎠
R22 R23 R24 R2N u2
The QRD-RLS algorithm flow is depicted
below in Figure 2. R33 R34 R3N u3
Vectoring mode
X R
QR Back c R44 R4N u4
y Decomposition u Substitution
Rotating mode
RNN UN
Figure 2. QR decomposition based least squares

III. CORDIC-BASED QR-DECOMPOSITION


The QR-decomposition of the input matrix X Fig. 3. Systolic array architecture for QR Decomposition
can be performed, as illustrated in Figure 3,
using the well-known systolic array architecture results obtained were achieved using Altera’s
[5]. The rows of matrix X are fed as inputs to the Quartus® II push-button flow on its Stratix
array from the top along with the corresponding devices. As the input-bit width increases, the
element of the vector y. The R and u values held resource consumption also increases, whereas
in each of the cells once all the inputs have been the fMAX decreases.
passed through the matrix, are the outputs from Table 1. Altera Cordic Resource Consumption
QR-decomposition. These values are Input Number of Logic fMAX (MHz)
subsequently used to derive the coefficients Width Iterations Elements
using the back-substitution technique. (bits) (LEs)
8 8 380 264
Each of the cells in the array can be 16 16 1,300 219
implemented as a coordinate-rotation digital 24 24 2,670 198
computer (CORDIC) block. CORDIC describes 32 32 4,600 189
a method to perform a number of functions, 40 40 7,010 163
including trigonometric, hyperbolic and
logarithmic functions [6]. The algorithm is Direct mapping of the CORDIC blocks onto
iterative, and uses only additions, subtractions the systolic array shown in Figure 3 consumes a
and shift operations. This makes it very significant number of LEs and yields enormous
attractive for hardware implementations. The throughput that is generally not required for
number of iterations depends on the precision many applications. The resources required to
required, which correlates directly with the implement the array can be reduced by trading
amount of bits needed. throughput for resource consumption via mixed
Altera’s CORDIC blocks have a deeply and discrete mapping schemes.
pipelined parallel architecture enabling speeds Mixed mapping: In mixed mapping schemes,
over 250 MHz on Stratix FPGAs in both the bottom rows in the systolic array are moved
vectoring and rotating modes. For real inputs, to the end of the top rows, to ensure the same
only one CORDIC block is required per cell. number of cells in each row. A single CORDIC
Many applications involve complex inputs and block can then be used to perform the operations
outputs to the algorithm, for which three of all the cells in a row, with the total number of
CORDIC blocks are required per cell. In such CORDIC blocks required being equal to the total
cases, a single CORDIC block can be efficiently number of rows. Because each CORDIC block
timeshared to perform the complex operations. has to operate in both vectorise and rotating
Table 1 illustrates the resource consumption modes, the scheme is called mixed mapping [7]
for the CORDIC algorithm in terms of logic Discrete mapping: In this scheme, at least two
elements (LEs) for different input bit widths. The CORDIC blocks are required. One block is used
purely for vectorise operations, while the other is
used for rotate operations [8]. This single and imaginary coefficient values and stores the
functionality of the processors allows any gains results back into memory.
from hardware optimisation to be realized. A 32-bit Nios CPU can optionally be
Further information on the different mapping configured to include a hardware 16x16 –> 32
schemes can be found in references [7], [8], and integer multiplier implemented using the digital
[9]. An example resource estimation of direct, signal processing (DSP) blocks on Stratix
mixed, and discrete mapping schemes is FPGAs. The MUL instruction can then be used to
presented in Section V. complete the multiply operation in a single clock
cycle. The divide operation can be implemented
IV. BACK SUBSTITUTION ON NIOS SOFT PROCESSOR as a custom logic block that becomes a part of
As outlined in Section II, the final coefficient the Nios processor’s arithmetic logic unit (ALU)
weight vector is derived from the outputs of the as shown below in Figure 5. The 32-bit division
QR-decomposition algorithm using a procedure can then be completed in approximately 35 clock
called back substitution. The back substitution cycles by the Nios processor operating at
procedure, as seen in Figures 2 and 3, primarily 100 MHz. The Nios configuration wizard creates
involves multiplication and division operations. software macros in C/C++ and assembly,
The Nios embedded soft processor, with its providing software access to the custom logic
custom multiply and divide instructions, is an block. The back substitution operation can thus
ideal fit to implement this function. be efficiently implemented using hardware
The Nios processor is based on the accelerated code on the Nios processor, as
revolutionary concept of embedding soft RISC observed in the simulation results in the next
processors within FPGAs and can operate at over section.
100 MHz on Stratix FPGAs [10]. Embedded
designers can create custom processor-based
systems using Altera’s SOPC Builder system
development tool. SOPC Builder can be used to
integrate one or more configurable Nios
processor with any number of standard
peripherals, gluing the system together with the
automatically generated Avalon switch fabric.
Figure 4 outlines the procedure involved in
computing the coefficients via back substitution.

Nios CPU
I D

Avalon Bus

Program I/ Rr ur cr Figure 5. Adding custom logic to the Nios ALU


O
& Data
Memory V. RESOURCE ESTIMATES & SIMULATION RESULTS
Ri ui ci
Table 2 gives an overview of the resource
estimates for performing the CORDIC-based
QR-decomposition using the three different
CORDIC mapping schemes described in Section III. An
example scenario of m=64 and N=9 with 16-bit
Figure 4. Nios processor for back substitution complex inputs to the CORDIC is considered.
The system clock is assumed to be at 150MHz
The CORDIC block performs the QR- with a single CORDIC block timesharing the
decomposition and stores the R and u values three operations required for complex inputs.
(real and imaginary) in memory accessible to the The estimates do not include the resources
Nios processor, which then calculates the real required for scheduling operations between
the different CORDIC blocks.

1000000
Table 2. Resource Estimates Example for Different CORDIC Blocks

Some new terms are defined as follows.


Update Delay: Time required before all cells in
systolic array are updated with their R and z
values
Throughput: Number of input matrices (each m
Implementation CORDIC Throughput Cost
Technique Usage
No. of No. of Update Updates LE/Upd
Blocks LEs Delay per ate
(us) Second
Direct Mapping 54 70200 5.1 196078 1.497
Mixed Mapping 4 5200 250.85 3986 1.305
Discrete Mapping 2 2600 198.11 5047 0.515

x N) that are processed per second = 1/update


delay.
Cost: Number of LEs consumed per update
As seen in Table 2, direct mapping offers the
highest throughput (update/s) but at the expense
of fifty-four CORDIC blocks, which requires a
huge number of LEs. In almost all applications,
the cost of these logic resources would make this Fig. 6. Back substitution results: Nios Processor vs. hardware
implementation unviable. Mixed mapping only approach
reduces the resource consumption by employing
only four CORDIC blocks with a corresponding substitution is accelerated in hardware and
drop in throughput. In comparison, although the accessed as a custom peripheral by the processor.
discrete mapping scheme needs five CORDIC For an example of N=9, the Nios processor
blocks, it can be implemented using only two takes approximately 12,000 clock cycles or
blocks. Because Altera’s CORDIC block is 120 s (software-only approach), which is
iterative and pipelined, the number of blocks can acceptable for many applications. For faster
be reduced from five to two with only a three- updates, the Nios processor with hardware-
clock cycle delay penalty (i.e., one extra clock acceleration approach can be used, and requires
cycle for each timeshared CORDIC operation). only 300 clock cycles or 3 s, but at the expense
This not only further reduces the resource of additional hardware resources. Moreover, the
consumption, but also provides higher Nios processor can be used to implement other
throughput than mixed mapping scheme. data and control functions on the FPGA during
The discrete mapping scheme, therefore, best the time it is waiting for the outputs from the
exploits the pipelining ability of Altera’s CORDIC block. This facilitates a complete
CORDIC block and offers the optimum tradeoff system-on-a-programmable-chip (SOPC)
between resource consumption and throughput. solution without the need for an external
Figure 6 illustrates the number of clock cycles processor.
required for the Nios processor, operating at
100 MHz, to calculate different numbers of VI. SUMMARY
complex coefficients. The back-substitution A novel implementation of the QRD-RLS
subroutine written in C code calculates the algorithm using Altera’s Stratix FPGAs with
coefficient array and stores it into memory. The embedded Nios soft processor technology was
subroutine assumes that all complex arrays are proposed. The ability to exploit the deeply
stored as short (signed 16-bit precision) and all pipelined parallel architecture of Altera’s
complex numbers are stored in complex CORDIC block to implement the QR-
conjugate form. The plot also shows the decomposition using the discrete mapping
corresponding time taken when a multiply- scheme was described, and resource estimates
accumulate loop is involved in the back were presented. The back-substitution algorithm
was implemented on the configurable Nios soft
processor with custom instructions for hardware
acceleration. Simulation results were also
presented, illustrating the efficiency of the Nios
processor to facilitate an integrated SOPC
solution.

REFERENCES
[1] Simon Haykin, Adaptive Filter Theory,
Prentice Hall, Fourth Edition
[2] Tim Zhong Mingqian, A.S.Madhukumar, and
Francois Chin, “QRD-RLS adaptive equalizer
and its CORDIC-based implementation for
CDMA systems” International Journal on
Wireless & Optical communications, Vol.1, No.1
(2003) 25-39
[3] Babak Hassibi, “An efficient square-root
algorithm for BLAST” Proceedings of the 2000
IEEE International Conference on Acoustics,
Speech and Signal Processing, pages 737-40.
[4] Stratix FPGAs, http://www.altera.com
[5] Gentleman, W.M. and Kung, H.T., “Matrix
triangularization by systolic arrays” Real-Time
Signal Processing IV, Proc. SPIE 298, 19-26.
[6] J.Volder, “The CORDIC trigonometric
computing technique”, IRE Trans. Electron.
Comput., Vol. EC-8, pp. 330-334, 1959
[7] C.M.Rader, “VLSI systolic arrays for
adaptive nulling”, IEEE Sig.Proc.Mag, Vol.13,
No.4, pp.29-49, 1996
[8] G.Lightbody, R.L.Walke, R.Woods,
J.McCanny, “Novel mapping of a linear QR
architecture”, Proc. ICASSP, vol IV, pp.1933-6,
1999
[9] R.L. Walke, R.W.M.Smith, “Architectures
for adaptive weight calculation on ASIC and
FPGA”, 33rd Asilomar Conference on Signals,
Systems and Computers, 1999
[10]Nios processor,
http://www.altera.com/literature/lit-nio.jsp

Vous aimerez peut-être aussi