Académique Documents
Professionnel Documents
Culture Documents
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 1, JANUARY 2007
AbstractThis paper studies low-complexity high-speed decoder architectures for quasi-cyclic low density parity check
(QC-LDPC) codes. Algorithmic transformation and architectural
level optimization are incorporated to reduce the critical path.
Enhanced partially parallel decoding architectures are proposed
to linearly increase the throughput of conventional partially
parallel decoders through introducing a small percentage of extra
hardware. Based on the proposed architectures, a (8176, 7154) Euclidian geometry-based QC-LDPC code decoder is implemented
on Xilinx field programmable gate array (FPGA) Virtex-II 6000,
where an efficient nonuniform quantization scheme is employed
to reduce the size of memories storing soft messages. FPGA implementation results show that the proposed decoder can achieve
a maximum (source data) decoding throughput of 172 Mb/s at 15
iterations.
Index TermsError correction codes, field programmable
gate array (FPGA), low density parity check (LDPC), parallel
processing, quasi-cyclic (QC) codes.
I. INTRODUCTION
ECENTLY, a class of structured low density parity check
(LDPC) codes, namely quasi-cyclic (QC) LDPC codes
[1][6], that can achieve comparable performance to computer
generated random codes have been proposed. Further work includes Euclidian geometry-based QC-LDPC (EG-LDPC) codes
[7]. EG-LDPC codes have shown even lower error floor than
equivalent computer generated random LDPC codes in some
cases. QC-LDPC codes are well suited for hardware implementation. The encoder of a QC-LDPC code can be easily built
with shift-registers [8] while random codes usually entail complex encoding circuitry to perform complex matrix and vector
multiplications [9], [10]. In addition, QC-LDPC codes also facilitate efficient high-speed decoding due to the regularity of
their parity check matrices. On the other hand, randomly constructed LDPC codes require complex routing in the (hardware)
decoders, which not only consumes a large amount of chip area,
but also significantly increases the computation delay. For instance, a fully parallel decoder based on direct mapping presented in [21] for a rate-1/2 1024-bit LDPC code consumed
1.7-M gates with a maximum (source data) decoding throughput
of 500 Mb/s.
A memory-based partially parallel decoding architecture was
presented in [11] to obtain a good tradeoff between hardware
Manuscript received August 26, 2005; revised January 19, 2006 and June 6,
2006.
The authors are with the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97330 USA (e-mail: zwang@eecs.
oregonstate.edu; cuizh@eecs.oregonstate.edu).
Digital Object Identifier 10.1109/TVLSI.2007.891098
complexity and decoding speed. The proposed field programmable gate array (FPGA) implementation for a 9216-bit
rate-1/2 (3, 6)-regular code achieved a maximum (coded data)
decoding throughput of 54 Mb/s (equivalent to 27-Mb/s source
data rate) at 18 iterations, where (3, 6)-regular codes refer to
the LDPC codes whose parity check matrices have a constant
column weight of 3 and row weight of 6. The design presented in [23] achieved 30% higher decoding throughput than
1038 b) of
the previous one for a much smaller size (
QC-LDPC code. In [12], Chen et al. presented a highly parallel
FPGA implementation for a rate-1/2, 8088-bit irregular QC-LDPC code. The maximum (source data) decoding
throughput reached 40 Mb/s at 25 iterations. In [13], Karkooti
et al. presented a FPGA implementation for a (3, 6)-regular
1536-bit QC-LDPC code decoder. To enable a high-level (e.g.,
-level,
) parallelism, one additional constraint in which
the cyclic shift value of each circulant matrix needs to be
a multiple of was introduced. This proposed decoder can
run at 121 MHz to obtain 63.5-Mb/s (source data) decoding
throughput at 20 iterations. Such a high clock speed largely
benefited from significantly less routing complexity because of
the smaller codeword size compared to two other designs [11],
[12]. On the other hand, the added constraint to the parity check
matrix will inevitably circumvent the performance of LDPC
codes. In this paper, we will present a low complexity FPGA
implementation for a 8176-bit (4, 32)-regular EG-LDPC code
that can achieve a maximum (source data) decoding throughput
of 172 Mb/s at 15 iterations.
Contributions of this paper are as follows. First, we propose
a modified sum-product algorithm (SPA), which balances the
computation load between two decoding phases in order to
reduce the critical path. The new algorithm also facilitates the
merger of two types of node process units of a decoder, which
will lead to 100% hardware utilization efficiency [14]. Second,
we present architectural optimization techniques for general
node processing units to further reduce the computation delay in
the critical path; Third, we propose enhanced partially parallel
decoding architectures that can linearly increase the decoding
throughput of the conventional partially parallel decoder with
small extra hardware. We also discuss an efficient nonuniform
quantization scheme that can reduce the size of memories for
storage of soft-messages. At last, we present a detailed FPGA
implementation of the 8176-bit EG-LDPC code decoder, where
novel circuit design techniques are introduced to raise the clock
speed.
This paper is organized as follows. Section II gives a brief
review of the conventional SPA and discusses the modified version based on algorithmic transformation. Section III presents
architectural optimizations for node processing units for gen-
WANG AND CUI: LOW-COMPLEXITY HIGH-SPEED DECODER DESIGN FOR QUASI-CYCLIC LDPC CODES
105
(1)
(2)
proposed a modified version based on algorithmic transformation in order to balance the computation load between the two
decoding phases [14]. The new algorithm is expressed as follows:
where
(5)
(6)
(7)
where
106
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 1, JANUARY 2007
different LUTs, i.e., LUT-A and LUT-B. The sign bit will be
included in the input of either LUT. The output of LUT-A and
LUT-B will be in unsigned format (5 bits in this example) and
2s complement format (6 bits in the example), respectively. In
addition, we proposed to utilize architectural level optimization
techniques to further reduce the total computation delay for the
summation part [15]. The new architectures for VPU and CPU
are shown in Figs. 3 and 4, respectively. The pipeline latches
can be properly inserted into the data paths as explained in [16].
Therefore, the critical path for either type of node processing
units can be reduced to approximately one-multi-bit (5 or 6-bit
in this example) addition operation with 3-satge pipelining,
which leads to multiple times speed-up for the node processing
units compared with the traditional designs [25], [26].
IV. PARTIALLY PARALLEL DECODING ARCHITECTURE
AND ENHANCEMENT
Several papers have addressed partially parallel decoding
architectures for regular LDPC codes such as [11] and [12].
These architectures generally achieve a good tradeoff between
hardware complexity and decoding throughput. A partially
parallel decoder architecture for generic (3, 5) QC-LDPC codes
memory banks
is shown in Fig. 5, where totally
are used to store the soft message symbols conveyed at both
s are used to store the
decoding phases, memory bank
WANG AND CUI: LOW-COMPLEXITY HIGH-SPEED DECODER DESIGN FOR QUASI-CYCLIC LDPC CODES
107
Fig. 6. Memory module corresponding to a submatrix: a) store one soft message symbol per entry and b) store two soft message symbols per entry.
For EG-LDPC codes, each submatrix of the parity check matrix consists of two superimposed cyclic-shifted identity matrices (see Fig. 11). A more general case is to have multiple
(e.g.,
) independently cyclic-shifted identity matrices superimposed in one submatrix. To efficiently decode this class
of codes, we proposed to have one separate memory bank for
each independently cyclic-shifted identity matrix [28]. Hence,
we need two memory banks for each submatrix in decoding the
previously mentioned EG-LDPC codes.
To increase the parallelism of a partially parallel decoder, we
propose three of the following methods:
1) partition each memory bank into sub-banks (or called
memory segments) where all the soft symbols coradjacent rows of a
responding to 1-components at
submatrix are stored in different segments;
2) store soft messages corresponding to adjacent rows of a
submatrix in one memory entry while utilizing extra buffers
to solve the memory access conflict;
3) combine steps 1) and 2).
In this paper, we will only discuss the first two methods. The
principle of the third method is evident given the details of the
first two. We will use the same example as before for illustra.
tion. Without loss of generality, we focus on the case of
In the first approach, we could employ a straightforward partition scheme as shown in Fig. 7(a), where the number (1, 2, 3, or
4) associated with each 1-component in the submatrix indicates
in which memory sub-bank the data is stored. Memory access
conflict will occur at the second processing cycle in the column
decoding phase. During this cycle, it is supposed to process
108
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 1, JANUARY 2007
Fig. 8. Example of packaging multiple data into one memory entry: (a) a straightforward packaging scheme and (b) a better packaging scheme.
Fig. 9. Block diagram for the switching network used in the column processing
phase when the straightforward packaging scheme is adopted.
WANG AND CUI: LOW-COMPLEXITY HIGH-SPEED DECODER DESIGN FOR QUASI-CYCLIC LDPC CODES
109
Fig. 10. Switching networks for: (a) row processing phase;(b) column decoding phase; and (c) for both decoding phases.
Fig. 11. 15
(8)
Each submatrix
is a circulant matrix with both a column
15 submatrix
and a row weight of 2. An example of a 15
is shown in Fig. 11. Observe that the inherent parallelism level
with the code is 16 in the column processing phase, and 2 in the
row processing phase of the partly parallel decoding. In other
words, we can instantiate up to 16 variable node processing units
and two check node processing units if we adopt the conventional partly parallel decoding architecture. In this design, we
will employ a two-level enhanced partly parallel decoding architecture to raise the parallelism level, and thus further increase
the decoding throughput.
B. Enhanced Partially Parallel Decoding Architecture
In principle, either kind of the proposed enhanced partially parallel decoding architectures can linearly increase the
decoding throughput with small hardware overhead. In this
design, we arbitrarily chose the first approach with
. The
modified SPA is employed for decoding.
Fig. 12 shows the architecture of a CPU, which performs
computation. Each CPU has 32
check-to-variable message
inputs and 32 outputs. The LUT-A is introduced to perform
. The magnitude of the
the function
output is the summation of 31 out of 32 numbers coming from
LUT-A. The sign bit of the output is a product of 31 out of 32
sign bits which come from the inputs. In the last addition stage,
each of the two addends is separated into high and low parts.
110
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 1, JANUARY 2007
Two partial additions are performed in parallel to reduce the addition delay. In the CPU, pipeline latches are inserted as indicated by the dashed lines to reduce the critical path. The data
representations for the inputs of CPU, the outputs of LUT-As,
and the final outputs of CPU are in 2s complement, unsigned,
and sign-magnitude, respectively.
The architecture of a VPU is shown in Fig. 13, which percomputation. Each VPU
forms variable-to-check message
has five inputs and five outputs. and stand for the intrinsic
message and the tentative decoded bit, respectively. The LUT-B
. All data are repreperforms the function
sented as 2s complement in VPU computation, which includes
the intrinsic message , the outputs of LUT-Bs, and the outputs of VPU. The inputs of LUT-Bs coming from the outputs of
CPUs are in sign-magnitude format.
The block diagram of the proposed decoder architecture is
, which consists of
shown in Fig. 14. Each memory block
two memory banks, is associated with a circulant submatrix
of the parity check matrix . These memory blocks are used
to store the extrinsic soft messages conveyed in two decoding
and
are used to store the
phases. The memory modules
Fig. 14. Proposed enhanced partially parallel decoder architecture for the
EG-LDPC code.
intrinsic soft messages and the temporary decoded bits, respectively. As it can be seen from the figure, the overall architecture
has 16 2 32 VPUs and 2 2 4 CPUs.
With 2-parallel enhancement, each memory bank is partitioned into two segments, i.e., one contains all data corresponding to even-addressed rows and the other contains data
corresponding to all odd-addressed rows of the corresponding
submatrix. To illustrate the details of dataflow, the analyses
are performed, respectively, for three different cases, which
correspond to even, odd, and zero shift offsets (ranging from
0 to 510) of a cyclic-shift identity matrix. Fig. 15 shows an
example of the memory partitioning and data switching scheme
applied to a 15 15 cyclic-shifted identity matrix with an even
(excluding 0) shift offset of 6. In the check-to-variable message
updating phase, the two data located in the even memory segment MEM_E and the odd memory segment MEM_O with the
same index are sent to CPU_E and CUP_O in parallel, where
WANG AND CUI: LOW-COMPLEXITY HIGH-SPEED DECODER DESIGN FOR QUASI-CYCLIC LDPC CODES
Fig. 15. Memory partitioning and data switching structure for even shifting
offset case.
CPU_O and CPU_E are CPU components for even row and odd
row message updating, respectively. In the variable-to-check
message updating phase, the two data connected by an arrow
are sent to corresponding VPU_0 and VPU_1 in the same clock
cycle. Here, VPU_0 and VPU_1 are respective VPU components for even column and odd column data computation. A
stored in a memory segment corresponds
soft message
to the 1-component located at row and column of the
cyclic-shifted identity matrix. In this example, the data located
in the even columns 6, 8, 10, 12, and 14 of the submatrix are
stored in the even addressed memory segment. However, the
data located in the even columns 0, 2, and 4 are stored in the
odd addressed memory segment. Similar cases exist for the
data located in the odd columns. Therefore, switching units
are needed to route data between memories and VPUs in the
variable-to-check message updating phase. Because the size
of each circulant matrix associated with the EG-LDPC code
is an odd number, only the data corresponding to the last row
(or column) of these matrices are accessed in the last clock
cycle of the check-to-variable (or variable-to-check) message
are used
updating phase. In Fig. 15, memory block and
to store intrinsic soft messages and temporary decoded bits,
respectively. Symbol represents a constant for initialization
procedure.
A similar example for a cyclic-shifted matrix with an odd shift
offset of 5 is shown in Fig. 16. If the memory is partitioned in
a straightforward way, the data corresponding to the last row of
this matrix should be stored in the last location of even memory
segment. Consequently, data access conflict will occur when the
two data from columns four and five are retrieved from the even
memory segment in the same cycle as indicated by the dashed
arrow. In our design, the last datum in the even rows is stored in
the odd memory sub-bank to eliminate the data access conflict.
Consequently, two multiplexers are needed to steer the displaced
data between the odd memory sub-bank and CPU_E.
For the third case, when the shift value is 0, the cyclic-shifted
identity matrix becomes an identity matrix. The details of
memory partitioning and data switching schemes are shown in
Fig. 17. This is in fact the simplest case.
111
Fig. 16. Memory partitioning and data switching structure for odd shifting
offset case.
Fig. 17. Memory partitioning and data switching structure for identity matrix.
112
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 1, JANUARY 2007
09(x).
TABLE I
UNIFORM TO NONUNIFORM QUANTIZATION CONVERSION
D. Fixed-Point Implementation
The word length of the soft messages directly affects the
memory size, the node processing unit size and the decoding
performance of an LDPC code decoder. Unless the target datarate is very high, e.g., over 4 Gb/s, the overall hardware of an
LDPC decoder is predominantly determined by the size of the
memories holding intrinsic and extrinsic soft messages. Therefore, it is very important to find efficient quantization schemes
for these soft messages.
denote the uniform quantization scheme in which
Let
the finite word length is bits, of which bits are used for
the fractional part of the value to be quantized [19]. For the
considered EG-LDPC code, the target bit error rate (BER) is set
based on the original requirement from NASA. With
as
extensive simulation, we found that at least 7 bits are needed to
represent the magnitude of a soft message to achieve this goal if
a uniform quantization scheme is adopted (see Fig. 21). In this
case, both the input and output of a LUT-A (or LUT-B) are 8
bits, which is quite large.
A nonuniform quantization scheme which generally out performs uniform quantization schemes with the same word length
was proposed in [20]. However, in the original method, a -bit
nonuniform quantization scheme generally performs worse than
-bit word length since
the uniform quantization case with
significantly less precision is maintained for large values. This
paper presents a new nonuniform quantization scheme that can
achieve decoding performance almost identical to that of the
WANG AND CUI: LOW-COMPLEXITY HIGH-SPEED DECODER DESIGN FOR QUASI-CYCLIC LDPC CODES
113
TABLE II
XILINX VIRTEXII-6000 FPGA UTILIZATION STATISTICS
both
4.2 dB and
4.4 dB. At
4.1 dB or below, over 50 block errors were observed for each
simulation case. It can be seen from the figure that there is no
observable difference between the decoding performance of
using a 7-bit uniform quantization scheme and that of using
a 6-bit nonuniform quantization scheme. It is also clear that
using a 6-bit uniform quantization scheme can hardly meet the
target BER requirement.
The new architectures for CPU and VPU with the nonuniform quantization scheme are shown in Figs. 22 and 23, respectively. The uniform to nonuniform quantization converters
(U2NUs) are introduced as shown in the two figures. They are
implemented with simple combinational logic. The LUTs for
both CPU and VPU are the same.
E. FPGA Implementation Results
Based on the architectures described before, the (8176, 7154)
EG-based LDPC decoder was modeled in VHDL and simulated
using ModelSim. We then synthesized and performed place and
route for the design using the Xilinx ISE 5.1i software package.
114
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 1, JANUARY 2007