Low-Complexity High-Speed Decoder Design For

104
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 1, JANUARY 2007
Low-Complexity High-Speed Decoder Design for

Quasi-Cyclic LDPC Codes
Zhongfeng Wang, Senior Member, IEEE, and Zhiqiang Cui
AbstractThis paper studies low-complexity high-speed decoder architectures for quasi-cyclic low density parity check
(QC-LDPC) codes. Algorithmic transformation and architectural
level optimization are incorporated to reduce the critical path.
Enhanced partially parallel decoding architectures are proposed
to linearly increase the throughput of conventional partially
parallel decoders through introducing a small percentage of extra
hardware. Based on the proposed architectures, a (8176, 7154) Euclidian geometry-based QC-LDPC code decoder is implemented
on Xilinx field programmable gate array (FPGA) Virtex-II 6000,
where an efficient nonuniform quantization scheme is employed
to reduce the size of memories storing soft messages. FPGA implementation results show that the proposed decoder can achieve
a maximum (source data) decoding throughput of 172 Mb/s at 15
iterations.
Index TermsError correction codes, field programmable
gate array (FPGA), low density parity check (LDPC), parallel
processing, quasi-cyclic (QC) codes.
I. INTRODUCTION
ECENTLY, a class of structured low density parity check
(LDPC) codes, namely quasi-cyclic (QC) LDPC codes
[1][6], that can achieve comparable performance to computer
generated random codes have been proposed. Further work includes Euclidian geometry-based QC-LDPC (EG-LDPC) codes
[7]. EG-LDPC codes have shown even lower error floor than
equivalent computer generated random LDPC codes in some
cases. QC-LDPC codes are well suited for hardware implementation. The encoder of a QC-LDPC code can be easily built
with shift-registers [8] while random codes usually entail complex encoding circuitry to perform complex matrix and vector
multiplications [9], [10]. In addition, QC-LDPC codes also facilitate efficient high-speed decoding due to the regularity of
their parity check matrices. On the other hand, randomly constructed LDPC codes require complex routing in the (hardware)
decoders, which not only consumes a large amount of chip area,
but also significantly increases the computation delay. For instance, a fully parallel decoder based on direct mapping presented in [21] for a rate-1/2 1024-bit LDPC code consumed
1.7-M gates with a maximum (source data) decoding throughput
of 500 Mb/s.
A memory-based partially parallel decoding architecture was
presented in [11] to obtain a good tradeoff between hardware
Manuscript received August 26, 2005; revised January 19, 2006 and June 6,
2006.
The authors are with the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97330 USA (e-mail: zwang@eecs.
oregonstate.edu; cuizh@eecs.oregonstate.edu).
Digital Object Identifier 10.1109/TVLSI.2007.891098
complexity and decoding speed. The proposed field programmable gate array (FPGA) implementation for a 9216-bit
rate-1/2 (3, 6)-regular code achieved a maximum (coded data)
decoding throughput of 54 Mb/s (equivalent to 27-Mb/s source
data rate) at 18 iterations, where (3, 6)-regular codes refer to
the LDPC codes whose parity check matrices have a constant
column weight of 3 and row weight of 6. The design presented in [23] achieved 30% higher decoding throughput than
1038 b) of
the previous one for a much smaller size (
QC-LDPC code. In [12], Chen et al. presented a highly parallel
FPGA implementation for a rate-1/2, 8088-bit irregular QC-LDPC code. The maximum (source data) decoding
throughput reached 40 Mb/s at 25 iterations. In [13], Karkooti
et al. presented a FPGA implementation for a (3, 6)-regular
1536-bit QC-LDPC code decoder. To enable a high-level (e.g.,
-level,
) parallelism, one additional constraint in which
the cyclic shift value of each circulant matrix needs to be
a multiple of was introduced. This proposed decoder can
run at 121 MHz to obtain 63.5-Mb/s (source data) decoding
throughput at 20 iterations. Such a high clock speed largely
benefited from significantly less routing complexity because of
the smaller codeword size compared to two other designs [11],
[12]. On the other hand, the added constraint to the parity check
matrix will inevitably circumvent the performance of LDPC
codes. In this paper, we will present a low complexity FPGA
implementation for a 8176-bit (4, 32)-regular EG-LDPC code
that can achieve a maximum (source data) decoding throughput
of 172 Mb/s at 15 iterations.
Contributions of this paper are as follows. First, we propose
a modified sum-product algorithm (SPA), which balances the
computation load between two decoding phases in order to
reduce the critical path. The new algorithm also facilitates the
merger of two types of node process units of a decoder, which
will lead to 100% hardware utilization efficiency [14]. Second,
we present architectural optimization techniques for general
node processing units to further reduce the computation delay in
the critical path; Third, we propose enhanced partially parallel
decoding architectures that can linearly increase the decoding
throughput of the conventional partially parallel decoder with
small extra hardware. We also discuss an efficient nonuniform
quantization scheme that can reduce the size of memories for
storage of soft-messages. At last, we present a detailed FPGA
implementation of the 8176-bit EG-LDPC code decoder, where
novel circuit design techniques are introduced to raise the clock
speed.
This paper is organized as follows. Section II gives a brief
review of the conventional SPA and discusses the modified version based on algorithmic transformation. Section III presents
architectural optimizations for node processing units for gen-
1063-8210/$25.00 2007 IEEE
WANG AND CUI: LOW-COMPLEXITY HIGH-SPEED DECODER DESIGN FOR QUASI-CYCLIC LDPC CODES
105
eral LDPC decoders. Section IV proposes enhanced partially

parallel decoding architectures. Section V presents the detailed
FPGA implementation of the 8176-bit EG-LDPC code decoder.
II. SPA AND MODIFICATION
The conventional SPA (also known as the belief propagation
algorithm) is commonly used in LDPC decoding and is generally implemented in log domain. The SPA has the best decoding
performance in practice. It consists of two phases of message
passing, i.e., variable-to-check message passing and check-todenotes the check-to-varivariable message passing. Let
able message conveyed from the check node to the variable
represents the variable-to-check message connode , and
is
veyed from the variable node to the check node , then
computed as follows:
(1)
Fig. 1. Architecture of CPU with original SPA.
(2)
proposed a modified version based on algorithmic transformation in order to balance the computation load between the two
decoding phases [14]. The new algorithm is expressed as follows:
where
is the sign part of

, and
denotes the set of
variable nodes connected to the check node excluding the variable node . The nonlinear function
is generally implemented with a look-up table (LUT) in hardis computed with the
ware. The variable-to-check message
following equation:
(3)
denotes the set of check nodes connected to the
where
variable node excluding the check node , and
is the intrinsic information related to the received soft symbol
and the estimated standard deviation of the additive white
Gaussian noise (AWGN) channel noise. The log likelihood ratio
for the variable node , denoted as , is computed as follows:
(4)
The sign of
is taken as the estimated information bit ( 1
or 1).
It can be observed that the conventional SPA has unbalanced
computation complexity between the two decoding phases. This
leads to unbalanced data-paths between two types of node processing units. In fact, the critical path of a check node processing
unit (CPU) consists of one summation operation and two LUT
operations while that of a variable node processing unit (VPU)
consists of only one summation operation. As the clock speed
will be upper-bounded by the critical path, the throughput of an
LDPC decoder employing the SPA will be limited. Hence, we
(5)
(6)
(7)
where
is computed with (2),

. As
, it can be derived from (5) that
. Thus, (7) and (4) are identical. For simplicity, we will still
to represent the new check-to-variable message in later
use
discussion. The major benefit of the modified algorithm is that
the computation complexity and, thus, computation delay are
balanced between two decoding phases. As shown in [14], this
modification not only helps reduce the clock cycle time, but also
facilitate 100% hardware utilization efficiency.
III. ARCHITECTURES FOR NODE PROCESSING
UNITS AND OPTIMIZATIONS
Let us consider a general (3, 5)-regular QC-LDPC code. The
architectures of VPU and CPU with the original SPA are shown
in Figs. 1 and 2, respectively. Similar architectures can be found
in Fig. 2 stands for the
in [25] and [26]. The input signal
intrinsic information. As it can be seen from Fig. 1, two LUT
operations are contained in the critical path of each CPU.
With the new algorithm, we move one LUT operation onto
the critical path of every VPU. We can further eliminate the
sign-magnitude to 2s complement data format conversion
block [SM-2s] and the 2s complement to sign-magnitude
[2s-SM] data conversion block in order to further shorten
the critical path. To compensate for this change, we need two
106
Fig. 2. Architecture of VPU with original SPA.
Fig. 4. Optimized architecture for VPU.
Fig. 5. Structure of a partially parallel decoder for (3, 5) QC-LDPC codes.
Fig. 3. Optimized architecture for CPU.
different LUTs, i.e., LUT-A and LUT-B. The sign bit will be
included in the input of either LUT. The output of LUT-A and
LUT-B will be in unsigned format (5 bits in this example) and
2s complement format (6 bits in the example), respectively. In
addition, we proposed to utilize architectural level optimization
techniques to further reduce the total computation delay for the
summation part [15]. The new architectures for VPU and CPU
are shown in Figs. 3 and 4, respectively. The pipeline latches
can be properly inserted into the data paths as explained in [16].
Therefore, the critical path for either type of node processing
units can be reduced to approximately one-multi-bit (5 or 6-bit
in this example) addition operation with 3-satge pipelining,
which leads to multiple times speed-up for the node processing
units compared with the traditional designs [25], [26].
IV. PARTIALLY PARALLEL DECODING ARCHITECTURE
AND ENHANCEMENT
Several papers have addressed partially parallel decoding
architectures for regular LDPC codes such as [11] and [12].
These architectures generally achieve a good tradeoff between
hardware complexity and decoding throughput. A partially
parallel decoder architecture for generic (3, 5) QC-LDPC codes
memory banks
is shown in Fig. 5, where totally
are used to store the soft message symbols conveyed at both
s are used to store the
decoding phases, memory bank
intrinsic information and memory bank Cs are used to store

the decoded data bits.
For QC-LDPC codes, the address generator for each memory
bank can be realized with a simple counter, which not only simplifies the hardware design, but also improves the circuit speed.
In general, each node processing unit takes 1 clock cycle (assuming dual-port memories are used, otherwise two cycles are
needed) to complete message updating for one row (or column)
of the parity check matrix. Fig. 6(a) shows a small sub-matrix
of a QC-LDPC parity-check matrix, where all 1-components
are numbered starting from the first row. With the conventional
partially parallel decoding approaches [11], [27], 14 memory
entries need to be allocated for the corresponding memory
bank. All the soft messages corresponding to 14 1-components
shown in Fig. 6(a) are stored sequentially in the memory. In the
row decoding phase (i.e., the check-to-variable message passing
phase), the memory address generator generates 0, 1, , 13. In
the column decoding phase (i.e., the variable-to-check message
passing phase), the address generator outputs 9, 10, , 13, 0,
1, , 8. Hence, the address generator can be implemented as a
simple modulo-13 counter.
To increase the parallelism, we can enable each node processing unit to process the data corresponding 1-components
at multiple rows (or columns) of the parity check matrix at the
same cycle. However, this will generally cause memory access
conflicts since multiple data accesses per cycle are required for
each memory bank.
107
Fig. 6. Memory module corresponding to a submatrix: a) store one soft message symbol per entry and b) store two soft message symbols per entry.
Fig. 7. Memory partitioning schemes for one memory module.
As discussed in [14], all soft message symbols corresponding

to all the 1-components of a submatrix are stored in one memory
bank. Thus, a straightforward approach to enable each process
unit to process soft messages corresponding to 1-components at
multiple rows (or columns) of the submatrix in each cycle is to
store multiple soft symbols in each memory entry. For instance,
we can store two soft symbols corresponding to 1-components
at two adjacent rows of a submatrix in one memory entry as
shown in Fig. 6(b), where two data stored in the same memory
entry are labeled with the same number. This design immediately resolves the problem of processing soft symbols corresponding to two rows of each submatrix at each cycle, if we
sequentially process two adjacent rows starting from the top.
However, this is generally not applicable to column processing.
Assuming that we start column processing from the first column
for this example, memory access conflicts will happen in the
entire column processing phase because the soft symbols corresponding to two adjacent columns are always stored in different
memory entries. We could start column processing at a different
column to avoid this problem for this particular submatrix. However, memory access conflict is generally inevitable in decoding
an LDPC code since there are generally multiple submatrices in
one (block) column of a parity check matrix, e.g., there are three
submatrices in one block column for a (3, 5)-regular QC-LDPC
code. Using multiport memories is a possible solution, but not
an efficient one, because the overall hardware will be linearly
increased. Adding additional constraint to the parity check matrix to avoid the memory access conflict as proposed in [13] is
also not appreciated in practice.
For EG-LDPC codes, each submatrix of the parity check matrix consists of two superimposed cyclic-shifted identity matrices (see Fig. 11). A more general case is to have multiple
(e.g.,
) independently cyclic-shifted identity matrices superimposed in one submatrix. To efficiently decode this class
of codes, we proposed to have one separate memory bank for
each independently cyclic-shifted identity matrix [28]. Hence,
we need two memory banks for each submatrix in decoding the
previously mentioned EG-LDPC codes.
To increase the parallelism of a partially parallel decoder, we
propose three of the following methods:
1) partition each memory bank into sub-banks (or called
memory segments) where all the soft symbols coradjacent rows of a
responding to 1-components at
submatrix are stored in different segments;
2) store soft messages corresponding to adjacent rows of a
submatrix in one memory entry while utilizing extra buffers
to solve the memory access conflict;
3) combine steps 1) and 2).
In this paper, we will only discuss the first two methods. The
principle of the third method is evident given the details of the
first two. We will use the same example as before for illustra.
tion. Without loss of generality, we focus on the case of
In the first approach, we could employ a straightforward partition scheme as shown in Fig. 7(a), where the number (1, 2, 3, or
4) associated with each 1-component in the submatrix indicates
in which memory sub-bank the data is stored. Memory access
conflict will occur at the second processing cycle in the column
decoding phase. During this cycle, it is supposed to process
108
Fig. 8. Example of packaging multiple data into one memory entry: (a) a straightforward packaging scheme and (b) a better packaging scheme.
the soft messages corresponding to 1-components in the fifth

to eighth columns. However, the corresponding data are located
in memory segments 2, 1, 2, and 3. Thus, memory access conflict occurs because we can not load two data from Segment 2
in one cycle. In the previous analysis, we assume each memory
bank is a dual-port memory, i.e., one port for READ and the other
port for WRITE operation. Performing two READ operations at the
same cycle is not allowed. The memory access conflict will not
exist if the number of rows of a submatrix is a multiple of or
the transition group ends at the datum corresponding to the last
row. Unfortunately, neither of the conditions is generally satisfied in practice. Here, the transition group refers to the group
that contains the data corresponding to the last row of a submatrix. In general, the transition group contains two subsets: one
contains data corresponding to some bottom rows and the other
contains data corresponding to some top rows of a submatrix
(e.g., Fig. 7). The group partition starts from the first row and a
complete group contains data corresponding to adjacent rows.
To resolve the memory access conflict, we propose to rearrange the data in the first subset of the transition group (in the
previous example, this subgroup only contains one data) into
proper memory segments. For this example, we should store the
last data into the fourth memory segment as shown in Fig. 7(b)
instead of in the second segment. A general rule is to greedily
store those data in the first subset into high index memory
segments. If the first subset contains data corresponding to
memory segments 3, 4, and 1 with the straightforward partitioning scheme, these data should be rearranged into memory
segments 3, 4, and 2 based on the proposed rule. In this case,
the transition group will contain data located in memory segments 3, 4, 2, and 1. Therefore, there will be no memory access
conflicts in the column processing phase. There are quite a few
possible combinations for the data in the first subset. Any other
cases can be similarly handled. We will not elaborate on them
in this paper.
In the second approach, we package multiple data in one
memory entry and use simple data switching networks to resolve
memory access conflict. In general, we can package soft messages corresponding to adjacent rows (or columns) of a subtimes
matrix in one memory entry in order to achieve
speed-up over the conventional partially parallel decoding architecture [11], [27]. In this way, we can easily process rows at the
Fig. 9. Block diagram for the switching network used in the column processing
phase when the straightforward packaging scheme is adopted.
same cycle in the row processing phase. However, data buffers

and switching networks must be introduced in the column processing phase.
Fig. 8 shows an example of packaging four data in one
memory entry. With a straightforward approach (i.e., sequentially package four data in one memory entry) as shown in
Fig. 8(a), we need to load data from three different memory
entries (one for data { }, one for data { }, and one for {1}) in
the second decoding cycle of the column processing phase. In
fact, this is the worstcase for any value of and any size of the
circulant matrix. To resolve this problem, two data buffers must
be introduced since only one datum can be loaded at the current
cycle. In addition, switching networks must be added to select
the target group of data. To fill up the two buffers, two extra
cycles will be needed, which leads to an extra latency of two
clock cycles. Fig. 9 shows the block diagram of the switching
network. The details are not discussed in this paper, where the
multiplexer block is a combination of four 3:1 multiplexers.
It should be pointed out that the output [1, , 4] may be
composed of three parts: 1) some outputs from buffer D1; 2)
some outputs from D2; and 3) some input data from [1, , 4]
at certain time instants.
A modified packaging scheme is shown in Fig. 8(b). For convenience in later discussion, we denote the set of data along the
upper right line as the -set, and the set of data along the lower
left line as the -set. We start packaging data in the -set from
the data corresponding to the 1-component at the first row, and
start packaging data in the -set from the data corresponding to
the 1-component at the first column. If the circulant matrix is an
identity matrix, then the two starting points become one. With
this method, only one data buffer will be needed. Furthermore,
the extra latency is no more than one clock cycle. The detailed
109
Fig. 10. Switching networks for: (a) row processing phase;(b) column decoding phase; and (c) for both decoding phases.
operations are as follows. In the first cycle of row processing

phase, we load the last group of data corresponding to the -set
into the data buffer. In column processing phase, we load the last
group of data corresponding to -set into the data buffer at the
beginning. The switching networks are shown in Fig. 10, where
(a) is for row processing, (b) is for column processing and (c) is
for combined row/column processing. The enable signal En
becomes valid when we start processing the last group of data
in the -set (or -set) in the row (or column) processing phase.
The En signal will be ANDed with the clock signal and connect
to each register. In addition, signals , , , and stand for
the loaded data from a memory bank at the current cycle,
,
,
, and
represent the output data to be sent to the corresponding row or column processing units.
There are two exceptional cases. 1) The last group in the
-set (or -set) may consist of exactly data. In this case,
the data buffers as well as the first loading cycle are saved and
the En signal will remain invalid during the entire row (or
column) processing phase. 2) If the circulant matrix is an identity matrix, data buffers and the first loading cycle are not required for either decoding phase.
In summary, with the modified data packaging scheme and
the proposed switching network architectures, we can efficiently
achieve multiple times speed-up in LDPC decoding over the traditional partially parallel decoding architecture. It can be noted
that either kind of enhanced partially parallel decoder architecture requires no extra memory storage. Hardware overhead is
only introduced in node processing units.
As the memory part usually dominates the overall hardware
of an LDPC decoder, the proposed enhanced partially parallel
LDPC architectures generally have small amount of hardware
overhead over the traditional partially parallel decoding architecture.
V. DESIGN EXAMPLE WITH FPGA
In this section, we will present an FPGA design on Xilinx
Virtex II for a (8176, 7154) EG-LDPC code using the previously
discussed techniques. It will be shown that the proposed FPGA
design not only achieves a very high clock speed (192 MHz), but
also doubles the inherent parallelism level of the code in either
decoding phase of the partly parallel decoding.
Fig. 11. 15
2 15 submatrix for EG-LDPC codes.
A. (8176, 7154) EG-Based QC LDPC Code

The EG-LDPC codes are constructed based on the decomposition of finite Euclidean geometries. The (8176, 7154) code
is a regular QC-LDPC code with a column weight of 4 and
a row weight of 32 (originally designed for NASA) [7]. The
parity-check matrix is a 2 16 array of 511 511 submatrices
as shown in the following expression:
(8)
Each submatrix
is a circulant matrix with both a column
15 submatrix
and a row weight of 2. An example of a 15
is shown in Fig. 11. Observe that the inherent parallelism level
with the code is 16 in the column processing phase, and 2 in the
row processing phase of the partly parallel decoding. In other
words, we can instantiate up to 16 variable node processing units
and two check node processing units if we adopt the conventional partly parallel decoding architecture. In this design, we
will employ a two-level enhanced partly parallel decoding architecture to raise the parallelism level, and thus further increase
the decoding throughput.
B. Enhanced Partially Parallel Decoding Architecture
In principle, either kind of the proposed enhanced partially parallel decoding architectures can linearly increase the
decoding throughput with small hardware overhead. In this
design, we arbitrarily chose the first approach with
. The
modified SPA is employed for decoding.
Fig. 12 shows the architecture of a CPU, which performs
computation. Each CPU has 32
check-to-variable message
inputs and 32 outputs. The LUT-A is introduced to perform
. The magnitude of the
the function
output is the summation of 31 out of 32 numbers coming from
LUT-A. The sign bit of the output is a product of 31 out of 32
sign bits which come from the inputs. In the last addition stage,
each of the two addends is separated into high and low parts.
110
Fig. 12. Check node processing unit architecture.
Fig. 13. Variable node processing unit architecture.
Two partial additions are performed in parallel to reduce the addition delay. In the CPU, pipeline latches are inserted as indicated by the dashed lines to reduce the critical path. The data
representations for the inputs of CPU, the outputs of LUT-As,
and the final outputs of CPU are in 2s complement, unsigned,
and sign-magnitude, respectively.
The architecture of a VPU is shown in Fig. 13, which percomputation. Each VPU
forms variable-to-check message
has five inputs and five outputs. and stand for the intrinsic
message and the tentative decoded bit, respectively. The LUT-B
. All data are repreperforms the function
sented as 2s complement in VPU computation, which includes
the intrinsic message , the outputs of LUT-Bs, and the outputs of VPU. The inputs of LUT-Bs coming from the outputs of
CPUs are in sign-magnitude format.
The block diagram of the proposed decoder architecture is
, which consists of
shown in Fig. 14. Each memory block
two memory banks, is associated with a circulant submatrix
of the parity check matrix . These memory blocks are used
to store the extrinsic soft messages conveyed in two decoding
and
are used to store the
phases. The memory modules
Fig. 14. Proposed enhanced partially parallel decoder architecture for the
EG-LDPC code.
intrinsic soft messages and the temporary decoded bits, respectively. As it can be seen from the figure, the overall architecture
has 16 2 32 VPUs and 2 2 4 CPUs.
With 2-parallel enhancement, each memory bank is partitioned into two segments, i.e., one contains all data corresponding to even-addressed rows and the other contains data
corresponding to all odd-addressed rows of the corresponding
submatrix. To illustrate the details of dataflow, the analyses
are performed, respectively, for three different cases, which
correspond to even, odd, and zero shift offsets (ranging from
0 to 510) of a cyclic-shift identity matrix. Fig. 15 shows an
example of the memory partitioning and data switching scheme
applied to a 15 15 cyclic-shifted identity matrix with an even
(excluding 0) shift offset of 6. In the check-to-variable message
updating phase, the two data located in the even memory segment MEM_E and the odd memory segment MEM_O with the
same index are sent to CPU_E and CUP_O in parallel, where
Fig. 15. Memory partitioning and data switching structure for even shifting
offset case.
CPU_O and CPU_E are CPU components for even row and odd
row message updating, respectively. In the variable-to-check
message updating phase, the two data connected by an arrow
are sent to corresponding VPU_0 and VPU_1 in the same clock
cycle. Here, VPU_0 and VPU_1 are respective VPU components for even column and odd column data computation. A
stored in a memory segment corresponds
soft message
to the 1-component located at row and column of the
cyclic-shifted identity matrix. In this example, the data located
in the even columns 6, 8, 10, 12, and 14 of the submatrix are
stored in the even addressed memory segment. However, the
data located in the even columns 0, 2, and 4 are stored in the
odd addressed memory segment. Similar cases exist for the
data located in the odd columns. Therefore, switching units
are needed to route data between memories and VPUs in the
variable-to-check message updating phase. Because the size
of each circulant matrix associated with the EG-LDPC code
is an odd number, only the data corresponding to the last row
(or column) of these matrices are accessed in the last clock
cycle of the check-to-variable (or variable-to-check) message
are used
updating phase. In Fig. 15, memory block and
to store intrinsic soft messages and temporary decoded bits,
respectively. Symbol represents a constant for initialization
procedure.
A similar example for a cyclic-shifted matrix with an odd shift
offset of 5 is shown in Fig. 16. If the memory is partitioned in
a straightforward way, the data corresponding to the last row of
this matrix should be stored in the last location of even memory
segment. Consequently, data access conflict will occur when the
two data from columns four and five are retrieved from the even
memory segment in the same cycle as indicated by the dashed
arrow. In our design, the last datum in the even rows is stored in
the odd memory sub-bank to eliminate the data access conflict.
Consequently, two multiplexers are needed to steer the displaced
data between the odd memory sub-bank and CPU_E.
For the third case, when the shift value is 0, the cyclic-shifted
identity matrix becomes an identity matrix. The details of
memory partitioning and data switching schemes are shown in
Fig. 17. This is in fact the simplest case.
111
Fig. 16. Memory partitioning and data switching structure for odd shifting
offset case.
Fig. 17. Memory partitioning and data switching structure for identity matrix.
Fig. 18. State transition diagram of the controller.
C. Architecture of the Controller

The controller, which generates the memory addresses and
the control signals for data switching networks, is built as a
two-level finite state machine. Fig. 18 shows the state transition
diagram of the finite state machine. In the initialization state, the
intrinsic soft messages stored in the memory are transferred
112
Fig. 19. Block diagram of the controller.
into the memory . The antioverlap state is introduced to avoid

the data access conflict between the two decoding phases.
The block diagram of the controller is shown in Fig. 19. The
memory WRITE addresses and the data switching control signals for writing are the delayed versions of the memory READ
addresses and the control signals for reading, respectively. In
order to increase the clock speed of the controller, the retiming
technique [18] is employed to shorten the critical path of the
controller. By introducing one delay unit, the critical path is significantly reduced.
Fig. 20. 7:4 uniform quantization of
09(x).
TABLE I
UNIFORM TO NONUNIFORM QUANTIZATION CONVERSION
D. Fixed-Point Implementation
The word length of the soft messages directly affects the
memory size, the node processing unit size and the decoding
performance of an LDPC code decoder. Unless the target datarate is very high, e.g., over 4 Gb/s, the overall hardware of an
LDPC decoder is predominantly determined by the size of the
memories holding intrinsic and extrinsic soft messages. Therefore, it is very important to find efficient quantization schemes
for these soft messages.
denote the uniform quantization scheme in which
Let
the finite word length is bits, of which bits are used for
the fractional part of the value to be quantized [19]. For the
considered EG-LDPC code, the target bit error rate (BER) is set
based on the original requirement from NASA. With
as
extensive simulation, we found that at least 7 bits are needed to
represent the magnitude of a soft message to achieve this goal if
a uniform quantization scheme is adopted (see Fig. 21). In this
case, both the input and output of a LUT-A (or LUT-B) are 8
bits, which is quite large.
A nonuniform quantization scheme which generally out performs uniform quantization schemes with the same word length
was proposed in [20]. However, in the original method, a -bit
nonuniform quantization scheme generally performs worse than
-bit word length since
the uniform quantization case with
significantly less precision is maintained for large values. This
paper presents a new nonuniform quantization scheme that can
achieve decoding performance almost identical to that of the
uniform quantization case with 1-bit longer ( -bit versus

-bit) word length.
In both decoding phases, the extrinsic soft messages
are sent to LUTs, which perform the nonlinear function
. For simplicity, we assume the input
is nonnegative. This assumption will be justified later. Fig. 20
shows the quantized numbers for the input and output of
using the 7:4 uniform quantization scheme. A close study of
these quantization results reveals that no quantized values of
are lost if is nonlinearly quantized, as is shown in
the middle column of Table I. Based on this observation, the
new nonuniform quantization scheme for this specific case is
as follows: the soft messages in uniform quantization format
computed with (5) and (6) are converted into nonuniform
quantization format as expressed in Table I and are then stored
in the corresponding memory blocks. The 6-bit input of the
is in nonuniform quantization format, and the
LUT for
numbers stored in the LUT (i.e., the output of the LUT) are in
uniform quantization format with 7 bits per symbol. Fig. 21
shows the simulation results for four different cases: 1) using
7:4 uniform quantization scheme; 2) using 6:3 uniform quantization scheme; 3) using 6-bit nonuniform quantization scheme;
and 4) with floating-point computation, where AWGN channel
and BPSK modulation are assumed. We have simulated 90-G
random information bit for either 6-bit quantization scheme at
113
Fig. 23. Variable node processing unit with nonuniform quantization.
TABLE II
XILINX VIRTEXII-6000 FPGA UTILIZATION STATISTICS
Fig. 21. Performance comparison between fixed-point and double precision

simulations.
Fig. 22. Check node processing unit with nonuniform quantization.
both
4.2 dB and
4.4 dB. At
4.1 dB or below, over 50 block errors were observed for each
simulation case. It can be seen from the figure that there is no
observable difference between the decoding performance of
using a 7-bit uniform quantization scheme and that of using
a 6-bit nonuniform quantization scheme. It is also clear that
using a 6-bit uniform quantization scheme can hardly meet the
target BER requirement.
The new architectures for CPU and VPU with the nonuniform quantization scheme are shown in Figs. 22 and 23, respectively. The uniform to nonuniform quantization converters
(U2NUs) are introduced as shown in the two figures. They are
implemented with simple combinational logic. The LUTs for
both CPU and VPU are the same.
E. FPGA Implementation Results
Based on the architectures described before, the (8176, 7154)
EG-based LDPC decoder was modeled in VHDL and simulated
using ModelSim. We then synthesized and performed place and
route for the design using the Xilinx ISE 5.1i software package.
The design was targeted on the Xilinx Virtex-II 6000 device

(speed grade -6). Based on the Xilinx TRACE report, the maximum clock frequencies of the uniform and nonuniform quantization implementation are 193.4 and 192.6 MHz, respectively.
Table II shows the FPGA utilization statistics of both implementations.
The maximum iteration number is set as 15. It takes a half
iteration to transfer the intrinsic soft messages from memory
into memory
of the decoder. It takes
clock cycles to perform one iteration in which five clock cycles
are allotted to the antioverlap state. Thus, the nonuniform quantization implementation can achieve a worst-case information
decoding throughput of
172 Mb/s, which is significantly higher than the other published
FPGA implementations [11][13], [23].
It should be mentioned that such a high decoding throughput
can be attributed to two factors: 1) very high clock speed resulted from various architectural level and circuit level optimizations and 2) enhanced parallelism in partly parallel decoding.
VI. CONCLUSION
In this paper, we have presented a modified SPA which
balances the computation load between two LDPC decoding
phases. We proposed architectural optimizations to reduce
the clock period for LDPC decoders and developed enhanced
partially parallel decoder architectures to linearly increase the
decoding throughput with a small amount of extra hardware.
We have also introduced an efficient nonuniform quantization scheme to reduce memory size of LDPC decoders. We
have demonstrated significant benefits of using the proposed
techniques with an FPGA implementation of a (8176, 7154)
EG-LDPC code decoder on Xilinx Virtex-II 6000 device.
114
It has been shown that the presented FGPA implementation

can achieve a maximum source data decoding throughout of
172 Mb/s at 15 iterations.
REFERENCES
[1] D. Sridhara, T. Fuja, and R. M. Tanner, Low density parity check
codes from permutation matrices, presented at the Conf. Inf. Sci. Syst.
Baltimore, MD, 2001.
[2] J. L. Fan, Array codes as low-density parity-check codes, in Proc.
2nd Int. Symp. Turbo Codes, 2000, pp. 545546.
[3] Y. Kou, J. Xu, H. Tang, S. Lin, and K. Abdel-Ghaffar, On circulant
low density parity check codes, in Proc. ISIT, 2002, p. 200.
[4] D. E. Hocevar, LDPC code construction with flexible hardware implementation, in Proc. IEEE ICC, 2003, pp. 27082712.
[5] M. P. C. Fossorier, Quasi-cyclic low-density parity-check codes from
circulant permutation matrices, IEEE Trans. Inf. Theory, vol. 50, no.
8, pp. 17881793, Aug. 2004.
[6] Z. Li and B. V. K. V. Kumar, A class of good quasi-cyclic low-density
parity check codes based on progressive edge growth graph, in Proc.
38th Asilomar Conf. Signals, Syst. Comput., 2004, pp. 19901994.
[7] L. Chen, J. Xun, I. Djurdjevic, and S. Lin, Near Shannon limit quasicyclic low density parity-check codes, IEEE Trans. Commun., vol. 52,
no. 7, pp. 10381042, Jul. 2004.
[8] Z. Li, L. Chen, L. Zeng, S. Lin, and W. Fong, Efficient encoding of
quasi-cyclic low-density parity check codes, IEEE Trans. Commun.,
vol. 54, no. 1, pp. 7181, Jan. 2006.
[9] T. Richardson and R. Urbanke, Efficient encoding of low-density
parity-check codes, IEEE Trans. Inf. Theory, vol. 47, no. 2, pp.
638656, Feb. 2001.
[10] D.-U. Lee, W. Luk, C. Wang, and C. Jones, A flexible hardware
encoder for low-density parity-check codes, in Proc. IEEE Symp.
FCCM, 2004, pp. 101111.
[11] T. Zhang and K. Parhi, A 54 Mbps (3,6)-regular FPGA LDPC decoder, in Proc. IEEE SiPS, 2003, pp. 127132.
[12] Y. Chen and D. Hocevar, A FPGA and ASIC implementation of rate
1/2, 8088-b irregular low density parity check decoder, in Proc. IEEE
GLOBECOM, 2003, pp. 113117.
[13] M. Karkooti and J. R. Cavallaro, Semi-parallel reconfigurable architectures for real-time LDPC decoding, in Proc. ITCC, 2004, pp.
579585.
[14] Z. Wang, Y. Chen, and K. Parhi, Area-efficient quasi-cyclic LDPC
code decoder architecture, in Proc. ICASSP, 2004, pp. 4952.
[15] Y. Zhang, Z. Wang, and K. Parhi, Efficient high-speed quasi-cyclic
LDPC decoder architecture, in Proc. 38th Asiloma Conf. Signal, Syst.
Comput., 2004, pp. 540544.
[16] Z. Wang and Q. Jia, Low complexity, high speed decoder architecture
for quasi-cyclic LDPC codes, in Proc. ISCAS, 2005, pp. 57865789.
[17] J. Chen and M. Fossorier, Near optimum universal belief propagation based decoding of low-density parity check codes, IEEE Trans.
Commun., vol. 50, no. 3, pp. 406414, Mar. 2002.
[18] K. K. Parhi, VLSI Digital Signal Processing Systems, Design and Implementation. New York: Wiley, 1999.
[19] Z. Wang, H. Suzuki, and K. Parhi, VLSI implementation issues of
turbo decoder design for wireless applications, in Proc. IEEE Workshop Signal Process. Syst. (SiPS), 1999, pp. 503512.
[20] T. Zhang, Z. Wang, and K. K. Parhi, On finite precision implementation of low density parity check codes decoder, in Proc. IEEE Int.
Symp. Circuits Syst., 2001, pp. 202205.
[21] A. J. Blanksby and C. J. Howland, A 690-mW 1-Gb/s 1024-b, rate-1/2

low-density parity check code decoder, IEEE J. Solid-State Circuits,
vol. 37, no. 3, pp. 404412, March 2002.
[22] K. Shimizu, T. Ishikawa, T. Ikenaga, S. Goto, and N. Togawa,
Partially-parallel LDPC decoder based on high-efficiency message-passing algorithm, in Proc. Int. Conf. Comput. Design, 2005,
pp. 503510.
[23] P. Bhagawat, M. Uppal, and G. Choi, FPGA based implementation of
decoder for array low-density parity-check codes, in Proc. ICASSP,
2005, pp. v29v32.
[24] S.-H. Kan and I.-C. Park, Memory-based low density parity check
code decoder architecture using loosely coupled two data-flows, in
Proc. Int. Symp. Circuits Syst., 2004, pp. II-397II-400.
[25] S. Kim, G. E. Sobelman, and J. Moon, Parallel VLSI architectures for
a class of LDPC codes, in Proc. ISCAS, 2002, pp. 9396.
[26] E. Yeo, B. Nikolic, and V. Anantharam, Architectures and implementations of low-density parity check decoding algorithms, in Proc.
MWSCAS, 2002, pp. 437440.
[27] F. Kienle, T. Brack, and N. Wehn, A synthesizable IP core for DVB-S2
LDPC code decoding, in Proc. Design, Autom. Test Eur., 2005, pp.
100105.
[28] Z. Cui and Z. Wang, A 170 Mbps (8176,7156) quasi-cyclic LDPC
decoder implementation with FPGA, in Proc. IEEE Int. Symp. Circuits
Syst., 2006, pp. 50955098.
Zhongfeng Wang (M00SM05) received the B.E.
and M.S. degrees in the automation from Tsinghua
University, Beijing, China, in 1988 and 1990, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Minnesota,
Minneapolis, in 2000.
Upon finishing his graduate study, he worked for
Morphics Technology Inc., Campbell, CA, for two
years and then moved to National Semiconductor
Corporation in 2002. Since September 2003, he
has been with the School of Electrical Engineering
and Computer Science, Oregon State University, Corvallis, as an Assistant
Professor. He has published numerous technical papers and has filed a few U.S.
patent applications. His current research interests are in the area of efficient
VLSI implementation of error correction codes, digital communications,
and cryptography systems. He served as an Associate Editor for the IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS from 2003
to 2005.
Dr. Wang was a recipient of the Best Student Paper Award at the 1999 IEEE
Workshop on Signal Processing Systems (SiPS99). He is a member of Sigma
Xi.
Zhiqiang Cui received the B.E. degree from the

University of Electronic Science and Technology
of China, Chengdu, China, in 1993, and the M.E.
degree from Texas A&M University, College Station,
in 2002. He is currently pursuing the Ph.D. degree in
electrical and computer engineering at Oregon State
University, Corvallis.
His research interests include the area of VLSI design for digital signal processing and communication
systems.

Low-Complexity High-Speed Decoder Design For

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Low-Complexity High-Speed Decoder Design For

Transféré par

Droits d'auteur :

Formats disponibles

104

Low-Complexity High-Speed Decoder Design for

1063-8210/$25.00 2007 IEEE

eral LDPC decoders. Section IV proposes enhanced partially

Fig. 1. Architecture of CPU with original SPA.

is the sign part of

is computed with (2),

Fig. 2. Architecture of VPU with original SPA.

Fig. 4. Optimized architecture for VPU.

Fig. 5. Structure of a partially parallel decoder for (3, 5) QC-LDPC codes.

Fig. 3. Optimized architecture for CPU.

intrinsic information and memory bank Cs are used to store

Fig. 7. Memory partitioning schemes for one memory module.

As discussed in [14], all soft message symbols corresponding

the soft messages corresponding to 1-components in the fifth

same cycle in the row processing phase. However, data buffers

operations are as follows. In the first cycle of row processing

2 15 submatrix for EG-LDPC codes.

A. (8176, 7154) EG-Based QC LDPC Code

Fig. 12. Check node processing unit architecture.

Fig. 13. Variable node processing unit architecture.

Fig. 18. State transition diagram of the controller.

C. Architecture of the Controller

Fig. 19. Block diagram of the controller.

into the memory . The antioverlap state is introduced to avoid

Fig. 20. 7:4 uniform quantization of

uniform quantization case with 1-bit longer ( -bit versus

Fig. 23. Variable node processing unit with nonuniform quantization.

Fig. 21. Performance comparison between fixed-point and double precision

Fig. 22. Check node processing unit with nonuniform quantization.

The design was targeted on the Xilinx Virtex-II 6000 device

It has been shown that the presented FGPA implementation

[21] A. J. Blanksby and C. J. Howland, A 690-mW 1-Gb/s 1024-b, rate-1/2

Zhiqiang Cui received the B.E. degree from the

Vous aimerez peut-être aussi