Vous êtes sur la page 1sur 87

UNIVERSIDAD POLITECNICA DE MADRID

ESCUELA TECNICA SUPERIOR DE INGENIEROS DE


TELECOMUNICACION

PROYECTO FIN DE CARRERA

TURBO DECODER IMPLEMENTATION BASED ON THE SOVA


ALGORITHM

Carlos Arrabal Azzalini


Madrid, Abril 2007
PROYECTO FIN DE CARRERA

TURBO DECODER IMPLEMENTATION BASED ON THE SOVA


ALGORITHM

Autor:
Carlos Arrabal Azzalini

Tutor:
Pablo Ituero Herrero

DEPARTAMENTO DE INGENIERÍA ELECTRÓNICA


ESCUELA TECNICA SUPERIOR DE INGENIEROS DE
TELECOMUNICACIÓN
UNIVERSIDAD POLITÉCNICA DE MADRID

Madrid, Abril 2007


PROYECTO FIN DE CARRERA: Turbo Decoder Implementation Based on the
SOVA Algorithm

AUTOR: Carlos Arrabal Azzalini


TUTOR: Pablo Ituero Herrero

El tribunal nombrado para juzgar el Proyecto arriba indicado, compuesto por los si-
guientes miembros:

PRESIDENTE: D. Carlos Alberto López Barrio

VOCAL: Dña. Marı́a Luisa López Vallejo

SECRETARIO: D. José Luis Ayala Rodrigo

SUPLENTE: D. Gabriel Caffarena Fernández

acuerdan otorgarle la calificación de:

Madrid de de 2007

El Secretario del Tribunal


To my parents
Acknowledgements

First of all I would like to thank Marisa for assigning this project and the scholarship to
me. I have enjoyed working on it all along.

I would like to give special thanks to my mentor and friend Pablo for his advices and
support. I had great time working with him.

Thanks to my friends at the Lab for the fantastic environment.

Finally I would like to thank Sandra for all her support and patient and for being there
all the time.

i
ii
Abstract

Today most common architectures for implementing the SOVA algorithm are affected by
two parameters: the trace back depth and the reliability updating depth. These parame-
ters play an important role in the BER performance, power consumption, area and system
throughput trade-offs. In this work, we present a new approach for doing the SOVA de-
coding that is not limited by the mentioned parameters and leads to an optimum SOVA
algorithm execution. Besides, the architecture is achieved by recursive units which con-
sume less power since the amount of employed registers is reduced. We also present a
new scheme to improve the SOVA BER performance which is based on a approximation
to the BR-SOVA algorithm. With this scheme the BER achieved is 0.1 dB from the one
obtained with a Max-Log-Map algorithm.

iii
iv
Contents

1 Introduction 1

2 Turbo Codes 5
2.1 Binary Phase Shift Keying Communication System Model. . . . . . . . . . 5
2.2 Soft Information and Log-Likelihood Ratios in Channel Coding. . . . . . . . 7
2.3 Convolutional Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Trellis Diagrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Turbo Codes Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Trellis Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Decoding Turbo Codes : Soft Output Viterbi Algorithm 13


3.1 Turbo Codes decoding process. . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 SISO Unit: SOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Viterbi Algorithm Decoding Example . . . . . . . . . . . . . . . . . 18
3.2.2 Soft Output extension for the VA. . . . . . . . . . . . . . . . . . . . 20
3.2.3 Improving the soft output information of the SOVA algorithm. . . . 23

4 Hardware Implementation of a Turbo Decoder based on SOVA 25


4.1 Turbo Decoder RAM buffers. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Interleaving/Deinterleaving unit of the turbo decoder . . . . . . . . . . . . . 28
4.3 SOVA as the core of the SISO. . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Branch Metric Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Add Compare Select Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Survival Memory Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6.1 Register Exchange Survival Memory Unit. . . . . . . . . . . . . . . . 34
4.6.2 Systolic Array Survival Memory Unit. . . . . . . . . . . . . . . . . . 35
4.6.3 Two Step approach for the Survival Memory Unit. . . . . . . . . . . 37
4.6.4 Other Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6.5 Fusion Points Survival Memory Unit. . . . . . . . . . . . . . . . . . 38

v
4.7 Fusion Points based Reliability Updating Unit. . . . . . . . . . . . . . . . . 45
4.8 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.9 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Methodology 55

6 Measures and Results 59


6.1 Quantization Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Bit Error Rate Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Throughput Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Power Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7 Conclusions and future work 71

Bibliography 72

vi
List of Figures

2.1 Simplified communication system model. . . . . . . . . . . . . . . . . . . . . 6


2.2 Discrete AWGN Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
2.3 NSC encoder of rate 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1
2.4 RSC encoder of rate 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 RSC encoder used in the UMTS standard. Pf b = [1011], Pg = [1101] . . . . 9
2.6 Trellis Example of an RSC encoder with Pf b = [111], Pg = [101] . . . . . . . 10
2.7 Serial concatenated Turbo encoder . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Parallel concatenated Turbo encoder. RSC encoder with Pf b = [111], Pg =
[101]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 Turbo Encoder with trellis termination in one encoder. Pf b = [111], Pg =
[101]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Turbo Decoder generic scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 13


3.2 Output during state transition for a given trellis. . . . . . . . . . . . . . . . 16
3.3 Trellis diagram for VA, Code given by Pf b = [111] , Pg = [101] . . . . . . . . 19
3.4 Soft Output extension example for the Viterbi Algorithm. Code given by
Pf b = [111] , Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Hardware implementation of a turbo decoder . . . . . . . . . . . . . . . . . 26


4.2 Overall system states diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Data-in RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Data-out RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 RAM La/Le and RAM Le/La connections . . . . . . . . . . . . . . . . . . . 28
4.6 Interleaving/Deinterleaving Unit . . . . . . . . . . . . . . . . . . . . . . . . 29
4.7 Viterbi and SOVA decoder schemes . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 BMU for the RSC encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.9 Add Compare Select Unit for the SOVA. Pf b = [111], Pg = [101] . . . . . . 32
4.10 Modular representation of the path metrics. Each path metric register has
a width of nb bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.11 Merging of paths in the traceback. . . . . . . . . . . . . . . . . . . . . . . . 33

vii
4.12 Register Exchange SMU for the SOVA. Pf b = [111], Pg = [101] . . . . . . . 34
4.13 Register Exchange processing elements. . . . . . . . . . . . . . . . . . . . . 35
4.14 Systolic Array for the Viterbi Algorithm. . . . . . . . . . . . . . . . . . . . 36
4.15 Survival unit for the Systolic Array. . . . . . . . . . . . . . . . . . . . . . . 37
4.16 Two Step idea. First tracing back, and then reliability updating. . . . . . . 37
4.17 Fusion Points based SMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.18 Possibility of fusion points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.19 Fusion Point detection algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 40
4.20 Sequence of the Fusion Point algorithm . . . . . . . . . . . . . . . . . . . . 41
4.21 FPU architecture for a code with constraint length K = 3. . . . . . . . . . . 43
4.22 Reliability updating problem . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.23 One possible solution to the problem of bit reliabilities releasing. . . . . . . 46
4.24 Solution adopted for the bit reliabilities releasing problem. . . . . . . . . . . 46
4.25 Fusion Points based Reliability updating unit . . . . . . . . . . . . . . . . . 48
4.26 Recursive Updating Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.27 Recursive Updating Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.28 Control Unit General Scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.29 Control Unit State Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.30 Reliability Updating Unit with BR-SOVA approximation . . . . . . . . . . 53
4.31 Recursive Update with BR-SOVA approximation . . . . . . . . . . . . . . . 54

5.1 Project Work Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56


5.2 Hardware-in-the-loop approach . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Hardware-in-the-loop verification procedure . . . . . . . . . . . . . . . . . . 57

6.1 ∆ quantization effect on the system BER performacne. BR-SOVA approxi-


mation scheme. Simulation with quantification. MCF. Pf b = [111], Pg = [101] 60
6.2 HR-BRapprox comparison. Infinite precision simulations. MCF interleaver.
Pf b = [111], Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 HR-SOVA HIL results. MCF interleaver. Pf b = [111], Pg = [101] . . . . . . 63
6.4 BR-SOVA approximation HIL results. MCF interleaver. Pf b = [111], Pg =
[101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.5 HR-BRapprox HIL comparison. MCF interleaver. Pf b = [111], Pg = [101] . 64
6.6 HR-BRapprox comparison. Infinite precision simulations. RAND inter-
leaver. Pf b = [1011], Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . 65
6.7 BR-SOVA approximation HIL results. RAND interleaver. Pf b = [1011],
Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

viii
6.8 Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [111], Pg =
[101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.9 Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [111], Pg =
[101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.10 Throughput statistics. f = 16.66M Hz, fRU U = 25M Hz. Pf b = [111],
Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.11 Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [1011],
Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.12 Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [1011],
Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.13 Throughput statistics. f = 16.66M Hz, fRU U = 50M Hz. Pf b = [1011],
Pg = [1101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

ix
Chapter 1

Introduction

The goal of any communication system is to achieve highly reliable communications with
a reduced transmitted power and reach as high as possible data rates. All these pa-
rameters usually represent a trade-off that designers have to deal with. Bandwidth is
also a limited resource in communication systems. Error-detecting and error-correcting
techniques are used in digital communication systems in order to get higher spectral and
power efficiencies. This is based on the fact that with these techniques more channel errors
can be tolerated and so the communication system can operate with a lower transmitted
power, transmit over longer distances, tolerate more interference, use smaller antennas,
and transmit at a higher data rates.
One of the most widespread of these techniques is Forward Error Correction (FEC). On
the transmitter side, an FEC encoder adds redundancy to the data in the form of parity
information. Then at the receiver, an FEC decoder is able to exploit the redundancy in
such a way that a reasonable number of channel errors can be corrected. Claude Shannon
—father of Information Theory— showed that if long random codes are used, reliable
communications can take place at the minimum required Signal to Noise Ratio (SNR).
However, truly random codes are not practical to implement. Codes must possess some
structure in order to have computationally tractable encoding and decoding algorithms.
Turbo Codes were introduced by Berrou, Glavieux and Thitimajshima in 1993 [3].
These codes exhibit an astonishing performance close to the theoretical Shannon limit,
in addition to a good feasibility of VLSI (Very Large Scale Integration) implementation.
Turbo Codes are used in the two most widely adopted third-generation cellular standards
(UMTS and CDMA2000). They are also incorporated into standards used by NASA for
deep space communications (CCSDS) and digital video broadcasting (DVB-T).
Decoding in Turbo Codes is carried out by a soft-output decoding algorithm: an
algorithm that provides a measure of reliability for each bit that it decodes. Specifically
two of the component decoding algorithms that are used in Turbo Codes are known as
MAP (Maximum a Posteriori) and SOVA (Soft Output Viterbi Algorithm). The high
computational complexity of the MAP algorithm makes its implementation expensive
and power-hungry. This is why most implementations perform a simplified version of
the algorithm. The most common simplifications are: the Log-MAP and Max-Log-MAP
algorithms which work in the logarithmic domain. Regardless, these algorithms are still
more complex and power-hungry compared to the SOVA algorithm which presents the
2 Introduction

drawback of a worse BER (Bit Error Rate) performance.


This work deals with a SOVA algorithm implementation. Today most common archi-
tectures for implementing the SOVA algorithm are affected by two parameters: the trace
back depth and the reliability updating depth. These parameters play an important role
in the BER performance, power consumption, area and system throughput trade-offs. In
this work, we present a new approach for doing the SOVA decoding that is not limited by
the mentioned parameters and leads to an optimum SOVA algorithm execution. Besides,
the architecture is achieved by recursive units which consume less power since the amount
of employed registers is reduced. We also present a new scheme to improve the SOVA
BER performance. With this scheme the BER achieved is 0.1 dB from the one obtained
with the Max-Log-Map algorithm.
The design was implemented on a low cost Spartan III FPGA (Field Programmable
Gate Array). The system was tested for two major polynomials and the system BER was
measured for different SNR input messages. Also throughput measures were taken while
power estimations were carried out by simulations.
The key points of this work can be summarized in the following list:

• A complete Turbo Decoder implementation based on the SOVA algorithm has been
achieved:

– A two step approach for the SOVA decoding has been adopted [9].
– A new algorithm that does not depend on the trace back depth of the survival
path has been introduced for the SOVA decoding.
– A new architecture for the previous algorithm has been designed.
– A new architecture for updating bit reliabilities according to the HR-SOVA
algorithm has been designed.
– A novel updating process that approximates the BR-SOVA algorithm for binary
RSC codes has been presented. With this scheme the BER performance is less
than 0.1dB from the the Max-Log-Map approach.

• The system has been described with generic VHDL code.

• The system has been highly tested.

– BER curves have been measured for the HR-SOVA and the BR-SOVA approx-
imation with different codes. (Real System).
– Throughput estimations have been obtained for different codes. (Real System).
– Power estimations have been obtained with simulation tools. (VHDL Post-
Place and Route model).

The structure of this document is the following. The second chapter introduces Turbo
Codes and sets the environment where this work resides. The third chapter deeply de-
scribes the SOVA algorithm and sets the main ideas for the fourth chapter which describes
today most common architectures and introduces the SOVA implementation proposed in
this work. It is inside the fourth chapter where the new algorithm, in conjunction with
the new architectures, is presented. The fifth chapter illustrates the practical design, from
3

implementation to verification. Finally the sixth chapter presents the results and mea-
sures carried out on the real system while the seventh chapter gives the conclusions and
establishes the basis of the future work.
4 Introduction
Chapter 2

Turbo Codes

Turbo Codes were presented by Glavieux [3] in 1993. They had a tremendous impact in
the discipline of channel coding. They are, along with LDPC (Low Density Parity Check)
codes, the closest approximation ever to the code that Claude Shanon probed to exist in
the mid XX century and which is able to achieve error free communications. Since their
introduction, they have been intensively studied. The first commercial application was
presented in 1997 [1] and today they are already part of the UMTS (Universal Mobile
Telecommunication System) standards. They have become the first choice when working
with low SNRs (Signal to Noise Ratio) such as in wireless applications and deep space
communications.
In this chapter we first introduce the communication system model which has been
employed in this work as the scenario for channel coding tests. Next we introduce the
concept of soft information which is the key of Turbo Codes. We describe the Turbo
Codes encoders and finally we talk about the trellis termination. The decoding process is
left to the next chapter.

2.1 Binary Phase Shift Keying Communication System Model.

In order to explain the soft information concept and the log-likelihood ratio, we will
develop a simplified communication model that will be the base example for the proceeding
concepts. This communication model is shown in Figure 2.1. On the transmitter side there
is a source of information that we assume to provide equally likely symbols. There is a
block for channel coding which is the main subject of this work and it is carried out by
a Turbo Code. The modulation scheme is BPSK (Binary Phase-Shift Keying) and the
channel is assumed to be an AWGN (Additive White Gaussian Noise). On the receiver
side, all the complement blocks for those in the transmitter are found. Also, there is a
matched filter which maximizes the SNR before sampling the received data. Note that we
have omitted the synchronization recovery subsystem which will be assumed to be ideal.
As starting point the source provides message bits mi at a rate of T1 bits/sec, which
are fed into the channel coding block. In a Turbo Code context, these bits are grouped
to form a frame of size L bits. The channel coding block outputs a coded frame with size
2L. So, for each message bit mi there is a symbol made of two bits xi = {xsi , xpi }. Then
6 Turbo Codes

+1V

-1V

u (t )
mi
source 01011010... Channel Coding 110010... BPSK Modulation

AWGN
+1V
Channel

-1V

sink 01011010...
Channel
Matched Filer r ( t)
Decoding
m̂i yi
+1V

Implies sampling
-1V
and quantification

Figure 2.1: Simplified communication system model.

AW GN
xi M a t c h e d F il e r yi
C hannel

D is c r e t e A W G N c h a n n e l

Figure 2.2: Discrete AWGN Channel

the code rate is r = 12 —one input bit, two output bits. The modulator generates the
waveform signals from the input bits and transmits them through the AWGN channel.
The matched-filter filters the received signals which, at the corresponding time instant,
are sampled and so the yi symbols are obtained. The AWGN channel, in conjunction with
the matched filter and the sampling unit, can be modeled as a discrete AWGN channel
as shown in figure 2.2. The modeling of a discrete channel is desired, since computer
simulations are simplified and the computing time is reduced. The equation that governs
the behavior of this channel is the following:

yi = a Es (2xi − 1) + nG (2.1)

where a is a fading amplitude which is assumed to be 1. If a fading channel was under


the scope of study, then a would be assumed to be a random variable with a Rayleigh
distribution. Es is the energy of the transmitted symbol and it relates to the energy per
bit of information as Es = rEb. Finally nG represents the white Gaussian noise with
zero mean and a power spectral density of N20 . For simulation purposes equation 2.1 is
rewritten as:
yi = a (2xi − 1) + n0 G (2.2)

N0
where the variance of n0G becomes σ 2 = 2Es .
2.2 Soft Information and Log-Likelihood Ratios in Channel Coding. 7

2.2 Soft Information and Log-Likelihood Ratios in Channel


Coding.
Whenever a symbol yi is received at the decoder, the following test rule helps us to
determine what the transmitted symbol was, based only on the observation yi and without
the help of the code.

P (xi = 1 | yi ) > P (xi = 0 | yi ) ⇒ xi = 1


P (xi = 1 | yi ) < P (xi = 0 | yi ) ⇒ xi = 0

This rule is known as MAP (Maximum a posteriori) since P (xi = 1 | yi ) and P (xi = 0 | yi )
are the a posteriori probabilities. Using the Bayes theorem, the previous rule can be
rewritten as:
P (yi | xi = 1) P (xi = 1) P (yi | xi = 0) P (xi = 0)
> ⇒ xi = 1
P (yi ) P (yi )
P (yi | xi = 1) P (xi = 1) P (yi | xi = 0) P (xi = 0)
< ⇒ xi = 0
P (yi ) P (yi )

and rewriting equations as ratios, yields:

P (yi | xi = 1) P (xi = 1)
> 1 ⇒ xi = 1
P (yi | xi = 0) P (xi = 0)
P (yi | xi = 1) P (xi = 1)
< 1 ⇒ xi = 0
P (yi | xi = 0) P (xi = 0)

If we apply the natural logarithm on the previous equations, the testing result is not
altered, then we obtain:

P (yi | xi = 1) P (xi = 1)
ln + ln > 0 ⇒ xi = 1
P (yi | xi = 0) P (xi = 0)
P (yi | xi = 1) P (xi = 1)
ln + ln < 0 ⇒ xi = 0
P (yi | xi = 0) P (xi = 0)

The previous ratios in the log domain, are the LLR (Log Likelihood Ratio) metrics which
is a useful way to represent the soft decision of receivers or decoders. We can summarize
the previous steps with only one equation as follows:

L (xi | yi ) = L (yi | xi ) + L (xi )


P (yi |xi =1)
where L (xi | yi ) = ln PP (x i =1|yi ) P (xi =1)
(xi =0|yi ) , L (yi | xi ) = ln P (yi |xi =0) and L (xi ) = ln P (xi =0) . The
notation of the previous equation is usually rewritten as:

Λ0i = Lc (yi ) + Lai

where Lai is the LLR of the a priori information and Lc (yi ) is related to a measure of the
channel reliability. Note that the sign of the i indicates the hard decision.
8 Turbo Codes

x1p i

Pg1 = [101]
1 0 1

mi

1 1 1
Pg 2 = [111]

x 2p i

1
Figure 2.3: NSC encoder of rate 2

So far we have introduced the equations of soft information based on the received
symbol at the input of the decoder without the aid of the underlying code. The fact
of using channel coding in the communication system lets us improve the LLR of the a
posteriori probability. This is shown in [3]. The LLR of the a posteriori information at
the output of the decoder is:

Λi = Λ0i + Lei = Lc (yi ) + Lai + Lei (2.3)

The term Lei is known as the extrinsic information which actually is the improvement
achieved by the decoder and the decoding process on the soft information. The extrin-
sic information will be the data fed as a priori information to the other decoder in a
concatenated decoding scheme. It is important to remark that all terms in equation 2.3
can be added because they are statistically independent [3]. Statistical independence of
terms is essential to allow iterative decoding and this is the reason of interleavers in the
concatenation schemes of Turbo Encoders and Turbo Decoders.

2.3 Convolutional Encoders.


Turbo Codes encoders are mainly based on convolutional encoders. In these encoders
the output signals are typically generated convolving the input signal with itself in several
different configurations, consequently adding redundancy to the code. Convolutional codes
can be either Non-systematic Convolutional codes (NSC) when the input word is not
among the outputs; or Recursive Systematic Convolutional codes (RSC) when the input
word is one of the outputs [8]. Figure 2.3 illustrates an example of a NSC encoder while
figure 2.4 shows an RSC encoder. A set of registers and modulo two adders can be
appreciated on the figures. The connections among those registers and the modulo two
adders determine the output sequence of the encoder. Dividing the number of inputs I
over the number of outputs O results in the code rate OI . The cited examples through all
this work will always use an RSC encoder with rate 21 .
To define a convolutional encoder we need a set of polynomials which represent the
connections among the registers and the modulo two adders. For an NSC two code gener-
2.4 Trellis Diagrams. 9

P = [111]
fb

mi xs i

1 1 1

1 0 1

Pg1 = [101]
x pi

1
Figure 2.4: RSC encoder of rate 2

mi xs i

1 0 1 1

1 1 0 1

xp i

Figure 2.5: RSC encoder used in the UMTS standard. Pf b = [1011], Pg = [1101]

ator polynomials define the encoder of rate 21 —see figure 2.3. On the other side, an RSC
encoder is defined by both feedback and generator polynomials —see figure 2.4.
The status of the set of registers represents the state of the encoder. Input bits mi
make the encoder memory elements change and move into another state while producing
the output bits xsi , xpi —for the case of the RSC encoder. Convolutional encoders are
characterized by the constraint length K. An encoder with constraint length K has K − 1
memory elements which allows the encoder to jump through 2K−1 states.
RSC encoders are mostly used in Turbo Codes schemes rather than NSC encoders,
since better BER performances have been achieved with them. For instance, the encoder
used in UMTS is the one depicted in Figure 2.5.

2.4 Trellis Diagrams.

A trellis diagram is a graphical representation of the states of the encoder. It is a pow-


erful tool since not only allows us to see state transitions, but also their time evolution.
The MAP (Maximum a posteriori Probability) and the SOVA (Soft Input Soft Output)
algorithms are used to decode Turbo Codes. They base their calculations on the trel-
lis branches in order to reduce computing and this is the reason why we explain trellis
diagram.
10 Turbo Codes

m = <110...> x=<11 10 00 ...>

s0 {0,0}

{1,1}

{1,1}

s1 {0,0}

{1,0}
s2 {0,1}

{0,1}

s3 {1,0}

i=0 i=1 i=2 i=3

m i =0
m i =1

Figure 2.6: Trellis Example of an RSC encoder with Pf b = [111], Pg = [101]

Figure 2.6 shows the trellis for the RSC encoder of figure 2.4. The figure also shows an
example of an input message and how this input message represents a path in the trellis
diagram. This path is colored in blue an it is known as the state sequence s.
In order to find the trellis representation of an encoder we follow these steps:

• The trellis will have 2K−1 states, at each time instant.

• The memory elements of the encoder are set to represent a given state. Usually the
first state is 0. Then we want to calculate the connections between the present state
and the subsequent states.

– An input bit mi equal to zero is assumed. Then the output symbol is calculated
by operating with the adders and the value of the registers. Also the next state
is calculated by shifting the register inputs at the clock edge. For example in
figure 2.6, we see that at state s0 , an input message bit mi = 0 produces a
transition to state s0 . In contrast, a bit mi = 1 produces a transition to state
s2 .
– An input bit mi equal to one is assumed. Again, the output symbol is calculated
by operating with the adders and the value of the registers. Also the next
state is calculated by shifting the register inputs at the clock edge. Note that,
whenever a transition is due to a zero input bit, then that transition is drawn
as a solid line. In contrast, whenever the transition is due to an one input bit,
that transition is drawn as a dashed line.

• Repeat the previous steps with the rest of the states, s1 -s3 in the example.

The previous trellis diagram is given by the polynomials and therefore it is the same for
all the stages. The encoded message can be thought as a particular path within the trellis
diagram as shown in the example of 2.6.
2.5 Turbo Codes Encoders. 11

RSC RSC
m i In te r le a v e r xi
E ncoder E ncoder

Figure 2.7: Serial concatenated Turbo encoder

mi xs i

1 1 1

1 0 1
xp i

Interleaver x p1
i

x p2 i

puncturing

Figure 2.8: Parallel concatenated Turbo encoder. RSC encoder with Pf b = [111], Pg =
[101].

2.5 Turbo Codes Encoders.


As we mentioned in 2.3. Turbo Codes encoders are mainly based on convolutional en-
coders. However Turbo encoders also include one or more interleavers for shuffling data.
Figure 2.7 shows a serial concatenated Turbo encoder, while figure 2.8 shows a parallel
concatenated Turbo encoder of rate 12 which is the one used in our communication system
model. A lot of combinations can be achieved by concatenating different convolutional
encoders with interleavers. The reason of the interleavers is to uncorrelate data streams,
so at the decoder, an iterative decoding can take place. In figure 2.8, there is a block
known as puncturer, which basically compounds the parity bit of the resulting encoder
by selecting one parity bit from each convolutional encoder at a time. If no puncturing
was done, then the rate of the entire Turbo encoder would be 13 —the data rate of the
resulting Turbo encoder can be different from the rate of the convolutional encoder.

2.6 Trellis Termination


Before getting into the decoding process, it is important to mention the trellis termination
of the convolutional encoders since it affects the BER performance of the code. The trellis
12 Turbo Codes

s1
mi xs
i
s2

1
1 1

1 0 1
xp i

Interleaver x p1
i

x p2 i

Figure 2.9: Turbo Encoder with trellis termination in one encoder. Pf b = [111], Pg = [101].

termination is basically the final state the memory elements of the convolutional encoders
adopt when the end of the frame, being encoded, is reached. Since there is an interleaver
between both convolution encoders, the trellis termination of them is not a trivial task [16].
We will choose, for the purpose of this work, to terminate the first encoder and left the
second encoder open. Figure 2.9 shows the resulting Turbo encoder. The system works
as follows: At the beginning, switch s1 is closed and switch s2 is opened. A data frame
of size L − 2 is encoded and then switch s1 is opened and s2 is closed, the remaining two
bits are encoded , this leads the first convolutional encoder to the state 0. Note that the
data frame, for this case L − 2 bits long, and the remaining two bits are used to terminate
the trellis.
Chapter 3

Decoding Turbo Codes : Soft


Output Viterbi Algorithm

In this chapter we will introduce a general scheme of a Turbo decoder for a parallel
concatenated code. We will go step by step through the entire decoding process and
deeply describe one of the algorithms used in the SISO (Soft Input Soft Output) unit:
The SOVA algorithm.

3.1 Turbo Codes decoding process.

In the previous chapter we presented Turbo Codes and the encoding process. Now it is
time to talk about the decoding process. Turbo codes are asymmetrical codes. That is,
while the encoding process is relatively easy and straight forward, the decoding process is
complex and time consuming.

The power of Turbo Codes resides on the decoding process which unlike others tech-
niques, is done iteratively. Figure 3.1 shows a general scheme of a turbo decoder. As
we can see, the decoding process is done by two SISO decoders. Signals arriving at the
receiver are sampled and processed with the aid of the channel reliability before becoming
the soft information ”parity info 1,2” and ”systematic info” shown in figure 3.1. We can

DeInterleaver

- -

+ +
a priori Info La Λ Interleaver La Λ

SISO - SISO -
parity Info 1 p parity Info 2 p
DeInterleaver
systematic Info s s

Interleaver

Decision decoded bits

Figure 3.1: Turbo Decoder generic scheme.


14 Decoding Turbo Codes : Soft Output Viterbi Algorithm

see the output of one SISO decoder becoming the input of the other decoder and vice
versa, forming a feedback loop. The name of turbo code is due to this feedback loop and
its comparable appearance to a turbine engine.
Final decoding is achieved by an iterative process. Soft input information is processed
and as a result soft output information is obtained. The second decoder takes this soft
information as input and produces new soft output information that the first decoder
will use as input. This process continues until the system makes a hard decision. The
BER obtained improves drastically with the first iterations until it begins to converge
asymptotically [3]. A trade-off exists between the decoding delay and the bit error rate
achieved. Even though eight iterations are enough to obtain a reasonable BER, decoders
not always do them all; instead they check the parity of the message header and then they
decide whether to keep iterating or not.
Note that between each decoder there is an interleaver or deinterleaver depending on
the data flow. As we mentioned in chapter 2, the interleaver/deinterleaver unit is a big
issue in turbo coding. This unit reorders soft information so a priori data, parity data and
systematic data are all time coherent at the moment of processing.
Figure 3.1 also shows how soft input information is extracted from output, in order to
avoid the positive feedback which degrades the BER performance of the system.

3.2 SISO Unit: SOVA.


Even though the SOVA algorithm and the MAP algorithm are both trellis based —they
take advantage of trellis diagram to reduce computations— they differ in the final estima-
tion they obtain. MAP performs better when working with low SNR and both of them
are about the same when working with high SNR. MAP finds the most probable set of
symbols in a message sequence while SOVA finds the most probable message sequence of
states associated to a path within a trellis. Nevertheless, MAP is computationally much
heavier than SOVA.
SOVA stands for Soft Output Viterbi Algorithm. Actually, it is a modification of the
Viterbi Algorithm [7].We will introduce the Viterbi Algorithm based on the explanation
given in [16] and then we will add the soft output extension. VA is widely used because
it is useful to find the most probable sequence within a trellis and we can use a trellis
diagram to represent any finite state Markov process.
Recalling our communication model, let ŝ = (s0 , s1 , . . . sL ) be the sequence we want to
estimate and let y be the received sequence of symbols. VA finds:
n o
ŝ = arg maxP [s | y] (3.1)
s

where y is the noisy set of symbols we have at the decoder after sampling. To be more
precise y is the observation. From Bayes theorem we have:
 
P [y | s] P [s]
ŝ = arg max (3.2)
s P [y]
since P [y] does not change with s, we can rewrite equation 3.2 as:
3.2 SISO Unit: SOVA. 15

n o
ŝ = arg maxP [y | s] P [s] (3.3)
s

In order to compute equation 3.3, we could try all sequences s and find the one that
maximizes the expression. However, this idea it is not scalable when the frame size is too
large.
Since there is a first order Markov process involved, we can take advantage of two of
its properties to simplify the search for ŝ. These properties are:

P [si+1 | s0 . . . si ] = P [si+1 | si ] (3.4)


P [yi | s] = P [yi | si → si+1 ] (3.5)

Equation 3.4 establishes that the probability of next state does not depend on the entire
past sequence. It only depends on the last state. Equation 3.5 states that the conditional
probability of the observation symbol yi through white noise is only relevant during the
state transition.

Using these properties we can work on 3.3:

L−1
Y
P [y | s] = P [yi | si → si+1 ] ,
i=0
L−1
Y
P [s] = P [si+1 | si ] ,
i=0
L−1 L−1
( )
Y Y
ŝ = arg max P [yi | si → si+1 ] P [si+1 | si ] (3.6)
s
i=0 i=0

A hardware implementation of an adder requires less resources than a hardware imple-


mentation of a multiplier. So if we apply natural logarithm on 3.6 we can replace multi-
plications with additions without altering the final result. Thus it yields:

L−1
( )
X
ŝ = arg max ln P [yi | si → si+1 ] + ln P [si+1 | si ] (3.7)
s
i=0

Introducing λ (si → si+1 ) = ln P [yi | si → si+1 ] + ln P [si+1 | si ], we can rewrite equation


3.7 as:

L−1
( )
X
ŝ = arg max λ (si → si+1 ) (3.8)
s
i=0

λ (si → si+1 ) is known as the branch metric associated with transition si → si+1 .

The observation yi during state transition si → si+1 is actually the output of the
encoder observed through white noise during the state transition. For our BPSK model this
16 Decoding Turbo Codes : Soft Output Viterbi Algorithm

xsi (usi,upi)= (-1,-1)


mi
s0 s0

(1
,1
s1 s1

)
s2 s2
xpi
BPSK Modulation
u s = 2 xs − 1
i i

u p = 2x p −1
s3 s3
i
i

Figure 3.2: Output during state transition for a given trellis.

observation is related to the systematic and parity bit pair (Figure 3.2). Thus, assuming
noise independence, we can express the conditional probability of yi during state transition
as follows:

P [yi | si → si+1 ] = P [ysi | usi ] P [ypi | upi ] (3.9)

where usi and upi are the systematic and parity bits respectively after BPSK modulation
and      
ysi −usi 2 ypi −upi 2
 
1 1 1 1
P [ysi | usi ] = σ√2π exp − 2 σ dysi ; P [ypi | upi ] = σ√2π exp − 2 σ dypi ,
since we are dealing with white Gaussian noise with σ 2 variance. In addition, it is more
convenient to express P [si+1 | si ] in terms of the message bit mi since state transitions
are due to this bit. Then,

P [si+1 | si ] = P [mi ] (3.10)

This is our a priori probability. For turbo decoding it is easier to work with log-likelihood
ratios, then:

P [mi = 1]
Lai = ln
P [mi = 0]
( La
e i
mi = 1
1+eLai ⇒ ln P [mi] = Lai mi − ln 1 + eLai

P [mi ] = 1
1+eLai
mi = 0

It is important to remark that for the first iteration, all message bits are assumed to be
equally likely, then P [mi = 1] = P [mi = 0] = 0.5 → Lai = 0. For successive iterations
Lai is the extrinsic information provided by the other decoder through the interleaver.
Replacing equation 3.9 and the above expression in the branch metric equation, we have:
3.2 SISO Unit: SOVA. 17

1 (ysi − usi )2 1 (ypi − upi )2


 
dysi dypi
+ Lai mi − ln 1 + eLai

λ (si → si+1 ) = ln 2
− 2
− 2
σ 2π 2 σ 2 σ
1 h i
= − 2 (ysi − usi )2 + (ypi − upi )2 + Lai mi

1 
− 2 ys2i − 2ysi usi + u2si + yp2i − 2ypi upi + u2pi + Lai mi
 
=

1
= [ys us + ypi upi ] + Lai mi
σ2 i i

Note that in order to simplify equations we have neglected terms that do not change when
N0
varying sequence s. From chapter 2 we know that σ 2 = 2Es and Es = rEb where r = 12
is the code rate. So finally we obtain:

Eb
λ (si → si+1 ) = [ys us + ypi upi ] + Lai mi (3.11)
N0 i i

It is more common to express equation 3.11 as shown below, since channel reliability
Es
Lc = 4a N 0
( a = 1 for our model ),

λ (si → si+1 ) = Lc ysi xsi + Lc ypi xpi + Lai mi (3.12)

then 3.8 becomes:

L−1
( )
X
ŝ = arg max Lc ysi xsi + Lc ypi xpi + Lai mi (3.13)
s
i=0

where xsi , xpi are the raw bits at the output of the channel encoder before the BPSK
modulation. Also mi = xsi for our RSC encoder.
It is important to remark that according to [11], for the SOVA algorithm, Lc can be
assumed to be equal to 1. This means that there is no need to estimate the SNR of
the channel. This is possible because at the beginning of the decoding process, at the
first iteration Lai = 0 which leads the resulting extrinsic information to be weighted by
Lc . This extrinsic information becomes Lai for the next SISO decoder making all terms in
equation 3.13 to be weighted by Lc . Therefore Lc has no influence in the decoding process.
The fact that the SOVA does not need the channel estimation saves a lot of difficulties
and represents a big advantage over the MAP algorithm.
Summarizing, table 3.1 shows the relevant equations for applying the SOVA algorithm.
18 Decoding Turbo Codes : Soft Output Viterbi Algorithm

Element Equation

Branch Metric λ (si → si+1 ) = ysi xsi + ypi xpi + Lai mi (3.14)

L−1
( )
X
Sequence Estimator ŝ = arg max ysi xsi + ypi xpi + Lai mi (3.15)
s
i=0

Where {xsi , xpi } is the encoder output symbol when the input message bit is mi ; {ysi , ypi }
is the received symbol, when the encoder output symbol is BPSK-modulated and trans-
mitted through an AWGN channel. Finally Lai represents the LLR of the message bit
mi .

Table 3.1: Equations summary.

In the next subsection we will develop an example in order to show how expression
3.15 and the trellis diagram are applied in the decoding process.

3.2.1 Viterbi Algorithm Decoding Example

Figure 3.3 shows a trellis diagram example for a code with Pf b = [111], Pg = [101], and
tries to clarify the decoding process.

• As shown on figure 3.3.a, The process begins at time i = 0 from state 0 because
that is the state the encoder takes when initialized. Thus, the probability of being
at state 0 is one, and probability of being at any other state is zero. We assign these
probabilities, as path metrics in log domain, to each state:

pmi,k ⇒ pm00 = 0
pm0k = −∞ ∀k 6= 0

• Then, the branch metrics are computed at each state for message bit 0,1 and corre-
sponding parity bit.
3.2 SISO Unit: SOVA. 19

0 λ ( s00 → s10 )

s0
λ ( s01 → s10 )
−∞
s1
λ ( s02 → s11 )
−∞
s2 λ ( s03 → s11 )

−∞
s3
i=0 i=1 i=2 i=3 i = L-2 i = L-1 i=L

(a) Computing branch metrics

0
s0
−∞
s1
−∞
s2
−∞
s3
i=0 i=1 i=2 i=3 i = L-2 i = L-1 i=L

(b) Surviving branches

0 λ ( s10 → s20 )

s0
λ ( s11 → s20 )

−∞
s1
−∞
s2
−∞
s3
i=0 i=1 i=2 i=3 i = L-2 i = L-1 i=L

(c) Continuing at i=1

s0

s1

s2

s3
i=0 i=1 i=2 i=3 i = L-2 i = L-1 i=L

m̂ = 101 10 L survival path

(d) Tracing back from last state.

Figure 3.3: Trellis diagram for VA, Code given by Pf b = [111] , Pg = [101] .
20 Decoding Turbo Codes : Soft Output Viterbi Algorithm


λ si,k → si+1,k0 ⇒ λ(s0,0 → s1,0 ) = (ysi + Lai ) 0 + ypi 0
λ(s0,0 → s1,2 ) = (ysi + Lai ) 1 + ypi 1
λ(s0,1 → s1,2 ) = (ysi + Lai ) 0 + ypi 0
λ(s0,1 → s1,0 ) = (ysi + Lai ) 1 + ypi 1
λ(s0,2 → s1,3 ) = (ysi + Lai ) 0 + ypi 1
λ(s0,2 → s1,1 ) = (ysi + Lai ) 1 + ypi 0
λ(s0,3 → s1,1 ) = (ysi + Lai ) 0 + ypi 1
λ(s0,3 → s1,3 ) = (ysi + Lai ) 1 + ypi 0

• The incoming path metrics for each state at time i = 1 are calculated by adding the
incoming branch metrics to the corresponding path metrics of states at time i = 0.
Figure 3.3.b.

• For each state at time i = 1, the incoming branch with the greater incoming path
metric is kept. The new path metrics of these states are the survival incoming path
metrics.

pm1,0 = max (pm0,0 + λ (s0,0 → s1,0 ) , pm0,1 + λ (s0,1 → s1,0 ))


pm1,1 = max (pm0,3 + λ (s0,3 → s1,1 ) , pm0,2 + λ (s0,2 → s1,1 ))
pm1,2 = max (pm0,1 + λ (s0,1 → s1,2 ) , pm0,0 + λ (s0,0 → s1,2 ))
pm1,3 = max (pm0,2 + λ (s0,2 → s1,3 ) , pm0,3 + λ (s0,3 → s1,3 ))

In figure 3.3.b, 3.3.c the survival branches are drawn thicker.

• This algorithm is repeated from item 2 until time i = L − 1. Note that the final
states will be at i = L.

• In order to find ŝ at this point, there are two possibilities: if the encoder was
terminated, the system should trace back from the state at which the encoder was
terminated —usually state 0— through all survival linked branches. If the encoder
was not terminated, the system should choose the state with the greater path metric
and trace back from there. Each branch within the trellis has a message bit m̂i
associated. The set of those bits is the most probable message. This step is shown
in figure 3.3.d while the survival path is colored in green.

3.2.2 Soft Output extension for the VA.

The Viterbi Algorithm is able to find the most probable sequence within the trellis and
hence its associated bits. Turbo coding techniques also demand the SISO unit to supply
soft output information. There are two well-known extensions for the Viterbi Algorithm
that produce soft output [11]. One was proposed by Battail [2] and it is known as BR-
SOVA. The other one was proposed by Hagenauer [7] and it is known as HR-SOVA. The
latter is mostly used rather than the former, even though the BR-SOVA performs better
in terms of BER. However, HR-SOVA allows an easier hardware implementation. We will
explain the HR-SOVA extension and remark the main idea.
3.2 SISO Unit: SOVA. 21

Soft output information represents a measure of the bit reliabilities. As a starting


point for the algorithm, a reliability ρ of infinity is assumed for every bit in the frame,
thus ρi = ∞ ∀ i. The remaining steps proceed as follows:

• As shown in the example of figure 3.4.a, at time i = L and state k = 0 the trace back
of the survival path starts. The survival path has been colored in green as exhibited
in the legend of the figure. In order to find the bit reliabilities, the competing path
also needs to be traced back from time i = L and state k = 0 to the time it merges
with the survival path. This competing path has been colored in orange, and for
the example of figure 3.4.a, the time where both paths merge is im = L − 4. Also
the difference between both incoming path metrics at time i and state k has to be
found. In figure 3.4 this value is represented as:
 
∆i,k = pmi−1,k0 + λ si−1,k0 → si,k − pmi−1,k00 + λ si−1,k00 → si,k (3.16)

where k is the next state of k 0 and k 00 , for a message bit mi ∈ {0, 1} respectively.
See Figure 3.4.a for references.

• Let j be a new time index in the range im < j ≤ i. At every time instant j, the
system compares the message bit of the survival path with the message bit of the
competing path. If they differ then the reliability ρj has to be updated according to

ρj ⇐ min (ρj , ∆i,k ) (3.17)

In figure 3.4 a red square is placed on the branches that differ in the message bit.
The BR-SOVA has also an updating rule for the case where the message bit of the
survival path do not differ to the message bit of the competing path:

ρj ⇐ min ρj , ∆i,k + ρcj



(3.18)

This is the main difference between HR-SOVA and BR-SOVA. Nevertheless this
updating rule implies the knowledge of the bit reliabilities of the competing paths
ρcj [11].

• Once the system reaches the state where the survival path and the competing path
merge, it moves one time instant back from i to i − 1 through the survival path
and traces back once again the competing path at that state. This process is shown
in figure 3.4.b. For the example, the system now starts at time i = L − 1 and the
corresponding state k = 0. For this case, the competing path and the survival path
now merge at time im = L − 5.

• This algorithm continues from step 2 until time i = 1, thus allowing all the bit
reliabilities to be updated. Figure 3.4.c shows one more iteration with the aim of
clarifying this process.

• Finally soft output information is obtained in terms of LLR(Log-Likelihood Ratio)


as follows:

Λi = (2m̂i − 1) ρi 0≤i≤L−1 (3.19)


22 Decoding Turbo Codes : Soft Output Viterbi Algorithm


im k '
L ,0
k
s0
k ''
s1

s2

s3
survival path i = L-5 i = L-4 i = L-3 i = L-2 i = L-1 i=L
competing path

(a) Survival path and competing path at time i=L, state k=0

∆ L −1,0
s0

s1

s2

s3
survival path i = L-5 i = L-4 i = L-3 i = L-2 i = L-1 i=L
competing path

(b) Survival path and competing path at time i=L-1, state k=0

∆ L − 2,1
s0

s1

s2

s3
survival path i = L-5 i = L-4 i = L-3 i = L-2 i = L-1 i=L
competing path

(c) Survival path and competing path at time i=L-2, state k=1

Figure 3.4: Soft Output extension example for the Viterbi Algorithm. Code given by
Pf b = [111] , Pg = [101] .

where m̂i is the estimated message bit —m̂i ∈ {0, 1}. Note that (2m̂i − 1) only gives
the sign to Λi ; its magnitude is provided by ρi .

After explaining the previous algorithm, it is important to remark the main idea of the
process. At a given time 0 ≤ i ≤ L − 1 the question to ask is: How reliable is the message
bit m̂i ? The extension for soft output indicates that, the correctness of bit m̂i can only be
3.2 SISO Unit: SOVA. 23

as good as the decision to choose the “closest” competing path over the most likely path.

3.2.3 Improving the soft output information of the SOVA algorithm.

The soft output generated by the HR-SOVA turned out to be overoptimistic [12]. It means
that the HR-SOVA algorithm produces a LLR that is greater in magnitude than the LLR
produced by the BR-SOVA or by the MAP algorithm. These overoptimistic values for the
LLR lead HR-SOVA to a worse performance in terms of BER.
In [12] two problems associated with the output of the HR-SOVA are described. One
is due to the correlation between extrinsic and intrinsic information when the HR-SOVA
is used in a turbo code scheme. The other problem is due to the fact that the output of
the HR-SOVA is biased. The first problem is not easy to solve, and most of the hardware
implementations do not deal with it. In contrast, for the second problem there have
been several proposals that are based on a normalization method. The idea behind a
normalization method can be shown by assuming that the output of the HR-SOVA, given
a message bit mi , is a random variable with a Gaussian distribution, then:
!
1 (Λi − µΛ )2
P [Λi | mi = 1] = √ exp − 2 dΛi , (3.20)
2πσΛ 2σΛ
!
1 (Λi + µΛ )2
P [Λi | mi = 0] = √ exp − 2 dΛi , (3.21)
2πσΛ 2σΛ
q 
where µΛ is the expectation of Λi and σΛ = E Λ2i − µ2Λ is the standard deviation. In
order to find the LLR of the message bit mi , given the output of the HR-SOVA, we can
define:  
0 P [mi = 1 | Λi ]
Λ i = ln , (3.22)
P [mi = 0 | Λi ]

using Bayes theorem, assuming P [mi = 1] = P [mi = 0], and working on the previous
expression with 3.20 and 3.21, yields:
2µΛ Λi
Λ0 i = 2 , (3.23)
σΛ

2µΛ
which indicates that the HR-SOVA output should be multiplied by the factor c = 2
σΛ
to
obtain the LLR.
The factor c, according to [12], depends on the BER of the decoder output. Some
schemes try to estimate factor c while others set up a fixed value for it. In our hardware
implementation we will use a fixed scaling factor since in [10], it has been reported that
the BER performance by a fixed scaling factor is better than by a variable scaling factor.
24 Decoding Turbo Codes : Soft Output Viterbi Algorithm
Chapter 4

Hardware Implementation of a
Turbo Decoder based on SOVA

In the previous chapter we introduced the general ideas of a turbo decoder and presented
the HR-SOVA algorithm —from now on we will refer to it just as SOVA— as the active
part of the SISO unit. In this chapter we will deal with the implementation issues and
analyze today most commonly used hardware architectures. Next we will introduce a new
algorithm for finding points of the survival path and consequently we will present the
architecture for implementing it. We will describe the unit that updates bit reliabilities
and finally we will present the improvements which allow the decoder to boost the BER
performance.
As a general scheme we present the figure 4.1. There are two blocks of RAM used
as input and output buffers. There are also two more blocks of RAM used to store
temporary data as a priori and extrinsic information. Then there is a unit that deals with
the interleaving process, a unit to control the system and to interact with the user and
finally the SISO unit that implements the SOVA algorithm. Note that we only use one
SISO unit. This is possible because of the fact that the interleaver/deinterleaver does not
allow concurrent processing, so a frame has to be completed by one decoder before it can
be processed by the other. For the proposed architecture, this processing is always done
by the same decoder.

Data arriving at the receiver is processed and fed into the data-in RAM buffer, then
a starting command is delivered to the control unit. The states the system goes through
are shown in figure 4.2. The system starts to process the interleaved data first, and at the
last iteration, it ends up with the deinterleaved data . This is done this way, in order to
save an access through the interleaver at the end of the decoding process which also saves
power and allows a simpler control unit. However, the system has to wait until the entire
frame is received, before decoding can take place.
Even though the same unit is used as decoder 1 and decoder 0, its behavior changes
slightly, depending on the role the unit is playing. We can summarize the following tasks
for each role:

• SOVA unit is acting as decoder 1:


26 Hardware Implementation of a Turbo Decoder based on SOVA

– When SOVA addresses data-in RAM buffer, it addresses belong to the inter-
leaved domain.
– Since it addresses belong to the interleaved domain, in order to get systematic
data, it has to go through the deinterleaver.
– It can address “parity data 2” directly.
– If the first iteration is running, then a priori information is assumed to be 0.
Otherwise, it fetches a priori information through the deinterleaver from RAM
La/Le.
– It writes extrinsic information directly to the RAM Le/La. This entails that,
when acting as decoder 0, it has to access a priori information through the
interleaver.

• SOVA unit is acting as decoder 0:

– It addresses belong to the deinterleaved domain, or the domain where informa-


tion bits are in order.
– It can access systematic data and “parity data 1” directly form the data-in
RAM buffer.
– The a priori information is accessed through the interleaver, since each word was
written to an address in RAM Le/La, that belongs to the interleaved domain.
– It writes extrinsic information directly to the RAM La/Le.
– It writes hard output directly to the data-out RAM buffer. This can be done at
each iteration, allowing the user to check for a frame header, or when running
the last iteration with the aim of saving power.

cmd

RAM RAM Control


data in status
Data In La/Le Unit

Interleaving/Deinterleaving RAM
Unit Le/La

RAM
SOVA data out
Data Out

Figure 4.1: Hardware implementation of a turbo decoder


4.1 Turbo Decoder RAM buffers. 27

Idle
+ Addresses belong to the interleaved domain.
+ It fetches “parity data 2” directly from data-in
RAM buffer.
+ It fetches systematic data through the
deinterleaver.
N + It fetches a priori data through the
Begin? deinterleaver from RAM La/Le.
+ It writes extrinsic information directly to RAM
Le/La.
Y

Deco 1

Deco 0

+ Addresses belong to the deinterleaved


domain.
+ It fetches systematic data and “parity data 1”
N Last directly from data-in ram buffer.
Iteration? + It fetches a priori data through the interleaver
from RAM Le/La.
+ It writes extrinsic information directly to RAM
Y
La/Le.
+ It writes hard output directly to the output
buffer.
Done

Figure 4.2: Overall system states diagram

4.1 Turbo Decoder RAM buffers.


All the RAM buffers are based on double port RAMs. The figure 4.3 shows the scheme
of data-in RAM. Since the systematic data and parity data 2 belong to different time
domains, two double port RAMs are used to store either information data. In figure 4.4
the scheme of the data-out RAM is shown. Finally figure 4.5 presents the RAM La/Le
and the RAM Le/La, which are equivalent.
28 Hardware Implementation of a Turbo Decoder based on SOVA

syst p1 sys p1

systematic
parity 1
RAM

addr in addr out sp1


wr rd sp1

p2
parity 2
RAM addr out p2
p2 rd p2

Figure 4.3: Data-in RAM

x̂ x̂

data-out
RAM
addr in addr out
wr rd

Figure 4.4: Data-out RAM

data La data Le

RAM
La/Le
addr la/le addr in
rd la/le we la/le

data La data Le addr le

RAM
Le/La
addr le/la addr in
rd la/le wr le/la

Figure 4.5: RAM La/Le and RAM Le/La connections

4.2 Interleaving/Deinterleaving unit of the turbo decoder


There have been several proposals to design an area efficient interleaver. In [14] contention
free interleavers, that allow concurrent processing, are studied. In our case for the sake of
simplicity and versatility a ROM is used to carry out the interleaving/deinterleaving func-
tions as look up tables. Figure 4.6 shows the interleaving/deinterleaving unit. The figure
also shows some control signals. The signal named “deco” indicates the role the SOVA unit
is playing. Note that when working with “deco=1”, the address of the “parity data 2” is
4.3 SOVA as the core of the SISO. 29

RAM la/le -
RAM le/la le/la write
RAM p2 RAM sp1 RAM la/le port interface
interface interface
interface interface

rd p2 rd sp1 rd lale rd lela


wr le/la wr la/le
addr out sp1
deco ‘1’
1 0
addr_la_1 addr_la_0 deco dmux addr_le
mux deco
0 1

addr out p2 wr le

Delay 1 Deinterleaver Interleaver


ROM ROM
α β
rd dint rd int

addr_spla

Figure 4.6: Interleaving/Deinterleaving Unit

delayed one cycle, while the address of the systematic data goes through the deinterleaver.
Also the a priori data is fetched form the RAM La/Le and the extrinsic information is
written directly to the RAM Le/La. In contrast, when working with “deco=0”, there is
no need to access the “parity data 2” RAM, since the “parity data 1” and the systematic
data are stored in the same RAM position. In this case, a priori information is accessed
through the interleaver and extrinsic information is written directly to the RAM La/Le.

4.3 SOVA as the core of the SISO.


Before getting into our hardware implementation for the SOVA algorithm, it is important
to comment some of today most commonly used hardware architectures.
Since the SOVA algorithm is an extension of the Viterbi Algorithm, most of the main
units have been based on the implementations achieved for the Viterbi Algorithm. This
architectures are complemented with reliability updating units to produce the soft output.
Figure 4.7, shows a comparison between Viterbi decoders and SOVA decoders. Both
decoders have a BMU (Branch Metric Unit), an ACSU (Add Compare Select Unit), and
an SMU (Survival Memory Unit). However the SOVA ACSU has to provide with the ∆
difference between path metrics, and the SMU includes an RUU (Reliability Updating
Unit) that provides the soft output information. In the next subsection we will discuss
the issues related to the SOVA components.
30 Hardware Implementation of a Turbo Decoder based on SOVA

data in data in
VA SOVA

BMU BMU

ACSU ACSU

SMU
SMU
RUU

data out data out

Figure 4.7: Viterbi and SOVA decoder schemes

4.4 Branch Metric Unit.


As it name suggests, this unit calculates the branch metrics. According to equation 3.14,
the possible branch metrics depend on xsi , xpi and mi bits. When working with an RSC
encoder of rate 12 , xsi = mi and there is only one parity bit xpi , which means that there
are four possible path metric at each time instant i:

• (xsi , xpi ) = (0, 0) → λ0 = 0

• (xsi , xpi ) = (0, 1) → λ1 = ypi

• (xsi , xpi ) = (1, 0) → λ2 = ysi + Lai

• (xsi , xpi ) = (1, 1) → λ3 = ysi + Lai + ypi

The BMU for an RSC encoder of rate 12 , is shown in figure 4.8.


4.5 Add Compare Select Unit. 31

ys
(x s , x p ) = (1, 0 )
λ2

+
La
(x s , x p ) = (1,1 )
λ3

+
yp
(x s , x p ) = (0 ,1 )
λ1

(x s , x p ) = (0 , 0 )
λ0

Figure 4.8: BMU for the RSC encoder.

4.5 Add Compare Select Unit.


Applying equation 3.15 in the trellis diagram, yields the following expression:

pmi,k = pmi−1,k0 + λ si−1,k0 → si,k

where k is the next state of k 0 that produces the higher incoming path metric. The previous
expression suggests that the path metric pmi,k can be obtained by recursion. In figure 4.9
an ACSU for the SOVA unit is presented.
The set of registers holds the previous path metrics. The branch metrics are mapped to
the corresponding adders according to the outputs during state transitions to produce the
incoming path metrics. Then these incoming path metrics are connected to the selectors,
which choose the higher incoming path metric and produce the decision vector along with
the ∆ difference between incoming path metrics. The connections between adders and
selectors represent the trellis butterfly.
One problem that might arise is the overflow of the path metrics after a certain amount
of time. Since the relevant information is the difference between path metrics, a normal-
ization method can be adopted. There have been proposed many normalization methods
since the introduction of Viterbi decoders. We find the modulo technique reported in [13]
to be a good solution, since it actually allows the overflow.
The idea behind the modulo technique is that the maximum difference ∆B between
path metrics at all states is bounded. The figure 4.10 shows the mapping between all
representable numbers, by the path metric register of nb bits, on a circumference.
Let ipm0i,k and ipmi,k be two incoming path metrics at a given time i state k, then
it is shown in [13] that ipm0i,k > ipmi,k , if ipm0i,k − ipmi,k > 0 in a two-complement
representation context. The number of bits nb relates to the bound as follows:

C = 2nb = 2∆B
32 Hardware Implementation of a Turbo Decoder based on SOVA

λ0 λ1 λ2 λ3

∆ ,0i
pm

+ +
− 1,0

Sel
i

v i ,0
pm i ,0

∆ i ,1

+ +

Sel
v i ,1

∆ ,2i
+ +

Sel
v i ,2

∆ ,3i
+ +

Sel
v i ,3

(0 , 0 ) = λ0
pm (1 ,1 )
i ,k
(1 ,1 )

(0 , 0 )
- ∆, (1 , 0 )
i k

(0 ,1 )
(0 ,1 )
>0?
v ,
i k
(1 , 0 )

selector

Figure 4.9: Add Compare Select Unit for the SOVA. Pf b = [111], Pg = [101]

This means that, even though the path metrics may grow in different ways, they all remain
in the half of the representation space provided by C. An appropriate bound is ∆B = 2nB,
being n the minimum number of stages to ensure a complete trellis connectivity among
all trellis states, and B is the upper bound on the branch metrics [13].
4.6 Survival Memory Unit. 33

ipm i,k

increasing

2nb−1 −1
0

− 2 nb−1 −1
ipm '
i ,k

ipm i , k' − ipm i k ,


mod 2 nb
> 0? ⇒ '
ipm i , k > ipm i , k

Figure 4.10: Modular representation of the path metrics. Each path metric register has a
width of nb bits.

s0

s1

s2

s3
iFP i

Figure 4.11: Merging of paths in the traceback.

4.6 Survival Memory Unit.

The remaining SOVA units should obtain the soft output information for every bit in the
frame along with the most likelihood path. One way to do so, is to store all the data the
ACSU provides. Then when the last time instant is reached, the data is traced back and
the bit reliabilities are updated according to the SOVA algorithm. However most of the
hardware architectures do not do it that way because the latency is high and the amount
of memory grows considerably with the frame size, the number of states of the encoder
and the width of the quantization of ∆i,k .
Most of the SMUs take advantage of a trellis property to solve this problem. This
property is illustrated in Figure
4.11, where a trellis diagram from a decoding process is shown. If all the paths are
traced back from all the states at a given time i, it is found that they merge at time instant
iF P . Therefore, from time instant iF P down to i = 1 the only path remaining in the trace
34 Hardware Implementation of a Turbo Decoder based on SOVA

i i-U i-D
0 ∆MAX
PEU PEU PEU PEU PE PE PE v− ,0 ρi−D
1 ∆MAX
i D

v i ,0 ∆ ,0
i

0 ∆MAX
PEU PEU PEU PEU PE PE PE v− ,0 ρi−D
∆MAX
i D
1

v i ,1 ∆ ,1
i

0 ∆MAX
PEU PEU PEU PEU PE PE PE v− ,0 ρi−D
∆MAX
i D
1

v i ,2 ∆ ,2
i

0 ∆MAX
PEU PEU PEU PEU PE PE PE v− ,0 ρi−D
∆MAX
i D
1

v i ,3 ∆ ,3
i

Figure 4.12: Register Exchange SMU for the SOVA. Pf b = [111], Pg = [101]

back started at time i, is the survival path. We define the time instant along with the state
where the paths merge as a FP (Fusion Point). Then, looking at the example of figure
4.11, for time instant i there is a FP at (iF P, s3 ). Simulations have shown that the distance
between the time instant i and the FP iF P is a random variable. It is also observed that
the probability of the paths merging increases with the depth of the trace back and it is
proportional to the constraint length of the code. Then a trace back depth of 10 times
the constraint length of the code, might allow the paths to merge with high probability.
Below we will describe the mostly used architectures based on the previous property.

4.6.1 Register Exchange Survival Memory Unit.

The RE (Register Exchange) SMU for an RSC encoder of rate 21 is shown in figure 4.12.
This scheme is reported in [9]. It is an array of PE (Processing Elements) of n rows and
D columns —n is the number of states of the encoder and D is the trace back depth. The
connection topology between PEs is given by the trellis of the encoder. In figure 4.12, two
types of PEs can be distinguished. The first U PEs —red outline— , besides tracing back
the paths, update the bit reliabilities. In figure 4.13.a a PE with updating capability is
shown. In figure 4.13.b a normal PE is shown. The system allows the trace back of all the
paths from the states at time instant i. The ACSU provides the data that enters the RE
from the left. The first U units update the bit reliabilities of each path according to the
SOVA algorithm. Each row of the array holds the information of one path. For example,
the first row holds the path information corresponding to the path traced back from the
state 0 at time i. The second row holds the information corresponding to the path traced
back from state 1 at time i, and so on. After D clock cycles, if D is large enough to allow
paths merging, the message bit and its reliability are obtained. Note, that if paths merge
before D then the data coming out from rows, at all states, is the same, since the tails of
all the paths belong to the survival path. Therefore only data from one row is selected.
Parameters U and D, represent a trade-off. Some architectures, set U in a range from
two to five times the constraint length of the code, while D is set between five to ten
times the constraint length of the code. If U and D are too large, the BER performance
increases, so power consumption and area do. The area increasing, is also due to the
resources spent in the connections, which becomes a serious problem with the number of
states of the encoder. If U is large, and D is not, then resources are spent worthlessly
since BER performance is not increased. The same if D is large while U is short or when
4.6 Survival Memory Unit. 35

v '

v ' ∆' v
''

v'' ∆' ' ∆'


∆' '
a

a>b?
b

vi k
,
∆i k
,

(a) PE with updating capability.

v '

v ' ∆' v
''

v'' ∆' ' ∆'


∆' '

vi k
,

(b) Normal PE. Trace back only.

Figure 4.13: Register Exchange processing elements.

both are short. The decoding latency for this scheme is D clock cycles and, as it can be
observed, the pipelined style of the architecture suggests high activity and hence a relative
high dynamic power consumption.

4.6.2 Systolic Array Survival Memory Unit.

The RE scheme, presents one major problem that leads to a high power consumption. The
problem is that all the paths are traced back D steps. The idea behind the SA (Systolic
Array) is to trace back only one path, however, this path, after D steps will merge with
the survival path and will become the path we are looking for. SA is presented in [15].
The figure 4.14.a introduces the scheme of the SA for an RSC encoder of rate 21 and
four states in the encoder. The figure only shows the SMU for the VA. It is composed of
an array of elements arranged in n rows and 2D columns —n is the number of states of the
36 Hardware Implementation of a Turbo Decoder based on SOVA

i i-2D

vi ,0
1 1 1 1 1

Selection vi ,1
1 0 0 0 0
Unit vi ,2
1 1 0 1 1
vi ,3
0 1 0 0 0

TB TB TB TB TB TB SU vi−Dk
,

(a) Systolic Array for the Viterbi Algorithm

vi ,0

vi ,1

vi ,2

vi ,3

si , k si− k
Last 1, '

State

(b) Trace Back element of the Systolic Array

Figure 4.14: Systolic Array for the Viterbi Algorithm.

encoder and D is the depth of the trace back. There is also one more row, with D TB(trace
back) elements. The row of TB elements holds the sequence of the states belonging to the
survival path. It can be observed that the connections between the elements in the array
are much simpler than in the RE scheme.
The system works as follows: the selection unit, feeds the decision bits vi,k provided
by the ACSU into the left of the array. After D clock cycles, the SA is half full and the
selection unit begins to feed the state si,k with the higher path metric accumulated in
the ACSU registers into the left most TB element. The system also works if the selection
unit feeds any other state. However the state with the higher path metric is more likely
to be the survival path. Once the most likely state is fed, the TB elements along with
the decision vectors, do the trace back of that state D more cycles. Figure 4.14.b shows
the details of the TB cell. Finally after 2D cycles the SU(Survival Unit) —Figure 4.15
— provides the most likely message bit. Note that for this scheme the latency is twice
the latency of the RE scheme, however, the trace back depth is only D. Note that this
structure also suggests high activity and relatively high dynamic power consumption.
So far the SA deals with the VA. The SOVA extension for the SA presents some major
problems that were cited in [6]:

• SOVA requires path metrics differences for every state,

• trace back must occur on two paths (survivor and competitor),


4.6 Survival Memory Unit. 37

vi−D ,0

vi−D ,1

vi−D ,2

vi−D ,3

si , k

Figure 4.15: Survival unit for the Systolic Array.

Reliability updating Trace back


s0

s1

s2

s3
i-D-U i-D i
survival path
competing path

Figure 4.16: Two Step idea. First tracing back, and then reliability updating.

• each state must have access to all the information about the path metric differences
and decision vectors for that particular time.

These issues make the SA not a good choice for a complete SOVA based decoder. However
SA has been used in [17] as a reliability updating unit in a Two Step configuration.

4.6.3 Two Step approach for the Survival Memory Unit.

This scheme was proposed in [9] with the intention to discard all the operations that do
not affect the output. The idea is to postpone the updating process until the survival path
is found. Figure 4.16 shows this concept. The first D steps intend to find the survival
path, while the remaining U steps updates the bit reliabilities. A FIFO(First In, First
Out) memory is usually employed to delay the path metric differences along with decision
vectors until the updating process begins. The SMU we propose in this document is
actually a Two Step configuration. However, we introduce a new scheme for finding the
survival path.
38 Hardware Implementation of a Turbo Decoder based on SOVA

∆ ,1i

∆ ,2i

∆ ,1 i

∆ ,3i
RAM
v ,0i

v i ,1

v ,2i

v ,3i

FPU RUU ρ

Figure 4.17: Fusion Points based SMU

4.6.4 Other Architectures.

A lot of architectures and schemes have been proposed in the last years. In [4] different
SMUs for the VA are studied and compared. In [6] a trace back architecture based on an
orthogonal memory is presented. However, all these schemes deal with a finite trace back
depth D and with a finite updating length U , which leads to a non optimum algorithm
execution. In the next subsection we will introduce a new architecture for the SOVA
algorithm that does not depend on the D-U trade-off.

4.6.5 Fusion Points Survival Memory Unit.

So far, two of the most common schemes have been studied. They are the RE and the
SA. Both of them carry out a trace back with the aid of a pipeline architecture. The
size of this pipeline architecture has an impact in the area, power consumption and BER
performance. One of the contributions of this work is a new type of architecture based on
a new algorithm and the development of the architecture that implements it. The major
advantage of this new scheme is that it is independent of the D-U trade-offs and it allows
recursive processing which lessens the register activity.
The new architecture to implement the SOVA algorithm that we propose, as it name
suggests, deals with the Fusion Points. Figure 4.17 shows the general scheme. It consists
of a FPU (Fusion Point Unit) which finds the time instant and the state where the survival
paths merge. It is inside this unit where the new algorithm is implemented. There is a
dual port RAM to store the data the ACSU provides, and finally there is a RUU that
updates the bit reliabilities based on the information provided by the FPU.
The unit works as follows: the data the ACSU provides is stored in the dual port
RAM, the decision bits vi,k are also used by FPU to implement the FP search algorithm.
4.6 Survival Memory Unit. 39

Possible Fusion Points

s0

s1

s2

s3

Merging paths
Possible Fusion Points

Figure 4.18: Possibility of fusion points

Whenever a FP is found, it is indicated to the RUU which updates the bits reliabilities by
a tracing back method aided with the data fetched through the second port of the dual
port RAM.

4.6.5.1 Fusion Points Unit

This unit finds the Fusion Points along the trellis for a code with rate 12 by means of a new
algorithm1 . The algorithm is based on the idea that a fusion point for a code rate 12 will
always reside in the merging point of two paths. Figure 4.18 shows these possibles fusion
points. The following thought explain the previous idea: whenever a trace back operation
takes place, the system traces back from a given time instant i; while tracing back, paths,
at different time instants, merge in groups of two. The last of these “two-paths merging
point” is a Fusion Point. Therefore a FP will always reside in the merging point of two
paths.
The following steps along with the example of figure 4.19 introduce part of the algo-
rithm:

• Decision vectors coming from the ACSU, are used to identify the merging paths or
possible fusion points —Figure 4.19.a.

• Each possible fusion point is marked. Whenever a mark is set, the mark time and
state are held in registers —Figure 4.19.a.

• This mark is propagated along the branches to the next states —Figure 4.19.b.

• The mark is propagated at every clock cycle.

• If a mark propagates to all the sates at a given time, then the origin of that mark
is a fusion point. The fusion point coordinate is held by the register and can be
recalled immediately —Figure 4.19.c.

After introducing the mark movements, figure 4.20 shows a sequence example where more
1 1
We develop the algorithm for a code of rate 2
, however, it can be extended to any code rate.
40 Hardware Implementation of a Turbo Decoder based on SOVA

Possible Fusion Point detected at (0,S 0 )

s0

s1

s2

s3
i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8

(a) Possible fusion point detecction

Mark propagated

s0

s1

s2

s3
i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8

(b) Mark propagation

M ark propagated

s0

s1

s2

s3
i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8

Fusion point at (0,S 0 )


detected at tim e i=2

(c) Mark propagation and fusion point detection.

Figure 4.19: Fusion Point detection algorithm.

than one mark is handled at the same time. In the figure two columns can be appreciated.
The left column indicates the time instant the system is processing, and also the status.
The status is composed by three pointers that are able to hold the time and the states
of FPs. The first two pointers hold the possible FPs detected while the third pointer
4.6 Survival Memory Unit. 41

s0
Time Instant i=0

System Status s1

Pointer 0 (0,s0)
Pointer 1 (-,-) s2

FP (-,-)
s3
i=0

s0
Time Instant i=1

System Status s1

Pointer 0 (1,s0)
Pointer 1 (1,s2) s2

FP (0,s0)i=2
s3
i=1

s0
Time Instant i=2

System Status s1
Its pointer is free since it has no
chance to become a fusion point
Pointer 0 (1,s0)
Pointer 1 (2,s2) s2

FP (-,-)
s3
i=2

s0
Time Instant i=3

System Status s1

Pointer 0 (1,s0)
Pointer 1 (3,s0) s2
Its pointer is free since it has no
chance to become a fusion point
FP (-,-)
s3
i=3

s0
Time Instant i=4
We find that the blue mark and the yellow mark
s1 become fusion points, however if we trace back
System Status from i=5 we find that paths merge at (3,s0).
Whenever two marks coincide, the one with the
Pointer 0 (4,s0) latest origin is kept.

Pointer 1 (4,s2) s2

FP (3,s0)i=5
s3
i=4

s0
Time Instant i=5

System Status s1

Pointer 0 (4,s0)
Pointer 1 (4,s2) s2

FP (-,-)
s3
i=5

Figure 4.20: Sequence of the Fusion Point algorithm


42 Hardware Implementation of a Turbo Decoder based on SOVA

indicates a FP. The right column shows the sequence from time time i = 0 to time i = 5.
The algorithm proceeds as follows:

• i = 0, a possible fusion point is detected at (0, s0 ). A green mark is set and is


propagated to the states (1, s0 ), (1, s3 ). Its coordinate (0, s0 ) is held in the pointer
0 register.

• i = 1, a possible fusion point is detected at (1, s0 ). A blue mark is set and is


propagated to the states (2, s0 ), (2, s3 ). Another possible fusion point is detected at
(1, s2 ). A fuchsia mark is set and is propagated to the states (2, s1 ), (2, s3 ). Since the
green mark propagates to all the state at i = 2 its origin becomes a fusion point. The
fusion point register is set with the data of the pointer 0, which holds the coordinate
of the green mark, and pointer 0 is free. Also a green straight line across all the
states at time i = 2 indicates the time the FP is detected. Note that even though
the actual time instant is i = 1, the detection line of the FP is at i = 2. Before
moving into the next time instant, the coordinates of the blue and fuchsia marks are
stored in pointer 0 and pointer 1 registers respectively. Note that pointer 0 register
was free when the fusion point was detected.

• i = 2, a possible fusion point is detected at (2, s0 ). A red mark is set and is


propagated to the states (3, s1 ), (3, s3 ). The fuchsia mark is propagated to the state
(3, s2 ) and its pointer is free. The reason why the fuchsia mark pointer is free will
be explained later. The blue mark propagates to the states (3, s0 ), (3, s1 ), (3, s3 ).
Before moving into the next time instant, the coordinate of the red mark is stored
in pointer 1 since it is the only free pointer available.

• i = 3, a possible fusion point is detected at (3, s0 ). A yellow mark is set and is


propagated to the states (4, s0 ), (4, s2 ). The red mark is propagated to the state
(4, s3 ) and its pointer is free. The reason why the red mark pointer is free is the same
as that for the fuchsia mark pointer in the previous instant and will be explained
later. The blue mark is propagated to (4, s0 ), (4, s2 ), (4, s3 ). Before moving to the
next time instant, the coordinate of the yellow mark is stored in pointer 1 since it is
the only free pointer available.

• i = 4, a possible fusion point is detected at (4, s0 ). A turquoise mark is set and is


propagated to the states (5, s0 ), (5, s2 ). Another possible fusion point is detected
at (4, s2 ). A brown mark is set and is propagated to states (5, s1 ), (5, s3 ). Both,
the blue mark and the yellow mark propagates to all the sates at i = 5. This means
that the origin of the blue mark and the origin of the yellow mark are both fusion
points. However, the point we are looking for, is the closest FP to the time being
processed, which in this case that FP corresponds to the origin of the yellow mark
at (3, s0 ). The reason is the definition of a FP. If the system traces back from time
i = 5, it finds that all paths merge at (3, s0 ). So (3, s0 ) represents the point where all
paths merge in a trace back operation from i = 5. The point (1, s0 ) corresponding to
the origin of the blue mark, belongs to the survival path, but it does not represent
a merging point for a trace back operation that starts at time i = 5. Now we can
extend the previous thought in the following way: suppose that two marks propagate
to the same states. This means, that in the future, their propagations will always be
4.6 Survival Memory Unit. 43

Address registers

FP addr

current addr
FP sel

Mark
Processing
FP state

v ,0i Mark
Mark Propagation
FP detected
v i ,1
Detection
v ,2i

v ,3i
State code registers

Mark registers

Combinational Logic Memory

Figure 4.21: FPU architecture for a code with constraint length K = 3.

the same. They will have the same possibilities to become fusion points. However
the closest FP to the time being processed is the true FP. Then, it is not necessary
to propagate and process the behavior of both marks. The mark that is relevant
is the one with the origin closest to the time being processed. Therefore, we can
enunciate the following rule:

whenever two marks coincide, the one with the latest origin is kept.

Finally, before getting back to the algorithm, it is time to explain why the red
mark pointer and the fuchsia mark pointer were free in the previous steps. We
saw that either mark propagated to only one state. Therefore if the system
keeps propagating those marks, in the best case, they will coincide in the future
with a possible fusion point, and whenever two marks coincide the one with
the latest origin is kept. For this case the mark to be kept is the future possible
fusion point. Summarizing, this last rule becomes:

whenever a mark propagates to only one state, it has no chance to become


a FP in the future, then its pointer can be freed.

Now that we have set the main ideas and rules, we return to the algorithm.
The fusion point register is set with the pointer 1 data. The pointer 1 and
the pointer 0 are free, and then the coordinates of the turquoise mark and the
brown mark are stored in them.

• i = 5 The algorithm is executed, but there are no possible fusion points detections,
only mark propagations.

Figure 4.21 presents a design of the FPU for a code with constraint length K = 3. It
consists of a Mark Detection Unit, which uses the decision bits vi,k provided by the ACSU
44 Hardware Implementation of a Turbo Decoder based on SOVA

to detect possible FPs according to the trellis butterfly. There is a Mark Propagation
block, which propagates, along the trellis, the new marks and the stored marks. There is
a processing unit, which compares all the marks at the input, and proceeds as follows:

• if there are two equal marks, then the one with the latest address is kept.

• if there is a mark with only one bit set, then its corresponding register is freed, since
it has no chance to become a FP in the future.

• if there is a mark with all bits set, then a FP is indicated with its address and state.

Finally there is a set of registers used to hold marks, addresses, and state codes.
It is important to point out some major concerns:

• The algorithm can be computed by recursion.

• There are at most n2 new possible FPs at each time instant, where n is the number
of states of the encoder.
1
• Simulations have shown that for an RSC encoder of rate 2 with n = 2K−1 states,
the amount of registers the FPU needs is:

– n−2, registers of n+1 bits to hold marks —the remaining bit is used to indicate
if the register is empty.
– n − 2 registers of K − 1 bits to hold state codes.
– n − 2 registers of A bits to hold addresses, where A is the number of bits used
to code the frame size.

• Since the processing unit compares all marks at the same time to see if there are
equal marks, then the number of XOR gates increases drastically with the constraint
length of the code. However it has been observed that Turbo Code schemes with
encoders with short constraint length have better BER performance than encoders
with large constraint length [18].

Comparing our approach with the previous implementations, we conclude with the results
of table 4.1 for an RSC code of rate 12 , K = 3 and a message frame size of 1024. We see
that for a code with constraint length K = 3, a frame size 2A = 210 = 1024 bits, and a
trace back depth of D = 5 ∗ K, the RE SMU needs (5 ∗ 3) ∗ 4 = 60 register of one bit, and
the FPU needs (4 − 2) ∗ (4 + 1) + (4 − 2) ∗ 2 + (4 − 2) ∗ 10 = 34 register of one bit. Also,
the FPU will always find the correct FP, while the RE SMU might produce wrong results,
if paths do not merge within the trace back pipeline. Another difference is that the RE
outputs the symbol sequence of the survival path, while the FPU outputs the sequence of
FPs that are spread along the trellis. However, in a turbo code scheme context, the RUU
may take advantage of these FPs as we will show in the next subsection.
4.7 Fusion Points based Reliability Updating Unit. 45

Observation REU FPU


One bit Registers 60 34
Reliability depends on the trace back depth Optimum
Output Rate One state per clock cycle Random

Table 4.1: Comparison between the REU and FPU for a code with rate 12 , K = 3 and a
frame size 2A = 210 = 1024

The reliability of these bits, could


depend on ∆4,0 or ∆4,2

s0
∆ 3,0 ∆ 4 ,0

s1

s2 ∆ 4 ,2

s3
i=0 i=1 i=2 i=3 i=4
survival path competing path Possible competing path in the future

Figure 4.22: Reliability updating problem

4.7 Fusion Points based Reliability Updating Unit.

Before getting into the hardware issues it is important to highlight the main problem we
face at the moment of updating bit reliabilities. For example, figure 4.22 illustrates one
example. While processing data at time instant i = 4, a FP is found at (3, s0 ). This FP
is colored in green. The example shows the survival path and the competing path traced
back from the FP until they merge. The blue branches indicate possible future branches
of the survival path, while the red paths indicate possible competing paths in the future.
The RUU could start to update bit reliabilities as soon as a FP is detected. However,
figure 4.22, shows how the reliability of bits i = 2, i = 3 might depend on ∆4,0 , or ∆4,2
. The earlier release of those bit reliabilities leads to a non optimum SOVA algorithm
execution.
One solution to the mentioned problem, is illustrated in figure 4.23. The idea is to trace
back U steps, to allow all the competing path, that start after time i, to merge. After U
steps, the remaining bit reliabilities could be released. However, this solution introduces
the U factor which is a trade-off between BER performance and power consumption.
It has no impact on the area, since as we will show later, bits reliabilities are updated
recursively. Anyway, the introduction of the U factor, leads to a non optimum SOVA
algorithm execution.
The solution we adopted is introduced by the example of figure 4.24 . By the time i,
two FPs have been detected. Since the second FP, resides after the detection line of the
first one, the updating process takes place starting from the second FP. Once, the first FP
is reached, the system continues updating and releasing the bit reliabilities. The fact that
46 Hardware Implementation of a Turbo Decoder based on SOVA

Updating and Releasing Updating Only

s0

s1

s2

s3 ∆ I FP ,3
survival path iFP -U iFP i

competing path

Figure 4.23: One possible solution to the problem of bit reliabilities releasing.

Bits reliabilities before i FP1, will not be affected by future


competing paths. Therefore they can be released .
paths traced back from any instant after i DFP1 will merge at IFP1
Updating and Releasing Updating Only
s0

s1

s2

s3
iFP1 iDFP1 iFP2 i iDFP2

Possible competing path in the future

Figure 4.24: Solution adopted for the bit reliabilities releasing problem.

the second FP needs to reside after the detection line of the first one, is due to the concept
that any path traced back after the detection line will merge at the FP of that detection
line. Therefore any future competing path of the survival path, at most will merge at the
first FP, and will not affect the bit reliabilities before the first FP.
We can generalize this solution in an algorithm as follows:

• Wait for the first FP provided by the FPU.

• Wait for the second FP.

• If the second FP is detected after the detection line of the first one then, proceed
with the updating process.

• If the second FP is detected before the detection line of the first FP, then wait for
one more FP:

– If the third FP resides after the detection line of the second FP then, proceed
to the updating process with the information of the second and third FP.
– If the third FP does not reside after the detection line of the second FP, but it
does after the detection line of the first one, then the updating process proceeds,
with the information of the first and third FP.
4.7 Fusion Points based Reliability Updating Unit. 47

– If the third FP does not reside after the detection lines of any of the other two
FPs, then the third FP is discarded. The RUU continues from step 4.

• When the updating process finishes, then the last FP, becomes the first FP, and the
process is repeated from step two.

• If the end of the frame is reached by the ACSU, then the RUU is interrupted and it
begins to update the bits reliabilities from the end.

Figure 4.25.a presents the RUU general scheme. There is a state machine which controls
the unit. It also carries out the previous algorithm. The registers at the left of the figure,
hold FP state codes, FP addresses and FP detection lines which are used to address the
RAM block and control the updating process. The lastState Unit, calculates the previous
state in the trellis, based on the current state, and the decision bit for that state. This
unit is actually doing the trace back, at each clock cycle, of the survival path. The current
state is used to drive the multiplexers to select the message bit associated with the survival
path and the ∆ difference between the metric of the survival path, and a competing path.
These elements are fed into the recursive Updating unit, which calculates the reliability
magnitude of bits ρi .
The term Lepi is stored in the RAM block in conjunction with the decision bits vi,k
and ∆i,k . This term is equivalent to:

Lepi = ysi + Lai


which is used to calculate the final extrinsic information Lei :

Lei = Λi − Lc ysi − Lai


Lei = Λi − (ysi + Lai )
Lei = Λi − Lepi

The term Lepi is calculated when ysi and Lai are available at the time of branch metrics
because it saves clock cycles at the time of computing Lei . Not doing it a that time
supposes to access the data-in RAM buffer and the RAM La/Le-Le/La again. Besides,
the access has to be done through the interleaving/deinterleaving unit, which might be
being used. The calculation of Lei is done in the following way: the recursive units outputs
ρi , which is actually the magnitude of Λi . The bit mi gives the sign to Λi . Since a two
complement representation is used, the bit mi will indicate whether to complement ρi or
not. Then we have:

Lei = ρi + (0 − Lepi ) mi = 1
Lei = not ρi + (1 − Lepi ) mi = 0

The operation in parenthesis is done first and its result is delayed until ρi comes out of the
recursive unit. This allows to distribute combinational delays among the registers. The
resulting Lei is stored in the RAM La/Le-Le/La depending on the decoder.
The recursive updating unit is shown in figure 4.26. This unit updates bits reliabil-
ities by managing all competing path at once. In the scheme there is a set of register
that holds the different ∆ for each state. These ∆ are propagated to the corresponding
Hardware Implementation of a Turbo Decoder based on SOVA

It calculates the last state of the sequence based on


the current state and decision bit
previous state
FPs state code register Current state
put_frame_end_data
lastState
Survival path states
Starting state

m
Starting state decision bit
m

Message bits corresponding


si,k to the survival path
queue
si+1,k
Decision bits
load

downCounter
delay
m

Path metric differences at each


FPs address registers mi+1

m
v k, i state of the survival path.
The inversion is controlled by the
v +
k,1 i
f2 bit mi+4
∆ ∆+

m
RAM SMU k, i
m ρ+
i 1 NOT
Recursive
Update i 4
State
Machine
>= f2 eL
i + 5
+
>= f1
Lepi Lepi+1
- delay
1 = finish delay finish_delayed

+
FPs detection line address
registers

flushing_addr_start
b

m
a<=b flushing delay wr le
a
( + − )ρ + −
This operation is equivalent to xxxxxxxxxxxxxx.
m2 i 4 +1 i 4 peL i 4
This is done this way in order to distribute
combinational delays among register
Delay addr_x^
These multiplexers select the end-of-the-frame
data when the ACSU reaches the end of the
frame.
Delay addr_le
48 Figure 4.25: Fusion Points based Reliability updating unit
4.7 Fusion Points based Reliability Updating Unit. 49

ps0
∆ i

p0
p0

m
∆MAX

MIN
decoder

p1 it selects the paths that differ in the

m
si,k message bit associated.
p2
∆MAX

m
c0

p3

m
it complements the
decision bit of the ps1
v ,0i
survival state
ps0 p1

m
p0

MIN
v ,1
∆MAX

MIN
i

m
p1 ps1
v ∆MAX

m
i ,2 c1
p2 ps2
v

m
i ,3
p3 ps3

ps2

ρ +3
p2
v ,0

MIN
m
i i
c0
mi
∆MAX

MIN
m

v i ,1
mi c1 m
∆MAX c2
v i ,2
c2

m
mi
v i ,3
mi c3 ps3

p3
m

MIN
∆MAX
MIN
m

∆MAX
m

c3

m
Trace back Trellis
connection topology

∆D Trace back of all the competing paths Calculates the Minimum D ∆

Figure 4.26: Recursive Updating Unit

s0

s1

s2

s3
i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i = 10

survival path competing path competing path

ρ = min (∆ , ∆ , ∆ )
competing path competing path

Figure 4.27: Recursive Updating Process.

previous states by the pair of multiplexers and the reverse trellis topology connection ac-
cording to the trellis decision vector. The moving of the ∆ is actually the trace back of
the competing paths. It can be observed its similarities to the ACSU in the recursive
procedure. Whenever two competing path merge, the one with the minimum ∆ is kept.
At each stage, the decision bits along with the estimated message bit, are used to drive
the multiplexers that select the relevant ∆. The minimum ∆ among these relevant ∆ is
the resulting bit reliability.
In order to clarify how the recursive unit works, we will introduce the example of figure
4.27. The set of registers from figure 4.26 will hold the colored ∆ from figure 4.27. When
the updating process is lunched, the set registers are set to ∆M AX .
50 Hardware Implementation of a Turbo Decoder based on SOVA

• The unit begins at the time instant i = 10. The orange ∆ is fed into the system
through the multiplexer from state 1. A the same time a minimizing process is
started with this orange ∆ and the remaining ∆ of the registers. The orange ∆ is
sent to state 3.

• At time i = 9. The blue ∆ is fed into the system through the multiplexer from
state 2. The orange ∆ from state 3 and the blue ∆ from state 2 participate in the
minimizing process. The blue ∆ it is sent to state 1, while the orange ∆ is sent to
state 3 again.

• At time i = 8. The fuchsia ∆ is fed into the system through the multiplexer from
state 0. Now there are three ∆ participating in the minimizing process. Finally the
orange, blue and fuchsia ∆ are sent to state 2, state 3 and state 0 respectively.

• The remaining steps are about the same. Note that, at time i = 6, state 2, two
competing path merge. For this example the blue ∆ is assumed to be less than the
turquoise ∆ and that is why it is kept.

Before moving into the next section it is important to talk about some throughput issues.
In figure 4.24 we can see that the unit RUU updates some distance before it can release
the final bit reliabilities. Therefore if we think of the time distance between fusion points
as a random variable with a mean D̄. Then the RUU processes 2D̄ time instants for each
FP detected by the FPU. This means, that the FIFO input data rate will be higher than
FIFO output data rate and the FIFO will get full. If the FIFO gets full, then the RUU
misses some FPs, however, this is not as bad as is seems, since the algorithm that manages
the FPs is still valid.
Let denote the amount of bits remaining to be updated, when the ACSU unit reaches
the end of the frame, with the parameter DR . Then the throughput of the SOVA SISO
can be estimated by

L
T HSISO = f [bps] (4.1)
L + DR
where L is the frame size and f is the frequency of the system. It is straight forward, that
if we want to increase the throughput of the system DR should be reduced. This can be
achieved by increasing the working frequency of the RUU so it processes more FPs per
time unit and at the end of the frame, less bits remain to be updated.

4.8 Control Unit

We finally present the design of the control unit, which is basically a finite state machine
that delays and synchronizes modules. Figure 4.28, shows the scheme. There are two
counters, one is responsible for the frame address count, and the other is responsible for
the iteration count. The iteration counter is first loaded with the number of iterations
that the user indicates. Figure 4.29 shows the state diagram that the entire system goes
through. Once the user drives the go signal high, the system begins to work. It first
initializes the units and progressively activate the corresponding modules before settling
4.8 Control Unit 51

x2+1

Iteration
Counter
niters niters
=0? iters Finished

Bit Counter
Frame Finished
State =?
Machine
Frame Length

Figure 4.28: Control Unit General Scheme.

Idle

Iters Finished go

Initializing
Finishing /Iters Finished
Modules

Frame Finished

Decoding

Figure 4.29: Control Unit State Diagram.

down in the decoding state. Once the end of the decoding process is reached the system
checks whether there is an iteration left or not.
52 Hardware Implementation of a Turbo Decoder based on SOVA

4.9 Improvements
The most common implementation of the SOVA decoder only updates bits reliabilities
by the HR-SOVA rule that was described in 3.2.2. A BR-SOVA updating rule would
be desirable since it has been proved in [5] that max-log-map algorithm and BR-SOVA
are equivalent and that the max-log-map algorithm perform better in terms of BER than
the HR-SOVA. However the BR-SOVA updating rule requires the knowledge of the bit
reliabilities of the competing paths which implies a higher complexity in the decoder and
this is the reason why we do not do a strict BR-SOVA, instead we approximate its behavior
by introducing a bound for the bit reliability of the competing path as shown below.
The BR-SOVA updating rule and HR-SOVA updating rule are the same when the
estimated bit and the competing bit are different. In contrast the following equations
recall the updating rules for each algorithm when the estimated bit and the competing bit
differ.

ρBR ⇐ min ρj , ∆i,k + ρcj



j
ρHR
j ⇐ ρj (4.2)

If we assume ρcj = ∞, equation 4.2 can be rewritten as:

ρj ⇐ min ρj , ∆i,k + ρcj




That is why we can think of the HR-SOVA as a BR-SOVA with an unbounded ρcj . The
improvement proposed in this work is to bound ρcj to a known value. When working
with an RSC binary code, the two incoming branches, at any state of a trellis diagram,
are associated with different message bits. Therefore, the ∆ difference between the path
metrics is actually a bound for the reliability of those message bits. The resulting updating
rule becomes: (
ρj ⇐ min
 (ρj , ∆j,k )  m̂i 6= ci
ρj ⇐ min ρj , ∆i,k + ∆j c m̂i = ci

where ρj is the reliability of bit j; ∆i,k is the path metric difference between competing
path and survival path; m̂i is the estimated message bit; ci is the estimated message bit
which is associated with the competing path and finally ∆cj is the path metric difference
at each state at time j that belongs to the competing path.
Figure 4.30 and 4.31 show the modified RUU and the Recursive Updating Unit respec-
tively. They allow the previous rule to be executed. Note that the main difference is the
handling of all the ∆ since they represent the bound for the competing bit reliabilities.
previous state

Current state

put_frame_end_data
lastState
4.9 Improvements

Starting state

m
decision bit

m
si,k

Starting state
queue
si+1,k

delay

load

m
m
vi ,k mi+1

downCounter
vi +1,k
f2

m
RAM SMU ∆ i,k

m
∆ i +1 NOT
Recursive
Update ρ i+4
State
Machine
>= f2 ∆ i +1, k Lei +5
>= f1 delay
+
Lepi Lepi+1
-

1 = finish delay finish_delayed

m
b
a<=b flushing delay wr le

flushing_addr_start
a

Delay addr_x^ All deltas need to be available for the recursive update

Delay addr_le

Figure 4.30: Reliability Updating Unit with BR-SOVA approximation


53
54 Hardware Implementation of a Turbo Decoder based on SOVA

∆ i,k
The ∆ i +1,k is the bound for ρ c i+1
ps0
∆i
p0
p0 ∆ i +1, 0
m

c0
decodificador

∆MAX
MIN

p1
m

si,k
p2 0 m
m

p3
+

ps1
vi , 0
ps0 p1
∆ i +1,1
m

p0 c1
MIN

v i ,1
∆MAX
MIN
m

p1 ps1
m

vi ,2 0
m

p2 ps2
v i ,3
+

p3 ps3

ps2

ρi +3
p2
MIN

vi , 0 ∆ i +1, 2
m

c2
c0
mi
∆MAX
MIN
m

v i ,1
m

c1 0
m

mi
vi , 2
c2
+

mi
v i ,3
mi c3
ps3

p3
∆ i+1, 3
m

c3
MIN

∆MAX
MIN
m

0
m

∆D Trace back of all the competing paths Calculates the Minimum D ∆

Figure 4.31: Recursive Update with BR-SOVA approximation


Chapter 5

Methodology

The whole practical design process was carried out with the aid of powerful software tools.
Mainly three tools were employed in this thesis:

• Matlab 7.1. The mathematics software package Matlab was extensively used in the
simulation and verification of the design. It was employed to model the whole com-
munication system: encoder, channel, receiver and decoder. We also used Matlab
for the HIL (Hardware In the Loop) verification of the design. It was carried out
by establishing a serial port communication with an interface circuit specifically
developed for testing purposes.

• Xilinx ISE 8.2. The synthesis software package of Xilinx, ISE 8.2, was used in
all the tasks referred to the implementation, specifically the mapping, translation,
placement and routing along with the back annotation and the static timing analysis.
The FPGA programmer iMPACT is also included in this package; it was used to
download our design into the Xilinx Spartan III FPGA.

• ModelSim 6.1. VHDL code and Post-Place and Route models were simulated with
this tool.

Figure 5.1 summarizes the work flow. Five steps have taken place with some feedback
between them. On the rightmost part we have the fundamental stages of this process
whereas on the leftmost part the verification tasks associated with each stage are displayed.
The blue boxes show the main tool employed in the related task. Now we give a description
for the stages of the process:

• Information recopilation. A considerable amount of papers and journals were re-


copilated. They allowed us to understand the main problem and to focus our main
concerns on some aspects of the subject.

• Specification. The specification of this work consisted on the design and implemen-
tation of a SOVA based Turbo Decoder implementation.

• High Level Design. A high level model was programmed using the software tool
MATLAB 7.1. This model allowed us to try the system in different environments
and also to fine tune the design specifications cited in step two.
56 Methodology

Information
Recopilation

Design Specifications

Matlab

High Level Design High Level Design


Implementation Verification

ModelSim

VHDL
Behavioral Verification
Implementation

ModelSim
FPGA Post-Place &
Route Model Verification
VHDL Synthesis
In-Circuit Verification
Matlab

Figure 5.1: Project Work Flow.

• VHDL Implementation. Once we were familiar with all the concepts related to
the decoding algorithm we started to work on the structure of the datapath. It
was described on VHDL code and all the combinational modules were verified by
appropriate test-benches on ModelSim. After the Datapath was totally defined, we
began to specify the control needs of our system and the way it would communicate
exteriorly, subsequently we gradually defined the whole system.

• VHDL Synthesis. After a VHDL functional model was achieved, the synthesis was
carried out. The targeted device was an Spartan 3 X3S200FT256. The system
was first verified by a Post-Place and Route model. Later the FPGA was programed
with the iMPACT tool for In-Circtuit verification. Figure 5.2 illustrates the approach
employed for this purpose while figure 5.3 shows the followed procedure. The serial
port baud rate was set to 115200 bps.
57

Interface
Matlab Decoder
Unit

Spartan 3 X3S200FT256 FPGA


RS232 Serial Port

Figure 5.2: Hardware-in-the-loop approach

mi
source Channel Coding BPSK Modulation

AWGN
BER Calculation Discrete Channel

sink m̂i
yi
Matlab

FPGA
Channel
Decoding

Spartan 3 X3S200FT256 FPGA

Figure 5.3: Hardware-in-the-loop verification procedure


58 Methodology
Chapter 6

Measures and Results

The system presented in chapter 4 was described using VHDL (Very high speed integrated
circuits Hardware Description Language). A generic and parameterizable VHDL code was
written. A VHDL package code includes the frame size, quantization scheme, polynomials
of the code, and the SOVA algorithm mode (HR-SOVA or BR-SOVA approximation).
The system can be configured through this package before the synthesis is performed.
The targeted device was a general purpose Xilinx FPGA Spartan 3 X3S200FT256.
All the tests have been done for two major polynomial pairs. One is the pair we have
been using through all this work, Pf b = [111], Pg = [101]. The other pair is the UMTS
polynomial pair, Pf b = [1011], Pg = [1101]. The size of the data frame has been set to 1024
bits and it is the same for all simulations and syntheses. The depth of the RUU FIFO has
been set to 16 FPs. We have employed two types of interleavers. One of the interleavers
is given in [14] —from now on, MCF. It is described by the following equations:

α (x) = 31x + 64x2 mod 1024, Deinterleaver ;




β (x) = 991x + 64x2 mod 1024, Interleaver.


The other interleaver was randomly generated —from now on RAND. Its function is
described by a look-up table. The tests with the normalization by the fixed scaling factors,
have not been done yet and is left as future work. We will present the results in the
following subsections according to their nature.

6.1 Quantization Scheme


The quantization scheme is presented in table 6.5. The same quantization scheme is used
in all the tests. This scheme has been adopted from [18].

Element Word width : Fractional Part


Received Symbols yi 4:2
Extrinsic Information 7:2
Path Metrics 10:2
∆s 4:2

Table 6.1: Quantization Scheme Summary


60 Measures and Results

10^(0)
4:2
6:2
10^(−1) 8:2

10^(−2)

10^(−3)
BER

10^(−4)

10^(−5)

10^(−6)

10^(−7)
0.5 1 1.5 2 2.5 3
EbNo [dB]

Figure 6.1: ∆ quantization effect on the system BER performacne. BR-SOVA approxi-
mation scheme. Simulation with quantification. MCF. Pf b = [111], Pg = [101]

The only quantization study that has been carried out is related to the path metric
difference ∆ which has a significant impact in the system BER performance. Figure 6.1
shows the BER curve against the received signal SNR. It is observed that, for the current
example, the scheme 4:2 is better than the 6:2, 8:2. This behavior has been reported in [11]
as a method of improving the system BER performance. Since quantization saturates the
∆, the overoptimistic values of the bit reliabilities are lessened and consequently the system
BER performance increases. Note that adopting the reduced quantization scheme yields
more benefits. First the RAM that stores the data the ACSU is reduced. Furthermore
the logic related to the RUU is also reduced.

6.2 Synthesis Results

Tables 6.2 and 6.3 present the synthesis results for the short pair of polynomials and the
UMTS polynomials respectively. Both pairs of polynomials were synthesized with the
quantization scheme given in table 6.5.
6.2 Synthesis Results 61

Observation HR BRap Resources


Logic Utilization
Number of Slice Flip Flops 720(18%) 752(19%) 3840
Number of 4 input LUTs 776(20%) 803(20%) 3840
Logic Distribution
Number of occupied Slices 677(35%) 674(35%) 1920
Number of Slices containing only related logic 677(100%) 674(100%) 674
Number of Slices containing unrelated logic 0 0 674
Total Number 4 input LUTs 789(20%) 816(21%) 3840
Number used as logic 776 803 1
Number used as a route-thru 13 13 1
Number of Block RAMs 10(83%) 10(83%) 12
Number of MULT18X18s 1(8%) 1(8%) 12
Number of GCLKs 4(50%) 4(50%) 8
Total equivalent gate count for design 671207 671658

Table 6.2: Synthesis Results for Pf b = [111], Pg = [101]

Observation HR BRap Resources


Logic Utilization
Number of Slice Flip Flops 1045(27%) 1108(28%) 3840
Number of 4 input LUTs 2067(53%) 2096(54%) 3840
Logic Distribution
Number of occupied Slices 1256(65%) 1329(69%) 1920
Number of Slices containing only related logic 1256(100%) 1329(100%) 674
Number of Slices containing unrelated logic 0 0 674
Total Number 4 input LUTs 2082(54%) 2111(54%) 3840
Number used as logic 2067 2096 1
Number used as a route-thru 15 15 1
Number of Block RAMs 11(91%) 11(91%) 12
Number of MULT18X18s 1(8%) 1(8%) 12
Number of GCLKs 4(50%) 4(50%) 8
Total equivalent gate count for design 748769 749432

Table 6.3: Synthesis Results for Pf b = [1011], Pg = [1101]

Note that the BR-SOVA approximation spent almost the same amount of resources
than the HR implementation. In contrast the amount of used resources significantly
increases when working with the UMTS polynomials. This is due to the fact that the
UMTS encoder has twice the number of states.
Table 6.4 shows the max frequencies that the system can attain. When working with
the pair of short polynomials, the system can reach up to 85 MHz. The critical path is
located in the ACSU unit and it is related to the add, compare, select and ∆ quantiza-
tion delays. On the other side, when working with the UMTS pair of polynomials, the
maximum clock frequency suffers a considerable degradation. This is due to the excessive
62 Measures and Results

10^(0)

10^(−1)

10^(−2)

10^(−3)
BER

10^(−4)
HR Iter 1
BRap Iter 1
10^(−5) HR Iter 3
BRap Iter 3
HR Iter 5
BRap Iter 5
10^(−6) HR Iter 8
BRap Iter 8
Max−log−map Iter 8
10^(−7)
0.5 1 1.5 2 2.5 3
EbNo [dB]

Figure 6.2: HR-BRapprox comparison. Infinite precision simulations. MCF interleaver.


Pf b = [111], Pg = [101]

combinational logic that the FPU gets for an eight states code. The optimization of these
units should be considered as future work.

Polynomials Maximum clock frequency Critical Path


Short Pf b = [111], Pg = [101] 85 MHz ACSU
UMTS Pf b = [1011], Pg = [1101] 29 MHz FPU

Table 6.4: Maximum clock frequencies

6.3 Bit Error Rate Results


Before getting into the HIL results, we will discus the BR-SOVA approximation BER
performance that it is shown in figure 6.2. These results were obtained by simulation
with a floating point numeric representation. We observe that for an error probability of
10−4 the BR-SOVA approximation gains 0.3dB over the HR-SOVA at the eighth iteration.
For an error probability of 10−5 the BR-SOVA approximation gains only 0.23dB over the
HR-SOVA at the eighth iteration. We also observe that for higher SNRs, the curves begin
to converge and the distance between them gets shorter.
Figure 6.3 exhibits the real system BER performance when implementing the HR-
SOVA for the short pair of polynomials. The figure illustrates the comparisons between
the hardware implemented HR-SOVA and the floating point simulations. Note that the
real HR-SOVA performs better. This is due to the ∆ quantization effect that was explained
in 6.1.
6.3 Bit Error Rate Results 63

10^(0)

10^(−1)

10^(−2)

10^(−3)
BER

10^(−4)

10^(−5) HR Iter 1 Inf.Pre


HR Iter 1 HIL
HR Iter 5 Inf.Pre
HR Iter 5 HIL
10^(−6) HR Iter 8 Inf.Pre
HR Iter 8 HIL
Max−log−map Iter 8 Inf.Pre
10^(−7)
0.5 1 1.5 2 2.5 3
EbNo [dB]

Figure 6.3: HR-SOVA HIL results. MCF interleaver. Pf b = [111], Pg = [101]

Figure 6.4 shows the real system BER performance when implementing the BR-SOVA
approximation for the short pair of polynomials. For low SNRs, we see that the real
decoder performs worse than the floating point simulation. For high SNRs the opposite
situation is observed. Note, that for the BR-SOVA approximation the BER performance of
the real decoder is about the same as the BER performance of the floating point simulation.
The ∆ quantization does not improve the BER as much as in the HR implementation.
The comparison between the HR-SOVA implementation and BR-SOVA approximation
implementation is shown in figure 6.5. The figure also shows a partial plot of a quantized
max-log-map algorithm with the following quantization scheme:

Element Word width : Fractional Part


Received Symbols yi 4:2
Extrinsic Information 7:2
γ 7:2
α 9:2
β 9:2

Table 6.5: Quantization Scheme Summary

We observe that, in the worst case, the HR-SOVA is 0.14dB from the BR-SOVA
approximation and the latter is only 0.1dB from the quantized implementation of the
max-log-map.
Finally figures 6.6 and 6.7 show some partial results of the BER performance with the
UMTS polynomials and the randomly generated interleaver.
64 Measures and Results

10^(0)

10^(−1)

10^(−2)

10^(−3)
BER

10^(−4)

10^(−5) BRap Iter 1 Inf.Pre


BRap Iter 1 HIL
BRap Iter 5 Inf.Pre
BRap Iter 5 HIL
10^(−6) BRap Iter 8 Inf.Pre
BRap Iter 8 HIL
Max−log−map Iter 8 Inf.Pre
10^(−7)
0.5 1 1.5 2 2.5 3
EbNo [dB]

Figure 6.4: BR-SOVA approximation HIL results. MCF interleaver. Pf b = [111], Pg =


[101]

10^(0)

10^(−1)

10^(−2)

10^(−3)
BER

10^(−4)

10^(−5)

HR Iter 8 HIL
10^(−6) BRap Iter 8 HIL
Max−log−map Iter 8 Quant.
Max−log−map Iter 8 Inf.Pre
10^(−7)
0.5 1 1.5 2 2.5 3
EbNo [dB]

Figure 6.5: HR-BRapprox HIL comparison. MCF interleaver. Pf b = [111], Pg = [101]


6.3 Bit Error Rate Results 65

10^(−2)

10^(−2.5)

10^(−3)

10^(−3.5)
BER

10^(−4)

HR Iter 1
10^(−4.5) BRap Iter 1
HR Iter 3
BRap Iter 3
HR Iter 5
10^(−5) BRap Iter 5
HR Iter 8
BRap Iter 8
10^(−5.5)
0.5 1 1.5 2 2.5 3
EbNo [dB]

Figure 6.6: HR-BRapprox comparison. Infinite precision simulations. RAND interleaver.


Pf b = [1011], Pg = [1101]

10^(−2)

10^(−2.5)

10^(−3)

10^(−3.5)
BER

10^(−4)

10^(−4.5)
BRap Iter 1 Inf.Pre
BRap Iter 1 HIL
BRap Iter 5 Inf.Pre
10^(−5) BRap Iter 5 HIL
BRap Iter 8 Inf.Pre
BRap Iter 8 HIL
10^(−5.5)
0.5 1 1.5 2 2.5 3
EbNo [dB]

Figure 6.7: BR-SOVA approximation HIL results. RAND interleaver. Pf b = [1011],


Pg = [1101]
66 Measures and Results

6.4 Throughput Results


In this section we investigate the effect of running the RUU at higher frequencies and its
impact in the system throughput. A DCM (Digital Clock Manager) was used in order to
generate the corresponding frequencies.
Figures 6.8, 6.9 and 6.10 show the throughput histogram statistics for the frequencies
relations fRU U = f ,fRU U = 2f and fRU U = 3f respectively and for the short pair of
polynomials. The statistics were generated with 50000 samples. We observe that the
throughput increases whit the RUU working frequency as we expected.
In a real application context, the system has to guarantee a constant throughput so
it could be set to one of the minimum intervals observed in the histogram. These values
are summarized in table 6.6. We can think of a power saving benefit since the system,
according to the figures, will work faster than the guaranteed throughput. Therefore,
when the system finishes the execution it goes to an idle state until a new set of data
arrives, during this idle state no activity is performed in the circuit which supposes an
important reduction in the power consumption.
Figures 6.11, 6.12 and 6.13 show the same throughput histogram statistics but this
time for the UMTS pair of polynomials. We observe the same effect than with the short
pair. However we notice a slightly difference in the statistics between them. This is due
to the appearing frequency of FPs, which is higher for higher constraint lengths.

Observation Short Polynomials UMTS Polynomials


fRU U = f 0.5259f [bps] 0.5270f [bps]
fRU U = 2f 0.8258f [bps] 0.8308f [bps]
fRU U = 3f 0.9543f [bps] 0.9399f [bps]

Table 6.6: Minimum estimated throughput.


6.4 Throughput Results 67

4000

3500

3000
Number of observations

2500

2000

1500

1000

500

0
0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61
SISO throughput as a percentage of the system clock in [bps]

Figure 6.8: Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [111], Pg = [101]

3000

2500
Number of observations

2000

1500

1000

500

0
0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9
SISO throughput as a percentage of the system clock in [bps]

Figure 6.9: Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [111], Pg = [101]
68 Measures and Results

5000

4500

4000

3500
Number of observations

3000

2500

2000

1500

1000

500

0
0.95 0.955 0.96 0.965 0.97 0.975 0.98 0.985
SISO throughput as a percentage of the system clock in [bps]

Figure 6.10: Throughput statistics. f = 16.66M Hz, fRU U = 25M Hz. Pf b = [111],
Pg = [101]

4000

3500

3000
Number of observations

2500

2000

1500

1000

500

0
0.52 0.54 0.56 0.58 0.6 0.62
SISO throughput as a percentage of the system clock in [bps]

Figure 6.11: Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [1011], Pg =
[1101]
6.4 Throughput Results 69

4000

3500

3000
Number of observations

2500

2000

1500

1000

500

0
0.82 0.84 0.86 0.88 0.9 0.92 0.94
SISO throughput as a percentage of the system clock in [bps]

Figure 6.12: Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [1011], Pg =
[1101]

8000

7000

6000
Number of observations

5000

4000

3000

2000

1000

0
0.935 0.94 0.945 0.95 0.955 0.96 0.965 0.97 0.975 0.98
SISO throughput as a percentage of the system clock in [bps]

Figure 6.13: Throughput statistics. f = 16.66M Hz, fRU U = 50M Hz. Pf b = [1011],
Pg = [1101]
70 Measures and Results

6.5 Power Results


The power consumption has been estimated by simulations. Table 6.7 summarizes the
results. The system frequencies were set to f = 25M Hz, fRU U = 50M Hz. The simulation
test bench was carefully designed in order to guarantee a SISO throughput of 0.8f =
20M bps. This throughput is feasible according to figures 6.9 and 6.12. We observe a
dynamic power consumption of (22 − 12) = 10mW for the Short pair of polynomials.
The dynamic power consumption rises up to (29 − 12) = 17mW when working with the
UMTS polynomials. This effect was expected since the area increasing is about 50% when
jumping from four states to eight. Table 6.7 only shows the power consumption of the BR-
SOVA approximation since the difference between the BR-SOVA approximation scheme
and the HR-SOVA scheme is negligible.

Observation Short Polynomials UMTS Polynomials


Total estimated power consumption[mW] 47 54
Vccint 1.20V: 22 29
Vccaux 2.50V: 25 25
Vcco25 2.50V: 0 0
Clocks: 6 6
Inputs: 1 1
Outputs: 2 4
Vcco25 0 0
Signals: 2 5
Quiescent Vccint 1.20V: 12 12
Quiescent Vccaux 2.50V: 25 25

Table 6.7: Estimated Power consumption. BRapprox. f = 25M Hz, fRU U = 50M Hz
Chapter 7

Conclusions and future work

We have design a complete Turbo Decoder based on the SOVA algorithm. For this purpose
we have introduced a new algorithm for doing the SOVA decoding and we have designed
the architecture that implements it. The resulting design is not affected by the D-U trade-
off and it achieves an optimum SOVA execution. We have also introduced a modification
to the previous architecture that approximates the BR-SOVA. The resulting BER of this
last scheme is 0.1 dB from a comparable Max-Log-Map algorithm.
As future work, the following key points are proposed:

• The system throughput is affected by the management of the fusion points. Different
schemes should be studied with the aim of improving the resulting throughput. For
example, a LIFO memory could be employed instead of a FIFO at the input of the
RUU.

• The power consumption of the system could be reduced by properly selecting the
FPs that lunch the reliability updating process. This way, a long updating-without-
releasing process can be avoided.

• The critical path of the system, for the UMTS polynomials, resides inside the FPU.
Optimization strategies should be analyzed in order to reduce the combinational
delays.

• An non-optimum SOVA execution should be adopted by properly reducing and man-


aging the RAM buffer that is used to store the data the ACSU units provide. Taking
into account other implementations, this memory could be reduced by more than
50% without major BER performance degradation.

• A complete BR-SOVA should be carefully studied for implementation. This could


be probably achieved by replicating the recursive updating unit. One of these units
traces back and updates the survival path, while the others do the same with the
competing paths.
72 Conclusions and future work
Bibliography

[1] Sorin Adrian Barbulescu. What a wonderful turbo world ... E-book, 2004.

[2] G. Battail. Pondération des symboles décodés par l’agorithem de Viterbi. Ann.
Telecommun, 42:31–38, January 1987.

[3] C. Berrou, A. Glavieux, and P. Thitimasjshima. Near Shannon Limit Error-Correcting


Coding and Decoding: Turbo-Codes. Proceedings of the IEEE International Confer-
ence on Communications, Geneva, Switzerland, May 1993.

[4] Gennady Feygin and P.G. Gulak. Architectural Tradeoffs for Survivor Sequence Mem-
ory Management in Viterbi Decoders. IEEE TRANSACTIONS ON COMMUNICA-
TIONS, 41:425–429, March 1993.

[5] Marc P. C. Fossorier, Frank Burkert, Shu Lin, and Joachim Hagenauer. On the Equiv-
alence Between SOVA and Max-Log-Map Decoding. IEEE COMMUNICATIONS
LETTERS, 2(5), May 1998.

[6] David Garret and Micea Stan. Low Power Architecture of the Soft-Output Viterbi
Algorithm. Low Power Electronics and Design, 1998. Proceedings. 1998 International
Symposium on, pages 262– 267, August 1998.

[7] Joachim Hagenauer and Peter Hoeher. A Viterbi Algorithm with Soft-Decision Out-
puts and its Applications. Proc. GLOBECOM IEEE, 3:1680–1686, November 1989.

[8] Pablo Ituero Herrero. Implementation of an ASIP for Turbo Decoding. Master’s
thesis, KTH, May 2005.

[9] Olaf Joeressen, Martin Vaupel, and Henrich Meyr. High-Speed VLSI Architectures
for Soft-Output Viterbi Decoding. Proc. IEEE ICASAP’92, Oakland, California,,
pages 373–384, August 1992.

[10] D. W. Kim, T. W. Kwon, J. R. Choi., and J. J. Kong. A modified two-step SOVA-


based turbo decoder with a fixed scaling factor. Circuits and Systems, 2000. Pro-
ceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, 4:37–40,
May 2000.

[11] Lang Lin and Roger S. Cheng. Improvements in SOVA-Based Decoding For Turbo
Codes. Communications, 1997. ICC 97 Montreal, ’Towards the Knowledge Millen-
nium’. 1997 IEEE International Conference on, 3:1473–1478, June 1997.
74 BIBLIOGRAPHY

[12] Lutz Papke and Patrick Robertson. Improved Decoding with the SOVA in a Parallel
Concatenated (Turbo-code) Scheme. IEEE International Conference on Communi-
cations, Conference Record, Converging Technologies for Tomorrow’s Applications.,
1:102–106, June 1996.

[13] C.B Shung, P.H. Siegel, G. Ungerboeck, and H.K Thapar. VLSI Architectures for
Metric Normalization in the Viterbi Algorithm. Communications, 1990. ICC 90, In-
cluding Supercomm Technical Sessions. SUPERCOMM/ICC ’90. Conference Record.,
IEEE International Conference on, 4:1723–1728, April 1990.

[14] Oscar Y. Takeshita. On Maximum Contention-Free Interleavers and Permutation


Polynomials over Integer Rings. Submitted as a Correspondence to the IEEE Trans-
actions on Information Theory, April 2005.

[15] T.K Truong, Ming-Tang Shih, Irving S. Reed, and E. H. Satorius. A VLSI Design for
a Trace-Back Viterbi Decoder. Communications, IEEE Transactions on, 40:616–624,
March 1992.

[16] Matthew C. Valenti. Iterative Detection and Decoding for Wireless Communications.
PhD thesis, Virginia Polytechnic Insitute and State University, July 1999.

[17] Yan Wang, Chi-Ying Tsui, and Roger S. Cheng. A Low Power VLSI Architecture
of SOVA-based Turbo-code decoder using scarce State Transition Scheme. IEEE
International Symposium on Circuits and Systems, Geneva, Switzerland, 00:00–00,
May 2000.

[18] Zhongfeng Wang. High Performance, Low Complexity VLSI Design of Turbo De-
coders. PhD thesis, University of Minnesota, September 2000.

Vous aimerez peut-être aussi