Liang Li Nine Month Report

UNIVERSITY OF SOUTHAMPTON Faculty of Engineering, Science and Mathematics School of Electronics and Computer Science
A progress report submitted for continuation towards a PhD
Supervisor: Dr. Rob Maunder, Prof. Bashir M Al-Hashimi and Prof. Lajos Hanzo Examiner: Dr Song Xin Ng
Analysis of Low Power Implementational Issues of Turbo-like Codes in Body Area Networks
by Liang Li November 3, 2009
UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF ENGINEERING, SCIENCE AND MATHEMATICS SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE A progress report submitted for continuation towards a PhD by Liang Li
Body Area Networks (BANs) are a promising application of wireless sensor networks (WSNs) which are attracting a lot of research interest. A BAN is a WSN located in the vicinity of a human body for continual monitoring of certain parameters of the human body, which can provide a healthcare service in a more comfortable, convenient and economical way than the conventional methods. The extremely low power and high reliability requirements of BANs make the communication challenge. In this report, a state of the art investigation of the research on communication technologies in BANs is given. Based on the investigation, a proposal of using Turbo-like codes for the channel coding scheme of BANs is discussed. Because of the low power requirement of BANs applications, the low power implementation issues of Turbo decoding schemes are discussed. A method to determine the optimal data width specication in xed-point implementation of Turbo decoder from a low power point of view is presented. A framework to compare and evaluate dierent Turbo-like codes from the energy consumption point of view is proposed.
Contents
Acknowledgements 1 Introduction 1.1 Introduction of Body Area Networks (BANs) . . . . . . . 1.2 Communication in BANs . . . . . . . . . . . . . . . . . . 1.2.1 Communication requirements . . . . . . . . . . . . 1.2.1.1 Frequency conditions . . . . . . . . . . . 1.2.1.2 Network scale and communication range 1.2.1.3 Data rate . . . . . . . . . . . . . . . . . . 1.2.1.4 Reliability, accuracy and latency . . . . . 1.2.1.5 Energy consumption . . . . . . . . . . . . 1.2.1.6 Network topology . . . . . . . . . . . . . 1.2.1.7 Security . . . . . . . . . . . . . . . . . . . 1.3 Candidate options for Body Area Networks . . . . . . . . 1.3.1 Turbo-like Codes . . . . . . . . . . . . . . . . . . . 1.4 Outline of the report . . . . . . . . . . . . . . . . . . . . . 2 Turbo-like Code Solutions in BANs 2.1 Introduction . . . . . . . . . . . . . . . . . . . . 2.2 Turbo codes and BCJR decoding algorithm . . 2.2.1 UMTS encoder and decoder architecture 2.2.2 BCJR algorithm . . . . . . . . . . . . . 2.2.2.1 Log-BCJR algorithm . . . . . 2.3 EXIT chart analysis . . . . . . . . . . . . . . . 2.4 Fixed-point representation in a Turbo decoder . xi 1 1 2 4 4 5 5 6 6 7 7 7 8 10 13 13 18 18 24 24 28 31 35 35 42 43 43 43 46 49 51 51 57
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
3 Optimal Data-width Settings for Fixed-point Implementation 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Fixed-point EXIT chart analysis of UMTS Turbo Decoder . . . . 3.3 Simulation and Analysis Results . . . . . . . . . . . . . . . . . . 3.3.1 Comparison between dierent Logarithm methods . . . . 3.3.2 Comparison and Analysis in Fixed-point simulation . . . 3.3.2.1 Wrapping Technique . . . . . . . . . . . . . . . . 3.3.2.2 Saturation Technique . . . . . . . . . . . . . . . 3.3.2.3 Normalisation Technique . . . . . . . . . . . . . 3.3.2.4 Final validation . . . . . . . . . . . . . . . . . . 4 Energy Estimation Decoding Algorithm v
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
vi 4.1 4.2 4.3
CONTENTS Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Previous works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A framework for quantifying the energy consumption of a Turbo-like decoder 4.3.1 Level 1 of the framework . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Future work: Level 2 of the framework . . . . . . . . . . . . . . . . 57 59 61 61 64 67 69
5 Conclusions and Further Works Bibliography
List of Figures
1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 A typical BANs architecture. . . . . . . . . . . . . . . . . . . . . . . . . . Two concatenation way of Turbo-like codes. . . . . . . . . . . . . . . . . . Two decoding schemes of two types of Turbo-like codes. . . . . . . . . . . . . . . . . . . . . . . . 3 9 9 14 15 16 16 17 18 19 21 22 22 23 25 29
Transmission scheme of serial concatenation codes. . . . . . . . . . . . . A typical BER chart for Turbo codes. . . . . . . . . . . . . . . . . . . . Performance comparison of a Turbo code and a convolutional code [?]. . A classical Turbo encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . A classical Turbo decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . A classical SC decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scheme of UMTS Turbo encoder. . . . . . . . . . . . . . . . . . . . . . . Scheme of the convolutional encoder and the trellis diagram. . . . . . . Trellis diagram of a transition sequence. . . . . . . . . . . . . . . . . . . A example transition sequence. . . . . . . . . . . . . . . . . . . . . . . . Scheme of UMTS Turbo decoder. . . . . . . . . . . . . . . . . . . . . . . A example trellis of a short terminated trellis code. . . . . . . . . . . . . Scheme of the EXIT chart generating. . . . . . . . . . . . . . . . . . . . One EXIT curve I(ae ) = F (I(aa )) of UMTS Turbo code using BPSK to transmit over an AWGN channel having an SNR of -4 dB. . . . . . . . . 2.15 EXIT chart of UMTS Turbo decoder. . . . . . . . . . . . . . . . . . . . 2.16 The decoding trajectories in the EXIT chart. . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 correction function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A possible accumulation route in the trellis. . . . . . . . . . . . . . . . . Example of dierence calculation in twos complement representation. . Three dierent Turbo codes in previous works. . . . . . . . . . . . . . . EXIT chart of dierent log algorithms. . . . . . . . . . . . . . . . . . . . EXIT chart of dierent fraction lengths. . . . . . . . . . . . . . . . . . . Scheme of UMTS Turbo decoder. . . . . . . . . . . . . . . . . . . . . . . EXIT chart of dierent integer lengths with wrapping technique - 1. . . EXIT chart of dierent integer lengths with wrapping technique - 2. . . EXIT chart of dierent integer lengths with wrapping technique - 3. . . EXIT chart of dierent integer lengths with wrapping technique - 4. . . EXIT chart of dierent integer lengths with saturation technique. . . . . EXIT chart of dierent integer lengths with normalisation technique. . . Simulation results of 5114-bit block length in xed-point with normalisation and oating-point. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 29 . 30 . 30 . . . . . . . . . . . . . 36 38 39 40 43 44 46 47 47 49 49 50 52
. 53
vii
viii
LIST OF FIGURES 3.15 Simulation results of 40-bit block length in xed-point with normalisation and oating-point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.16 Simulation results of SNR=-4.83dB/453-bit block length in xed-point with normalisation and oating-point. . . . . . . . . . . . . . . . . . . . 3.17 Simulation results of 5114-bit block length in xed-point with wrapping technique and oating-point. . . . . . . . . . . . . . . . . . . . . . . . . 3.18 Simulation results of 40-bit block length in xed-point with wrapping technique and oating-point. . . . . . . . . . . . . . . . . . . . . . . . . 3.19 Simulation results of SNR=-4.83dB/453-bit block length in xed-point with wrapping technique and oating-point. . . . . . . . . . . . . . . . . 4.1 4.2
. 53 . 54 . 54 . 55 . 55
The dependence between the dierent stages. . . . . . . . . . . . . . . . . 63 Flowchart of energy estimation framework. . . . . . . . . . . . . . . . . . 65
List of Tables
1.1 2.1 2.2 3.1 Data rate requirement of dierent applications in BANs . . . . . . . . . . 5
Dierent representation methods for integer numbers . . . . . . . . . . . . 32 Twos complement representation method for fraction numbers . . . . . . 33 Dierent representation methods for integer numbers . . . . . . . . . . . . 40
ix
Acknowledgements
I would like to express my gratitude to all those who gave me help to complete this report. I would like give my special thanks to my supervisors, Dr. Rob Maunder, Prof. Bashir M Al-Hashimi and Prof. Lajos Hanzo. Their insightful guidance and directions and all the wise advice allowed this work to become reality. I am deeply indebted to Dr. Rob Maunder. His stimulating suggestions and encouragement helped me in all the time of research for and writing of this report. Many thanks also to my colleagues and the sta of the Communications Group and Electronic System Design Group for the useful discussions and comments throughout my research. Special thanks to my colleagues Amit Acharyya and Dr. Jos Akhtman for their technical support. I would also like to express my appreciation to my parents who taught me all the good things that really matter in life. Especially, I would like to thanks to my girlfriend Nuofei Lu whose patient love enabled me to complete this work.
xi
Chapter 1
Introduction
A wireless sensor network (WSN) is a network that composed with a number of sensor devices with the ability to communication with each other or with upper level networks. A sensor node in a WSN consist of data sensing, data processing, and communicating components. The sensors are deployed either inside the phenomenon or very close to it. Typically, one or more central devices are included in a WSN for collecting data from the sensors and communicating with upper level networks. With the development of wireless communication technologies, WSNs start to play an important role in supporting dierent scales of networks to connect a person at anywhere, at anytime and with anybode. In this report, a promising application of wireless sensor networks (WSNs), Body Area Networks (BANs) is introduce. The communication requirement of BANs is investigated based on literature review. The candidate technologies for the wireless communication in BANs are discussed, including the potential of using Turbo-like coding schemes in BANs, which is advocated in this report. Finally, the outline of the report is given.
1.1
Introduction of Body Area Networks (BANs)
Body Area Networks (BANs) or Wireless Body Area Networks (WBANs) are a promising application of short range wireless sensor networks(WSNs) in the healthcare industry. Its basic scenario is by locating a number of wireless sensors on the human body for continual monitoring of physiological parameters of humans such as heart rate, ElectroCardioGram (ECG) data, ElectroEncephaloGraphy (EEG) data, blood pressure, body temperature, levels of certain chemicals such as sugar, oxygen and medications in the blood, motion, etc [1]. The parameters could be important or even life critical to some people or patients, such as the aging population, chronic disease patients, cerebrovascular and cardiovascular disease patients. Long-term monitoring and logging of the
Chapter 1 Introduction
physiological parameters of the patients could help doctors to treat the patient or discover risks earlier. The purpose of deploying BANs on such group of people is to create a more comfortable, convenient and economical way to perform the required monitoring and logging missions either in hospitals or in a home-based healthcare system. The concept of (W)BANs was rst introduced by T. G. Zimmerman in 1996 [2]. Over the past few years, the advancements in electronic systems and wireless technologies have enabled the development of small and intelligent medical sensors which can be attached to or implanted into the human body. The healthcare industry is becoming increasingly interested in using such types of technologies to develop practical BANs [3]. Hence, a lot of hot spots in such research area are being widely researched including energy harvesting, signal processing and communication, etc. Recently the IEEE 802.15 working group established a study group, body area network (802.15.TG6), to develop guideline for using wireless technologies for BANs applications in various healthcare services. This report is focused on the wireless communication technology for BANs. Furthermore, the concept of BANs can be further divided into two categories depending on their operating environments. One is mainly operated on the surface or in the vicinity of human bodies, namely wearable BANs. The other is operated inside body, namely implantable BANs. Due to the dierent operating environments, the communication requirements of such two types of application are dierent, which lead to dierent strategies of the development of the communication technology in the networks. Most of the recent works and the IEEE 802.15.TG6 group are targeted on the wearable BANs, which is also the focus of this report. In the rest part of this report, the term BANs is referred to the wearable BANs.
1.2
Communication in BANs
To discuss the wireless communication issues in BANs, we start with the scenarios of BANs applications. A typical BANs scenario is given in Figure 1.1. It consists a number of sensor nodes attached on the human body to perform dierent functions for collecting dierent physiological parameters. A central device such as a PDA or a mobile phone takes the data from the sensor nodes via a wireless network formed between them, namely a BAN. The size of the sensors is supposed to be as small as possible for comfort and convenience issues. Therefore, the functions of them are limited. The basic assumption is the sensors only collect the data from the body, do the necessary signal processing and transmit the data to the central device in real-time communication. The central device then perform more processing of the data. It might connect to an external network for communicating with a higher-level system, such as the hospital, depending on the requirement of dierent scenarios. The number of sensors required could be variable depending on dierent applications. For one particular disease, usually there are only a few (< 3) sensor nodes required [3]. However, for more complicated situations, more
Sensor nodes
Central node
Figure 1.1: A typical BANs architecture.
sensors might be required. Especially, when motion detection is involved, for example, for people need after treatment to help with recover the mobility, sensors might required on every movable part of the body, and including more sensors mean more accurate of the motion detection. According to [1], typically no more than 20 sensors will be used for any one person. Based on such a scenario of the applications of BANs, there are a couple of features of the scenario need to be considered carefully while developing communication technologies for BANs. Firstly, the limited resource on the sensors due to the limited size of them. The nodes need to be light, small, wireless and long lived. Since one of the purpose of BANs is to create a convenient healthcare system for patients without the support from the professionals. The wireless sensors on human bodies need to be battery operated or energy harvesting operated. They are required to last long without any need of maintenance. This leads to a extremely energy ecient requirement. Secondly, in the scenario, all the sensors and the central device are always around the patient which leads to a very short cover range requirement for BANs. Thirdly, in the scenario, all the data only required to be transmitted from the sensors to the central nodes. Since the function of a sensor is simple and limited, there is almost no need for a dual-way communication in the system. An oneway communication (from the sensors to the central node) is sucient for BANs. Although, in some scenarios, there may be an overall energy saving by letting nodes listen to each other. Also, including relays in the networt might require
Chapter 1 Introduction nodes to receive. The basic assumption of no requirement of transmission from the central node to the sensors is still valid.
Based on the discussion of the basic scenario of BANs, in next section, the requirement of the communication in BANs is discussed.
1.2.1
Communication requirements
To develop new technologies of a particular type of WSN, the basic requirements need to be considered rst. In this section, we will discuss the communication requirements of BANs, which includes frequency bands, network scale, communication range, data rate, reliability, energy consumption, network topology and security issues.
1.2.1.1
Frequency conditions
Currently, there is no clear frequency range could be used for BANs. According to the latest news [4], the Federal Communications Commision (FCC) is considering several possible frequency bands for use by BANs: 2300-2305 MHz and 2360-2395 MHz Band: The 802.15.TG6 Group and GE Healthcare (GEHC) propose to use this band for BANs. However, this band is currently used by several other services, including Aeronautical Mobile Telemetry (AMT), federal radio location and amateur radio users. This can be a problem regarding interferences and security. The FCC is considering the proposed potential use of these bands by BANs on a coexistence and non-interference basis. 2400-2483.5 MHz Band: This band is used by Industrial, Scientic and Medical (ISM) equipment on a non-licensed basis under the FCCs rules. The FCC seeks comment on whether BANs could operate in this band under current rules or whether new rules would be required to regulate BANs using this band. Other Frequency Bands: The FCC seeks comment on whether other frequency bands may be appropriate for BANs, including the 5150-5250 MHz band, which is now allocated for federal and non-federal aeronautical navigation and non-federal xed-satellite use and unlicensed notional information infrastructure (U-NII) devices. An alternative solution is to use the Ultra-WideBand (UWB) technology, which is authorised to communicating between 3.1 GHz and 10.6 GHz [5]. [6] discussed the potential and the advantages of applying UWB technology to BANs.
Chapter 1 Introduction 1.2.1.2 Network scale and communication range
As discussed in Section 1.2, a BAN is a small scale network which could include up to 20 sensor nodes. One important feature of BANs is that all the devices in the network are around a human body. This extremely limited the communication range requirement of the network. An universal agreement of the communication range of BANs is 2-5 meters [1, 7, 8], which is shorter than the cover range of any existing WSNs application. An obvious advantage of such a short communication range is that it directly leads to a low emission level from the transmitter. On the other hand, all the devices are in each others vicinity, which could induce a interference problem.
1.2.1.3
Data rate
Most of the previous works agree that BANs require real-time low data rate communication, but the detailed investigation results of the data rate requirement are dierent. For example, the investigation results from [3] and [7] are summarised in Table 1.1. Healthcare applications Heartbeat Body temperature Electrocardiogram (ECG) Electroencephalography (EEG) Electromyography (EMG) Blood pressure Blood sugar level Blood analysis B. Zhens work [3] <0.1 kps <0.1 kps 2.5 kps 0.54 kps <0.1 kps <0.1 kps H. Lis work [7] 0.05 kps 0.05 kps 72 kps 131.1 kps 1152 kps 0.05 kps 8.192 kps
Table 1.1: Data rate requirement of dierent applications in BANs
Note that for the low data rate requirement applications, such as heartbeat, body temperature and blood pressure, the investigation results are quiet close. However, for more complicated parameters, such as ECG and EEG, the investigation results are very different. The reason of this could be the dierent assumption of the pre-processing of the signals before transmitting. For example, if a sensor transmits compressed data rather than transits the raw data, the data rate requirement could be reduced. With a review of a number of previous work, the highest data rate assumption is given by [8], which claimed that up to 1 Mbps data rate is required by BANs. This conclusion still gives a data rate requirement low than any existing WSNs applications. For example, for another lower speed short range WSNs application Wireless Personal Area Networks (WPANs), the required data rate is up to 10 Mbps. Although, some of the previous works including video stream transmission into BANs scenarios, the data rate requirement could be increased to 100Mbit/s [1].
6 1.2.1.4 Reliability, accuracy and latency
Although the requirement of the performance in BANs is still under discussion, which includes reliability, accuracy and latency, of BANs. Since the monitoring signal are life critical, the fault data transmission or a few minutes delay maybe fatal for the patient. The requirement of the performance in BANs should be relatively high. Many previous works agreed that delays and communication errors should be within strictly dened standard in order to avoid disastrous behaviour [3, 9, 10]. According to the 802.15.TG6 groups ocial released report [11], a fast reaction of < 1 second with a reliability of 99.99% is expected. Also latency should < 250 ms and jitter < 50 ms. The evaluation of the performance can be revealed by the delay prole, the information loss rate, the bit error rate (BER) and frame error rate (FER), which need to be considered carefully during the design of a communication system. In addition, when a human body moves the sensors will change positions to each other. When the environment is changing channel conditions will also change and interfere in the network performance. For BANs, the system must keep its reliability under all the possible conditions. The dierent state of human bodies must be considered such as walking, running, turning, etc. Dierent environments will be crossed in practice where dierent and multiple BANs will coexist or dierent and multiple interference resistance will happen, such as tunnels, subways parks, etc. The design of BANs must be prepared for all the realistic scenarios.
1.2.1.5
Energy consumption
The energy consumption is a very crucial issue in BANs on the sensor nodes due to the limited energy resource and the long life-time requirement. As discussed in Section 1.2, the sensors in BANs have to be battery operated or depending on energy harvesting which do not need to be recharged frequently. Since the sensors are expected to be as small as possible, the energy resources can be provided on them would be extremely limited. This requires every function on the sensors need to be energy ecient, including the wireless communication mechanism. In addition, a distinguished feature of BANs is that the sensors are required to attached on the human body. On the other hand, previous works suggested that, for the safety to human body, wireless devices should be separated at least 30 cm distance from human body [3]. Hence, extremely low power in transmission is required to protect human tissues. It is widely agreed that the low power requirement is one of the most challenging issues in developing BANs [1, 3, 12]. Since such a small distance and especial requirement for protecting people tissue of BANs, it is not covered by existing wireless standards [1].
Chapter 1 Introduction 1.2.1.6 Network topology
Because of the small network scale and the one-way communication assumption, a star topology becomes a direct solution of BANs [13,14]. The advantages of star topology are its simple architecture and highly concentrated system complexity to the central node, which is suitable features for BANs. However, the devices are located on the human body that can be in motion. BANs should therefore be robust against frequent changes in the network topology. Moreover, human bodies strongly attenuate RF signal [10]. Both of the reasons lead to benets of using a multi-hop network topology in BANs. In addition, using multi-hop transmission instead of direct transmission can approach lower energy consumption in communication [12]. Hence, many of recent works were focusing on multi-hop network solutions for BANs [15, 16].
1.2.1.7
Security
Security is a major issue on medical applications [8]. Safety and privacy should be concerned including all involved parts from doctors, nurses, patients, administrative personal and medical service providers. Devices also needs authentication for security. Interferences of external devices or intentional attacks must be considered on safety issues. On the other hand, due to the limited resource on the sensors and the user friendly requirement, the system must be simple. The insertion and deinsertion of a node in a BAN must be easy to the user.
1.3
Candidate options for Body Area Networks
Based on the discussion in Section 1.2.1, the distinctions of BANs from other networks are its lower communication range, extremely low power and high accuracy and reliability requirements. Since the existing WSNs technologies are not suitable for BANs, a new IEEE study group 802.15.TG6 is working on developing new standard and protocol for BANs applications. For creating a new protocol suitable for BANs, there is a choice between to dene a new PHY/MAC or to evaluate and improve current available or emerging technologies. Some paper suggested to modify from existing standards [1, 3]. For the low power purpose, a possible approach is to scale down existing standards, such as turn down the transmit power or introduce duty cycle mechanism [1]. IEEE 802.15.4a/b is a popular standard to be evaluated or improved for BANs applications in previous works [8, 1719]. Some other Existing standards such as Medical Implant Communication Service (MICS) and Wireless Medical Telemetry Service (WMTS) are also suggested [20]. Based on an investigation of some previous works about using existing standards in BANs [8, 1723], we found that most of these works focus on the evaluation of the performance of the standards in BANs applications. Despite of
the importance of the low power requirement of BANs, the power issue have not been addressed a lot in these works. [23] pointed out that the reason of the energy consumption issue of the potential standards is hard to investigate is that the platforms currently available for evaluating the existing standards are not particularly designed with proper low power techniques for the purpose of extremely low power applications such as BANs. Therefore, it is not fair to evaluate the energy consumption based on the platforms since the existing standard could be further scaled down, such as turning down the transmission power or introducing duty cycle mechanism, for the low power purpose, as suggested in [1]. However, the scaling down of a standard must lead to degradation on the performance. For example, although some of the previous works claim that the 802.15.4 standard provides a sucient performance for BANs applications, it cannot be guaranteed that the performance can be maintained after proper scaling down techniques are applied to the standard. Another option is to dene a new standard for BANs. Since the existing low power standards could not meet the ultra-low power requirement of BANs. How to further scale down the energy consumption in a WSNs and keep the high performance becomes a challenge. There are two problems challenge the performance of the communication in BANs. One is that the transmission power of the sensors is ideally as low as possible for protecting human tissues. On the other hand, human tissues are composed primarily of water molecules, which tend to absorb RF energy [24]. These problems induce a requirement of a proper Error Correction Code in channel coding scheme to maintain a high reliability and accuracy in such a crucial condition. However, due to the reduction of the transmission power in the system, the energy consumption by channel coding could take more contribution of the whole system, which make a low power design of the channel coding scheme becomes desirable. For overcome such challenge, we investigate the potential to use Turbo-like codes in the communication of BANs. The novel way to apply Turbo-like Codes in BANs and the advantages will be discussed in Section 1.3.1.
1.3.1
Turbo-like Codes
We will give all overall discussion of Turbo principle and Turbo-like codes in next chapter. In this section, for the purpose of discussing the protential advantages of applying Turbolike codes in the channel coding scheme of BANs, we will give a brief introduction of some distinctive features of Turbo-like codes. Turbo-like codes refer to a type of ECC that includes two component codes in one coding scheme. The concatenation between the component codes on the encoding process could be parallel or serial. An interleaver is used between the component encoders. The two type of concatenation in Turbo-like codes are illustrated in Figure 1.2. The success of Turbo-like codes is that it introduces an iterative decoding process to approach the best decoding results. The two decoding schemes corresponding to the two encoding schemes
Input
Encoder1
Encoder2
MUX
Output
Input
Encoder1
Encoder2
Output
Parallel Concatenated Code
Serial Concatenated Code
Figure 1.2: Two concatenation way of Turbo-like codes.
are given in Figure 1.3. An iterative decoding process is performed between the two
Decoder1
1
deMUX
Input
Output
Decoder2
1
Decoder1
Input
Output
Decoder2
Parallel Decoding Scheme
Serial Decoding Scheme
Figure 1.3: Two decoding schemes of two types of Turbo-like codes.
concatenated decoders by feeding the decoded results back to each others input. Under such a scheme, the decoded result improves in each iteration, until the best result is achieved after a certain number of iterations. The details of the Turbo-like codes are given in next chapter. The advantages of the Turbo-like code are its high reliability and the near Shannon capacity performance [25, 26], which could conquer the crucial transmission condition in BANs. However, the disadvantage of it is its relatively high complexity decoding scheme. Turbo-like codes are usually not appropriate for low power communication since the iterative decoding process consumes a lot of energy [27]. However, the oneway communication assumption in BANs provide an opportunity to apply Turbo-like codes in its communication system. In contrast of the complicated decoding schemes of the Turbo-like codes, the encoding schemes of them are simple and easy to have a low power implementation. Therefore, in a star topology network, the decoding scheme does not need to be implemented on the sensors based on the one-way communication assumption. Although it need to be implemented on the central nodes, in BANs, the central node usually has a sucient energy recourse since it is usually a much bigger equipment than the sensors, such as a PDA or a smart mobile. Hence, the Turbo-like codes are naturally suited for a star topology BAN. Furthermore, as we discussed, the multihop network is necessary in some scenario of BANs. In a multihop network, relays are used reduce the transmission distance. Thus the transmission power on each nodes can be reduced, which is an especially desired feature for BANs. On the other hand, it is required to induce extra receiving, transmitting and coding process on the relays,
10
which might increase the overall energy consumption on the relays and reduce the lifetime of them. For the channel coding point of view, one way to avoid inducing extra coding process on the relays is to transmit the received signals without any processing but amplify them. However, it is not a desirable solution since in this way, the noisy in the transmission is also been amplied on the relay, which might increase the decoding error rate at the central node and decrease the communication performance. For an alternative solution,we propose a novel decoding scheme the relay which including fewer times of iteration in the decoding process, and transmit the sub-optimal decoding result. It is possible to nd out a balance point that with a certain times of iterative decoding on the relays, the energy consumption of the extra coding process and the overall communication performance are both acceptable and the transmission power can be reduced by the multihop mechanism in the network. To further research on the proposed novel decoding scheme, we need to investigate on the energy consumption of the low power implementations of the Turbo-like code, such as xed-point ASIC implementation. In this report, we proposed a method to determine the optimal datawidth specication in xed-point implementation of Turbo-like code in low power point of view. As discussed in Section 1.3, the possible energy consumption of a communication standard or a coding scheme for the low power applications such as BANs is hard to evaluate. In this report, we propose a framework to compare and evaluate the possible energy consumption of dierent Turbo-like code decoding schemes. To sum up, in this report, we discussed the possibility of introducing Turbo-like code into BANs communication system. And related requirement of the method to evaluate the energy consumption of a Turbo-like code decoding system is introduced. An important issue of low power realization of Turbo-like decoding algorithms in xed-point implementation, the data width specication is explored. The outline of the report is given in next section.
1.4
Outline of the report
The later chapters in this report are organised as follows. Chapter 2 is a background chapter. A fully introduction to the Turbo principle and Turbo-like codes is presented. The xed-point implementation issues of hardware design for the algorithms are also introduced. Chapter 3 proposes a method to determine the optimal data width specication in xed-point implementations of Turbo-like codes from a low power point of view. Chapter 4 proposes a framework to evaluate and compare the dierent Turbo-like codes from the energy consumption point of view. Part of the framework is the future work of the project, which is also discussed in this chapter.
Chapter 1 Introduction Chapter 5 gives the conclusion of this report.
11
Chapter 2
Turbo-like Code Solutions in BANs

In this chapter, we introduce the background information of this report. It includes a brief introduction of the Turbo principle and its encoding and decoding algorithms, the EXIT chart analysis tool and the basic theory of xed-point representation in hardware implementation.
2.1
Introduction
The Turbo principle is a concept of error correcting code (ECC) including iterative decoding processes, also referred as turbo decoding processes, such as serial or parallel concatenated codes [25, 28] and LDPC codes [29]. A unique feature of Turbo-like codes is including two or more component codes concatenated in the schemes. Such types of codes are called concatenated codes, which were rst proposed by [30]. The rst version of concatenation codes was serially concatenated (SC) codes. It includes two or more component codes concatenated in a serial structure. A famous example consists of a Reed-Solomon code [31] as the outer code (applied rst and removed last) and followed by a convolutional code [32] as the inner code (applied last and removed rst) [33] in the scheme. In the early concatenated coding schemes, despite including two or more component codes, there is no iterative decoding process in the decoder. The decoder generates hard decisions (i.e. the determined bit results) directly. In a communication receiver, the demodulator is usually produce soft decisions in the demodulation process. The soft decisions are corresponding to the hard decisions. Instead of giving the decoded bit results, soft decisions are reliability information expressed by a posteriori probability of each bit. Soft decisions express not just what the most likely value of a bit is, but also how likely it is while hard decisions only express the former. And before Turbo principle discovered, a typical decoder utilises the soft decisions in the decoding process and 13
14
Chapter 2 Turbo-like Code Solutions in BANs
generate hard decisions at its output. Such a decoder could be called as a Soft-in Hardout (SIHO) decoder. Therefore, a straight forward way of decoding the SC codes involves the use of a SIHO decoder for the inner decoder and a HIHO (Hard-in Hard-out) decoder for the outer decoder concatenatedly. If a convolutional encoder is concerned, a Viterbi Algorithm (VA) [34] decoder is used at the corresponding place to give hard decisions. As discussed in [35], the rst drawback of such a structure is that the inner decoder generates hard decisions, thus preventing the outer decoder from utilising its ability to accept soft decisions at its input. The second drawback is that if the inner decoder makes a continual error sequence, the outer decoder is unable to correct the error. The second drawback can be conquered by inserting an interleaver between the inner and the outer encoder and correspondingly an deinterleaver between the inner and the outer decoder. The function of a interleaver is to rearrange the order of a sequence in a pseudo-random way. The function of a deinterleaver, with knowledge of the rearranging method of the corresponding interleaver, is to restore the order of an interleaved sequence. Thus, a continual error sequence in the inner decoder becomes dispersed in the input to the outer decoder. The transmission scheme is shown in Figure 2.1. However, if errors occurred
Input bits Outer encoder Interleaver Inner encoder Channel Decoded bits Outer decoder Deinterleaver Inner decoder
Figure 2.1: Transmission scheme of serial concatenation codes.
at the output of the outer decoder, these would remain in the nal decoding results. A Turbo-like code can be considered of a renement of the concatenated encoding schemes with an improved decoding process including iterative algorithms. The concept of turbo decoding is for a system with two component codes to pass soft decisions from the output of one decoder to the input of the other decoder, and to iterate this process many times to produce more reliable decisions. To obtain benets from an iterative decoding process, it required that the two decoders feed soft decisions to each other. It is because using hard decisions as an input of a decoder degrades system performance compared with soft decisions [36]. Therefore, it requires Soft-in Soft-out (SISO) decoders for the decoding of each component code. The introduction of Turbo codes in [25] is also the rst time of introduction of parallel concatenation (PC) codes. It was reported that the scheme can achieve a bit error rate (BER) of 105 using a rate 1/2 code over an additive white Gaussian noise (AWGN) channel and BPSK modulation at an Eb /N0 of 0.7 dB [25, 26]. According to the discussion in [25, 26], the Shannon capacity for a binary modulation is the error probability Pe = 0 (Pe = 105 can be taken as a reference here) for Eb /N0 = 0 dB. Hence the performance is 0.7 dB from Shannon capacity. Most importantly, with the iterative decoding scheme, the complexity of a Turbo decoder is much less than that of a non-iterative decoder having the same performance. According to [37], the complexity required to allow the earlier codes to approach the Shannon capacity would
15
be not feasible to implement. The discovery of Turbo codes has revolutionised the eld of error correcting codes since it rst time achieved the performance very close to Shannon capacity in practice. To evaluate the performance of a Turbo or Turbo-like code, a Bit Error Rate (BER) chart is a commonly used tool. A typical BER chart of Turbo codes looks like Figure 2.2. Y axis is the BER of the decoding result after a certain times of iterative decoding and X axis is Eb /N0 , where Eb is the energy in one bit and N0 is the noise power spectral density (i.e. noise power in a 1 Hz bandwidth). As shown in Figure 2.2, a typical Turbo
Threshold Eb /N0
100 101 102 103 104 105 106 107
BER
Turbo cliff
Error floor
0 0.5 1 1.5 2 2.5
Eb/N0
Figure 2.2: A typical BER chart for Turbo codes.
code can achieve very low BER once Eb /N0 reached a certain point. In the gure, the point which the BER curve starts to decrease is called threshold Eb /N0 , the region where the BER curve falling fast is called the turbo cli region and the region where the BER curve is at at a very low value is called the error oor region. To understand how the Turbo codes outperforms the earlier coding schemes, we quote Figure 2.3 from [?]. It shows simulation results of the original rate R=1/2 turbo code presented in [25] and a maximum free distance (MFD) R=1/2, memory = 14(2, 1, 14) convolutional code with Viterbi decoding. The simulation results show that the Turbo code outperforms the convolutional code by 1.7 dB at a BER of 15 . The comparison is distinct, especially since a detailed complexity analysis reveals that the complexity of the Turbo decoder is much smaller than the Viterbi decoder used for the convolutional code. A classical Turbo encoder is composed of two recursive systematic convolutional (RSC) encoders, as shown in Figure 2.4. The input information sequence is encoded twice by the two RSC encoders. The rst encoder processes the information in its original order, while the second encoder processes the same sequence in a dierent order obtained by an interleaver. In this scheme the systematic bit sequence is also transmitted to the
16
Figure 2.3: Performance comparison of a Turbo code and a convolutional code [?].
decoder. As shown in the gure, sequence c and d are the output of each encoder. Sequence a is the systematic bit sequence and b is the interleaved systematic bit sequence. Note that only a is transmitted since b can be obtained by an identical interleaver on the decoder. In the decoding process, as shown in Figure 2.5, two a posteriori probability
Puncturing
Input
RSC1
c d
Output
RSC2
Figure 2.4: A classical Turbo encoder.
(APP) decoders are used correspondingly for the two convolutional encoders in the en coding scheme to get the minimal bit error probability. In the gure, a, c and d are the soft decisions sequence corresponding to the output sequence a, c and d in Figure 2.4 obtained by the demodulator. The purpose of an APP decoder is to compute a posteriori probabilities on either the information bits or the encoded symbols. Its applications in Turbo-like codes make it became the major representative of the SISO decoders. The algorithm was originally invented by Bahl, Cocke, Jelinek and Raviv in 1972, so called BCJR algorithm [38]. The capability of generating soft decisions of it is well suited for iterative decoding schemes. In Figure 2.5, the two decoders are working alternatively in
17
an iterative way. To get the correct order of the input sequences, a identical interleaver with the one used in the encoding scheme and a corresponding deinterleaver is used between the decoders. An extra interleaver is used for providing the systematic sequence for both of the decoders. The main advantage of this decoding process compared with using the VA decoders is that it utilises the ability of the decoders to accept soft decisions at their input. However, in iterative decoding schemes, the information provided for one decoder from the other one, is extrinsic information but not a posteriori information. The extrinsic information represents the new information obtained by a decoder. The reason of using extrinsic information is to prevent the decoding scheme from being a positive feedback amplier [35]. As shown in Figure 2.5, the a priori information from systematic sequence is added to the input of the decoders, since the a posteriori information already includes the a priori information from the previous decoding process from the other decoder, this creates a positive feedback amplier in the loop, by using extrinsic information instead of a posteriori information, such a problem can be solved. Therefore, the output of the decoders in Figure 2.5 is extrinsic information. It can be obtained by a simple subtraction between the a posteriori and the a priori information. Alternately, it can also be generated directly by a modied BCJR algorithm. By receiving the new extrinsic information from the other decoder, the reliability of the decoding increases in each iteration. The whole decoding process stops when the required reliability is reached or until no further reliability can be gleaned. In practice, the modied
a
APP1
Output
APP2
Figure 2.5: A classical Turbo decoder.
BCJR algorithm avoid the nal subtraction operations, which is more suitable for iterative decoding schemes. Moreover, a further improved version of BCJR algorithm called the Log-APP or Log-BCJR algorithm is a transferred version of BCJR algorithm in logarithmic domain. The purpose of it is to avoid the mass of multiplication operations in BCJR algorithm and more importantly, the Log-BCJR algorithm has variables with a much more manageable dynamic range than those of the BCJR algorithm, reducing the memory requirement and allowing xed-point processing to be used. Since it avoids the complex circuit implementation due to many multipliers required by original BCJR algorithm and requires much less memory, Log-BCJR algorithm is widely used in practice. Hence, in this report, since we only investigate the applications of BCJR algorithm in interactive decoding schemes, the algorithm we discuss and simulate would be the
18
modied Log-BCJR algorithm that generate the extrinsic information directly. We will discuss the detail of the algorithm in next section. The Turbo principle can also apply to SC codes, which becomes to another primary category Turbo-like code, serially concatenated convolutional codes (SCCC) [28]. Instead of using the decoding scheme in Figure 2.1, a scheme similar to with the Turbo decoder, as shown in Figure 2.6 is used. The two SIHO decoders are replaced with SISO decoders and an interleaver and a deinterleaver are required to form the iterative decoding loop. According to [35], serial Turbo codes perform better than parallel Turbo codes in the
Output Outer decoder Inner decoder Input
1
Figure 2.6: A classical SC decoder.
error oor region. On the other hand, in the turbo cli region, parallel Turbo codes perform better with the same overall coding rate.
2.2
Turbo codes and BCJR decoding algorithm
Turbo-like codes generally have a simple encoding scheme and a relatively complicated decoding scheme. In this section, we use 3rd Generation Partnership Project (3GPP) Universal Mobile Telecommunications System (UMTS) Standard [39] as an example to introduce the typical Turbo coding schemes, the included convolutional code and the SISO decoding algorithm for convolutional code, BCJR algorithm. UMTS Turbo code and BCJR algorithm are also using as examples to present my works in this report, so the description in this section are refereed in later chapters.
2.2.1
UMTS encoder and decoder architecture
To simplify the description, we assume BPSK modulation is used in our case, so each symbol in transmission is a bit. For other modulation methods, the transmitted bits here would be replaced by transmitted symbols. According to [39], the concatenated RSC encoder of UMTS Turbo code is a rate R=1/2, K=4 constraint length and m=3 memory convolutional code. Two such identical encoder form a rate 1/3, 8-state PCCC illustrated in Figure 2.7. In the RSC encoder, the three memory bits forms an 8-state nite-state machine (FSM). We use the notation Na to represent the block length of the encoding sequence a. Before the encoding of the bit sequence a commences, the shift registers of each concatenated convolutional code are initialised in a state that is known to the receiver. Typically, the m=3 memory elements of each shift register are

a c
Input
19
D
MUX
Output
Interleaver
b
D
f
Figure 2.7: Scheme of UMTS Turbo encoder.
initialised with logic-zeros, placing them in what is referred to as the all zeros state. However, following the encoding of the Na bits in the sequence a, the shift registers will enter states that are not inherently known to the receiver. A number of techniques have been proposed to cope with this [35]: No termination: In this case, in the decoding process, the end of a block sequence is considered to have a equivalent possibility of each possible state. No information of the nal state need to be provided. The decoding process is then less eective for the last encoded data and the performance may be reduced. The degradation is a function of the block length. However, for some applications the degradation may be acceptable. Termination: This method involves several extra bits at the end of each block sequence to force the encoder return to the all zero state. The UMTS Turbo code of Figure 2.7 is one example of such technique. The extra tail bits also need to be sent to the decoder. This method conquered the uncertain nal state issue but induced another two drawbacks. Firstly, extra redundancy information is added to the transmission. Nevertheless, the redundancy is negligible except for very short blocks and it is useful for error correction. Secondly, for parallel codes, the tail bits are not identical for each constituent codes, which means in the iterative decoding process, the extrinsic information of the tail bits cannot be exchanged between the decoders. Hence, the data at the end of the block sequence will get less benet from the Turbo decoding process. The SCCC also has the similar problem. Adopt tail-biting: [40] introduced a technique allows any state of the encoder as the initial state. This method involves a double encoding process: Firstly, a
20
Chapter 2 Turbo-like Code Solutions in BANs normal encoding of the sequence starting from all zero state is performed, but the output of the encoder is ignored. Only the nal state of the encoder is stored. Secondly, the encoding process is performed again in order to actually generate the output. In this step, the initial state is a function of the nal state previously stored. The result of this process is the nal state of the encoder is equal to its initial state. The advantage of this method is no extra bits have to be added and transmitted. However, the double encoding process is the main drawback of this method. In addition, it only works for the convolutional codes where BCJR algorithm is especially adapted.
In the UMTS Turbo code, the termination technique is used as shown in Figure 2.7. The initial states of the shift registers are all set to zeros when starting to encode a bit block a. Note that after encoding the Na bits of the source sequence a, the two switches in the gure switch down to form a closed loop in the two encoders. Following this, m = 3 bits are encoded in order to reset the contents of the shift register to all zero state. The output of the Turbo encoder is a, c, e, d and f , where a is the systematic bit sequence, c and d are the encoded bit sequences of the two encoders, respectively, and e and f are the termination sequence of the two encoders, respectively. The termination is performed by taking the tail bits from the shift register feedback after all information bits are encoded. It takes m bits to force the nal state back to all zero state for each encoder. Therefore, in the case where a comprises Na bits, c and d will comprise Nc = Nd = Na + m bits, while e and f will comprise Ne = Nf = m bits. In UMTS Standard, the possible block length of the Turbo code (i.e. the length of bit sequence a) Na [40, 5114]. For the interleaved sequence b, the length is Nb = Na . The termination bits e and f have a length of the number of memory bits in the RSC encoders, Ne = Nf = m = 3. Consequently, for the encoded sequence c and d, there is Nc = Nd = Na + 3. Note that the additional termination bits (e and f ) make the coding rate R of the encoder lower than 1/3, namely R = Na /(Na + Nc + Nd + Ne + Nf ) = Na /(3Na + 4m). To understand the operation of the FSM, the state transition can be shown as a trellis diagram in Figure 2.8. an , cn and en d are the input sequence and the output sequence.
+ + S1 , S2 and S3 are the current state of the three memory bits in the encoder. S1 , S2 + and S3 are the next state of the three memory bits. The transition of the states and
the decoding results can be expressed as the following equations. For encoding bits:
+ S1 = ak S2 S3 + S2 + S3
(2.1) (2.2) (2.3) (2.4)
= S1 = S2
+ ck = S1 S1 S2

ck an yn
+ S1
21
S1
+ S2
S2
+ S3
S3
ek S1 S2 S3 State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1

0/1
ak /ck
0/0 1/1 1/1 0/0 1/0 0/1 0/1 1/0
+ + + S1 S2 S3
S1 S2 S3 State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1
ek /ck
0/0 1/1
+ + + S1 S2 S3
000 001 010 011 100 101 110 111
000 001 010 011
1/0 0/1
0/1 1/0
State5 1 0 0
1/0
State5 1 0 0
1/0
100 101
1/1
State6 1 0 1
1/1
0/1
State6 1 0 1 State7 1 1 0
0/0
State7 1 1 0
0/0
0/0 1/1
110 111
State8 1 1 1
State8 1 1 1
Transition trellis for encoding bits
Transition trellis for termination bits
Figure 2.8: Scheme of the convolutional encoder and the trellis diagram.
For termination bits:

+ S1 = 0 + S2 = S1 + S3 = S2
(2.5) (2.6) (2.7) (2.8) (2.9)
ek = S2 S3 ck = 0 s1 S3
The eight possible states are corresponding to the State1 to State8 as shown in the gure. The trellis diagrams gives all the possible transitions of the FSM. The left trellis diagram shows the transitions for the encoding bits in a sequence. The right diagram shows the transitions for the termination bits in a sequence. Note that the rst state in a transition sequence is all zero, which is the state1 in the gure. With the termination technique, the last state in the sequence is forced back to state1 . It causes the possible transitions at the rst three steps and the last three steps are limited. A transition trellis diagram of a transition sequence is shown in Figure 2.9. The input sequence is an and the output sequence cn and en can be obtained by tracking the state transition in Figure 2.9. For instance, for a 5 bits input sequence a = [0, 1, 1, 0, 1], the transitions in the trellis is shown in Figure 2.10. Note that there are 8 steps in the trellis since there are three termination bits included. The encoded bit sequence would be c = [0, 1, 0, 0, 1, 0, 1, 1] and the actually transmitted systematic bit sequence is [a, e] = [0, 1, 1, 0, 1, 1, 0, 1]. The trellis diagram is not only helpful to understand the encoding operations of a convolutional
22
S1 S2 S3

a1 /c1 a2 /c2 a3 /c3 a4 /c4 a5 /c5 aNa /cNa e1 /cNa +1 e2 /cNa +2 e3 /cNa +3
State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1 State5 1 0 0 State6 1 0 1 State7 1 1 0 State8 1 1 1
Figure 2.9: Trellis diagram of a transition sequence.
code, but also useful to explain the BCJR decoding algorithm, as we shall discuss later.
yn /cn S1 S2 S3
0/0
1/1
1/0
0/0
1/1
0/0
0/1
1/1
State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1 State5 1 0 0 State6 1 0 1 State7 1 1 0 State8 1 1 1

Figure 2.10: A example transition sequence.
The architecture of the decoder is as shown in Figure 2.11. A data transmission loop is formed between the decoder 1 and decoder 2 to realize the iterative decoding process. Each iteration consists of two half iterations, one for each constituent RSC code. The two decoders operate alternately since the input of one decoder includes the output of the other decoder from pervious half iteration. The operation of the RSC decoder (i.e. the BCJR algorithm) is described in Section 2.2.2. In the gure, the input of the decoding scheme is assumed to be in soft decision form, which makes that the channel gain and

ac ap ya ye aa bc 1 be ba ae fc za ze
Decoder 2 Decoder 1
23
ec
cc deMUX
Input
dc
Figure 2.11: Scheme of UMTS Turbo decoder.
noise variance have been properly taken into account. The ve input ac , cc , dc , ec and f c are the input soft decisions corresponding to the coded output a, c, e, d and f in the encoding scheme. For each decoder, it received two information sequence. One is the soft decisions of the encoded sequence, which is received from the transmission channel directly (cc for decoder 1 and dc for decoder 2). The other one is the uncoded sequence input information from the other decoder, which is simply formed by adding the a priori information provided by the other decoder to the received systematic information. The a priori information is the extrinsic information generated by the other decoder after rearranging the order by the proper interleaver () or deinterleaver ( 1 ). For decoder 1, the input LLR from the other decoder ya is the sum of the aa and ac following with ec as shown in the gure. Because the two encoders have independent tails, the soft decisions of the tail bits are not passed between the decoders. Thus the information of the termination bits need to be considered carefully. The systematic information of the decoder 1s termination bits ec need to be added at the end for a complete ya . Therefore, ya = [aa + ac , ec ]. On the other hand, for the extrinsic information generated by decoder 1, ye , the information of the termination bits need to be cut o before it interleaved and transmitted to decoder 2, as shown in the gure. Therefore, the length of ya and ye are Ny = Na + m. For decoder 2, respectively, the uncoded information is the sum of ba (i.e. interleaved ae ) and interleaved systematic information bc . Therefore, za = [ba + bc , f c ] and the same processing of the termination bits is applied on za and ze . The length of za and ze are Nz = Nb + 3 = Na + 3. In the rst iteration, be is initialised with a sequence of zero valued LLRs which imply that the values of the corresponding bits are completely unknown. ac is the received systematic information. Note that two identical interleaver of the interleaver in the encoding scheme and a corresponding deinterleaver are used between the decoders to give the correct order of the input sequence. As discussed
24
before, in the BCJR algorithm we used, the extrinsic information directly generated by the decoding algorithm, which is done inside the decoder, after all the iterations are completed, the a posteriori output of the decoding scheme is obtained by adding the nal extrinsic output to the nal a priori input of the decoder 1, as shown in the gure. And the SISO decoding process is then completed. Based on the soft decisions, hard bit decisions can be taken to give the nal decoding result.
2.2.2
BCJR algorithm
As we discussed, the APP algorithm we investigate is the Log-BCJR algorithm. In this section, we give a brief description of the the Approx-Log-BCJR algorithm. The original BCJR algorithm is introduced with detail in [37].
2.2.2.1
Log-BCJR algorithm
There are two main advantages of inducing Log-BCJR algorithm. Firstly, the original BCJR algorithm is consist of many multiplication operations which lead to very complex circuits while implementing in hardware. Log-BCJR algorithm avoids the multiplications by transforming the algorithm into the logarithmic domain, where multiplications become additions. Secondly, the values of soft decisions in the normal domain could have a very large dynamic range and theoretically unlimited, which leads to a large amount of memory space in practice. To transfer them to logarithmic domain reduces the dynamic range of the soft decisions and consequently all the internal variables in the algorithm. Hence signicantly this approach reduce the memory requirement to implement the algorithm. We use notation y to represent the systematic bits in the encoder including the termination bits, which means y = [a, e], and use ya to represent the received uncoded sequence LLRs in our BCJR decoder, according to Figure 2.11.
y We have y = {yn }n=1 . In normal domain, the soft decision of a received bit is dened
as:
a yn =
P (yn = 0) P (yn = 1)
(2.10)
a where yn is the soft decision of the received bit yn . In logarithmic domain, the soft
decisions become log-likelihood ratios (LLRs) dened as:

a yn = ln
P (yn = 0) P (yn = 1)
(2.11)
The two basic operations in the original BCJR algorithm is addition and multiplication. For A = ln(a) and B = ln(b), the multiplication in normal domain becomes addition in logarithmic domain. ln(ab) = ln(eA eB ) = A + B (2.12)
25
Addition in the normal domain can be solved by Jacobian logarithm in the logarithmic domain, which we use max to dene the function ln(a + b) = ln(eA + eB ) = max(A, B) + ln(1 + e|AB| ) = max (A, B) (2.13)
The max function is usually computed by successive pairwise operations when there are more than two terms of it. In practice, the function fc = ln(1 + e|AB| ) can be implemented by a Look-Up-Table (LUT), so the function can be done by a select operation in the LUT. The LUT realized version of Log-BCJR algorithm is called Approx-Log-BCJR algorithm. In Approx-Log-BCJR, max function can be done by a compare operation between A and B. Thus, in all the operations required in Log-BCJR algorithm are add, compare and select, so called ACS operations. For presenting Log-BCJR algorithm, we use the convolutional code in the UMTS Turbo code as an example. Figure 2.12 shows the example trellis diagram provided in Section 2.2. yn are the systematic bits and cn are the encoded bits. There are three tail bits used for termination, as shown in the trellis, that is driving the encoder back into all zero state, State1 . Note that we use the notations for decoder 1 in Figure 2.11 here but simply replacing sequence y and c with z and b, the same decoding trellis can also apply to decoder 2 in Figure 2.11. In the trellis diagram, there are 16 posState yn /cn 0/0 1/1 1/0 0/0 1/1 0/0 0/1 1/1
State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1
T1 S1 S2
T3 S4 T4
T59 S38 T60 S37 S5 S35 S6 T58
T2
T5 State5 1 0 0 State6 1 0 1 State7 1 1 0 State8 1 1 1 S3 T6 S7
T12 S14 T28 T46 S23 S31
T54
Figure 2.12: A example trellis of a short terminated trellis code.
sible transitions, except the three initial steps at the start and the three termination steps at the end. For a certain input systematic sequence yn and the corresponding encoded sequence cn . only one transition is used in each step in a encoding trellis, as exemplied in Figure 2.12. The corresponding systematic bit and encoded bit of each transition is also given in the gure. To represent each state in Figure 2.12, we use
26
notation {S1 , S2 , S3 ...S38 } to notate the possible states in the trellis following the order from to to bottom and from left to right, as shown in the gure. Similarly, we use notation {T1 , T2 , T3 ...T60 } to notate each possible transitions in the trellis following the same order. In addition, we use notation tn to represent the transition employed in the encoder trellis for the nth bit. Similarly sn is the state entered by the encoder after the nth bit. Therefore, for the example sequence of y and c in Figure 2.12, we have
y the traced transitions {tn }n=1 = {T1 , T4 , T12 , T28 , T46 , T54 , T58 , T60 } and traced states y {sn }n=0 = {S1 , S2 , S6 , S14 , S23 , S31 , S35 , S37 , S38 }. For describing the algorithm, we de-
ned the following notations. fr(T ) is the starting state of the transition T . fr(T1 )=S1 and fr(T3 )=S2 . to(T ) is the ending state of the transition T . For example, in Figure 2.12, to(T1 )=S2 and to(T2 )=S3 . fr(S) is the aggregate of all the transitions started from state S. For example, in Figure 2.12, fr(S2 )={T3 , T4 }. to(S) is the aggregate of all the transitions ended at the state S. For example, in Figure 2.12, to(S38 )={T59 , T60 }. y(T ) is the value for the bit in y that is implied by the transition T . For example, in Figure 2.12, y(T1 ) = 0 since t1 = T1 implies that y1 = 0. Similarly, y(T4 ) = 1. c(T ) is the value for the bit in c that is implied by the transition T . For example, in Figure 2.12, c(T1 ) = 0 and c(T2 ) = 1. n(T ) is the bit index associated with the transition T . For example, in Figure 2.12, n(T1 ) = n(T2 ) = 1 and n(T3 ) = n(T4 ) = n(T5 ) = n(T6 ) = 2. With the notations above, we shall start to describe the Log-BCJR algorithm. The ultimate purpose of the algorithm is to calculate extrinsic LLRs of the decoded sequence ye . However, in the algorithm, it is more immediate to calculate the probability that the encoder traversed a specic transition in the trellis. The calculation of the extrinsic LLRs ye leads to the calculations of another three groups of internal variables, , and . The values are conditional transition probabilities. In our case, the values is divided into two sub-groups, the a priori transition probability y and the channel transition probability c . They corresponding to each transition in the trellis. For each transition in each step, there is a y (T ) and a c (T ). y (T ) represents
a the probability ln[P (tn(T ) = T |yn )]. c (T ) represents the probability ln[P (tn(T ) =
For example, in Figure 2.12,
T |cc )]. n
27
The values are corresponding to each state in each step in the trellis. It is the conditional probability that in step n (i.e. the decoding process is working on the
a trellis step that corresponding to the received yn and cc ), the traversed transition n
T is started from a particular state S, that is (S) represents the probability

a ln[P (Sn(S) = S|{yn }n=1 , {cc }n=1 )]. n n(S) n(S)
A value, on the other hand, is the conditional probabilities of a traversed transition T is ended to a particular state, that is (S) represents the probability
y a ln[P (Sn(S) = S|{yn }n=n(S)+1 , {cc }Nc n n=n(S)+1 )].
Finally, the three groups of the variables can be used to calculate the probability that the encoder traversed a specic transition T in the trellis. We use to represent such a probability. For calculating the extrinsic information, y is considered here, which
y a y T represents the probability ln[P (tn(T ) = T |{yn }n1,n=n(T ) , {cc }Nc )]. It is the joint n n1
probability of the corresponding c , and of the transition T . For computing all the variables above, Log-BCJR algorithm is composed of the following four parts. 1. calculation: The values of depend on the inputs of the convolutional decoder. There are two inputs, the encoded LLRs input and the uncoded LLRs input. As shown in Figure 2.7, the encoded LLRs input is the LLRs of the encoded sequence received from the channel cc . The uncoded LLRs input is ya . For a transition T , n the y and c can be calculated as: y (T ) = (1 y(T ))y a n(T ) c (T ) = (1 c(T ))cc n(T ) (2.14) (2.15)
2. calculation: The values of depend on the values and values from the previous step in the trellis. Hence, it requires a forward recursion in the trellis to obtain all the values. For a state S, in step n, the function to calculate is: (S) = max* (y (T ) + c (T ) + (f r(T )))
T to(S)
(2.16)
where (S1 ) = 0. 3. calculation: The values of depend on the values and values from the next step in the trellis. Hence, it requires a backward recursion in the trellis to obtain all the values. For a state S, in step n, the function to calculate is: (S) = max* (y (T ) + c (T ) + (to(T )))
T f r(S)
(2.17)
where (S38 ) = 0
28
Chapter 2 Turbo-like Code Solutions in BANs 4. y calculation: The values of y can be calculated according to (2.18). y (T ) = c (T ) + (f rom(T )) + (to(T )) (2.18)
5. Finally, the extrinsic information can be calculated based on values. The extrinsic LLRs of the uncoded bits ye are:
e yn = max* (y (T )) max* (y (T )) T |y(T )=0 T |y(T )=1
(2.19)
The algorithm is accomplished.
2.3
EXIT chart analysis
As we mentioned, the BER chart is a powerful tool to analyse the performance of a Turbo-like code. However, it is unable to characterise the convergence behaviour of a Turbo-like code, for example at the onset of the turbo cli. This requires a dierent analysis tool, namely the extrinsic information transfer (EXIT) chart [41]. An EXIT chart uses mutual information (MI) measurement to quantity the quality of the extrinsic information exchanged between the constituent decoders in an iterative decoding systems. It is comprised of two curves for the two decoders in the system. Each curve plots the mutual information of the extrinsic LLRs versus the mutual information of the a priori LLRs of one decoder in the system, which is basically to measure the quality of the input and the output of the decoder. For example, taking the UMTS decoding scheme as an example, for the rst decoder, the EXIT curve plots I(ae , a) as a function of I(aa , a) as shown in Figure 2.13, where I(aa , a) is the mutual information between aa and a, while I(ae , a) is the mutual information between ae and a. For drawing the EXIT curve, we using simulator to generate sequences of a priori LLRs aa having a range of mutual information (0 < I(aa , a) < 1). Using simulations that include the channel model, the modulation model and the BCJR decoder, the extrinsic output ae can be obtained and measured. If we use I(ae ) to represent I(ae , a) and I(aa ) to represent I(aa , a), the EXIT function I(ae ) = F (I(aa )) of the UMTS Turbo code is shown in Figure 2.14. In the simulation, we use the exact convolutional code shown in Figure 2.13, with BPSK modulation and AWGN channel. The Signal-to-Noise Ratio (SNR) is -4dB. The SNR is dened as: SN R = Es N0 (2.20)
For the other decoder, with the same function another EXIT curve can be drawn based on the simulation. For a Turbo code, owing to the symmetry of the two concatenated codes, the EXIT function of the lower convolutional code is identical to that of the upper
29
a c
Input
MUX
Modulator
e
Channel
ac ec deMUX aa ya BCJR decoder cc deModulator
I(aa , a)
generate LLRs measure MI
I(ae , a)
ae
ye
Figure 2.13: Scheme of the EXIT chart generating.
0.9
0.8
0.7
I(a )
0.6
0.5
0.4
0.3
0.2
0.1
0.2
0.3
0.4
I(aa)
0.5
0.6
0.7
0.8
0.9
Figure 2.14: One EXIT curve I(ae ) = F (I(aa )) of UMTS Turbo code using BPSK to transmit over an AWGN channel having an SNR of -4 dB.
convolutional code. In EXIT chart, the second curve is displayed with the swapped axes, that is the horizontal axis is the mutual information of the extrinsic output and the vertical axis is the mutual information of the a priori input. The reason to display the second curve with the swapped axises is because in the iterative decoding process, the output of one decoder is the input of the other decoder in next iteration. By putting the input of the decoder and the output of the other decoder in the same axis. The
30
interaction of the two concatenated decodes can be predicted on an EXIT chart. The complete EXIT chart of the UMTS Turbo code generated by our simulation results is given in Figure 2.15. The iterative decoding process of the Turbo code can be revealed by
EXIT chart of the UMTS Turbo decoder 1 0.9 0.8 0.7 0.6
I(a )
0.5 0.4 0.3 0.2 0.1 0
0.2
0.4
I(aa)
0.6
0.8
Figure 2.15: EXIT chart of UMTS Turbo decoder.
decoding trajectories in the EXIT chart, as shown in Figure 2.16. A decoding trajectory
1 0.9 0.8 0.7 0.6
I(a )
0.5 0.4 0.3 0.2 0.1 0
0.1
0.2
0.3
0.4
I(aa)
0.5
0.6
0.7
0.8
0.9
Figure 2.16: The decoding trajectories in the EXIT chart.
start at (0,0) point, where at the start of the decoding process, there is no a priori information coming from the other decoder. The mutual information of the output of the rst decoder can be obtained by the upper curve in the EXIT chart and is provided as
31
the input of the second decoder. Based on the mutual information provided by the rst decoder, the mutual information of the output of the second decoder can be obtained by the lower curve in the EXIT chart. The decoding performance of the next iteration can be obtained by the same way. Thus, a decoding trajectory can be obtained. Note that the the condition of the decoding trajectory has a high probability to reach the (1,1) point is that the EXIT chart has open tunnel. By reaching the (1,1) point, the maximum likelihood decoding has been found and the BER will be in the error oor region. However, since the EXIT chart is the statistical results by large samples of simulation. In practice, the trajectories are varying from the EXIT chart. As shown in Figure 2.16, the three trajectories are all dierent and departured from the EXIT chart. The EXIT chart gives the average convergence behaviour of the investigated code. An EXIT chart allows to consider the two concatenated codes in isolation of each other. Since EXIT charts can predict the iterative interaction of the two codes, the iterative decoding process does not need to be simulated in order to draw an EXIT chart. Thus, EXIT chars can be obtained faster than BER/FER charts. The measurement of the mutual information has a number of dierent methods. The rst method is the averaging method uses the equation: 1 I(, a) = 1 + a Na
Na 1 (1 a )ea (1 a )ea log2 [ ] 1 + ea 1 + ea
(2.21)
n1 a =0
This method has the advantage of not requiring any knowledge of the bit sequence a. This is achieved by assuming that the LLRs in zmathbf a satisfy the consistency condition, that is the LLRs do no express too much condence or too little condence. Since the averaging method believe what the LLRs say, it does not need to consider the true values of the bits in a. However, this assumption is only valid if there are no sub-optimalities in the receiver design. This requires perfect channel estimation, perfect carrier recovery, perfect synchronisation, perfect equalisation and optimal decoding using the Log-BCJR algorithm. The histogram method of measuring mutual information does not make the described assumption and is therefore better suited when a sub-optimal receiver is employed. This method uses knowledge of the true values of the bits in a to avoid having to believe what the LLRs say.
2.4
Fixed-point representation in a Turbo decoder
In this section, we give a introduction of xed-point representation in hardware design. Fixed-point representation, compared with oating-point representation, is easily implemented in a small memory space and it is fast to execute. Therefore, it is well-suited to real-time or low-power applications. Internally, the computation of xed-point numbers
32
take the values as integers, but considered the integer part and fraction part separately with an imaginary point. Twos complement representation is the most widely used xed-point representation in practice. A twos complement binary number is divided into three parts, a sign bit, an integer part and a fraction part. First, let us consider the twos complement representation of signed integers before considering the representation of numbers having a fraction part. The most signicant bit is used as the sign bit, where 0 is used to represent positive signs and 1 is to represent negative signs. The rest of the bits represent the magnitude of the number. For a negative number, the magnitude of it complemented bit by bit and incremented by 1 is its twos complement representation. For example the 3-bit representation of 2 is 010. The complement of this is 101. Adding 1 to this give the twos complement representation of -2, namely 110. The complete set of 3-bit twos complement representations is given in Table 2.1. In addition, another two signed integer representation methods, sign and absolute value notation and ones complement notation, are also given as examples in the table for comparison. As shown, Binary number Sign and absolute value One complement Two complement 000 +0 +0 +0 001 +1 +1 +1 010 +2 +2 +2 011 +3 +3 +3 100 -0 -3 -4 101 -1 -2 -3 110 -2 -1 -2 111 -3 -0 -1
Table 2.1: Dierent representation methods for integer numbers
compared with the other two methods, twos complement notation avoided the double representation of zero. As a consequence, the range of negative values is more than the range of positive values by one smallest value in its resolution. The main advantage of twos complement notation is the ability to perform the addition of negative numbers, without needing to take the sign of the operands into consideration. In twos complement notation, the subtraction is achieved by doing the complement and adding. For example, in 3-bit representation, 2 3 can be done by calculating the sum of 2 (010) and -3 (101). 2 3 = 2 + (3) = 010 + 101 = 111 = 1 (2.22)
For the subtractions in twos complement notation, letting the result overow is necessary. Take the following calculation as an example: 3 3 = 3 + (3) = 010 + 110 = (1)000 = 000 = 0 (2.23)
Since the overowed part is lost, the calculation gives the correct result naturally. In contrast, the subtraction in the other two notation methods is more complicate since complement and adding does no give the correct result. For example, in ones complement notation: 3 2 = 3 + (2) = 011 + 101 = 000 = 0 (2.24)
33
Therefore, the addition involve dierent signed components need to be considered carefully, extra correction is required. For a fractional xed-point number, A , an imaginary point is set at a certain place. For a 3-bit twos complement xed-point number with 2-bit fraction part, a example for such method is given in Table 2.2. In the table, the imaginary point is placed after the most signicant bit in the binary representation. For n-bit twos complement representation, Binary number Two complement 0.00 +0.00 0.01 +0.25 0.10 +0.50 0.11 +0.75 1.00 -1 1.01 -0.75 1.10 -0.5 1.11 -0.25
Table 2.2: Twos complement representation method for fraction numbers
we use notation Qp.q to represent the point setting, where p represents the bit number of the integer component and q represents the bit number of fraction component. The total bit number n = p + q + 1. For example, a 8-bit twos complement number with a imaginary point after the 5th bit: 01100.010. The integer part is 12 and the fraction part is 0.25, thus the value of 01100.010 is 12.25. The maximum and minimum limits of the representation are given by (2.25), and the resolution r is given by (2.26). 2p+q 1 2p+q A 2q 2q r = 2q (2.25) (2.26)
Chapter 3
Optimal Data-width Settings for Fixed-point Implementation

3.1 Introduction
In Turbo-like decoding schemes, the algorithms are usually specied in the oatingpoint domain. However, in practical implementations, for energy eciency, a xedpoint number representation is mandatory for most architectures, such as DSP systems, FPGA or VLSI implementations [42], since xed-point implementation allows signicant energy consumption reductions, with only insignicant reductions in performance [43]. As discussed in Chapter 2, one of the advantages of the Log-BCJR algorithm is the reduced dynamic range of the internal variables and the LLRs. In practice, this allows a xed point representation to be used. In xed-point implementation, the hardware complexity increases linearly with the internal bit-width representation of the data since the bit-width of the representation determines the bit-width of all the databus and the computing resources in the datapath structure [35]. Moreover, the iterative decoding process of Turbo-like coding schemes require a large amount of memory space to store the internal variables. Using less bits for each variable can signicantly reduce the memory requirement and hence reduce the energy consumption of the decoder. Therefore, for a low power implementation, minimising the number of bits required for representing the xed-point quantities in the algorithm is a very important issue. However, the information lost due to the reducing of the data width will cause degradation of the performance. Therefore, there is a trade o between communication performance and hardware complexity. This needs to be explored for an low power design. Many papers investigated the xed-point implementation issues of Turbo decoders by exploring the minimum data width of the dierent quantities with acceptable degradation on BER/FER chart [4250]. However, no universal conclusion has been obtained. Even though some of the papers were using the same specication of the simulation, 35
36
Chapter 3 Optimal Data-width Settings for Fixed-point Implementation
namely the UMTS Turbo decoder with BPSK modulation simulated in AWGN channel, the conclusions are dierent [42, 43, 45, 47, 49, 50]. The reason is that, in xed-point implementation, there are dierent issues aect the decoding performance and dierent techniques to deal with the issues. The performance degradation caused by xed point implementation is due to the lost information, that is the underow and the overow. For underow, the fraction bit-width limited the computation accurateness of the calculations in the algorithm. Especially, in our case, since we investigate the Log-BCJR algorithm [51] using Look-Up-Table(LUT) to realize Jacobian logarithm, the precision of the xed-point representing is directly relative to the numbers of the elements in the LUT. As discussed in Chapter 2, the max operator in Jacobian algorithm is dened as: max (x, y) = ln(ex + ey ) = max(x, y) + ln(1 + e|yx| ) = max(x, y) + fc (|y x|) (3.1) (3.2) (3.3)
Function fc is a quantised version of function ln(1 + e|yx| ) which is implemented by a LUT. Therefore, the bit-width of fraction part determines the largest number of elements in the LUT, as shown in Figure 3.1. For example, 3-bit fraction number gives the resolution of the LUT 0.125, which makes the largest possible LUT has 7 elements. By this analogy, 2-bit fraction number gives a 4 elements LUT and 1-bit fraction number
Correction fuctions and its LUT implementation 0.8 correction fuction 3bit fraction LUT 2bit fraction LUT 1bit fraction LUT
0.7
0.6
0.5
f(|yx|
0.4
0.3
0.2
0.1
0.5
1.5
2 |yx|
2.5
3.5
Figure 3.1: correction function.
gives only 2 elements LUT, as shown in Figure 3.1. Hence, the fraction bit-width
37
aects not only the width of databus, computing resources and memory requirement as discussed, but also the complexity of the LUT used in the algorithm. The occurrence of overow issues depends on the dynamic range of the variables in the algorithm and the number of bits used in the integer part of the xed-point representation. In the event of overow, the lost information could be fatal to the system performance. However, the dynamic range of the variables is dicult predict and sometimes quite large requiring a large number of bits in xed point representation to guarantee the range is covered. In Log-BCJR decoder, there are only three dierent operations, add, compare and select, as mentioned as ACS operations. The compare and select operations are not be able to induce any overow. However, any add is possible to be overowed. Take the Log-BCJR algorithm we used in the UMTS decoder in Chapter 2 for example, in the decoding trellis each is the sum of two , a from the previous step and a correct function fc in (3.3). Since it includes a from the previous step, the calculation of forms an accumulation of the value in the trellis. Therefore, the values of would increase without limits as the block length is increased. The resulting overow in a limited data width is the most signicant eect needs to be considered. The calculation of has the same problem. The calculation is the sum of a , a and a . It is also possible to be overowed. To deal with this issues, a number of dierent techniques have been proposed [45, 48]. The rst approach is to saturate the over owing data during its processing. This method is widely used in xed point digital lters [52, 53]. A disadvantage of this approach is that it requires some additional saturation hardware on each computing unit that could cause a overow, such as adders. Our simulation results showed that this technique is not suitable for Log-BCJR algorithm alone, but can work well in collaboration with a second technique, namely normalisation [45]. Normalisation is applied in Log-BCJR algorithm for dealing with the overow on and internal variables particularly. It scales down the increasing metrics in each step, in order to prevent them from increasing without bound. This reduces the occurrence of overow and allows the data width for representing the variables to be further reduced. As discussed, the and values are accumulating in the decoding trellis. Taking as a example, each is the sum of a previous , two values and a correct function values. For each , there is a accumulation history route in the trellis. A example is shown in Figure 3.2. Based on the algorithm describe in Chapter 2, (S4 ), (S5 ), (S6 ) and (S7 ) accumulate from (S2 ) and (S3 ), which in turn accumulate from (S1 ). This accumulation continues as the forward recursion proceeds, with subsequent values typically becoming higher and higher. In this way, overow can occur for the values calculated towards the end of the forward recursion. However, the extrinsic LLRs that are generated by the Log-BCJR are not sensitive to the particular value of any value, only to the dierence between the values of states having the same bit index [48]. For example, the Log-BCJR is not sensitive to the values of (S4 ), (S5 ), (S6 ) and (S7 ),
38
only to the dierence (S4 ) (S5 ), (S4 ) (S6 ), (S4 ) (S7 ) and so on. The same conclusion can also be applied to values. As shown in (2.19), basic operation of the extrinsic LLRs calculation is to select two and to calculate the dierence of them, where is the sum of a particular group of a , a and a in a single step. Therefore, if the
State yn /cn 0/0 1/1 1/0 0/0 1/1 0/0 0/1 1/1
State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1
T1 S1 S2
T3 S4 T4
T2
S5
T5 State5 1 0 0 State6 1 0 1 State7 1 1 0 State8 1 1 1

Figure 3.2: A possible accumulation route in the trellis.
S3 T6
S6
S7
values from the previous step reduced with a unique value before the values of the current step are calculated, the concerned dierences remains, but the increasing speed of the values would be slow down. The same method also works for values. The normalisation technique approach such purpose. However, the normalisation process needs extra calculation and operations to realize, increasing the datapath complexity. In addition, the normalisation technique also has dierent approaches. The most widely used normalisation technique is the subtractive normalisation [45, 48, 54]. The path metrics is normalised by subtracting a constant from all the metrics in particular time. Even this method also has dierent versions. In [45], the path metrics is subtracted with the respective minimum one in each step. In [48], the path metrics is subtracted with the maximum one of them in the step. This technique requires extra computations to nd the maximum path metric and perform the subtractions. In [54], a modied version is mentioned that instead of searching for the smallest or largest metric at each step, a xed state metric is subtracted from all path metrics. Hence, the comparison operation for searching the required metric can be avoided. All this dierent version end up with dierent data width requirements in the papers conclusions. The third approach explores the nature of the twos complement representation [45]. It was rst introduced for Viterbi decoders [55] and later applied to SISO decoders. The Log-BCJR decoding process is only concerned with the dierence between the path
39
metrics. It can be proven that all possible dierences between pairs of path metrics are upper bounded [55]. Therefore, in twos complement representation, as long as the dierence between two metrics is not over the largest value that can be represented by the specied data width, the subtraction can be performed correctly using modulo 2n arithmetic by simply ignoring the overow of the operands. Three examples of dierence calculation in such method is given in Figure 3.3. Note that in the calculation of (1+3)-2 and (2+2+2)-3, the results in the brackets are both overowed in 3-bit twos complement representation, but the equivalent calculation in twos complement representation still gives the correct answer as long as the result does not overowed. However, for the third calculation, the dierence in the last calculation step is overowed. In this situation, the twos complement representation cannot maintain the result correct, as shown in the gure. The modulo 2n arithmetic is naturally implemented in VLSI architecture. Thus,
0 -1
111 000 001 010 2 011 100
(1+3)-2=2 (001+011)-010=100-010=100+110=010=2 (2+2+2)-3=3 (010+010+010)-011=110-011=110+101=011=3 (2+2+2)-1=5 (010+010+010)-001=110-001=110+111=101=-3
-2 110
101
-3
-4
Figure 3.3: Example of dierence calculation in twos complement representation.
no additional hardware requirement is required in this approach. According to [45], 1-2 bit more data width may be required for this approach compared to subtractive normalisation, and it is shown that for a high-speed MAP decoder one additional bit results in approximately 25% higher area and power consumption. In conclusion, to implement a Turbo decoding algorithm in xed-point representation, dierent choices of relative techniques have dierent optimal data width requirement. In the eight similar previous works we investigated [4247,49,50], the dierent environment, design and implementation congurations lead to dierent conclusions. Some of them did not even provide a clear conguration of their simulations, which make the results unrepeatable. A brief summarisation of the congurations of the eight papers are given in Table 3.1. Three similar turbo codes are considered in the papers, as shown in Figure 3.4. In the gure, type-2 corresponds to the UMTS Turbo encoder that discussed in Chapter 2. Only a few papers discussed the eects of this issue based on mathematical proofs [45,47,48]. However, it is shown that mathematical proofs are not sucient to decide the optimal data width specications in practice. Some of the mathematical proofs give the upper bounds of the path metrics that will never being exceed in practice [48]. And when saturation and normalisation technique is applied, the data width requirement can be further reduced with tolerable decrease in communication performance (i.e. BER/FER degradation) [45]. Therefore, our simulation result show that the actual dynamic range
40

Type-1 Type-2 Type-3
Figure 3.4: Three dierent Turbo codes in previous works.

authors encoder modulation channel interleaver block length (bit) iteration times normalisation wrapping/saturation Look-Up-Table authors encoder modulation channel interleaver block length (bit) iteration times normalisation wrapping/saturation Look-Up-Table J. Hsu [44] Type-1 BPSK AWGN helical 216 5 Yes N/A 16 elements M. A. Castellon [43] Type-2 BPSK AWGN block prime 1024 3/8 N/A saturation 2 elements G. Montorsi [47] Type-2 BPSK/PAM AWGN N/A 4828 10 N/A saturation 22 elements M. A. Castellon [50] Type-2 BPSK AWGN N/A N/A 5/8 N/A N/A 7 elements H. Michel [42] Type-2 BPSK AWGN/Rayleigh N/A 600 5/10 N/A N/A 7/10 elements T. K. Blankenship [46] Type-3 BPSK AWGN/Rayleigh N/A 640 N/A N/A N/A 2/4/8 elements A. Worm [45] Type-2 BPSK AWGN/Rayleigh 3GPP compliant 600 5/7/10 Yes saturation N/A R. Hoshyar [49] Type-2 BPSK AWGN/Rayleigh ideal 2896 7/18 half Yes N/A N/A
Table 3.1: Dierent representation methods for integer numbers
used in xed-point implementation can be less than the theoretical bounds predicted by mathematical analysis. As a result, the data-width decisions of a decoding algorithm cannot be done only based on mathematical analysis. Traditional BER/FER chart simulation is time consuming. Sometimes dierent types of variables in a decoding scheme have dierent optimal data-widths. It induce a large number of combinations required to be tested while using simulation to nd out the optimal settings. If considering the eects of dierent technique utilisation, the required simulation of drawing BER/FER charts will be unacceptable. In addition, BER/FER chart analysis does not give an insight into the iterative decoding convergence process. Hence, to fully investigate the optimal data width in xed-point implementation of a decoding algorithm, we propose a method based on EXIT chart [41] analysis to determine the optimal xed point specication of a Turbo-like decoder in practical implementations. Our method is less time consuming compared with previous works using BER/FER chart to do the same analysis. Moreover, our results showed that the EXIT chart provides more useful information than BER/FER chart when determining the optimal xed point specication. Instead of only giving the performance result, the EXIT chart shows the convergence behaviour
41
of the decoder. And the reasons caused the performance degradation by insucient bit-width can be analysed. Hence, the proper technique to prevent the degradation can be induced to further optimise the system. For presenting our method, we investigate the 3GPP UMTS Turbo decoder [39] and the optimal data width specications for its xed-point implementation is concluded and compared with previous works. It is easy to apply this method to any Turbo code and potentially any Turbo-like code including an iterative decoding scheme which can be analysed by EXIT chart. As introduced in Chapter 2, EXIT chart analysis is a powerful tool to analyse and optimise the convergence behaviour of iterative systems, such as Turbo-like decoders. Unlike BER/FER simulation, It is less time consuming since the simulation of the interleaver in the decoder and the actual iterative decoding process are not required. Although the effects of the performance by an sub-optimal interleaver cannot be revealed, an interleaver only changes the order of the information sequence and no information is lost during such a process by xed-point implementation. Since our purpose is to investigate the performance degradation by xed-point implementation, the unconsidered interleaver would not aect the result of our method. Moreover, a BER plot can only give the performance of a particular number of iterations in the decoding process, while an EXIT chart traces the convergence behaviour of the decoder allowing an arbitrary number of iterations to be considered. Our results show that based on the analysis of EXIT chart, not only the performance can be investigated, but also the reasons causing the performance degradation by xed point implementation can be identied. Hence, the proper combination of techniques can be chosen to improve the performance. One drawback is an EXIT chart only considers a xed Signal-to-nose ratio (SNR) while a BER/FER chart considered a wide range of SNR or Eb /N0 . However, since the EXIT simulation is a lot less time consuming than BER/FER simulation. It is possible to draw dierent EXIT chart under dierent SNR if necessary. To sum up, EXIT chart is more suitable than BER/FER chart for nding the optimal data-width setting for a xed-point implemented decoding scheme. Also, we have tested an SNR where the tunnel is narrow and the performance is most sensitive to the xed point representations limitations. The ideal of using EXIT chart to analysis the impact of nite precision arithmetics of Turbo codes is rst introduced in [56]. However, no convincible analysis procedure and conclusion were given. In this chapter, we rst time introduce a detailed analysis method of using EXIT chart to determine the optimal data width specication of xed-point implementation of Turbo-like decoders, by giving a fully investigation of the UMTS Turbo decoder [39]. In Section 3.2, to demonstrate our method, we use it to select the optimal data-width specication for UMTS Turbo decoder with a comprehensive consideration of xed-point implementation techniques. The conclusion is then compared with previous works. In the last section, our conclusion is given.
42
3.2
Fixed-point EXIT chart analysis of UMTS Turbo Decoder
To present our method, we use EXIT chart analysis to investigate the xed-point eects of the UMTS Turbo decoder implemented by Log-BCJR with Jacobian logarithm. The specication and structure of the UMTS encoder and decoder are presented in Chapter 2. In our simulation, BPSK modulation is assumed, with an AWGN channel. We rst use a SNR=-4dB noise level for simulation, which the EXIT chart of the UMTS Turbo code has a moderately open tunnel, so the degradation of the performance can be easy to observe. We also chose an SNR=-4.83 dB where the tunnel is almost closed (i.e. the onset of the Turbo cli) and the performance is most sensitive to the xed point representations limitations to validate our optimal data width specication. Random bit sequences are given on the input of the Turbo encoder. We use 453-bit frame length (i.e. interleaver length), which is the geometric mean of the minimum and maximum block length of the UMTS standard. Since the performance degradation of Turbo codes proportion with the block length in logarithm domain, we use it to investigate the optimal data-width specication. The shortest (40-bit) and longest (5114-bit) frame lengths in UMTS Standard is then simulated under the optimal specication for investigate the performance eect of the frame length. In addition, we gathered dierent conclusions from eight previous works [4247, 49, 50] as a comparison to show the validity of our work. Firstly, the eects on the EXIT chart by using three BCJR algorithms are simulated in oating-point representation. The three algorithms are Log-BCJR using exact calculation of the Jocobian logarithm, Log-BCJR using 8 elements look-up-table (LUT) Jacobian logarithm and Max-Log-BCJR [57]. It has been proved that the performance loss by using Jocobian logarithm is less than 0.1dB relative to the exact log calculation, which is usually considered acceptable [54]. The performance degradation of MaxLog-BCJR UMTS Turbo decoder is also well explored. According to [58], the Eb /N0 performance degradation on 105 BER between Log-BCJR and Max-Log-BCJR is 0.3 dB of 640 bits block length and 0.54dB of 5114 bits block length in AWGN channel and worse in Rayleigh fading channel, which is considered signicant (not acceptable). In our analysis process, we aim to obtain the xed-point EXIT chart as close to the oating-point Log-BCJR result as possible and consider the degradation similar to that of the oating-point Max-Log-BCJR result unacceptable. Secondly, we investigate the eects of limited fraction part on xed-point representation. Since the limitation on fraction part length also limited the elements in the LUT, the numbers of elements in the LUT is also considered. Thirdly, we investigate the eects of a limited integer part on the xed-point representation. The three overow control approaches discussed before are all investigated.
43
Finally, based on the analysis, the optimal combination of fraction length and integer length is then investigated under the dierent block length. The eect of termination techniques is investigated here. The conclusion is given and compared with previous works.
3.3
3.3.1
Simulation and Analysis Results

Comparison between dierent Logarithm methods
Figure 3.5 gives the EXIT charts of UMTS Turbo decoder using the three log algorithms mentioned before. The tunnel between the two curves is narrower due to the information lost by Max-Log-BCJR. Therefore, by certain iteration decoding times, the mutual information should be lower than Log-BCJR algorithm implementation. In other words, to obtain the certain target BER, more decoding iterations could be required. Therefore, we can assert that the BER degradation due to the information lost by the implementation can be reected in the EXIT chart. Our further simulation results prove this conclusion.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 LogBCJR JacobiLogBCJR MaxLogBCJR 0 0.2 0.4
I(ae)/I(ba)
I(aa)/I(be)
0.6
0.8
Figure 3.5: EXIT chart of dierent log algorithms.
3.3.2
Comparison and Analysis in Fixed-point simulation
To analyse the eects of xed-point representation, xed-point data type are used for all the variables in simulations. Later, we will use a long bit-width, 32-bit, for the fraction
44
part but limited bit length for the integer part in order to investigate the degradation caused by the limited dynamic range. First however we consider the opposite in order to investigate the performance of limited precision with sucient integer bit-width (32bit). Note that the eects on the LUT in Log-BCJR is also considered here. For n-bit fraction part representation, up to 2n elements are used in the LUT, as described in Section 3.1. The EXIT chart results are shown in Figure 3.6. The simulation results show that using 1-bit length in fraction part and 2 elements in the LUT gives observable degradation in EXIT chart result, but 2-bit length in fraction part with 4 elements in the LUT gives almost no observable degradation in the EXIT chart result. As shown in
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0bit fraction 1bit fraction 2bit fraction 3bit fraction floating point 0 0.2 0.4
I(ae)/I(ba)
I(aa)/I(be)
0.6
0.8
Figure 3.6: EXIT chart of dierent fraction lengths.
the gure, 2-bit fraction length give no dierence result compared with the oating point result. 1-bit fraction length also gives EXIT chart very close the oating-point result and the degradation is much less than that of the oating-point Max-Log-BCJR result. Note that 0-bit fraction part eectively removes the LUT, transforming it from the approx-Log-BCJR to the Max-Log-BCJR. The EXIT chart degradation is worse than that of the Max-Log-BCJR, however because of the low resolution used for the BCJR variables. Considering the trading-o between energy consumption and performance, 1-bit and 2-bit fraction part both possible to be sucient for most of the applications. Further decision should refer to later combined simulation results with limited fraction and integer lengths. The BER chart analysis for dierent fraction lengths are given in [43, 47, 50]. Both [43, 47] concluded that 2-bit for the fraction length approaches the performance of the oating-point decoder, which could be chosen for the optimal specication. Although, [50] declared that 3-bit fraction length gives better performance, which only incurs a penalty of 0.015dB. The simulation result of [47] showed that 1-bit fraction length only causes a loss of 0.1dB for medium-low SNR but has no consequences
45
on the error oor performance. In the eight papers we selected, ve of them determined 2-bit fraction length is the most optimal choice and three of them chose 3-bit fraction length. For determining the optimal bit-width of integer part, many papers investigated the optimal bit-width setting for the dierent internal variables (input LLRs, , , and ) separately. However, in practice, it is not convenient to store dierent variables with dierent data-width memory block. The dierent setting of dierent variables will no decrease the memory requirement rather than using a unique data-width setting. Although [42] claimed that further bit-width minimisation for dierent variables can reduce the switching activity which has inuence on the energy consumption, as the process technology scaling down, the contribution to the total power consumption by dynamic power become smaller and smaller. Thus, the benets of considered the dierent variables separately is reduced. On the other hand, such a strategy requires additional extension and clipping mechanism of the databuses in the datapath, which increase the design complexity of the datapath. Therefore, we consider a single datawidth setting in our analysis. However, it is valuable and necessary to consider the input LLRs and the internal variables of the SISO decoder separately, because the limit of the input LLRs directly aect the dynamic range of the internal variables, such as and . According to [47], the possible dierences between pairs of path metrics M AX (i.e. the possible dierence between the values or values in a signal step in the decoding trellis), which is signicantly important in BCJR decoder as discussed in last section, are upper-bounded by a function of the dynamic range of the input LLRs: M AX = min(wMu + dmin (w)Mc ) (3.4)
where dmin (w) is the minimum weight of the code sequences generated by input sequences with weight w, Mu and Mc are the dynamic ranges for respectively the two input of the SISO decoder, extrinsic information and LLRs received by the soft demodulator. Hence, dmin (w) depends on the considered code. Mu and Mc are simply related to the bit-width of the integer part of the input LLRs. As discussed in the previous section, it is important to keep the dierence between pairs of metrics for maintaining the performance, and dierent overow control techniques require dierent data-width to guarantee this condition. Therefore, based on the discussion in [47], in an insucient data-width specication, the bit-width of the internal variables typically requires a couple more bits than the bit-width of the input LLRs. As discussed, with dierent bit-width settings for dierent variables, the transformation between dierent length data needs to be carefully managed. Transforming shorter length data to longer length data would not induce any problem since the values of the data remain unaltered. An extension mechanism simply add zores to the extra highest bits can solve the problem. It is easy to realize in hardware and no extra operation required in our simulation. However, the transforming for the other direction may
46
cause information loss. Moreover, the highest bit in twos complement representation determines the sign of the value, which means simply ignoring the extra bits during the transforming may not only reduce the value but change the sign of the data value. It will signicantly aect the correctness of the decoding process. Hence, a clipping mechanism with saturation is required during the transforming. If the original data value is over
clip
ac ec
ap
clip
Decoder 1 clip
clip
cc
aa bc 1 be
ae ba
clip
fc
Decoder 2 clip
clip
dc
Figure 3.7: Scheme of UMTS Turbo decoder.
the limitation of the aiming data width, the transformed data must be set to the limited value. Thus, the information lost is minimum. Such a method require extra hardware to realized in practice and extra operations in our simulation. Figure 3.7 showed that the clipping operations we used in the decoding scheme to simulate the data transforming between dierent data-widths.
3.3.2.1
Wrapping Technique
We investigated the optimal bit-width settings for the length of integer part under dierent overow control techniques by using EXIT chart analysis. As discussed before, twos complement representation can naturally avoid the eect of the overow in BCJR algorithm since the overowed data can be considered as wrapping in a circle, so the distance between two data remained. The benet of this wrapping technique is that no extra operation or hardware is required. Therefore, it is suitable for the cases where memory is sucient or a simple datapath is required. Note that for input LLRs, saturation is still required since it has shorter bit-width than internal variables and external input, as discussed before. The wrapping technique is only suitable for the internal variables. We use notation (LLR:X,VAR:Y) to describe the integer lengths setting, where X is the integer length of the input LLRs of a BCJR decoder and Y is the integer length of all the other internal variables. Figure 3.8 shows the EXIT chart results of setting (LLR:5,VAR:7),
47
(LLR:4,VAR:7) and (LLR:3,VAR:7). The simulation results showed that with setting
EXIT 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5bit LLRs/7bit VARs 4bit LLRs/7bit VARs 3bit LLRs/7bit VARs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 3.8: EXIT chart of dierent integer lengths with wrapping technique - 1.
(LLR:5,VAR:7), there is almost no degradation in EXIT chart compared with oatingpoint result. It is obvious that the EXIT chart of setting (LLR:3,VAR:7) failed to create a tunnel to (1,1) point, which means that the BER of the decoding result would be signicantly reduced. The setting (LLR:5,VAR:7) and (LLR:4,VAR:7) also give dierent results in EXIT chart analysis, which is shown in Figure 3.9. It is a zoomed in version of Figure 3.8. For setting (LLR:5,VAR:7), the curves of EXIT function Ie (Ia )
EXIT
0.99
0.98
0.97
0.96
0.95
0.94
0.93
5bit LLRs/7bit VARs 4bit LLRs/7bit VARs 3bit LLRs/7bit VARs 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
0.92
48
reaches a peak value at a certain Ia and starts decreasing. It makes the tunnel closed. Although the close point is very near to (1,1) point, it means the best possible decoding result of (LLR:5,VAR:7) cannot match the result of (LLR:4,VAR:7). Respectively in BER simulation, there is going to be a degradation of BER from (LLR:4,VAR:7) to (LLR:5,VAR:7). Note that the tunnels of (LLR:5,VAR:7) and (LLR:3,VAR:7) close in dierent ways. They show the dierent reasons cause the closures. For (LLR:3,VAR:7), the closure is caused by a lower increasing speed of the curves Ie (Ia ). Since the only dierence between(LLR:3,VAR:7) and (LLR:4,VAR:7) is 1 bit shorter or input LLRs, it can be conjectured that the lower Ie (Ia ) is due to the lost information in LLRs by the decreased bit-width. Thus, 4-bit integer length is the minimum sucient bit-width setting for input LLRs. For (LLR:5,VAR:7), the closure is due to the reduction of curves Ie (Ia ) after their peak times. The result of (LLR:4,VAR:7) is proved that such a bitwidth setting is sucient for maintaining the validated information in all the variables, so the performance degradation of (LLR:5,VAR:7) is due to the not enough bit-width dierence between the input LLRs and the internal variables. Because while the iteration times increasing, the mutual information in a priori LLRs is increasing, which means the average absolute value of the LLRs is increasing. Due to the accumulated add operations of the input LLRs in BCJR algorithm, insucient dierence of bit-widths between the input LLRs and the internal variables may cause a serious overow problem in the calculations of the internal variables. Therefore at the end of the EXIT chart, the function Ie (Ia ) starts decreasing. It is the overow of the internal variables exceeds the tolerance limit of the wrapping technique caused the EXIT chart failure to reach the (1,1) point. Such a eect can be shown more obviously in Figure 3.10 and Figure 3.11. In Fig-
ure 3.10, for the results of (LLR:5,VAR:7) and (LLR:6,VAR:7) the peak point of the curves occur earlier due to the even smaller bit-width dierence between the input LLRs and the internal variables. While the dierence increasing, the performance reach the best point at (LLR:4,VAR:7). On the other hand, in Figure 3.11, when the integer length of the input LLRs becomes shorter than 4-bit, the performance getting worse again. However, since the dierence is sucient, there are no reductions occurred in the curves of (LLR:3,VAR:7) and (LLR:2,VAR:7). Only the increasing speed of the curves is lower due to the insucient bit-width of the input LLRs, which caused the closure of the tunnel. If we further reduce the internal variables integer bit-width to 6-bit, the tunnel in the EXIT chart is always closed before (1,1) point irrespective of the integer bit-width of the LLRs. In conclusion, 4-bit integer width of the input LLRs is the minimum acceptable setting for UMTS Turbo decoders. With wrapping technique, the minimum dierence between the input LLRs and the internal variables is 3 bits. Therefore, the optimal integer length setting is (LLR:4,VAR:7).
49
0.99 0.98 0.97 0.96
I(ae)/I(ba)
0.95 0.94 0.93 0.92 0.91 0.9 6bit LLRs/7bit VARs 5bit LLRs/7bit VARs 4bit LLRs/7bit VARs 0.91 0.92 0.93 0.94
I(aa)/I(be)
0.95
0.96
0.97
0.98
0.99
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 4bit LLRs/7bit VARs 3bit LLRs/7bit VARs 2bit LLRs/7bit VARs 0 0.2 0.4
I(ae)/I(ba)
I(aa)/I(be)
0.6
0.8
3.3.2.2
Saturation Technique
Wrapping technique is actually a do nothing technique. No additional operation or hardware is used to deal with the overow in internal variables. Another simple overow control technique is saturation technique. As mentioned before, the input LLRs are forced to be saturated due to the shorter data-width than the internal variables in our specication. The same technique can also be applied to the internal variables.
50
The problem is that the BCJR algorithm is using the distance between the metrics, which are the internal variables, as described before. The saturation technique limited all the overowed data value to the maximum or minimum value, which changed the dierence of the variables. The dierences between the overowed data become 0 and the dierences between the overowed and unoverowed data are also decreased. The simulation results showed that this problem makes the results under saturation technique even worse than using wrapping technique. However, when subtracting normalisation, as mentioned as rescaling normalisation technique, is applied, saturation technique is a necessary condition to obtain the benet from normalisation [54]. Our further simulation results showed that normalisation technique without saturation cannot reduce the datawidth requirement less than wrapping technique. Figure 3.12 gives the simulation results of using saturation technique. Since the conditions for the input LLRs are not changed, the minimum integer width of them remains 4-bit. However, the required bit-width dierence between the input LLRs and the internal variables is signicantly increased due to the application of saturation technique.
EXIT 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (LLR:4,VAR:13) (LLR:4,VAR:12) (LLR:4,VAR:11) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 3.12: EXIT chart of dierent integer lengths with saturation technique.
Although it can be observed that for setting(LLR:4,VAR:12), the tunnel is closed before (1,1) point, as shown in the gure, the closing point is very close to (1,1) point and the EXIT curves have almost no dierence with the oating point result. Hence, (LLR:4,VAR:12) is the optimal integer bit-width setting for saturation technique. Its EXIT chart result has almost no dierence with the oating-point result. In the result of (LLR:4,VAR:11) the tunnel is closed far before (1,1) point. Note that dierent from the results of using wrapping technique, when the integer length of the internal variables reduced to 11-bit, the function Ie (Ia ) falls to near 0 values very soon after the peak point. As discussed before, the reason caused a exit chart curves (i.e. function Ie (Ia )) starting
51
reducing is the insucient dierence of the integer lengths between the input LLRs and internal variables. Since the EXIT chart of (LLR:4,VAR:13) can reach the (1,1) point, 4bit integer length for the input LLRs is still sucient under saturation technique. Hence, the saturation technique increases the requirement of the integer length dierence between the input LLRs and the internal variables. As mentioned before, the decoding result only depends on the dierence between path metrics (i.e. internal variables). The saturation technique xed overowed variables to the positive and negative limits. It can be speculated that while the overowed internal variables xed at the limited values, the dierence between path metrics become 0. Hence the reliable soft output cannot be obtained. When a certain amount of the internal variables overowed, the EXIT chart curves reduce to 0 very fast, as shown in the result of (LLR:4,VAR:11). Therefore, the saturation technique is not suitable for convolutional codes. However, it is a precondition for applying normalisation technique. Our simulation results in with normalisation technique show that it is important to combine saturation and normalisation techniques together to obtain the most optimal bit-width specication.
3.3.2.3
Normalisation Technique
The limitation of the wrapping technique is that if the dierence between the path metrics exceeds the dynamic range, the subtraction would not give the correct result any more. The purpose of the saturation technique is to xed the problem. However, as we showed in saturation technique simulation results, it induced another problem which a lot of overowed variables are xed at the same value. Normalisation technique is introduced to deal with such a problem. Our simulation results show that, with the combination of saturation and normalisation, the requirement of the integer bit-width of the internal variables can be further reduced. In our simulation, for each group the increasing variables and are subtracted with the largest one of them in each step. The EXIT chart results are shown in Figure 3.13. The optimal bit-width setting of the integer length is (LLR:4,VAR:5), which is two less bits requirement for the internal variables.
3.3.2.4
Final validation
To nally determine and validate the optimal data width specication for the xedpoint implementation of the UMTS Turbo code, we investigate the EXIT chart performance considered the combination of the limited integer and fraction lengths. Since the simulation results from Figure 3.6, which only limited the xed-point fraction length, were not sucient to determine the optimal fraction length, we consider both 1-bit and 2-bit fraction length in our nal validating simulation. We combined the fraction length settings with the optimal integer length setting for the wrapping technique (LLR:4,VAR:7) and for the normalisation technique (LLR:4,VAR:5). We use notation
52
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
I(a )/I(b )
(LLR:4,VAR:7) (LLR:4,VAR:6) (LLR:4,VAR:5) (LLR:4,VAR:4) 0 0.2 0.4
I(a )/I(b )
a e
0.6
0.8
Figure 3.13: EXIT chart of dierent integer lengths with normalisation technique.
(LLR:X,VAR:Y,FRC:Z) to represent the settings in our simulation results, where Z is the length of the fraction part. Moreover, for dierent settings, we simulate them in dierent situation, which include the longest block length (5114-bit), the shortest block length (40-bit) and the most performance sensitive SNR (SNR=-4.83dB) where the tunnel in EXIT is just open. For normalisation technique, the nal validation are shown in Figure 3.14 which is the result for the longest block length, Figure 3.15 which is the result for the shortest block length and Figure 3.16 which is the result for the most sensitive SNR=-4.83 dB. According to the results, setting (LLR:4,VAR:5,FRC:2) gives almost as same performance as the oating-point results in dierent situations while setting (LLR:4,VAR:5,VAR:1) gives further degradation due to the combined eects of the limited integer and fraction lengths, but the degradation is not as bad as the Max-Log-BCJR though. Moreover, according to the simulation results, the block length does not have a signicant eect on the EXIT chart. Consequently, a unique optimal specication can work for any block length. Therefore, for normalisation technique, we conclude that (LLR:4,VAR:5,FRC:2) is the optimal specication of the UMTS Turbo decoder. For wrapping technique, the nal validation are shown in Figure 3.17 which is the result for the longest block length, Figure 3.18 which is the result for the shortest block length and Figure 3.19 which is the result for the most sensitive SNR=-4.83 dB. Clearly with the results in the gures, 2-bit fraction part is the best option in our case. Hence, for wrapping technique, (LLR:4,VAR:7,FRC:2) is the optimal specication of the UMTS Turbo decoder.
53
1 0.9 0.8 0.7 0.6
I(a )
0.5 0.4 0.3 0.2 0.1 0 Fixdpoint (LLR:4,VAR:5,FRC:1) Fixedpoint (LLR:4,VAR:5,FRC:2) Floatingpoint 0 0.1 0.2 0.3 0.4
I(a )
a
0.5
0.6
0.7
0.8
0.9
Figure 3.14: Simulation results of 5114-bit block length in xed-point with normalisation and oating-point.
1 0.9 0.8 0.7 0.6
I(a )
0.5 0.4 0.3 0.2 0.1 0 Fixdpoint (LLR:4,VAR:5,FRC:1) Fixedpoint (LLR:4,VAR:5,FRC:2) Floatingpoint 0 0.1 0.2 0.3 0.4
I(a )
a
0.5
0.6
0.7
0.8
0.9
Figure 3.15: Simulation results of 40-bit block length in xed-point with normalisation and oating-point.
To compare our results with previous works, for the input LLRs, [42, 47, 49] claimed that 3-bit integer length is sucient. However, they only considered the input LLRs received from the channel. In EXIT chart simulation, the considered input LLRs are the a priori input of the concatenate decoders, which includes the channel input and the extrinsic information from the other decoder. Since the they are both the input of the concatenate decoders, it is more reasonable to consider that they have the same bit-width. Hence, our results showed that 4-bit integer length is the optimal setting for the input LLRs. [46] gave the same conclusion about the integer bit-width of the input
54
1 0.9 0.8 0.7 0.6
I(ae)
0.5 0.4 0.3 0.2 0.1 0 Fixdpoint (LLR:4,VAR:5,FRC:1) Fixedpoint (LLR:4,VAR:5,FRC:2) Floatingpoint 0 0.2 0.4
I(a )
a
0.6
0.8
Figure 3.16: Simulation results of SNR=-4.83dB/453-bit block length in xed-point with normalisation and oating-point.
1 0.9 0.8 0.7 0.6
I(ae)
I(a )
a
0.6
0.8
Figure 3.17: Simulation results of 5114-bit block length in xed-point with wrapping technique and oating-point.
LLRs. The other papers mentioned before did not considered the input LLRs separately. For the internal variables, [44,46] considered all the dierent internal variables (i.e. , , beta and delta) separately. The longest variables require 8 bits for the integer part. [50] concluded that 7-bit is the optimal setting. The conclusion of [49] is 6-bit. [42, 43] has the same conclusion with our result that 5-bit is the most optimal setting. The reason caused so many dierent conclusion is the dierent circumstance used in the simulation,
55
1 0.9 0.8 0.7 0.6
I(ae)
I(a )
a
0.6
0.8
Figure 3.18: Simulation results of 40-bit block length in xed-point with wrapping technique and oating-point.
1 0.9 0.8 0.7 0.6
I(a )
I(a )
a
0.6
0.8
Figure 3.19: Simulation results of SNR=-4.83dB/453-bit block length in xed-point with wrapping technique and oating-point.
as discussed before. For example, [49] used normalisation in their simulation, but only at a couple of certain steps in the decoding process while we used it in each step. Hence, it concluded with one more bit requirement for the internal variables than our conclusion. By using EXIT chart, we proved that with proper overow control techniques, the optimal bit-width specication for the UMTS Turbo decoder is (integer:4-bit,fraction:2bit) for the input LLRs and (integer:5-bit,fraction:2-bit) for the internal variables.
56
In conclusion, we introduced a method to determine the optimal data width specication for implementing a Turbo code in a low power xed point system based on EXIT chart analysis. By applying the method to the UMTS Turbo code, we demonstrate the advantage of the method compared with conventional method based on BER chart analysis. The dierent techniques for reduce the data width requirement of xed point turbo code implementation are also discussed.
Chapter 4
Energy Estimation Decoding Algorithm

In this Chapter, a framework to estimate the energy consumption of a encoder/decoder on the algorithmic level is proposed.
4.1
Introduction
There are dierent aspects to evaluate a systems power/energy consumption. For instance, the average power is directly related to the chip heating and temperature issues and the worst case instantaneous power aects the voltage drop problem [59]. In low power WSN applications, such as Body Area Networks, a long life-time is the most important motivation for applying low power techniques. Since a lot of advanced techniques, such as clock gating and power gating, can help to increase the life-time of a system without changing the average power while the system is fully operating, the estimation of power consumption of a design in early design stage is not good enough to investigate its potential life-time. Hence, energy consumption estimation is more suitable for this issue. Indeed, the latest works on long life-time WSNs design issues are more focused on energy consumption based design [6062]. Power/Energy estimation is required for all levels of abstraction in the design ow with dierent purposes [63]. In later stages, such as gate-level or transistor level, very accurate estimation can be given, since most of the information of implementation is available. On the other hand, most of the design eort has already been invested by this stage. Not a lot power reduction can be achieved after the estimation. The purposes of power/energy estimation at these stages are only to ne-tune the design and verify that the power constraints have been met. Therefore, to design an extremely low-power system, power/energy estimation is more important at the early stages. By being aware 57
58
Chapter 4 Energy Estimation Decoding Algorithm
of forecasted energy consumption during the early design stages, more energy reduction can be achieved. However, in very early design stages, such as algorithm level design, most knowledge of the physical parameters aecting the energy consumption are not available which make the energy estimation at this stage very dicult. Hence, in traditional design ow, communication engineers estimate the computational complexity of an algorithm instead of energy estimation to evaluate an algorithm design. The complexity indicates the certain computing resource an algorithm needs. However, in the case of a Turbo decoder for example, [64] demonstrated that memory access rather than computational complexity is the most critical part of the decoder in terms of energy consumption. The same assertion can be given to many other types of system design [65,66]. There are also other components in the implementation of a system, such as the datapath selection logic, the internal registers and the controller. Again, their contribution to the energy consumption cannot be predicted by the computational complexity of an algorithm. Therefore, a lower complexity algorithm cannot guarantee a lower energy consumption implementation. For low power design, energy estimation would provide more information than complexity estimation. As discussed in Chapter 1, in short range, lowpower WSNs, such as BANs, due to the low transmission power, energy consumption in the physical layer, especially in the channel coding, scheme could have a signicant contribution on the energy consumption of the whole system. In this chapter, we propose a framework for estimating energy consumption at a very early stage, namely the algorithm level, of a channel coding system. As we focus on the energy consumption rather than the average power consumption of a channel coding system, a lot of decisions are required in algorithm level design which can aect the energy consumption of the nal implementation. Moreover, after this stage, the basic scheme is xed. The potential reduction of power consumption is then limited. Therefore, the decisions in algorithm level design is very important to a low power design. Our framework aims to rank various coding scheme design options and thus helps in selecting the one that is potentially more eective from the energy point of view. Since the encoding algorithms are typically of low complexity and energy consumption, we are particularly interested in the energy consumption of decoding algorithms, which are typically much higher. The framework is suitable for all turbo-like algorithms and even other types of algorithms in channel coding systems, such as equalisation, interference cancellation and MIMO (Multi-Input and Multi-Output) detection. The knowledge of the hardware design in later design stages is not required while applying the framework. There are two classic approaches to implement a coding scheme, namely DSP (Digital Signal Porcessing) implementation and ASIC (Appication-Specic Integrated Circuit) implementation. A DSP system is based on a general purpose processer with an instruction set. Thus, the algorithm is realized by assembly language program. An ASIC system, on the other hand is a specic designed system for a particular application. Therefore, the hardware design would be very optimal for the applying algorithm. DSP
59
implemenation is widely used in traditional WSNs applications due to its general appicability. However, comparing with ASIC implementation, DSP implementations long execution time and low hardware usage eciency are not suitable for low power systems. Moreover, the lower bound energy consumption of a coding scheme is an important issue which needs to be considered in Physical Layer design. Therefore, our framework aims to estimate possible energy consumption of an algorithm in ASIC implementations.
4.2
Previous works
In previous works, power/energy estimation in early design stage, as mentioned as highlevel power/energy estimation, can be divided in two categories. One is based on DSP or FPGA implementations [60, 67, 68]. More specically, in order to simplify the problem, these methods assume the xed architectural templates, that are oered by DSPs and FPGAs. The benet of such types of methods is that it is easy to make the approaches suitable for a widely range of algorithm. However, DSP or FPGA implementations are not suitable for extremely low power applications, since the unique characteristics of an algorithm cannot be explored in these architectures. Such characteristics may be utilised in hardware implementation, which are very important for low power design, which is why the lower bound energy consumption is of particular interest to communication engineers. To investigate the distinct characteristics of an algorithm, energy estimation of possible ASIC implementations is required. This is the other category of high-level power/energy estimation, mostly referred as behavioural level power/energy estimation which based on executable behavioural descriptions [63]. Algorithm level, usually refers to a clear mathematical description of an algorithm, is a more general concept compared with behavioural level since the term algorithm level is widely used in communication area but behavioural level is a concept in hardware design area, which means an executable program or a clearly ow chart description with detailed operations requirement. However, they have no clear distinction from the power/energy estimation point of view. They both refer to a clear mathematical description of an algorithm and lack of knowledge of the architecture of the implementation. One type of behavioural level power estimation method is the activity-based model, which typically assumes some architectural style or template and produces physical capacitance and switching activity estimations of the resources based on it [69]. The dynamic power is then expressed as: P =
r{all resources} 2 fr Cr Vdd
(4.1)
60
where fr is the access frequency of resource r which is produced by activity prediction, Cr is the switched capacitance of r and Vdd is the supply voltage. [69] The equation has a couple of equivalent transformation in dierent methods, but they all based on (4.1). Typically, in such methods [7072], only dynamic power is considered. However, as IC process technology enter deep submicron sizes, an exponential increase in the subthreshold leakage current arises, which makes the leakage power of CMOS circuits unneglectable. Moreover, switching activity estimation for sequential circuits is dicult and time consuming, which makes such type of method dicult to use in a practical algorithm design stage. An alternative approach is oered by a complexity-based model, which considers the power/energy consumption of a system to be a sum of dierent entities power/energy consumption. In [73], the power consumption of cryptographic algorithms are estimated based on how many dierent components (registers, adders ... etc) were used and what type of memory was chosen. In [74], the energy consumption in a digital CMOS circuit is expressed as: E = Ngates Egate (4.2)
where is the circuit activity, Egate is the energy consumption per switching gate of a reference cell (e.g. 2-input NAND gate) in a particular technology and Ngates is the gate approximate equivalent count of the design with the reference cell. There parameters can be obtained by specication parameters of the technology or by simulation. Hence, all three types of power components (datapath, memory and controller) in CMOS circuit are considered automatically. The drawback of these methods is that the activity of the circuit is roughly estimated by using only one parameter. The framework we proposed conquers this drawback by considered the dierent components separately. The other challenge of high-level power/energy estimation is that unlike DSP or FPGA implementation, ASIC design is can be optimised for the specic algorithm. As a result, the possible implementation can be dicult to predicate in the algorithm level. Some previous works obtained the specics of hardware implementation by using highlevel synthesis tools [73, 74]. Others transfer the behavioural description into a more complicated description, which may include boolean functions, truth tables or circuit design [70, 72]. There approaches require knowledge of hardware design and synthesis processes. Further more, the required program and simulation processes are time consuming, which are not desirable for an algorithm level design stage. To conquer this challenge, our framework relies on the designer to specify the algorithm partitioning and resource constraints, but avoids an actual hardware design process. In this approach the framework not only estimates the potential energy consumption of an algorithm but also provides feedback on the quality of a design strategy.
61
4.3
A framework for quantifying the energy consumption of a Turbo-like decoder
In this section, a framework that allows us to compare and estimate the energy consumption of a Turbo-like decoder design at the algorithm level is proposed. Traditional low power system design methods can only design the algorithm based on the computational complexity analysis. Our framework provides an ample opportunity of feeding back the energy estimation result to algorithm choice or design steps. Based on the purpose of our work, the comparison and evaluation of the dierent Turbo-like code algorithm, we develop two levels of the framework. For comparison of dierent algorithms, it is not necessary to estimate all the possible energy consumption in the implementation of the algorithms, since this could be a time consuming work at the algorithm level. Therefore, in the level 1 of our framework, we aim to provide a quick method which allows communication engineers to compare dierent algorithms from the energy consumption point of view with little extra eort. We only considered two main parts of the possible energy consumption while implementing an algorithm in hardware design, the energy consumption by all the operations in the algorithm and the energy consumption by the memory requirement of the algorithm. In our case, for a Turbo-like code, all the possible operations in the algorithm are ACS operations. The reason we select these two parts of energy consumption in the system in level 1 framework is because that only these two parts are directly related to the target algorithms. The other parts of energy consumption of the system, such as the energy contribution by the controller and the datapath structure, could be variable depending on dierent design strategies. Therefore, only an approximate estimation can be given at the algorithm level. However, in level 2 framework, we aim to give a energy estimation which considered all the possible energy contribution in the system. The level 1 framework is presented in Section 4.3.1. The level 2 framework is still in the future work planning stage, which discussed in Section 4.3.2.
4.3.1
Level 1 of the framework
For considering the energy consumption of the computing operations in the target algorithm, a conventional complexity analysis of the algorithm need to be performed. Taking one convolutional decoder in the UMTS Turbo decoding scheme as an example, the algorithm is introduced in Chapter 2. We category all the ACS operations into two operations, additions (including subtractions) and the max operations. For a nbit decoding frame, the complexity analysis result shows that the decoding algorithm including 97n 10 additions and 30n 20 max operations.
62
For considering the energy consumption of each operation in the algorithm, we implemented the addition and the max operation in gate-level design based on STMircoelectronis 0.12 m technology standard cell library. We consider a max operation with a 4 elements LUTs support here. The data width of the operation units is 8-bit, which is sucient for the target convolutional decoding process, according to the results in Chapter 3. Then analysis the energy consumption of each operation by power analysis tool Synopsys PrimeTime [75]. Such a procedure is strictly following the standard ASIC design procedure. With the assumption that in the critical path of the implementation has no more than 10 adders and the system clock is lower than 10MHz, our power analysis result is that the typical energy consumption of an addition operation in our specication is Eadd = 0.04591pJ. With the same specication, the typical energy consumption of a max operation is Emax = 175pJ. Note that a max operation including a more than one comparison and addition operations and a 4 elements LUT which consumed much more energy than an addition operation. The conventional complexity analysis cannot taking such dierence between the operations into account. Therefore, based on the complexity analysis and our power analysis results, the total energy consumption by the operations in the UMTS decoding algorithm can be calculated by: Eoperations = Eadd (97n 10) + Emax (30n 20) (4.3)
The calculation result of Eoperations is 5254.4 n 3500 pJ. Therefore, for a 40-bit frame, the decoding energy consumption by the operations is 2.07 105 pJ. For a 5114-bit frame, the decoding energy consumption by the operations is 2.69 107 pJ. For considering the energy consumption by the memory requirement of the algorithm, we rst need to address the total memory requirement of the algorithm, that is, how many variables are required to be stored during the decoding. This required to analysis the dependence of the dierent stage in the algorithm. According to the introduction in Chapter 2, in the UMTS Turbo decoder, the decoding algorithm of the convolutional code including ve stage, the calculation of , , , and ye . The dependence of the stage can be shown in Figure 4.1. As shown in the gure, the values are required to be stored for the calculation of , and . The , and values all need to be stored for the calculation of . Since the input can be used to calculate straight away, and the can be used to calculate ye and the output straight away, there is no need to store such variables. The energy consumption of the memory is basically depends on the reading and the writing times of the memory. The writing times is equal to the number of variables need to be stored since each variable only need to be wrote in the the memory for one times. The reading times is equal to times of using the variables in the algorithm, since the variables can only be stored in the memory, for how many times of using the variable, there are how many times of reading the variable from the memory. According to the algorithm, for an n-bit frame, there are 32n 40 values, 8n 10 values, and 8n 10 values. Therefore, there are 48n 60 times of writing required
63
Input
ye
Output
Figure 4.1: The dependence between the dierent stages.
during the decoding. Each is used once in calculation and once in calculation. Only half of the is used once in the calculation. Therefore, the total times of reading is 2.5 (32n 40) = 80n 100 times. Each is used once for calculation, which induce 8n 10 times reading. Each is used once for calculation, which induce 8n 10 times reading. To sum up, 96n 120 times of reading is required for the decoding. Due to the lack of memory standard cells in our standard cell library. The power analysis of memory unit cannot be performed. Therefore, we used the datasheet of a 64Mbits memory product NEC uPD4564163 [76] to calculate the energy consumption of the memory. According to the datasheet, assuming no miss reading happened during the decoding processing, each reading or writing operation required 2 clock cycles at least. One reading or writing operation consume 9900 pJ. Note that the product is outdated compared with the technology we used for estimate the operation energy consumption. Therefore, with the proper memory product, the energy consumption might be reduced. In this case, the reading and writing consume the same amount of energy in the memory. The total energy consumption can be simple calculation: Ememory = 9900 (96n 120) = 9.5 105 n 1.19 106 (pJ) (4.4)
Therefore, for a 40-bit frame, the total energy consumption by the memory is Ememory = 3.68 107 pJ. For a 5114-bit frame, Ememory = 4.86 109 pJ. In this level of framework, the analysis results allow the comparison of dierent Turbo decoding algorithms from the operation energy point of view and the memory energy
64
point of view. If the same technology library of the memory and the standard cells both avaliable, it is reasonable to consider the sum of the two energy consumption part is the energy consumption estimation directly related to the algorithm. As discussed, the rest part of the energy consumption in the system highly depends on the design strategy hence cannot be accurately estimated at the algorithm level. In next section, we discuss the future work of the level 2 of our framework, which aim to give a total energy consumption estimation considering all the possible energy contribution in the system. As discussed, since accurately estimation of the total system energy consumption is impossible at the algorithm level, reasonable assumptions are required for level 2 of the framework.
4.3.2
Future work: Level 2 of the framework
The level 2 of the framework we propose is based on complexity, memory and parallelism analysis of the mathematical description of the algorithm. By converting the description into factor graphs, the computing resource, memory requirement and parameters of control unit can be obtained. The energy estimation is based on a look-up table of the energy consumption of dierent entities in design. The look-up table is built using the simulation of a particular technology library, in our case, STMicroelectronis 0.12m process standard cell library. A digital circuit system can be divided into three components, namely the datapath architecture, the system memory and the controller. Hence, the total energy consumption of a system is divided into three parts, as expressed in (4.5): Etotal = Edatapath + Ememory + Econtroller (4.5)
Our framework estimates the energy using clock cycle accurate analysis and timing analysis. The cycle-accurate analysis considers the energy consumption of dierent hardware components in dierent modes of operation, typically operating mode and idle mode. The timing analysis considers the required operation times of dierent components and the total time consumption of processing a typical task (e.g. decoding a data frame). The energy consumption of the system, such as average energy consumption per clock cycle or the energy consumption of a particular task, can be obtained. A owchart of the framework is shown in Figure 4.2. The mathematical description of a algorithm is the basic input of the framework. A executable program is not required. For a complicated algorithm, it is usually divided into many steps for ease of implementation. Therefore, a partitioning analysis of the algorithm is needed. After this, the algorithm can be converted into a factorgraph-based description. In our framework, a factor graph is used for describing the computation complexity in the algorithm and an overall owchart is used to describe the dependence between the steps. The computing resource required can be estimated based on the factor graph. Note that the
65
Mathmatical Description
Partitioning Analysis
Factor Graphic for Complexity Analysis Computing Resource Estimation
Overall Flowchart
Resource Constraint
Control Signal Estimation
Memory Access Estimation
Timing Analysis
Memory Requirement Estimation
Controller State Estimation
Energy Estimation of Datapath
Energy Estimation of Memory
Energy Estimation of Controller
Figure 4.2: Flowchart of energy estimation framework.
computing resource estimation includes all the entities in the datapath. By considering the resource requirement of each step and the dependence information in the overall owchart, the overall resource constraint can be obtained by analysis. In addition, the estimation of control signals, controller state, memory requirement and memory access are given. Timing analysis is given with the information from the factor graph and overall owchart. This generates the total clock cycle requirement to get the clock frequency constraint and for later cycle-accurate estimation. Finally, the three parts of energy consumption, the datapath, the memory and the controller can be obtained, as shown in in Figure 4.2.
Chapter 5
Conclusions and Further Works

In this report, we give an investigation of the state of the art of the development of wireless communication system for BANs. Based on the investigation of the previous works, we proposed a promising solution of applying Turbo-like codes in the channel coding scheme of BANs communication system. Based on the proposal, we brought out the requirement of exploring the xed-point low power implementation of Turbo-like codes and evaluating the dierent Turbo-like codes from the energy consumption point of view. Therefore, in chapter 3, we proposed a method based on EXIT chart analysis to determine the optimal data width specication of a Turbo-like decoding algorithm in xedpoint low power implementation. The issue is signicantly important to the energy consumption of the implementation. We represent our method by applying it to the UMTS Turbo decoder. We considered the dierent conditions of the overow issue in the implementation and compared our result with the previous works. The advantages of our method compared with conventional BER/FER chart analysis method are revealed. In chapter 4, we proposed a framework to evaluate the dierent Turbo-like codes from the energy consumption point of view. The framework has two levels. The level 1 of the framework considered the energy consumption of the required ACS operations in the algorithm and the related memory requirement. Level 1 has a relatively simple procedure, which can be easily applied to the target algorithms. It oer a better evaluation of the algorithms than conventional complexity evaluation of the algorithms from energy consumption point of view but with little extra eort required. The level 2 of the framework is the future work plan of the project. It aims to create a procedure to allow gate-level energy consumption estimation of the target algorithms without hardware design knowledge requirement. Some detail of this level of the framework is discussed. The works presented in this report is the preparation of exploring the novel decoding scheme on the relays in BANs proposed in Chapter 1. Therefore, the future work is to 67
68
Chapter 5 Conclusions and Further Works
apply the two proposed methods to Turbo-like codes which considered suitable for BANs applications. With the consideration of the other part of energy consumption on the relays including the receiving and transmission power, the modulation and demodulation schemes, our novel decoding scheme can be investigated.
Bibliography
[1] S. Drude, Requirements and applications scenarios for body area netwokrs, in Mobile and Wireless Communications Summit, 2007. 16th IST, 2007. [2] T. G. Zimmerman, Personal area networks: Neareld intrabody communication, IBM System Journal, vol. 35, pp. 609617, 1996. [3] B. Zhen, H. Li, and R. Kohno, IEEE body area netwokrs for medical applications, in Wireless Communication Systems, 2007. ISWCS 2007. 4th International Symposium on, 2007. [4] M. L. R. Fox, poses mban/. [5] Revision of part 15 regarding ultra-wideband transmission systems. rst report and order, et docket, 98-153, fcc 02-48, Federal Communications Commission (FCC), Tech. Rep., 2002. [6] J. Ryckaert, C. Desset, V. de Heyn, M. Badaroglu, P. Wambacq, G. V. der Plas, and B. V. Poucke, Ultra-wideband transmitter for wireless body area networks, in Proceeding on 14th IST Mobile & Wireless Communications Summit, Jun. 2005. [7] H. Li, K. Takizawa, B. Zhen, and R. Kohno, Body area network and its standardization at IEEE 802.15.MBAN, in Mobile and Wireless Communications Summit, 2007. 16th IST, Jul. 2007, pp. 15. [8] J. A. D. Moutinho, Wireless body area network, 2009, rECIN2009. [9] V. M. Jones, R. G. A. Bults, D. Konstantas, and P. A. M. Vierhout, Healthcare pans: Personal area networks for trauma care and home care, in In 4th International Symposium on Wireless Personal Multimedia Communications (WPMC), 2001, pp. 13691374. [10] M. Soini, J. Nummela, P. Oksa, L. Ukkonen, and L. Sydnheimo, Wireless body area network for hip rehabilitation system, Ubiquitous Computing and Communication Journal, vol. 3, p. 7, 2008. 69 rules for H. Symons, body area S. Berson, networks and H. Westphal, (mban), Jul. Fcc proaccess
2009,
from:http://mobihealthnews.com/3078/fcc-proposes-rules-for-body-area-networks-
70
BIBLIOGRAPHY
[11] B. Zhen, Ban technical requirements, IEEE 802.15.TG6, Tech. Rep., Sep. 2008. [12] B. Latr, I. Moerman, B. Dhoedt, and P. Demeester, Networking in wireless body area networks, in in 5th FTW PHD Symposium, Interactive poster session, Dec. 2004, p. 113. [13] C. K. Singh and A. Kumar, Performance evaluation of an IEEE 802.15.4 sensor network with a star topology, Wireless Networks, vol. 14, no. 4, pp. 543568, Aug. 2008. [14] S. Choi, S. Song, K. Sohn, H. Kim, J. Kim, J. Yoo, and H. Yoo, A low-power startopology body area network controller for periodic data monitoring around and inside the human body, in 2006 10th IEEE International Symposium on Wearable Computers, Oct. 2006, pp. 139140. [15] A. G. Ruzzelli, R. Jurdak, G. M. P. OHare, and P. V. D. Stok, Energy-ecient multi-hop medical sensor networking, in Proceedings of the 1st ACM SIGMOBILE international workshop on Systems and networking support for healthcare and assisted living environments, 2007, pp. 3742. [16] B. Latre, B. Braem, I. Moerman, C. Blondia, E. Reusens, W. Joseph, and P. Demeester, A low-delay protocol for multihop wireless body area networks, in MobiQuitous 2007. Fourth Annual International Conference on Mobile and Ubiquitous Systems: Networking & Services, Aug. 2007, pp. 18. [17] J. Misic, Enforcing patient privacy in healthcare WSNs using ECC implemented on 802.15.4 beacon enabled clusters, in Pervasive Computing and Communications, 2008. PerCom 2008. Sixth Annual IEEE International Conference on, 2008. [18] J. Rousselot, A. El-Hoiydi, and J.-D. Decotignie, Performance evaluation of the IEEE 802.15.4a UWB physical layer for body area networks, in Computers and Communications, 2007. ISCC 2007. 12th IEEE Symposium on, 2007. [19] D. Domenicali and M.-G. D. Benedetto, Perfromance analysis for a body area network composed of IEEE 802.15.4a devices, in Proceedings of 4th Workshop on Positioning, Navigation and Communication 2007(WPNC07), Hannover, Germany, Mar. 2007, pp. 273276. [20] M. R. Yuce, Implementation of body area networks based on MICS/WMTS medical bands for healthcare systems, in IEEE Engineering in Medicine and Biology Society Conference (IEEE EMBC08), Aug 2008, pp. 34173421. [21] S. Stoa, I. Balasingham, and T. A. Ramstad, Data throughput optimization in the ieee 802.15.4 medical sensor networks, in ISCAS 2007. IEEE International Symposium on Circuits and Systems, May. 2007, pp. 13611364.
BIBLIOGRAPHY
71
[22] X. Liang and I. Balasingham, Performance analysis of the IEEE 802.15.4 based ecg monitoring network, in Proceeding of The Seventh IASTED International Conferences on Wireless and Optical Communications (WOC07), 2007. [23] R. C. Shah, L. Nachman, and C. Wan, On the performance of bluetooth and ieee 802.15.4 radios in a body area network, in Proceedings of the ICST 3rd international conference on Body area networks, Tempe, Arizona, 2008. [24] D. D. Arumugam and D. W. Engels, Impacts of rf radiation on the human body in a passive rd environment, in 2008 IEEE Antennas and Propagation Society International Symposium, Jul. 2008, pp. 14. [25] C. Berrou, A. Glavieux, and P. Thitimajshima, Near shannon limit error correcting coding and decoding: Turbo codes, in IEEE Proceedings of the Int. Conf. on Communications, 1993. [26] C. Berrou and A. Glavieux, Near optimum error correcting coding and decoding: Turbo-codes, IEEE Trans. on Communications, vol. 44, no. 10, pp. 12611271, Oct. 1996. [27] I. Joe, Energy eciency maximization for wireless sensor netwokrs, international Federation for Information Processing, vol. 211, pp. 115122, 2006. [28] S. Benedetto and G. Montorsi, Serial concatenated of block and convolutional codes, Electronics Letters, vol. 32, no. 10, pp. 887888, May. 1996. [29] R. Gallager, Low-density parity-check codes, IRE Transaction on Information Theory, vol. 8, no. 1, pp. 2128, Jan. 1962. [30] J. G. D. Forney, Concatenated codes, Massachusetts Institute of Technology Research Lab of Electronics, Tech. Rep., 1966. [31] I. S. Reed and G. Solomon, Polynomial codes over certain nite elds, SIAM Journal of Applied Math, vol. 8, pp. 300304, 1960. [32] P. Elias, Coding for noisy channels, in IRE Convention Record Pt. 4, 1955, pp. 3737. [33] J. H. Yuen, M. K. Simon, W. Miller, F. Pollara, C. R. Ryan, D. Divsalar, and J. C. Morakis, Modulation and coding for satellite and space communications, in Proceedings of the IEEE, vol. 78, no. 7, Jul. 1990, pp. 12501265. [34] A. J. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory, vol. IT-13, pp. 493497, Apr. 1967. [35] E. Boutillon, C. Douillard, and G. Montorsi, Iterative decoding of concatenated convolutional codes: Implementation issues, in Proceedings of the IEEE, 2007.
72
BIBLIOGRAPHY
[36] B. Sklar, Fundamentals of Turbo Codes, Digital Communications: Fundamentals and Applications, Second Edition. Prentice-Hall, 2001.
[37] C. Schlegel and L. Perez, Trellis and Turbo coding, ser. IEEE Press series on digital & mobile communication, J. B. Anderson, Ed. John Wiley & Sons, 2004. [38] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Transactions on Information Theory, vol. 20, no. 3, pp. 284287, Mar. 1974. [39] 3rd generation partnership project; technical specication group radio access network; multiplexing and channel coding (tdd) (release 7), 3GPP Organizational Partners (ARIB, ATIS, CCSA, ETSI, TTA, TTC), Tech. Rep., 2008. [40] C. Weiss, C. Bettstetter, and S. Riedel, Code construction and decoding of parallel concatenated tail-biting codes, IEEE Transactions on Information Theory, vol. 47, no. 10, pp. 366386, Jan. 2001. [41] S. ten Brink, Convergence behavior of iteratively decoded parallel concatenated codes, IEEE Transactions on Communications, vol. 49, no. 10, pp. 17271737, Oct. 2001. [42] H. Michel and N. Wehn, Turbo-decoder quantization for umts, IEEE Communication Letters, vol. 5, no. 2, pp. 5557, Feb. 2001. [43] M. A. Castellon, I. J. Fair, and D. G. Elliott, Fixed-point Turbo decoder implementation suitable for embedded applications, in Electrical and Computer Engineering, 2005. Canadian Conference on, May. 2005, pp. 10651068. [44] J. Hsu and C. Wang, On nite-precision implementation of a decoder for turbo codes, in Proceedings of the 1999 IEEE International Symposium on, vol. 4, Orlando, FL, USA, Jul. 1999, pp. 423426. [45] A. Worm, H. Michel, F. Gilbert, G. Kreiselmaier, M. Thul, and N. Wehn, Advanced implementation issues of turbo-decoders, in Proc. 2nd International Symposium on Turbo-Codes and Related Topics, 2000, pp. 351354. [46] T. K. Blankenship and B. Classon, Fixed-point performance of low-complexity turbo decoding algorithms, in Vehicular Technology Conference, 2001. IEEE VTS 53rd, vol. 2, Rhodes, Greece, 2001, pp. 14831487. [47] G. Montorsi and S. Benedetto, Design of xed-point iterative decoders for concatenated codes with interleavers, IEEE Journal on Selected Areas in Communications, vol. 19, pp. 871882, 2001. [48] Y. Wu, B. D. Woerner, and T. K. Blankenship, Data width requirements in siso decoding with modulo normalization, IEEE Transactions on Communications, vol. 49, no. 11, pp. 18611868, Nov. 2001.
BIBLIOGRAPHY
73
[49] R. Hoshyar, A. R. S. Bahai, and R. Tafazolli, Finite precision Turbo decoding, in Proc. 3rd International Symposiumon Turbo Codes and Related Topics, Brest, France, Sep. 2003, pp. 483486. [50] A. Morales-Cortes, R. Parra-Michel, L. F. Gonzalez-Perez, and T. G. Cervantes, Finite precision analysis of the 3gpp standard turbo decoder for xed-point implementation in fpga devices, in Recongurable Computing and FPGAs, 2008. International Conference on, Dec. 2008, pp. 4348. [51] S. Benedetto, D. Divsalar, G. Montorsi, and F. Pollara, A soft-input soft-output app module for iterative decoding of concatenated codes, IEEE Communications Letters, vol. 1, no. 1, pp. 2224, Jan. 1997. [52] V. Singh, Elimination of overow oscillations in xed-point state-spece digital lters using saturation alrithmetic, IEEE Transactions on Circuits and Systems, vol. 37, no. 6, pp. 814818, Jun. 1990. [53] D. A. Balley and A. A. Beer, Simulation of lter structures for xed-point implementation, in Proceeding of the 28th Southeastern Symposium on System Theory, Baton Rouge, LA, USA, 1996, pp. 270274. [54] G. Masera, Turbo Code Applications: a journey from a paper to realization, K. Sripimanwat, Ed. Springer Netherlands, 2005. [55] A. Hekstra, An alternative to metric rescaling in viterbi decoders, IEEE Transcations on Communications, vol. 37, pp. 12201222, Nov. 1989. [56] B. Riaz and J. Bajcsy, Impact of nite precision arithmetics on exit chart analysis of turbo codes, in 5th IEEE Consumer Communications and Networking Conference, 2008. CCNC 2008., 2008. [57] P. Robertson, E. Villebrun, and P. Hoeher, A comparison of optimal and suboptimal map decoding algorithm operating in the log domain, in Proceeding of IEEE International Conference of Communication, 1995, pp. 10091013. [58] M. C. Valenti and J. Sun, The UMTS Turbo code and an ecient dcoder implementation suitable for software-dened radios, International Journal of Wireless Information Networks, vol. 8, no. 4, pp. 203215, Oct. 2001. [59] F. N. Jajm, A survey of power estimation techniques in VLSI circuits, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, no. 4, pp. 446455, Dec. 1994. [60] O. Celebican, T. S. Rosing, and V. J. M. III, Energy estimation of peripheral devices in embedded systems, in Proceedings of the 14th ACM Great Lakes symposium on VLSI, Boston, MA, USA, 2004, pp. 430 435.
74
BIBLIOGRAPHY
[61] J. Kaza and C. Chakrabarti, Design and implementation of low-energy turbo decoders, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 12, no. 9, pp. 968977, Sep. 2004. [62] S. Chouhan, R. Bose, and M. Balakrishnan, A framework for energy-consumptionbased-design space exploration for wireless sensor nodes, IEEE Transaction on Computer-Aided Design of Intergrated Circuits and Systems, vol. 28, no. 7, pp. 10171024, Jul 2009. [63] E. Macii, CAD algorithms, methods and tools for low-power circuits and systems, IEEE Technology Surveys, Tech. Rep., 2006. [64] G. Masera, M. Mazza, G. Piccinini, F. Viglione, and M. Zamboni, Architectural strategies for low-power VLSI Turbo decoders, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 10, no. 3, pp. 279285, Jun. 2002. [65] K. Hildingsson, T. Arslan, and A. T. Erdogan, Energy evaluation methodology for platform based system-on-chip design, in Proceedings of IEEE Computer society Annual Symposium on, Feb. 2004, pp. 6168. [66] T. V. Aa, M. Jayapala, F. Barat, H. Corporaal, F. Catthoor, and G. Deconinck, A high-level memory energy estimator based on reuse distance, in Proceedings of the 3rd Workshop on Optimizations for DSP and Embedded Systems (ODES05), San Jose, Calif, USA, Mar. 2005. [67] J. Laurent, E. Senn, N. Julien, and E. Martin, High-level energy estimation for DSP systems, in Proceedings of Int. Workshop on Power And Timing Modeling, Optimization and Simulation PATMOS 2001, 2001, pp. 311316. [68] C. Menn, O. Bringmann, and W. Rosenstiel, Controller estimation for FPGA target architectures during high-level synthesis, in Proceedings of the 15th international symposium on System Synthesis, 2002. [69] P. Landman, High-level power estimation, in Proceedings of ISLPED, 1996, pp. 2935. [70] P. Surti and L. Chao, Controller power estimation using information from behavioraldescription, in ISCAS 96., vol. 4, May. 1996, pp. 679682. [71] J. N. Kozhaya and F. N. Najm, Accurate power estimation for large sequential circuits, in Proceedings of the 1997 IEEE/ACM international conference on Computer-aided design, San Jose, California, United States, 1997, pp. 488 493. [72] M. Lesser and V. Ohm, Accurate power estimation for sequential cmos circuits using graph-based methods, VLSI Design, vol. 12, pp. 187203, 2001.
BIBLIOGRAPHY
75
[73] M. Khaddour and O. Hammami, High level energy consumption estimation of cryptographic algorithms, in ICTTA 2008. 3rd International Conference on, Apr. 2008, pp. 16. [74] A. B. A. Garcia, J. Gobert, T. Dombek, H. Mehrez, and F. Petrot, Energy estimations in high level cycle-accurate descriptions of embedded systems, in Proceedings of 5th International Workshop on Design and Diagnostics of Electronic Circuits and Systems (DDECS2002), Brno, Czech Republic, Apr. 2002, pp. 228235. [75] Datasheet of primetime, Synopsys, Tech. Rep., 2009. [76] Nec upd4564163 datasheets: 64mbit synchronous dram, 4 bank, lvttl, NEC, Tech. Rep., 1998.

Liang Li Nine Month Report

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Liang Li Nine Month Report

Transféré par

Droits d'auteur :

Formats disponibles

UNIVERSITY OF SOUTHAMPTON Faculty of Engineering, Science and Mathematics School of Electronics and Computer Science

A progress report submitted for continuation towards a PhD

vi 4.1 4.2 4.3

5 Conclusions and Further Works Bibliography

Introduction of Body Area Networks (BANs)

Figure 1.1: A typical BANs architecture.

Chapter 1 Introduction 1.2.1.2 Network scale and communication range

Table 1.1: Data rate requirement of dierent applications in BANs

6 1.2.1.4 Reliability, accuracy and latency

Chapter 1 Introduction 1.2.1.6 Network topology

Candidate options for Body Area Networks

Parallel Concatenated Code

Serial Concatenated Code

Figure 1.2: Two concatenation way of Turbo-like codes.

Parallel Decoding Scheme

Serial Decoding Scheme

Figure 1.3: Two decoding schemes of two types of Turbo-like codes.

Outline of the report

Chapter 1 Introduction Chapter 5 gives the conclusion of this report.

Turbo-like Code Solutions in BANs

Chapter 2 Turbo-like Code Solutions in BANs

Figure 2.1: Transmission scheme of serial concatenation codes.

Chapter 2 Turbo-like Code Solutions in BANs

Chapter 2 Turbo-like Code Solutions in BANs

Figure 2.4: A classical Turbo encoder.

Chapter 2 Turbo-like Code Solutions in BANs

Figure 2.5: A classical Turbo decoder.

Chapter 2 Turbo-like Code Solutions in BANs

Output Outer decoder Inner decoder Input

Turbo codes and BCJR decoding algorithm

UMTS encoder and decoder architecture

Chapter 2 Turbo-like Code Solutions in BANs

Figure 2.7: Scheme of UMTS Turbo encoder.

(2.1) (2.2) (2.3) (2.4)

Chapter 2 Turbo-like Code Solutions in BANs

ek S1 S2 S3 State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1

S1 S2 S3 State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1

000 001 010 011 100 101 110 111

000 001 010 011

Transition trellis for encoding bits

Transition trellis for termination bits

For termination bits:

(2.5) (2.6) (2.7) (2.8) (2.9)

Chapter 2 Turbo-like Code Solutions in BANs

State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1 State5 1 0 0 State6 1 0 1 State7 1 1 0 State8 1 1 1

Figure 2.9: Trellis diagram of a transition sequence.

State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1 State5 1 0 0 State6 1 0 1 State7 1 1 0 State8 1 1 1

Chapter 2 Turbo-like Code Solutions in BANs

Figure 2.11: Scheme of UMTS Turbo decoder.

Chapter 2 Turbo-like Code Solutions in BANs

decisions become log-likelihood ratios (LLRs) dened as:

Chapter 2 Turbo-like Code Solutions in BANs

State1 0 0 0 State2 0 0 1 State3 0 1 0 State4 0 1 1

T59 S38 T60 S37 S5 S35 S6 T58

T5 State5 1 0 0 State6 1 0 1 State7 1 1 0 State8 1 1 1 S3 T6 S7

T12 S14 T28 T46 S23 S31

Figure 2.12: A example trellis of a short terminated trellis code.

Chapter 2 Turbo-like Code Solutions in BANs

For example, in Figure 2.12,

Chapter 2 Turbo-like Code Solutions in BANs

T is started from a particular state S, that is (S) represents the probability

The algorithm is accomplished.

EXIT chart analysis