Vous êtes sur la page 1sur 77

A Project Report On

IMPLEMENTATION OF WAVELET TRANSFORM BASED


IMAGE COMPRESSION ON FPGA

 
Submitted to the faculty

Of

Department of Electronics and Communication Engineering

NATIONAL INSTITUTE OF TECHNOLOGY, WARANGAL (A.P)

In the partial fulfillment of the requirements

for the award of the degree

MASTER OF TECHNOLOGY

(Specialization: Electronic Instrumentation)

By

E.VAMSHI KRISHNA (062502)


Under the guidance of

Sri P.MURALIDHAR
Lecturer

Department of Electronics and Communication Engineering


NATIONAL INSTITUTE OF TECHNOLOGY, WARANGAL (A.P)

(DEEMED UNIVERSITY)

ANDHRA PRADESH, PIN-506004,

2006-2008

 
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY
(DEEMED UNIVERSITY)
WARANGAL – 506 004
2008

CERTIFICATE

This is to certify that this is the bonafide record of the project work
“IMPLEMENTATION OF WAVELET TRANSFORM BASED IMAGE COMPRESSION
ON FPGA” carried out by E.VAMSHI KRISHNA (062502) of final year M.Tech
(Electronic Instrumentation) during the academic year 2007-2008 in the partial
fulfillment of the requirements for the award of the degree of Master of Technology.

Sri P.MURALIDHAR, Dr. N.V.S.N. SARMA,


Project Guide, Professor and Head,
Dept. of E.C.E, Dept. of E.C.E,
National Institute of Technology, National Institute of Technology,
Warangal-506004. Warangal-506004.
ACKNOWLEDGEMENT

I would like to express my sincere thanks to Sri. P.Muralidhar, Lecturer,


Department of Electronics and Communication Engineering, National Institute of
Technology, Warangal, for his supervision, guidance, suggestions, and invaluable
encouragement during this project. His continuous involvement in the progress of the
project was of great help to me. He has been a constant source of inspiration and has
helped at each step of the project.

I am grateful to Dr N.V.S.N. Sarma professor and Head, Department of


Electronics and Communication Engineering, National Institute of Technology,
Warangal, for providing the necessary lab facilities at the college to carry out this
project.

I wish to thank all the staff members in the department for their kind cooperation
and support given throughout my project work. I am also thankful to all of my friends
who have given valuable suggestion throughout the project work.

Finally, I wish to thank all those who have involved directly or indirectly, for the
successful completion of my project work.

E.VAMSHI KRISHNA

 
ABSTRACT

Image processing systems can encode raw images with different degrees of
precision, achieving varying levels of compression. Different encoders with different
compression ratios can be built and used for different applications. The need to
dynamically adjust the compression ratio of the encoder arises in many applications. One
example involves the real-time transmission of encoded data over a packet switched
network.

To suitably adapt the encoder to varying compression requirements, adaptive


adjustments of the compression parameters are required. This involves reconfiguring the
encoder in an efficient manner. My approach exploits the reconfigurable nature of Field
Programmable Gate Arrays (FPGA), to adapt the encoder to the varying requirements in
real time. A Wavelet transform based image compression scheme is implemented for
encoding gray-scale frames of 512 by 512 pixels on FPGAs. By varying the zero
thresholds, the encoder can achieve varying compression levels.

In this thesis, a design for efficient hardware acceleration of the Discrete Wavelet
Transform is proposed and implemented on FPGA. The complete design of the codec on
FPGA is presented. Implementation details of the individual blocks are discussed in great
detail. Finally, results from testing are reported and discussed. 

 
CONTENTS

Abstract ………………….………………………………………………. (i)

List of Figures ……………...…………………………………………… (v)


List of Tables ………………………………………………………….. (vii)
Chapter 1: INTRODUCTION
1.1 Motivation ………………………………………………(1)
1.2 Background …………….……………………………….(1)
1.3 Problem Description..…………...………………………(2)
1.4 Organization of Thesis…………………………..………(2)

Chapter 2: WAVELET TRANSFORM BASED IMAGE


COMPRESSION
2.1 Introduction……………..…………………………… (3)
2.2 Principles of Image Compression………………...... (4)
2.3 Wavelet Transform as Source Encoder……………..(5)
2.4 Measuring Frequency content by W.T………………(5)
2.5 Wavelets..…………………………………..……...…(6)
2.5.1 Compact support, Vanishing moments & Smoothness..…(6)
2.5.2 Orthogonal & Bi orthogonal Wavelets…….……………..(6)
2.5.3 The Haar Wavelet……….………………………………..(7)
2.5.4 Lifting Scheme….……………………………………...…(7)
2.5.5 Wavelets that map Integer to Integer…….……………….(8)
2.5.6 (2,2) Bi orthogonal Cohen Daubechies Feuveau Wavelet..(8)
2.5.7 Boundary Treatment………………………………………(9)
2.5.8 Advantages of Wavelets……………………………….…(9)

ii 

 
Chapter 3: HARDWARE
3.1 Field Programmable Gate Array Architecture..……..…(11)
3.1.1 Applications of FPGA…………………………………(12)
3.2 Altera DE2 Board……………...…………………….…(12)
3.2.1 Examine the Board………………………………………(13)
3.2.2 Features………………………………………………….(13)
3.3 NIOS II Processor………………………………..….…(15)
3.3.1 Introduction……………………………………………..(15)
3.3.2 NIOS II Architecture...……………………………….…(15)
3.3.3 NIOS II Hardware Development……..….……………...(16)
3.3.4 System Development Flow………………..…………….(17)
3.3.5 Generating the System in SOPC builder…..……………(18)
3.3.6 NIOS II Software Development Tools……..……………(19)
3.3.7 Custom Instructions………………………..……………(20)
3.3.8 Summary of Development of NIOS II System…....……(22)
3.3.9 SOPC Builder………………………………..…...…..…(23)
 

Chapter 4: DESIGN AND IMPLEMENTATION


4.1 Design Parameters and Constraints…………………….(29)
4.1.1 Memory Read/Write…………..………………………(30)
4.1.2 Design Partitioning……………………………………(30)
4.2 Stage 1:Discrete Wavelet Transform.…………...…..…(31)
4.2.1 (2,2) Wavelet………………………………………...(31)

4.2.2 DWT in X and Y Direction…….……………...……..(31)

4.2.3 3 Stages of Wave letting……………………………..(34)


4.2.4 Overall Architecture of Stage 1……………………….(35)
4.3 Stage 2…………………………………….. .…………(36)
4.3.1 Dynamic Quantization…………………………………(37)
4.3.2 Zero Thresholding & RLE on Zeros……………………(38)
iii 

 
4.3.3 Entropy Encoding………………………………………(40)
4.3.4 Bit Packing……………………………………...………(42)
4.3.5 Output File Format………………………………………(43)
4.3.6 Stage 2 Overall Architecture………………………….…(44)
4.4 Stage 3…………….……………………………………(46)
4.4.1 Entropy Decoding………………………………………(46)
4.4.2 Run Length Decoding………………………..…………(47)
4.4.3 Dequantization…………………………………….……(48)
4.4.4 Stage 3 Overall Architecture……………………………(49)
4.5 Stage 4: Inverse DWT…..………………………..….…(49)
4.5.1 3 Stages of Inverse Wave letting…………..……………(50)
4.6 Implementations..……………………………….…….. (51)

Chapter 5: RESULTS

5.1 Metrics for Testing…………..……….………….……. (54)


5.2 VHDL Simulation Results….………….……….……...(54)
5.3 NIOS II Implementation Results…..…………………..(59)
5.3.1 Results Analysis of LENA Image………………………(60)
5.3.2 Results Analysis of BARBARA Image …..………...….(61)
5.3.3 Results Analysis of GOLD HILL Image……………...…(62)
5.4 Performance Graphs…………………………………....(63)

CONCLUSION ………………… …………………………………..… (64)

FUTURE WORK……………………………………………………… (65)

APPENDIX…………………………………………………………….. (66)

REFERENCES……………………………………………………….... (67)
 

 
iv 

 
 

List of Figures
 

Figure 2.1: A Typical Lossy Image Compression System…………….…………(4)


Figure 2.2: Lifting Scheme ………………………………………….….…….….(7)
Figure 3.1: Structure of FPGA……………………………………………….…..(11)
Figure 3.2: Altera DE2 Board………...…………………………………...….…(13)
Figure 3.3: DE2 Board Block Diagram……..……………………………..….….(14)
Figure 3.4: Example Design of NIOS…………………………………………....(17) 
Figure 3.5: NIOS II Development Flow………………………………….……...(18)
Figure 3.6: NIOS II IDE Workbench………………………………….…….…...(19)
Figure 3.7: Custom Instruction Logic in NIOS………………………….………..
(21)
Figure 3.8: Components of SOPC System…………………………….………….(22)
Figure 3.9: Multi Master System Module with Avalon Switch Fabric…………...(24)
Figure4.1: Flow Chart showing the sequence of steps in Compression……….…(29)
Figure4.2: Stage wise Ordering of Routines………………………………..….…(30)
Figure 4.3: Coefficient ordering along X Direction……………………...…….…(32)
Figure 4.4: Coefficient ordering along Y Direction……………………………...(32)
Figure 4.5: Forward Wavelet Transform Data Flow Blocks…………………....…
(33)
Figure 4.6: RTL view of Forward Wavelet X……………………………........….(33)
Figure 4.7: RTL view of Forward Wavelet Y…………………………...….....….(34)
Figure 4.8: High Pass and Low Pass Coefficients of Stage1 in X Direction...…...(34)
Figure 4.9 Mallot Ordering along the 3 Stages of Wave letting………………….(35)
Figure 4.10: Interleaved Ordering along the 3 Stages of Wave letting ..…...…....(36)
Figure 4.11: Stage1 Architecture………………………………….……..……….(36)
Figure 4.12: Dynamic Quantizer……………………………….……….………..(37)
Figure 4.13: RTL view of Dynamic Quantizer…………………...….…….....….(38)
Figure 4.14: Run Length Encoder for Continuous zeros …………..…….….…...(39)
Figure 4.15: RTL view of Run-Length Encoder………………........................….(40)
Figure 4.16: Entropy Encoding, bit allocation……………………..….…….…...(41)
Figure 4.17: Entropy Encoder…………………………………………….….…..(41)
(42)  v 

 
Figure 4.18: Binary Shifter for bit packing……………………………...….……
Figure 4.19: Output File Format……………………………………….……..… (44)
Figure 4.20: Stage2, Data Flow Diagram…………………………………..…... (45)
Figure 4.21: Stage2, Control Flow Diagram …………………...………….…... (46)
Figure 4.22: Entropy Decoder…………………………………………..…….... (47)
Figure 4.23: Run Length Decoder …………………………………………....... (47)
Figure 4.24: Dequantizer …………….………………………….………….….. (48) 
Figure 4.25: Stage3, Data Flow Diagram …………………...……………….… (49)
Figure 4.26: Coefficient Processing along X direction…………...………….…. (49)
Figure 4.27: Coefficient Processing along Y direction……………...……..…… (50)
Figure 4.28: RTL view of Inverse Wavelet …………………………...….....…. (50)
Figure 4.29: Custom Logic System Module with Avalon Switch Fabric………. (51)
Figure 4.30: Block Diagram of Processing Element (PE)...…….………..…….. (52)
Figure 4.31: 3 way Handshaking between PE & Host……...….…………..…... (52)
Figure 4.32: Timing Diagram for Memory Read after a Write Access………... (53)
Figure 5.1: VHDL Simulation Output for Forward Wavelet X...…….…...…… (54)
Figure 5.2: VHDL Simulation Output for Forward Wavelet Y...………….…… (55)
Figure 5.3: VHDL Simulation Output for Stage1 Top Level...…….………...… (55)
Figure 5.4: VHDL Simulation Output for Inverse Wavelet Transform………… (56)
Figure 5.5: VHDL Simulation Output for Quantizer...…….…………………… (56)
Figure 5.6: VHDL Simulation Output for Dequantizer...…….………………… (57)
Figure 5.7: VHDL Simulation Output for Run Length Encoder...…...….……… (57)
Figure 5.8: VHDL Simulation Output for Shifter...…….………………….…… (58)
Figure 5.9: PSNR & RMSE Equations ………….…………………………..…. (59)
Figure 5.10: Original Image of LENA……………………………………...….. (60)
Figure 5.11: Reconstructed Images of LENA………………………………….. (60)
Figure 5.12: Original Image of BARBARA………………………….…….….. (61)
Figure 5.13: Reconstructed Images of BARBARA…………………….…..….. (61)
Figure 5.14: Original Image of GOLD HILL……………………….…....…….. (62)
Figure 5.15: Reconstructed Images of GOLD HILL…………………..……...... (62)
Figure 5.16: Compression Ratio Vs PSNR Graph of LENA……….…….…..… (63)
Figure 5.17: Compression Ratio Vs PSNR Graph of BARBARA……...……… (63)
Figure 5.18: Compression Ratio Vs PSNR Graph of GOLD HILL……….…… (63) 

vi 

 
List of Tables

Table 2.1: (2,2) CDF Wavelet with Lifting Scheme ………………………….... (9)
Table 4.1: Bit Range allocation for RLE……………………….………………. (40)
Table 4.2: Coefficient Value Calculation for Dequantizer…………………...… (48)
Table 5.1: Compression Level & Noise Measurement of LENA……….…....… (60)
Table 5.2: Compression Level & Noise Measurement of BARBARA………… (61)
Table 5.3: Compression Level & Noise Measurement of GOLD HILL….….… (61)

vii 

 
CHAPTER 1
INTRODUCTION

1.1 Motivation for Research


A majority of today’s Internet bandwidth is estimated to be used for images and
video. Recent multimedia applications for handheld and portable devices place a limit on
the available wireless bandwidth. The bandwidth is limited even with new connection
standards. JPEG image compression that is in widespread use, today took several years
for it to be perfected. Wavelet based techniques such as JPEG2000 for image
compression has a lot more to offer than conventional methods in terms of compression
ratio. Currently wavelet implementations are still under development lifecycle and are
being perfected. Flexible energy-efficient hardware implementations that can handle
multimedia functions such as image processing, coding and decoding are critical,
especially in hand-held portable multimedia wireless devices.

1.2 Background

Among the several compression standards available, the JPEG image


compression standard is in wide spread use today. JPEG uses the Discrete Cosine
Transform (DCT) as the transform, applied to 8-by-8 blocks of image data. The DCT
introduces blocking artifacts at the image boundaries which is undesirable and
correlation of pixels inside the 8-by-8 block is considered neglecting the correlation
between neighboring blocks. So, the newer standard JPEG2000 is based on the Wavelet
Transform (WT). Wavelet Transform offers multi-resolution image analysis, which
appears to be well matched to the low level characteristic of human vision. The DCT is
essentially unique but WT has many possible realizations. Wavelets provide us with a
basis more suitable for representing images. In terms of implementation Hardware
platform provides fast execution time than a software platform.

1 | P a g e  
 
1.3 Problem Description
Computer data compression is, of course, a powerful, enabling technology that
plays a vital role in the information age. Among the various types of data commonly
transferred over networks, image and video data comprises the bulk of the bit traffic. For
example, current estimates indicate that image data take up over 40% of the volume on
the Internet. The explosive growth in demand for image and video data, coupled with
delivery bottlenecks has kept compression technology at a premium. So to overcome this
problem an efficient Image compression technique using Wavelet Transform is
proposed. Not only the efficiency but the execution time also plays a vital role, so a
hardware FPGA is used as implementation platform which not only provides fast
executions but also a concept of Reconfigurability Thus Implementation of Wavelet
Transform based Image Compression on FPGA is the main focus of my work.

1.4 Organization of Thesis


In this thesis, Chapter 2 gives a background discussion on wavelet transform
based image compression. Chapter 3 introduces the FPGA programmable logic and
describes the tools and methodologies involved. This chapter also describes in depth the
Altera DE2 board being used in this research. Chapter 4 presents the design and
implementation of the wavelet transform scheme using FPGAs. Chapter 5 summarizes
the VHDL implementation and summarizes the hardware testing results are discussed.
Finally Chapter 6 provides the conclusions and future extensions of work.

2 | P a g e  
 
CHAPTER 2
WAVELET TRANSFORM BASED IMAGE COMPRESSION

Many evolving multimedia applications require transmission of high quality


images over the network. One obvious way to accommodate this demand is to increase
the bandwidth available to all users. Of course, this "solution" is not without
technological and economical difficulties. Another way is to reduce the volume of the
data that must be transmitted. There has been a tremendous amount of progress in the
field of image compression during the past 15 years. In order to make further progress in
image coding, many research groups have begun to use wavelet transforms.
In this chapter, image compression, the nature of wavelets, and some of the
salient features of image compression technologies using wavelets are discussed. Since
this is a very rapidly evolving field, only the basic elements are presented.

2.1 Introduction
Image compression is different from binary data compression. When
binary data compression techniques are applied to images, the results are not optimal. In
lossless compression, the data (such as executables, documents, etc.) are compressed
such that when decompressed, it gives an exact replica of the original data. They need to
be exactly reproduced when decompressed. For example, the popular PC utilities like
Winzip or and Adobe Acrobat perform lossless compression. On the other hand, images
need not be reproduced exactly. A ‘good’ approximation of the original image is enough
for most purposes, as long as the error between the original and the compressed image is
tolerable. Lossy compression techniques can be used in this application. This is because
images have certain statistical properties, which can be exploited by encoders
specifically designed for them. Also, some of the finer details in the image can be
sacrificed for the sake of saving bandwidth or storage space.

In digital images the neighboring pixels are correlated and therefore contain
redundant information. Before the image is compressed, the pixels, which are correlated
is to be found. The fundamental components of compression are redundancy and
irrelevancy reduction. Redundancy means duplication and irrelevancy means the parts of
signal that will not be noticed by the signal receiver, which is the Human Visual System
(HVS). There are three types of redundancy that can be identified:
3 | P a g e  
 
Spatial Redundancy is the correlation between neighboring pixel values.
Spectral Redundancy is the correlation between different color planes or
spectral bands.
Temporal Redundancy is the correlation between adjacent frames in a sequence
of images (in video applications).

Image compression focuses on reducing the number of bits needed to represent


an image by removing the spatial and spectral redundancies. Since my work focuses on
still image compression, therefore temporal redundancy is not discussed.

2.2 Principles of Image Compression


A typical lossy image compression scheme is shown in Figure 2-1. The system
consists of three main components, namely, the source encoder, the quantizer, and the
entropy encoder. The input signal (image) has a lot of redundancies that needs to be
removed to achieve compression. These redundancies are not obvious in the time
domain. Therefore, some kind of transform such as discrete cosine, fourier, or wavelet
transform is applied to the input signal to bring the signal to the spectral domain. The
spectral domain output from the transformer is quantized using some quantizing scheme.
The signal then undergoes entropy encoding to generate the compressed signal. The
wavelet transform mainly applies on the source encoder component portion and this is
the focus of this project.  

Input Compressed
Signal Signal
Source Encoder  Quantizer  Entropy Encoder 

Figure 2.1 A Typical Lossy Image Compression System

Source Encoder
An encoder is the first major component of image compression system. A variety
of linear transforms are available such as Discrete Fourier Transform (DFT), Discrete
Cosine Transform (DCT), and Discrete Wavelet Transform (DWT). The Discrete
Wavelet Transform is main focus of my work.

4 | P a g e  
 
Quantizer
A quantizer reduces the precision of the values generated from the encoder and
therefore reduces the number of bits required to save the transform co-coefficients. This
process is lossy and quantization can be performed on each individual coefficient. This is
known as Scalar Quantization (SQ). If it is performed on a group of coefficients together
then it is called Vector Quantization (VQ).

Entropy Encoder
An entropy encoder does further compression on the quantized values. This is
done to achieve even better overall compression. The various commonly used entropy
encoders are the Huffman encoder, arithmetic encoder, and simple run-length encoder.
For improved performance with the compression technique, it’s important to have the
best of all the three components.

2.3 Wavelet Transform as the Source Encoder


Just as in any other image compression schemes the wavelet method for image
compression also follows the same procedure. The discrete wavelet transform constitutes
the function of the source encoder. The theory behind wavelet transform is discussed
below.

2.4 Measuring Frequency Content by Wavelet Transform


Wavelet transform is capable of providing the time and frequency information
simultaneously. Hence it gives a time-frequency representation of the signal. When we
are interested in knowing what spectral component exists at any given instant of time, we
want to know the particular spectral component at that instant. In these cases it may be
very beneficial to know the time intervals these particular spectral components occur.
Wavelets (small waves) are functions defined over a finite interval and having an
average value of zero. The basic idea of the wavelet transform is to represent any
arbitrary function ƒ(t) as a superposition of a set of such wavelets or basis functions.
These basis functions are obtained from a single wave, by dilations or contractions
(scaling) and translations (shifts). The discrete wavelet transform of a finite length signal
x(n) having N components, for example, is expressed by an N x N matrix similar to the
discrete cosine transform.

5 | P a g e  
 
2.5 Wavelets

A wavelet is a localized function in time (or space in the case of images) with
mean zero. A wavelet basis is derived from the wavelet (small wave) by its own dilations
and translations.
W j,k (t) = 2-j/2W(2-jt – k)
Functionally, Discrete Wavelet Transform (DWT) is very much similar to the
Discrete Fourier Transform, in that the transformation function is orthogonal. A signal
passed twice through the orthogonal function is unchanged. As the input signal is a set of
samples, both transforms are convolutions. While the basis function of the Fourier
transform is a sinusoid, the wavelet basis is a set waves obtained by the dilations and
translations of the mother wavelet.

2.5.1 Compact support, Vanishing moments, and Smoothness


Wavelets are localized functions and zero outside a bounded interval. This
compact support corresponds to an FIR implementation. Another way to characterize
wavelets by the number of coefficients and the level of iteration. If the frequency
response of the corresponding filter has p zeroes at π, the approximation order is p. In
other words, a wavelet basis with p vanishing moments can give a pth order
approximation for any signal. The smoothness of the transfer functions is measured by
the number of its derivatives.

2.5.2 Orthogonal and Bi-orthogonal Wavelets

The wavelet basis forms an orthogonal basis if the basis vectors are orthogonal to
its own dilations and translations. A less stringent condition is that the vectors be bi-
orthogonal. The DWT and inverse DWT can be implemented by filter banks. This
includes an analysis filter and a synthesis filter. When the analysis and synthesis filters
are transposes as well as inverses of each other, the whole filter bank is orthogonal.
When they are inverses, but not necessarily transposes, the filter bank is bi-orthogonal.

6 | P a g e  
 
2.5.3 A simple example -the Haar wavelet

One of the first wavelet was that of Haar. The Haar scaling function is shown
below.

⎧ 1,0 ≤ t ≤ 1
w( n) = ⎨
⎩0, otherwise

Applying the Haar wavelet on a sequence of values computes its sums and
differences. For example, a sequence of values a, b would be replaced by s=(a+b)/2 and
d=(b-a). The values of a and b can be reconstructed as a=s-d/2 and b=s+d/2. The input
signal with 2n samples is replaced with 2n-1 no of averages s 0 (i) and 2n-1 no of differences
d 0 (i). The averages can be thought of as a coarser representation of the signal and the
differences as the information needed to go back to the original resolution. The averages
and differences are now computed on the coarser signal s 0 (i) of length 2n-1. This gives
s 1 (i) and d 1 (i) of length 2n-2 each. This operation can be performed n times, till we run
out of samples. The inverse operation starts by computing s n-2 (i) from s n-1 (i) and d n-1 (i).

2.5.4 Lifting scheme


The above computation of the Haar wavelet needs intermediate storage to store
the average and difference. The average computed, cannot be written back in place of a,
till the difference has been computed. Lifting scheme on the other hand allows for an in
place computation. In the first step, we compute only the difference d=(b-a) and store it
in place of b. Next, the average value is computed in terms of a and the newly computed
difference, b, as s=a+b/2. Reversing the order and flipping the signs can compute the
inverse.
This is a simple instance of lifting.

Figure 2.2: Lifting Scheme

7 | P a g e  
 
A more general lifting scheme consists of three steps -split, predict and update.
The splitting stage splits the signal into two disjoint sets of samples. In the above
example, it consists of even numbered samples and odd numbered samples. Each group
contains half as many samples as the original signal. If the signal has a local correlation
the consecutive samples will be highly correlated. In other words, given one set it should
be able to predict the other. In the diagram, the even samples are used to predict the odd
samples. Then the detail is the difference between the odd sample and its prediction. In
the Haar case the prediction is simple, every even value is used to predict the next odd
value. The order of the predictor in the Haar case is 1 and it eliminates zeroth order
correlation. The reverse operation is done as undo- update, undo-predict and merge.

2.5.5 Wavelets that map Integer to Integer


We return to the Haar transform. Because of the division by 2 in the average
computation, it is not an integer transform. A simple alternative is to calculate the sum
instead of the average. Another solution known as the S (sequential) transform is to
round off the average value to an integer value. The sum and difference of two integers
are both even or both odd. So, the last bits of the difference and average should be
identical. Hence the last bit from average can be omitted, with out loosing information.
In the general case, though rounding may add a non-linearity to the transform, it has been
shown to be invertible.

2.5.6 (2,2) Bi-orthogonal Cohen Daubechies Feauveau Wavelet


The main intent of wavelet transform is to decompose a signal f, in terms of its
basis vectors.

f = ∑aw i i

To have an efficient representation of signal fusing only a few coefficients a i , the


basis functions should match the features of the signal we want to represent.
The (2,2) Cohen Daubechies Feauveau Wavelet [COHEN] is widely used for
image compression because of its good compression characteristics. The original filters
have 5+3=8filter coefficients, whereas an implementation with the lifting scheme has
only 2+2=4filter coefficients. The forward and reverse filters are shown in table 2.1.
Fractional numbers are converted to integers at each stage. Though such an operation

8 | P a g e  
 
adds non-linearity to the transform, the transform is fully invertible as long as the
rounding is deterministic.
Forward transform

si x 2i
di x 2i+1
di d i – (s i + s i+1 )/2
si s i + (d i-1 - d i )/4
Inverse transform

si s i – d i /2
di di + si
x 2i si
x 2i+1 di

Table 2.1: (2,2) CDF wavelet with lifting scheme

2.5.7 Boundary treatment


Real world signals are limited to a finite interval. However filter bank algorithms
assume infinite lengths. The computation of s and d coefficients refer to k signal samples
before and after the current sample, depending on the filter length k. Different methods
of extending the signal at the boundaries has been suggested. One scheme that is widely
used is the symmetric extension. It extends the finite signal by mirroring it around its
boundaries.

2.5.8 Advantages of Wavelets


Real time signals are both time-limited (or space limited in the case of images)
and band-limited. Time-limited signals can be efficiently represented by a basis of block
functions (Dirac delta functions for infinitesimal small blocks). But block functions are
not band-limited. Band limited signals on the other hand can be efficiently represented
by a Fourier basis. But sines and cosines are not time-limited. Wavelets are localized in
both time (space) and frequency (scale) domains. Hence it is easy to capture local
features in a signal.

9 | P a g e  
 
Another advantage of a wavelet basis is that it supports multi resolution. Consider
the windowed Fourier transform. The effect of the window is to localize the signal being
analyzed. Because a single window is used for all frequencies, the resolution of the
analysis is same at all frequencies. To capture signal discontinuities (and spikes), one
needs shorter windows, or shorter basis functions. At the same time, to analyze low
frequency signal components, one needs longer basis functions. With wavelet based
decomposition, the window sizes vary. Thus it allows analyzing the signal at different
resolution levels.

10 | P a g e  
 
CHAPTER 3
HARDWARE
 
3.1 FPGA Basics
3.1.1 Field Programmable Gate Array (FPGA) Architecture:
A Field Programmable Device (FPD) is a type of integrated circuit used for
implementing digital hardware, where the chip can be configured by the end user to
realize different designs. Programming of such a device often involves placing the chip
into a special programming unit, but some chips can also be configured “in-system”.
Another name for FPDs is programmable logic devices (PLDs).
FPGA — a Field-Programmable Gate Array is an FPD featuring a general
structure that allows very high logic capacity. The basic structure of FPGAs is array-
based, meaning that each chip comprises a two dimensional array of logic blocks that
can be interconnected via horizontal and vertical routing channels. FPGAs comprise an
array of uncommitted circuit elements, called logic blocks, and interconnect resources,
but FPGA configuration is performed through programming by the end user. As the only
type of FPD that supports very high logic capacity, FPGAs have been responsible for a
major shift in the way digital circuits are designed.

Figure 3.1 Structure of FPGA

Besides logic, the other key feature that characterizes an FPGA is its interconnecting
structure. The interconnect is arranged in horizontal and vertical channels.

11 | P a g e  
 
There are two basic categories of FPGAs used today:
1. SRAM-based FPGAs.
2. Antifuse-based FPGAs.

3.1.2 Applications of FPGA:


Because of the very wide range of applications, FPGAs have gained rapid acceptance
and growth over the past decade. Some of the applications of FPGAs are
• Random logic, integrating multiple SPLDs (Simple PLDs), device controllers,
communication encoding and filtering, small to medium sized systems with SRAM
blocks, and many more.

• Other interesting applications of FPGAs are prototyping of designs to be


implemented in gate arrays, and also emulation of entire large hardware systems.
Some of the applications might be possible using a single FPGA and some of them
may require many FPGAs connected by some interconnect.

• Another promising area for FPGA application is the usage of FPGAs as custom
computing machines. This involves using the programmable parts to “execute”
software, rather than compiling the software for execution on a regular CPU.

3.2 ALTERA DE2 Board


The DE2 board is designed using the same strict design and layout practices used
in high-end volume production products such as high-density PC Motherboards and car
infotainment systems with the highest QC standard. Major design and layout
considerations are listed below:

• Layout traces and components are carefully arranged so that they are properly
aligned. This nice alignment will increase the yield for manufacturing and ease
board debugging procedure.
• Jumper-free design for robustness. Jumpers are a great point of failure and
might cause frustration for users who don’t keep the manuals with them all the
time.

12 | P a g e  
 
• Compponents’ sellection was made according to thee volume shhipped. We selected
the most
m uration usedd in PC annd DVD players to
commoon componnent configu
ensurre the continnuous supplly of the com
mponent ressource in thhe future.
• Protection on Poower and IOs
I are con
nsidered to cover mosst of the acccidental
cases in the fieldd.
2

3.22.1 Exam
mine the Board
B

Figuree 3.2 Altera DE2 Board

3.22.2 Feattures:
DE2 boarrd providess users maany features to enablee various m
multimedia project
devvelopments.. Componennt selectionn was made according to the most popular design
d in
vollume producction multim
media produucts such ass DVD, VC
CD, and MP3 players. The
T DE2
plaatform allow
ws users to
t quickly understand
d all the insight
i triccks to desiign real
muultimedia prrojects for inndustry.
• Alteraa Cyclone III 2C35 FPG
GA with 350
000 LEs
• Alteraa Serial Connfiguration devices
d (EP
PCS16) for Cyclone
C II 22C35
• USB Blaster
B builtt in on boarrd for prograamming andd user API ccontrolling
• JTAG Mode and AS Mode are
a supporteed
• 8Mbytte (1M x 4 x 16) SDRA
AM
• 1Mbytte Flash Meemory (upgrradeable to 4Mbyte)
• SD Card Socket
13 | P a g e  
 
• 4 Pushh-button sw
witches
• 18 DP
PDT switchhes
• 9 Greeen User LE
EDs
• 18 Reed User LED
Ds
• 50MH
Hz Oscillatoor and 27MH
Hz Oscillator for externnal clock soources
• 24-bitt CD-Qualiity Audio CODEC with
w line-in, line-out, aand microp
phone-in
jacks
• VGA DAC (10-bbit high-speeed triple AD
DCs) with VGA
V out coonnector
• TV Decoder
D (NT
TSC/PAL) and
a TV in connector
• 10/1000 Ethernet Controller with
w socket.
• USB Host/Slave Controller with USB type
t A and type
t B connnectors.
• RS-2332 Transceiiver and 9-ppin connecto
or
• PS/2 mouse/keyb
m board conneector
• IrDA transceiverr
• Two 40-pin
4 Expaansion Headders with diiode protecttion
• DE2 Lab
L CD-RO
OM which contains
c maany examplees with sourrce code to exercise
the booards, includding: SDRA
AM and Flaash Controlller, CD-Quuality Musicc Player,
VGA and TV Labs,
L SD Card readeer, RS-2322/PS-2 Com
mmunication
n Labs,
NIOSIII, and Conttrol Panel API.
A

Figure 3.3 DE2 Blo


ock Diagram
m

14 | P a g e  
 
3.3 NIOS II PROCESSOR
3.3.1 Introduction
The NIOS II soft core processor is a general purpose RISC processor having the
following features
• Full 32 bit instruction set, data path and address space
• 32 general purpose registers
• 32 external interrupt sources
• Single-instruction 32 × 32 multiply and divide producing a 32-bit result
• Dedicated instructions for computing 64-bit and 128-bit products of
multiplication
• Single-instruction barrel shifter
• Access to a variety of on-chip peripherals, and interfaces to off-chip memories
and peripherals
• Hardware-assisted debug module enabling processor start, stop, step and trace
under integrated development environment (IDE) control
• Software development environment based on the GNU C/C++ tool chain and
Eclipse IDE
• Instruction set architecture (ISA) compatible across all NIOS II processor
systems
A NIOS II processor system is equivalent to a microcontroller or computer on a chip that
includes a CPU and a combination of peripherals and memory on a single chip. The
NIOS II software development environment is called The NIOS II integrated
development environment (IDE).

3.3.2 NIOS II Architecture


The NIOS II architecture defines the following user defined functional units.
• Register file
• Arithmetic logic unit
• Interface to custom instruction logic
• Exception controller
• Interrupt controller
• Instruction bus
• Data bus

15 | P a g e  
 
• Instruction and data cache memories
• Tightly coupled memory interfaces for instructions and data
• JTAG debug module
These functional units form the instruction set for the NIOS II processor. Each
hardware unit can be implemented and emulated in software.
The NIOS II architecture supports the addressing schemes like register
addressing, displacement addressing, immediate addressing, register indirect addressing,
and absolute addressing. In register addressing, all operands are registers, and results are
stored back to a register. In displacement addressing, the address is calculated as the sum
of a register and a signed, 16-bit immediate value. In immediate addressing, the operand
is a constant within the instruction itself. Register indirect addressing uses displacement
addressing, but the displacement is the constant 0.

3.3.3 NIOS II Hardware Development


Building embedded systems in FPGAs is a broad subject, involving system
requirements analysis, hardware design tasks, and software design tasks. The building of
NIOS II hardware and creating a software program to run in NIOS II system is described
below.
The NIOS II system is for control applications, which can monitor input stimuli and
respond by turning on or off output signals. This NIOS II system can also communicate
with a host computer, allowing the host computer to control logic inside the FPGA.

The example NIOS II system contains the following:


• NIOS II/s processor core
• On chip memory
• Timer
• JTAG UART
• 8 bit parallel I/O pins to control LEDs
• System identification peripheral

16 | P a g e  
 
Figure 3.44 Example design
d of NIOS

The deesign can bee implemennted in any NIOS II deevelopment boards. To


o build a
NIO
OS II system
m the boardd should meeet the follow
wing requirrements
• The board must have an Altera FPGA
F
• An oscillaator must drrive a constaant clock frrequency too an FPGA pin. The maximum
m
frequency limit depennds on the speed
s grade of the FPG
GA. Frequenncies of 50 MHz or
less.
• The boardd must havee a 10-pin header
h conn
nected to thee dedicatedd JTAG pins on the
FPGA to provide
p a coommunicatioon link to th
he NIOS II system.
• O pins can optionally connect to 8 (or feweer) LEDs tto provide a visual
FPGA I/O
indicator of
o processorr activity
• JTAG cablle or USB blaster
b downnload cable

3.33.4 System Develoopment Fllow


The NIOS
S II system
m developm
ment flow consists of three steps: Hardwaree design
stepps, softwaree design steeps, system design step
ps. The deveelopment floow starts with a pre
dessign activitty which includes an
a analysis of the application
a nts and
requiremen
dettermines thee system reqquirements.

17 | P a g e  
 
Figure 3.5 NIOS
N II dev
velopment fllow

3.33.5 Defin
ning and generating
g g the system in SO
OPC build
der
After analyyzing the syystem hardw
ware requireements, usee the SOPC Builder too
ol which
is included
i in the Altera Quartus II software. Using
U SOPC
C Builder sspecify the NIOS
N II
proocessor coree(s), memoory, and othher peripheerals the syystem requirres. SOPC Builder
auttomatically generates the intercoonnect log
gic to integgrate the ccomponentss in the
harrdware systeem.
Select from a list of standardd processorr cores and peripheralss provided with
w the
NIO
OS II deveelopment tools. We caan also add our own custom
c harddware to acccelerate

18 | P a g e  
 
sysstem perform
mance. Wee can add custom
c instrruction logic to the N
NIOS II corre which
acccelerates CP
PU perform
mance, or we can add a custom peeripheral w
which offloaads tasks
from the CPU.
Thee primary outputs
o of thhe SOPC buuilder are
1. SOPC bu
uilder systeem file (.pptf) -This file
f stores the hardwaare contentss of the
system. Thhe NIOS III IDE requuires the .pttf file to coompile softw
ware for th
he target
program
2. Hardwaree descriptio
on files (HD
DL) - Thesse are the haardware dessign files thaat which
describes the
t SOPC builder
b systeem. The Qu
uartus II sofftware uses these HDL
L files to
compile thhe overall deesign.
Using Quartus III software we can assign pins locations
l foor the I/O signals,
speecify timingg requiremennts and otheer design co
onstraints. Finally
F the Q
Quartus II project
p is
com
mpiled to geet FPGA coonfigurationn file (.sof).
Downlload the FPG
GA configuuration file on
o the targeet FPGA booard using an
a Altera
dow
wnload cabble like USB blaster. After con
nfiguration, the FPGA
A behaves as the
harrdware speccified whichh in this casee the NIOS II processoor.

3.33.6 NIOS
S II softwaare develoopment ta
asks
Using NIO
OS II IDE we
w can perfoorm all the software deevelopment tasks for th
he NIOS
II processor
p syystem builtt. After the generation
n of the system in SOP
PC builder, we can
dessign C/C++ applicationn code with the NIOS II
I IDE. In addition
a to thhe applicatiion code
we can designn and reuse the
t custom libraries available in ouut NIOS II IDE projectts.

Figure 3.6 NIOS


N II IDE
E workbencch
We caan simulate the applicaation prograam using Innstruction sset simulato
or (ISS).
Thiis simulatioon does not require thee target boarrd. ISS sim
mulates the pprocessor, memory,
m
stddin/stdout/stderr stream
ms which allow
a veriffying the program
p flow and allgorithm
corrrectness.

19 | P a g e  
 
After the configuration of Altera FPGA board with NIOS II system, we can
download the software using a download cable.

The NIOS II IDE produces several outputs as listed below


1. System.h file: It defines symbols for referencing hardware in the system. It is created
automatically when a project in NIOS II IDE is created.

2. Software executable file (.elf): This file is a result of compilation of C/C++


application program which can be downloaded directly to the NIOS II processor

3. Memory initialization files (.hex): These are the initialization files for the on-chip
memories that support initialization content.

4. Flash programming data: The IDE includes a flash programmer, which allows to
write the program to flash memory. The flash programmer adds appropriate boot
code to allow the program to boot from flash memory.

Running and Debugging the Code


The NIOS II IDE provides complete facilities for downloading software to a target
board, and running or debugging the program on hardware. The IDE debugger allows us
to start and stop the processor, step through code, set breakpoints, and analyze variables
as the program executes.

3.3.7 Custom Instructions


With custom instructions we can reduce a complex sequence of standard instructions to a
single instruction implemented in hardware. This feature can be used for a variety of
applications, e.g., to optimize software inner loops for digital signal processing (DSP),
and computation-intensive applications. The NIOS II CPU configuration wizard, which
is accessed via the Quartus II software’s SOPC Builder, provides a graphical user
interface (GUI) used to add up to 256 custom instructions to the NIOS II processor.
The custom instruction logic connects directly to the NIOS II processor ALU
logic as shown below

20 | P a g e  
 
Figurre 3.7 Custoom instructio
on logic in NIOS
N

NIOS II processoor custom innstructions are custom


m logic bloccks adjacen
nt to the
AL
LU in the cpu
c data paath. This giives the designers the ability to tailor the NIOS
N II
proocessor to meet
m the neeeds of a partticular appliication.

The baasic operatioon of NIOS II custom instruction


i logic is to rreceive inpu
ut on the
dattaa[31..0] and/or
a databb[31..0] andd drive outt the result on its resuult[31..0] po
ort. The
dessigner geneerates the custom
c insttruction log
gic that prroduces thee results. For
F each
cusstom instrucction, the NIOS
N II inteegrated dev
velopment environmen
e nt (IDE) pro
oduces a
maacro that is defined
d in the
t system header
h file. The macroo can be callled from C or C++
appplication coode as a normal
n function call. There are different custom insstruction
arcchitectures available
a too suit the application’
a ’s requirem
ments. The architecturees range
from a simplee, single cyycle combinnatorial arch
hitecture too an extendded variablee-length,
muulticycle cusstom instrucction architeecture.

21 | P a g e  
 
3.33.8 Summ
mary of Developm
D ment of NIOS II Sysstem
1) Analyze system
s requuirements: Based on the applicaation the ccomponentss of the
system shoould be deciided.
2) Start the Quartus
Q II sooftware andd open a pro
oject: Open a new Quarrtus II projeect. This
project serrves as an eaasy starting point of thee NIOS devvelopment fl
flow.
3) Start a new SOPC builder systeem: SOPC Builder is used to gennerate the NIOS
N II
processor system. Add
A the dessired periph
herals, and configure how they connect
together. The
T follow
wing steps are
a to be performed
p to create a system in
n SOPC
builder.
• Choosee SOPC Buuilder (Tools menu) in ware. SOPC Builder
n the Quarttus II softw
starts and
a displayss the Createe New Systeem dialog boox.
• Enter the
t system name
n
• Select verilog or VHDL
V as thhe target HD
DL
4) Define thee system in SOPC buildder:
Define thee hardware characteristtics of the NIOS
N II syystem such as NIOS III core to
use, whaat peripheraals to includde in the sy
ystem. SOP
PC builder ddoes not deefine the
software behavior.
b Thhe followingg steps are to
t be follow
wed to definee a system.
• Specify
fy the target FPGA and clock settin
ngs.
• Add thhe NIOS II soft
s core, onn-chip mem
mory, and othher peripherals.
• Specify
fy base address and inteerrupt requeest (IRQ) prriorities.
• Specify
fy more NIO
OS II settinggs
• Generaate the SOP
PC builder syystem

Figu
ure 3.8 Com
mponents of the SOPC system

22 | P a g e  
 
5) Integrate SOPC builder into Quartus II project:
The following steps are to be performed to complete the hardware design.
• Instantiate the SOPC Builder system module in the Quartus II project.
• Assign FPGA pins.
• Compile the Quartus II project
• Verify timing
6) Download hardware design to target FPGA:
• Connect the board to the host computer with the download cable, and apply
power to the board.
• Choose Programmer (Tools menu) in the Quartus II software. The Programmer
window appears and automatically displays the appropriate configuration file
(nios2_quartus2_project.sof).
• Click Hardware Setup in the top-left corner of the Programmer window to verify
your download cable settings. The Hardware Setup dialog box appears.
• Select the appropriate download cable in the currently selected hardware list.
• Turn on Program/ configure.
7) Develop software using NIOS II IDE:
• Create a new C/C++ application project.
• Compile the project
8) Run the program: We can run the program on target hardware or on the NIOS II
Instruction set simulator.

3.3.9 SOPC Builder


SOPC Builder is a powerful system development tool for creating systems based
on processors, peripherals, and memories. SOPC Builder enables to define and generate
a complete system-on-a-programmable-chip (SOPC) in much less time than using
traditional, manual integration methods. SOPC Builder automates the task of integrating
hardware components into a larger system. Using traditional system-on-chip (SOC)
design methods, top-level HDL files are manually written that join together the pieces of
the system. Using SOPC Builder, specifying the system components in a graphical user
interface (GUI), and SOPC Builder generates the interconnect logic automatically. SOPC
Builder outputs HDL files that define all components of the system, and a top-level HDL

23 | P a g e  
 
dessign file thhat connectss all the coomponents together. SOPC
S Builder generattes both
Verilog HDL and VHDL equally, annd does not favor one over
o the otheer.

3.33.9.1 Architecture off SOPC Bu


uilder
An SO
OPC Builderr componennt is a desig
gn module that
t SOPC Builder reccognizes
andd can autoomatically integrate into
i a systtem. SOPC
C Builder connects multiple
m
com
mponents toogether to create a topp-level HD
DL file calleed the systeem modulee. SOPC
Buuilder generaates Avalonn switch fabbric that con
ntains logic to manage the connecctivity of
all componentts in the system.

Figure 3.9 A multi-masster system module


m with
h Avalon sw
witch fabric

3.33.9.2 SOPC
C builder componen
c nts
SOPC Builder com
mponents are
a the build
ding blockss of the systtem modulee. SOPC
Buuilder compoonents use the
t Avalon interface fo
or the physical connecttion of comp
ponents,
andd SOPC Buuilder can bee used to coonnect any logical
l deviice (either oon-chip or off-chip)
o
thaat has an Avvalon interfface. The Avalon
A interface uses an address--mapped read/write
prootocol that allows
a master componeents to read and/or writte any slave componentt.
Alttera providees ready-to-uuse SOPC Builder
B com
mponents, thhose are:-
• Micropprocessors, such as the NIOS II prrocessor
• Microccontroller peripherals
• Timerss
• Serial communiccation interrfaces, such
h as a UA
ART and a serial peeripheral
interface (SPI)
24 | P a g e  
 
• General purpose I/O
• Digital signal processing (DSP) functions
• Communications peripherals
• Interfaces to off-chip devices
− Memory controllers
− Buses and bridges
− Application-specific standard products (ASSP)
− Application-specific integrated circuits (ASIC)
− Processors
The purpose of SOPC Builder is to abstract away the complexity of interconnect logic,
allowing designers to focus on the details of their custom components and the high-level
system architecture.

3.3.9.3 On-Chip Ram and Rom


FPGAs include on-chip memory blocks that can be used as RAM or ROM in SOPC
Builder systems. On-chip memory has the following benefits for SOPC Builder systems:
• On-chip memory has fast access time, compared to off-chip memory.
• SOPC Builder automatically instantiates on-chip memory inside the system
module, so you do not have to make any manual connections.
• Certain memory blocks can have initialized contents when the FPGA powers up.
This feature is useful, for example, for storing data constants or processor boot
code.
FPGAs have limited on-chip memory resources, which limits the maximum practical
size of an on-chip memory to approximately one megabyte in the largest FPGA family.

3.3.9.4 SDRAM
Altera provides a free SDRAM controller core, which uses inexpensive SDRAM
as bulk RAM in FPGA designs. The SDRAM controller core is necessary, because
Avalon signals cannot describe the complex interface on an SDRAM device. The
SDRAM controller acts as a bridge between the Avalon switch fabric and the pins on an
SDRAM device. The SDRAM controller can operate in excess of 100 MHz The choice
of SDRAM device(s) and the configuration of the device(s) on the board heavily
influence the component-level design for the SDRAM controller. Typically, the

25 | P a g e  
 
component-level design task involves parameterizing the SDRAM controller core to
match the SDRAM device(s) on the board.

3.3.9.5 Off-Chip SRAM & FLASH Memory


SOPC Builder systems can directly access many off-chip RAM and ROM
devices, without a controller core to drive the off-chip memory. Avalon signals can
exactly describe the interfaces on many standard memories, such as SRAM and flash
memory. In this case, I/O signals on the SOPC Builder system module can connect
directly to the memory device. These components make it easy to create memory
systems targeting Altera development boards. However, these components target only
the specific memory device on the board; they do not work for different devices. For
general memory devices, RAM or ROM, a custom interface can be created to the device
with the SOPC Builder component editor. Using the component editor, the I/O pins on
the memory device and the timing requirements of the pins are defined.

3.3.9.6 Avalon Tristate Bridge


A Tristate bridge connects off-chip devices to on-chip Avalon switch fabric. The
tristate bridge creates I/O signals on the SOPC Builder system module, which must be
connected to FPGA pins in the top-level Quartus II project. These pins represent the
Avalon switch fabric to off-chip devices. The Tristate Bridge creates address and data
pins which can be shared by multiple off-chip devices. This feature allows conservation
of FPGA pins when connecting the FPGA to multiple devices with mutually exclusive
access.

A tristate bridge may be used in either of the following cases:


• The off-chip device has bidirectional data pins.
• Multiple off-chip devices share the address and/or data buses.

The Avalon Tristate Bridge automatically adds registers to output signals from the
Tristate Bridge to off-chip devices. Registering the input and output signal shortens the
register-to-register delay from the memory device to the FPGA, resulting in higher
system f max performance. However, in each direction, the registers add one additional
cycle of latency for Avalon master ports accessing memory connected to the Tristate
Bridge. The registers do not affect the timing of the transfers from the perspective of the

26 | P a g e  
 
memory device. The Avalon interface for the PWM component requires a single slave
port using a small set of Avalon signals to handle simple read and write transfers to the
registers.
The component's Avalon slave port has the following characteristics:
• It is synchronous to the Avalon slave port clock.
• It is readable and writeable.
• It has zero wait states for reading and writing, because the registers are able to
respond to transfers within one clock cycle.
• It has no setup or hold restrictions for reading and writing.
• Read latency is not required, because all transfers can complete in one clock
cycle. Read latency would not improve performance.
• It uses native address alignment, because the slave port is connected to registers
rather than a memory device.

3.3.9.7 Flash Memory


In SOPC Builder, instantiate an interface to Common Flash Interface (CFI) flash
memory by adding a Flash Memory component. If the flash memory is not CFI
compliant, create a custom interface to the device with the SOPC Builder component
editor. The choice of flash device(s) and the configuration of the device(s) on the board
heavily influence the component-level design for the flash memory configuration wizard.
Typically, the component-level design task involves parameterizing the flash memory
interface to match the device(s) on the board. Using the Flash Memory (Common Flash
Interface) configuration wizard, specify the structure (address width and data width) and
the timing specifications of the device(s) on the board.

3.3.9.8 SRAM
The choice of RAM device(s) and the configuration of the device(s) on the board
determine how the interface component is created. The component-level design task
involves entering parameters into the component editor to match the device(s) on the
board.

27 | P a g e  
 
3.3.9.9 UART
An UART, universal asynchronous receiver / transmitter is responsible for
performing the main task in serial communications. The device changes incoming
parallel information to serial data which can be sent on a communication line. A
second UART can be used to receive the information. The UART performs all the tasks,
timing, parity checking, etc. needed for the communication. The only extra devices
attached are line driver chips capable of transforming the TTL level signals to line
voltages and vice versa. To use the UART in different environments, registers are
accessible to set or review the communication parameters. Eight I/O bytes are used for
each UART to access its registers. Settable parameters are for example the
communication speed, the type of parity check, and the way incoming information is
signaled to the running software.
 

28 | P a g e  
 
CHAPTER 4
DESIGN AND IMPLEMENTATION

This chapter explains aspects of design and implementation of the encoder and decoder.

Original Image  Compressed Image 

Discrete Wavelet  Huffman Decoding 
Transform 

Run‐Length Decoding 
Quantization 

Inverse Quantization 
Run ‐ Length Coding 

Inverse Discrete 
Huffman Coding  Wavelet Transform 

Compressed Image  Reconstructed Image 

Figure 4.1 Flow Chart showing the sequence of steps in Compression & Decompression

The above flow charts give the sequence of routines executed in Image
Compression and Decompression

29 | P a g e  
 
4.11 Design parameteers and coonstraintss
4.11.1 Memoory read/wrrite
The innput image to the encooder is raw gray scale frames of 512 by 512
2 pixels.
Eacch pixel is representedd by 256 gray scale lev
vels (8 bits)). Input fram
me is loadeed to the
em
mbedded memory by thhe host compputer and reesults are reead back, oonce the all PEs has
proocessed it. The top leevel VHDL
L module of
o each Staage is referrred as Pro
ocessing
Eleement (PE). The PE also
a uses thee embedded
d memory as
a intermediiate storagee to hold
results betweeen different stages of prrocessing.

The meemory has a read latenncy of 2 cyclles while memory


m writtes are comp
pleted in
thee same cyclle. Memoryy reads cann be pipelin
ned so thatt the effectt of this laatency is
minnimized. Hoowever, a clock
c cycle is wasted when
w there is a read too write turn around.
Thee design cooncerns aree to minim
mize memorry read/writte turn arounds and to
t allow
lonnger spells of
o read or write
w cycles instead. Atttempts havve also beenn made to minimize
m
meemory operaations.

4.11.2 Design
n partitioniing
The whhole compuutation is partitioned in
nto four staages. The fi
first stage co
omputes
discrete wavellet transform
m coefficiennts of the in
nput image frame
f and w
writes it bacck to the
em
mbedded mem
mory. The second
s stagge, operates on this resuult, does dynnamic quan
ntization,
zerro thresholdding, run length enccoding for zeroes, annd entropy encoding on the
coeefficients. The
T third staage does thee entropy deecoding, runn length deccoding of zeeros and
deqquantizationn. The fourtth stage com
mputes the pixel
p data frrom the wavvelet coefficcients.

F
Figure 4.2 Stage wise Ordering of Routines
R

30 | P a g e  
 
4.2 Stage 1: Discrete Wavelet Transform
Discrete Wavelet transform is implemented by filter banks. The filter used is the
(2,2) Cohen-Daubechies-Feuveau wavelet filter. Though much longer filters are common
for audio data, relatively short filters are used for video.

4.2.1 (2, 2) wavelet


A modified form of the Bi-orthogonal (2,2) Cohen-Daubechies-Feuveau wavelet
filter is used. The analysis filter equations are shown below.

High pass coefficients: g(k) = 2x(2k+1) - x(2k) - x(2k+2)


Low pass coefficients: f(k) = x(2k) + [g(k-1) + g(k)] /8

The boundary conditions are handled by symmetric extension of the coefficients


as shown below:
x[2], x[1], [ x[0], x[1],….., x[n-1], x[n] ], x[n-1], x[n-2]

The synthesis filter equations are shown below.


Even samples: x(2k) = f(k) – [ g(k-1) + g(k+1) ]/8
Odd samples: x(2k+1) = [ g(k) + f(k) + f(k+1)]/2

4.2.2 DWT in X and Y directions


Each pixel in the input frame is represented by 16 bits, accounting for 2 pixels
per memory word. Thus, each memory read brings in two consecutive pixels of a row.
Each clock cycle generates one value each of f and g coefficients. These have to be
written back in place. The f coefficients are used again in the next stage of wave-letting.
Two consecutive values of f are written back in one memory location (figure 4.3). This
saves on memory reads of the f coefficients in the next stage. In the next stage, where
only the fs are processed, only alternate memory words are read from. Thus, the f and g
coefficients are written back in an interleaved fashion. Another way to write back the
coefficients is to put all the low frequency coefficients (f) ahead of the high frequency
coefficients (g). This scheme of ordering the coefficients is called Mallot ordering. It
allows progressive image transmission/reconstruction. The bulk of the ’average’
information is ahead, followed by the minor ’difference’ information. However, this
ordering scheme requires temporary storage to hold the computed coefficients until tthey

31 | P a g e  
 
can be written back. In our design, we use the in-place ordering scheme described above
which is optimized for memory read/write operation. Once the three stages of wave-
letting is done, we resort back to Mallot ordering.

Figure 4.3 Coefficient ordering along X direction

Once the filter has been applied along all rows in a stage, the same filter is applied along
the columns. With the afore mentioned interleaved ordering scheme, alternate columns
are all fs or all gs. Unlike the row traversal, the two values obtained in a memory read on
a column traversal, are not consecutive values of the same column. Rather, they are
corresponding values from two different vertically parallel streams (figure 4.4).

Figure 4.4 Coefficient ordering along Y direction

These differences along the row and column computations are accounted by
having two separate data flow blocks along the two directions. The data flow block in X
direction (ForwardWaveletX) accepts two successive values of the same row and outputs
either two consecutive fs or two consecutive gs, in alternate fashion. The data flow block
32 | P a g e  
 
in Y direction (ForwardW
WaveletY) accepts
a one value each from two pparallel streaams and
outtputs either the fs for the two streams or thee gs in an alternate
a m
manner, (figu
ure 4.5).
Theese blocks also need informationn on when a row/coluumn starts/eends to han
ndle the
bouundary condditions. Theey also havee a pipeline latency of 3 cycles.

Figuree 4.5 Forwarrd Wavelet transform data


d flow bloocks

Figure 4.6 RTL


R view of
o Forward Wavelet
W X

   

33 | P a g e  
 
Figgure 4.7 RT
TL view of Forward Wavelet Y

4.22.3 3 stagees of wave-letting


The 5112 by 512 pixel
p input image
i fram
me is processsed with thrree stages of
o wave-
lettting. In thee first stagee, 512 pixells of each row
r are used to comppute 256 hiigh pass
coeefficients (gg) and 256 low
l pass cooefficients (ff), figure 4..6. The coeffficients aree written
bacck in place of
o the originnal row.

Figure 4.8 Hiigh pass and


d Low pass coefficients
c at stage 1, X direction

Once all
a the 512 rows are processed,
p the
t filters are
a applied in the Y direction.
Thiis completees the firstt stage of wave-lettin
ng. While conventiona
c al Mallot ordering
o
schheme aggreggates coeffiicients into the 4 quadrrants, our orrdering scheme interleaves the
coeefficients inn the memoory. The seecond stagee of wave-letting only processes the low
freqquency coeefficients frrom the firrst stage. This
T correspponds to thhe upper leeft hand
quaadrant in thhe Mallot scheme.
s Thhus, second
d stage operrates on roow and colu
umns of
lenngth 256, while
w the thhird stage operates
o on
n rows and columns oof length 128. The

34 | P a g e  
 
aggregation of coefficients along the 3 stages under Mallot ordering is shown in figure
4.7. The memory map with the interleaved ordering is shown in figure 4.8.

Figure 4.9 Mallot ordering along the 3 stages of wave-letting

4.2.4 Over all architecture of Stage 1


Stage one starts with a raw frame and does three stages of wave-letting. The over
all architecture is shown in figure 4.9. Memory addressing is done with a pair of address
registers read and write address registers. The difference between write and read registers
is the latency of the pipelined data-flow blocks. The maximum and minimum coefficient
values for each block (each quadrant in the multi stage wave-letting) are maintained on
the FPGA. These values are written back to a known location in the lower half (lower
0.5MB) of the embedded memory. The second stage, uses these values for the dynamic
quantization of the coefficients.

35 | P a g e  
 
Figure 4.10 Interleaved ordering along the 3 stages of wave-letting

Figure 3.8: Stage 1 architecture

Figure 4.11 Stage1 Architecture

4.3 Stage 2
Stage 2 does the rest of the processing on the wavelet coefficients computed in
the first stage. The coefficients, are quantized, zero-thresholded, zeroes run length
encoded, and entropy encoded to get the final compressed image.

36 | P a g e  
 
4.3.1 Dynamic quantization
The coefficients from different sub-bands (different quadrants with the Mallot
ordering scheme) are quantized separately. The dynamic range of the coefficients for
each sub-band (computed in first stage) is divided into 16 quantization levels. The
coefficients are quantized into one of the 16 possible levels. The maximum and
minimum value of the coefficients for each sub-band is also needed while decoding the
image.

Enable

Maximum 
(16)

Minimum  Dynamic
(16) Quantizer
Quantized 
Coefficients  output (4) 
(16)

Clock

Figure 4.12 Dynamic Quantizer

The dynamic quantizer is implemented as a binary search tree look up in


hardware (figure 4.10). A table look up based quantization scheme is not feasible since
the range is dynamic, different for each sub-band, and different for each frame. The
incoming stream of coefficients in the range [min:max] is translated to [0,max-min] by
adding (or subtracting) the minimum. The shifted incoming value is then compared with
half the dynamic range (r/2) to determine whether it lies in the lower eight or upper eight
quantization levels. The result forms the first bit (most significant bit) of the quantizer
output. Depending on the outcome, the value is then compared with (r/2)+(r/4) or (r/2)-
(r/4). This forms the second bit of the quantized output. The next two comparisons
provide the remaining bits. The quantizer is a pipelined design, with 4 stages.

37 | P a g e  
 
Figure 4.13 RTL view of Dynamicc Quantizer

4.33.2 Zero thresholdin


ng and RLE
E on zeroes
Regionns with abruupt changess will have larger wavelet coefficcients while regions
of little or no change woould have sm
maller coeffficients. Cooefficients oof small maagnitude
cann be negleccted withoutt considerabble distortio
on to the im
mage. The error introd
duced is
prooportional to
t the maggnitude of the coefficient beingg neglectedd. Coefficieents are
trunncated to zero, based on a threshhold. Differeent thresholds are usedd for differrent sub-
bannds, resultinng in differrent resolutiion in diffeerent sub-baands. Furtheer, differentt sets of
threesholds aree used to achieve diffferent levelss of comprression. Thrree differen
nt set of
threesholds are used for eaach sub-bannd to get th
hree differennt variants oof the encod
der with
diff
fferent comppression levvels. The coorresponding
g levels for the three coonfiguration
ns of the
enccoder are shhown in the appendix.

After the
t zero thrresholding a large num
mber of coeffficients aree truncated to zero.
Lonng sequences of zeroes can be eff
ffectively co
ompressed by
b run lenggth encoding
g, which
repplaces eachh individuall occurrencce of a zero in a coontinuous sspell with a count
inddicating the length of thhe spell. Too decode a run
r length encoded
e stream, this co
ount has
to be
b distinguiishable from
m other charracters of th
he input dataa set. The oother valid character
c
is the
t 4 bit output
o from the quantiizer. Sixteeen numbers 0 to 15 arre reserved
d for the
quaantizer outpput values, while num
mbers 16 to 255 (240 numbers) aare free. Th
hus, any
conntinuous sppell of zerooes rangingg from 1 (represented
( d by the nnumber 16) to 240
(reppresented by
b the num
mber 255) caan be replaaced by the correspondding count. Longer
speells have too be broken down to fall
fa within this
t range. Table
T 4.1 sshows the bit
b range
alloocation. Thhe run lenggth encoderr, might no
ot have an output onn every cyccle. The

38 | P a g e  
 
succeeding block has to be signaled as to when to read the RLE count, and when to wait
for a spell to finish. Whenever RLE detects a zero, it asserts ’RLErunning,’ and starts
counting the sequence of continuous zeroes. The current sum of zeroes is always
available on ’RLEout.’ When the continuous spell of zeroes end, ’RLErunning’ is
deasserted, and ’RLEspellEnd’ is asserted for one cycle to allow the next block to read
off the RLE count. The RLE counter is also reset to 15.

In this set-up, there is look ahead problem. Before RLE can signal the end of a
spell, it needs to see the next value is the stream. But, RLE is used in conjunction with
the dynamic quantizer, (RLE and quantizer are connected in parallel) which is a 4 staged
pipeline.
RLE might face an arbitrarily long sequence of zeroes. RLE can count only upto
a maximum of 240 zeroes. Thus, when RLE has seen 240 continuous zeroes and still
more zeroes are arriving, ’RLEspellEnd’ would be asserted for one clock cycle, and the
internal counter is reset to 15. Here, ’RLErunning’ would be high through out the spell.

The logic followed by the succeeding block is as follows. If ’RLErunning’ is


asserted then wait till ’RLEspellEnd’ is asserted and read the ’RLEout’. Else, read the
output of the dynamic quantizer.

Enable Flush

RLE out 
Input 
(8) 
(16)  Run Length
Encoder RLE 
Zero Threshold  running  
(16)  RLE Spell 
end 

Clock

Figure 4.14 Run Length Encoder for continuous zeroes

39 | P a g e  
 
Taable 4.1 Bit range
r alloca
ation for RL
LE

Figurre 4.15 RTL


L view of Ru
un-Length Encoder
E

4.33.3 Entrop
py encoding
Entroppy encodingg involves assigning a smallerr length encoding fo
or more
freqquently useed characterrs in the daata set and a larger lenngth encoding for infreequently
useed characterrs in the datta set. This involves variable lenggth encodinng of the inp
put data.
To efficiently retrieve thee original daata, an enco
oded word shhould not bbe a proper prefix
p of
anyy other enccoded wordd. Huffmann trees are an efficiennt way of coming up
p with a
varriable lengthh encoding for a set of
o characterss, given thee relative frrequencies. Further,
forr a Huffmann tree basedd encoding,, decoding can be donne in linear time (lineaar in the
lenngth of the encoded
e woord). Varioous other scchemes of encoding
e ussing differen
nt levels
of context
c sensitive inform
mation exitss. This migh
ht incur a coostlier decooding functio
on.

40 | P a g e  
 
Encoding scheme
In this implementation, I use an encoding scheme which is not a Huffman tree
based code. The bit allocation is shown in figure 4.12. Eight bit inputs are variable length
encoded between 3 to 18 bits. The encoding is implemented by two look-up tables on the
FPGA. Given an eight bit input, the first look-up table (LUT), provides information
about the size of encoding. The second LUT gives the actual encoding. Only the relevant
bits from the second LUT should be used. The rest of the bits in the output are don’t care
and are either chosen as logic 0 or 1 during logic optimization.

Figure 4.16 Entropy encoding, bit allocation

LUT for 
encoding  
Input (8)  length  (5) 

LUT for variable 
length bit 
(18) 
encoding 

Figure 4.17 Entropy encoder

41 | P a g e  
 
4.3.4 Bit packing
The output of the entropy encoder varies from 3 to 18 bits. The bits need to be
packed into 32 bit words before being written back to the embedded memory. This is
achieved by the shifter discussed below. This shifter is inspired from the Xtetris
computer game and the binary search algorithm.

Shifter
The shifter consists of 5 register stages, each 32 bits wide. The input data can be
shifted (rotated) by 16 or latched without shifting, to stage 1. The data can be shifted by
8 or passed on straight from stage 1 to stage 2. Similarly data can be shifted by 4, 2, and
1 when moving between the remaining stages. Data is shifted from stage to stage, and is
accumulated at the last stage. When the last stage has 32 bits of data, a memory write is
initiated and the last stage is flushed.

Figure 4.18 Binary Shifter for bit packing

The data is shifted to the right place over the 5 stages in order to complete a word
at the last stage. The key decision is whether to shift or not at each stage. A 5 bit counter
is maintained to store the length of the data currently held. For example, let the lengths
42 | P a g e  
 
of the words arriving at stage 1 be a 1 , a 2 , a 3 etc. The counter will have values 0, a 1 ,
a 1 +a 2 , etc. in the corresponding clock cycles. The counter is allowed to overflow once it
reaches 31. Thus, the counter value indicates where the next word should start by the
time it reaches the last stage. Different bits of the counter (delayed appropriately) are
used to decide whether to shift or not at each stage.

Part of the last stage needs double buffering. To determine the size of the double buffer
needed, consider the worst case. The last stage already has 31 bits and the next data
coming from stage 4 is of maximum size (18 bits). Only 1 out of the 18 bits can be added
to the last stage and a memory write initiated. The rest of the 17 bits need to be buffered
for this cycle, and brought out in the next cycle. Thus, 17 out of the 32 bits in the last
stage are double buffered. Thus, whenever an overflow is detected, the double buffer is
loaded with the excess bits and taken out during the next cycle.

4.3.5 Output file format


At the end of the second stage, the upper memory (upper 0.5MB) contains the
packed bit stream. The total count of the bit stream approximated to the nearest WORD
is written to memory location 0. To reconstruct the data from the bit stream, the
following information is needed.
• The actual bit stream. On Huffman decoding, the actual 8 bit codes are
retrieved. These codes are either the quantizer output, or the RLE count.
On expanding the RLE count to the corresponding number of zeroes, we
get the actual quantized stream.
• The four quadrants of the final stage of wave-letting can be located at the
first four 128*128 byte blocks. The three quadrants of the next stage can
be located at next three blocks sized at 256*256 bytes each. Each
quadrant (sub-band) is quantized separately. The dynamic range of each
of the quadrant should be known to reconstruct the original stream.

The output file written has all the information needed to reconstruct the image.
The format of the output file generated is shown in figure 4.15.

43 | P a g e  
 
Figure 4.19 Output file format

4.3.6 Stage 2, Overall architecture


The top level data flow diagram of the second stage is shown in figure 4.16.
Wavelet coefficients from memory are read from the lower half of the embedded
memory. The block (sub-band) minimum and maximum is also read from the memory.
The packed bit stream output is written to the upper memory, and the bit stream length is
written to memory location 0. The control software reads the embedded memory and
generates the compressed image file.

44 | P a g e  
 
Figure 4.20 Stage 2, data flow diagram

The control flow is show in figure 4.17. Before reading the wavelet coefficients,
the maximum and minimum of coefficients in each sub-band are read from the lower
memory. The coefficients are then read and processed for each sub-band, starting with
the lowest frequency band. As shown in the state diagram, a memory read is fired in
stage Read 001. Memory read has a latency of 2 clock cycles. The result of the read is
finally available in state Read 100. Memory writes are completed in the same cycle. The
two intermediate states, Read 010 and Write can be used to write back the output, if
output is available. Each memory read brings in two wavelet coefficients. Consider the
worst case, where the two coefficients gets expanded to 18 bits each. There are two
memory write cycles before the next read. When ever a memory write is performed, the
memory address register is incremented.

The read address generators read each sub-band from the interleaved memory
pattern. The output is written as a continuous stream, starting with the lowest sub-band.
Thus the output is effectively in Mallot ordering and can be progressively transmitted or
decoded.

45 | P a g e  
 
Figure 4.21 Stage 2, control flow diagram

4.4 Stage 3
Stage 3 decodes the compressed data into wavelet coefficients. The compressed
image data is entropy decoded , run length zeroes are decoded to the zero thershold value
and the quantized values are dequantized to the coefficient values.

4.4.1 Entropy Decoding


The file length present at the starting of the compressed data file, followed by the
Min & Max values of the blocks, length of the blocks makes easy for the identification of
the actual data.
The bit pattern of encoded data is valid data length followed by the valid data . So
the bit unpacking is done as, first the 5 bit length is read and then the valid data of that
length is read. For each data length and its corresponding data, they are copied into a 18
bit register and compared to the Huffman table (built in Huffman encoder stage) to give
its appropriate 8 bit code. This process goes on for the entire data of all the blocks. The
sum of length and actual data may vary from 3 to 18 bits, when copied into 18 bit
register the unused bits are set to logic zero.

The decoding is implemented by a single look-up table on the FPGA. The 18 bit
input pattern is searched in the look-up table and its corresponding eight bit output
pattern is given as output.

46 | P a g e  
 
Figu
ure 4.22 Entropy Decod
der

4.44.2 Run Length


L Decooding
The 8 bit
b output from
fr the enttropy encod
der may holdd the count value of ru
un length
zerros or the 4 bit quantizeed value.
When the 8 bit vaalue is the count valuee of zeros then
t the outtput is keptt at zero
threeshold valuue for the no
n of clockk cycles equ
ual to the value
v of couunt. The prrocess is
impplemented as
a wheneveer the 8 bit value
v is thee count valuue of zeros then the run
n length
deccoder is actiivated by making
m RLE
Een high and
d the countt value on R
RLEin. RLE
Erunning
is asserted
a nottifying thatt the outputt is valid ou
utput and iss the zerothhreshold vallue. The
RL
LEin is buffe
fered into coount variablle and the output
o is keppt at zero thhreshold vallue until
thee count valuue is equal to zero and the countt value is decreased
d bby 1 in everry clock
cyccle. As sooon as countt value equuals zero then the RL
LErunning signal is disserted
d
signaling the top
t module, as the end of run lengtth decodingg process.

Figuree 4.23 Run-L


Length Decooder

47 | P a g e  
 
4.44.3 Dequaantization
Here thhe quantizeed 4 bit valuues are deq
quantized innto 16 bit w
wavelet coeffficients.
Inpputs to this block
b are 166 bit maxim
mum and 16
6 bit minimuum values oof the block
k and the
4 bit
b quantizedd value. Bassed upon thhe 4 bit valu
ue the 16 bitt coefficientt value is caalculated
as shown in Table
T 4.2. Thus
T coefficcient valuess for each block
b are caalculated with
w their
respective maxximum and minimum values.
v

[1 * ( Maax − Min )]
M +
0000 ⇒ Min
16
[ 2 * ( Maax − Min )]
Min +
0001 ⇒ Mi
16
[ 3 * ( Maax − Min )]
M +
0010 ⇒ Min
16
[ 4 * ( Maax − Min )]
Mi +
0011 ⇒ Min
16






[13 * ( Max
M − Min )]]
M +
1100 ⇒ Min
16
[14 * ( Max
M − Min )]
Min +
1101 ⇒ M
16
[15 * ( Max
M − Min )]]
M +
1110 ⇒ Min
16
1111 ⇒ Max
M

T
Table 4.2 Coefficient va
alues calculaation for Deequantizer

Figu
ure 4.24 Deq
quantizer

48 | P a g e  
 
4.44.4 Stage 3,
3 Overall architectur
a re

Figu
ure 4.25 Stag
ge3 Data Floow Diagram
m

4.55 Stage 4: Inverse Discrete Wavelet Transform


In IDW
WT the inveerse wavelett transformaation is appllied to the w
wavelet coefficients
to get the pixeel data. Thee output off the stage3 is the mem
mory interleeaved modeel of the
waavelet coeffiicients, whicch is input for
f the stagee4. The IDW
WT is applieed to the daata along
thee row wise followed by
b the coluumn wise. The IDWT
T in X direection acceepts two
coeefficient vallues of the same row which
w may be either 2 fs or 2 gs aand outputss 2 pixel
datta which aree written baack sequentiially (Figuree 4.22). Thee IDWT in Y direction
n accepts
twoo coefficiennt values whhich may bee either 2 fs or 2 gs annd outputs 2 pixel dataa (one of
eacch column) Figure 4.223. The inpuut in both the
t directioons is 2 fs oor 2 gs so a single
IDW
WT block is sufficient to process data
d in both
h the directions (X and Y).

Figurre 4.26 coeffficient valuees processingg along X diirection

49 | P a g e  
 
Figu
ure 4.27 coeffficient valu
ues processin
ng along Y d
direction

Figgure 4.28 RTL


R view of Inverse Waavelet

4.55.1 3 stagees of Inversse wave-lettting


The waavelet coeffficients of image
i are processed
p inn three stagees of inversse wave-
lettting. In the first stage, 128 high pass coefficiients (g) andd 128 low ppass coefficcients (f)
aree processed. The coefficcients are written
w back
k in place off the originaal row.
Once all
a the 128 rows are processed,
p the a applied in the Y direction.
t filters are
Thiis completees the first stage of inveerse wave-letting. The second stagge of inversse wave-
lettting processses the 256 low frequenncy coefficiients and 2556 high freqquency coeffficients.
Thuus, second stage operaates on row
w and colum
mns of lenggth 256, whhile the third stage
opeerates on roows and collumns of leength 512. Thus
T finallyy we get thhe 512 x 512 image
pixxel data.

50 | P a g e  
 
4.6 Implementation
The VHDL top level modules (PE) are added to the NIOS system as the custom
logic peripherals. Initially NIOS system is created by using NIOS core processor,
SRAM, SDRAM, JTAG UART and PIO’s. Then each stage top level VHDL module is
added as custom logic peripheral to the NIOS system. The block diagram of the NIOS
system with the custom logic peripherals is shown in the Figure 4.24

Ethernet PHY 
Chip 

System Module

NIOS Processor  Ethernet MAC  Custom logic 1  Custom logic 4 

M M S M S M S

                                        Avalon Switch Fabric 

Flash  SRAM  SDRAM  UART    


Memory  Interface  Controller      (Slave 
Interface  (slave  (slave  component) 
(Slave)  component) component)

Flash Chip  SRAM Chip SDRAM Chip RS 232

Figure 4.29 Custom Logic System Module with Avalon Switch Fabric

The top level VHDL module for each stage can be named as Processing Element
(PE). PE uses those signals to access memory and communicate with host. PE along with
the memory and the control signals is shown in Figure 4.25

51 | P a g e  
 
Figure 4.30 Block Diagram of Processing Element (PE)

PE and memory create a block interface to the outside world (host). Signals
having arrows pointing to the block denote the input signals to PE and memory interface.
Signals having arrows leaving the block denote the output signals from interface to host.

Figure 4.31 3 way Handshaking between PE & Host

Figure 4.26 illustrates the three-way handshaking communication between the PE


and the host. When the PE wants to send an interrupt signal, it drives PE_Interrupt_Req
low until the host drives a PE_Interrupt_Ack_n signal low which informs PE that its
interruption has been acknowledged by the host system. PE can use this three-way
handshaking method to inform the host when it has finished its calculation.

52 | P a g e  
 
The tim
ming diagraam for PE accessing
a itts memory is
i shown inn Figure 4.2
27. Each
PE communiccates with itts memory through a 22-bit
2 addreess bus, a 332-bit bi-dirrectional
datta bus, a read-write
r select conttrol signal PE_Writesselect_n, annd a strobee signal
PE_MemStrobbe. When thhe PE is intended to do
o the processsing, NIOS
S enables th
he PE by
drivving PE_M
MemBusGrannt_n signal to ‘0’. PE_
_MemStrobbe_n shouldd be set to ‘0’ when
PE is accessinng the mem
mory. PE_M
MemWriteSeel_n signal ‘0’ denotes a write acccess and
‘1’ denotes a read
r access.. PE is required to assert the addreess of the acccess on thee address
signal. If PE wants
w to write memoryy, the data to
o be writtenn must be aasserted in the same
circcle as addrress assertioon. If PE wants
w to reead data froom memoryy, the data will be
avaailable threee clock cyclles after thee PE asserts the addresss.

F
Figure 4.32 Timing Diagram for Memory
M Read
d after a Wrrite Access
 

53 | P a g e  
 
CHAPTER 5
RESULTS

5.1 Metrics for testing


The metrics on which codec is graded include the compression ratio, processing
noise, and bpp (bits per pixel). Results are verified in 2 test criteria’s. They are
1. VHDL simulation results of individual components.
2. Image results, when attached to the NIOS system as custom peripheral.  
 

5.2 VHDL simulation results


Before any hardware implementation and testing is performed, all the VHDL
modules are tested for correct functionality using the Model Sim-Altera functional
simulation tool. The Model Sim-Altera tool provided the necessary environment for
complete functional simulation of the target hardware, i.e., Altera DE2 FPGAs. The
Model Sim-Altera tool features high speed and target hardware platform adaptability. It
also allows for modification and verification of the VHDL codes simultaneously.
Captured outputs for a run of the wavelet transform is presented and explained here.

Forward Wavelet X

                                           Figure 5.1 VHDL Simulation Output for Forward Wavelet X 

Figure 5.1 captured from Model Sim-Altera shows the timing of the Forward Wavelet X.
Fwav_p2, Fwav_p3 are the input signals, Fwav_f, Fwav_g are the output signals.
Fwavstart and Fwavend signals the starting and ending of the row data. Fwavclk
54 | P a g e  
 
provides the clock signal to the component with a time period of 100ns. In every cycle 2
pixel data is given to the input lines, 2 fs or 2 gs is latched on to the output lines.

Forward Wavelet Y

 
                                           Figure 5.2 VHDL Simulation Output for Forward Wavelet Y

Figure 5.2 captured from Model Sim-Altera shows the timing of the Forward Wavelet Y.
Fwav_a4, Fwav_b4 are the input signals, Fwav_a, Fwav_b are the output signals.
Fwavstart and Fwavend signals the starting and ending of the column data. Fwavclk
provides the clock signal to the component with a time period of 100ns. In every cycle 2
pixel data is given to the input lines, 2 fs or 2 gs is latched on to the output lines.
 
Stage1 Top Level

 
                                    Figure 5.3 VHDL Simulation Output for Stage1 Top Level

55 | P a g e  
 
Figure 5.3 captured from Model Sim-Altera shows the timing of the Stage1 toplevel.
Data_InReg , are the input signals, Data_OutReg, Addr_OutReg are the output signals.
MemReadSel_n, MemWriteSel_n, MemStrobe_n are the control signals used for
synchronizing the read and write operations. Fwavclk provides the clock signal to the
component with a time period of 100ns. In every cycle 2 pixel data is given to the
Data_InReg, 2 fs or 2 gs is latched on to the Data_OutReg.
 
Inverse Wavelet

 
                                             Figure 5.4 VHDL Simulation Output for Inverse Wavelet  

Figure 5.4 captured from Model Sim-Altera shows the timing of the Inverse Wavelet.
Fwav_f, Fwav_g are the input signals, Fwav_p2, Fwav_p3 are the output signals.
Fwavstart signals the starting of the row data. Fwavclk provides the clock signal to the
component with a time period of 100ns. In every cycle 2 coefficient values (2 fs or 2 gs)
is given to the input lines, 2 pixel data is latched on to the output lines.
 
Quantizer

 
                                                Figure 5.5 VHDL Simulation Output for Quantizer 

56 | P a g e  
 
Figure 5.5 captured from Model Sim-Altera shows the timing of the Quantizer.
QUANTin, QUANTmax, QUANTmin are the input signals, QUANTout is the output
signal. QUANTclk provides the clock signal to the component with a time period of
100ns. For every sub band the max and min values are given and remain until all the
coefficients of the sub band are quantized. In every cycle coefficient value is given to the
QUANTin, quantized 4 bit output is latched on to the QUANTout.

Dequantizer

 
                                              Figure 5.6 VHDL Simulation Output for Dequantizer 

Figure 5.6 captured from Model Sim-Altera shows the timing of the Dequantizer.
QUANTin, QUANTmax, QUANTmin are the input signals, QUANTout is the output
signal. QUANTclk provides the clock signal to the component with a time period of
100ns. For every sub band the max and min values are given and remain until all the
quantized values of the sub band are dequantized. In every cycle quantized 4 bit is given
to the QUANTin, coefficient value is latched on to the QUANTout.
 
Run-Length Encoder

 
                                         Figure 5.7 VHDL Simulation Output for Run‐Length Encoder 

57 | P a g e  
 
Figure 5.7 captured from Model Sim-Altera shows the timing of the Run Length
Encoder. RLEin, RLEzeroth are the input signals, RLEout, RLEspellend, RLErunning
are the output signals. RLEclk provides the clock signal to the component with a time
period of 100ns. For every sub band RLEzeroth is given and remains until all the values
of the sub band are Run length coded. In every cycle coefficient value is given to the
RLEin, RLEin compared to RLEzeroth, if it is less than RLEzeroth then the zeros count
is activated by asserting RLErunning and no of zeros are counted. When the zeros count
is broken then it latches the output on RLEout and makes RLEspellend high for 1 clock
cycle.
 
Shifter

 
                                                      Figure 5.8 VHDL Simulation Output for Shifter 

Figure 5.8 captured from Model Sim-Altera shows the timing of the shifter. SFTRdatain,
SFTRlenin are the input signals, SFTRout, SFTRouten are the output signals. SFTRclk
provides the clock signal to the component with a time period of 100ns. In every clock
cycle SFTRlenin gives the size of valid data on SFTRdatain. As soon as the data
received reaches 32 bit then the 32 bit packed output is latched on SFTRout and asserts
SFTRouten for 1 clock cycle.
 

58 | P a g e  
 
5.3 NIOS II Implementation Results
The encoder runs in two stages and decoder runs in two stages. A raw frame of
512 by 512 pixels is loaded to the embedded memory. The starting address of the image
data memory location and the control is transferred to the stage1 by NIOS. After each
stage finishes it’s processing on this memory, it interrupts the NIOS signaling the end of
operation, then NIOS transfers the control and starting address of the memory locations
to the next stage. The hardware configuration runs at a system clock of 10MHz. For each
level of compression the hardware is reconfigured and downloaded onto the FPGA
The embedded memory is loaded and unloaded by the host computer using the
operating system driver routines. The control software utilized, for loading the memory
and servicing each stage interrupts is written in C. The device driver routines provided
by the board vendor are employed for this task. 
 

5.3.1 Compression level Vs noise


Three different hardware configurations with different compression levels were
built and tested. The characteristics of the three configurations over three different
frames are displayed in tables ( 4.4, 4.5, 4.6). A software decoder available was used to
reconstruct the encoded image in order to compare with the original. Noise figures from
a software encoder (using 32 bit integer arithmetic) are also quoted (in braces). The
PSNR and RMSE metrics are computed as per the equation given below. Percentage
compression is the ratio of compressed image size to the original image size (512x512
bytes). Bits per pixel (bpp) is the ratio of image size in bits to number of pixels.

                                                           Figure 5.9 PSNR & RMSE Equations

59 | P a g e  
 
5.3.2 Results analysis for LENA (512 x 512) image

Figure 5.10 Original Image of LENA

Compressed  Compression  PSNR 


Configuration        bpp  RMS 
Size (Bytes)  Ratio  (db) 
Minimum 
28775  9.11  0.878  30.894  7.274 
Compression 
Medium Compression  5556  47.18  0.169  29.530  8.511 
Maximum 
3767  69.58  0.114  28.059  10.082 
Compression 
 

                                   Table 5.1 Compression Level & Noise Measurements of LENA Image 
 
 
 

                            
  Reconstructed Image after  Reconstructed Image after  Reconstructed Image after 
Minimum Compression  Medium Compression  Maximum Compression 
 

                                                           Figure 5.11 Reconstructed Images of LENA

60 | P a g e  
 
 

5.3.3 Results analysis for BARBARA (512 x 512) image


 

 
Figure 5.12 Original Image of BARBARA

Compressed  Compression  PSNR 


Configuration        bpp  RMS 
Size (Bytes)  Ratio  (db) 
Minimum 
29208  8.82  0.906  25.017  14.31 
Compression 
Medium Compression  8187  32.01  0.249  24.472  15.23 
Maximum 
4915  53.33  0.149  23.427  17.18 
Compression 
 

                            Table 5.2 Compression Level & Noise Measurements of BARBARA Image 

                                
 
Reconstructed Image after  Reconstructed Image after 
  Reconstructed Image after 
Minimum Compression  Medium Compression  Maximum Compression 
 

                                                       Figure 5.13 Reconstructed Images of BARBARA

61 | P a g e  
 
 

5.3.4 Results analysis for GOLD HILL (512 x 512) image


 

            
Figure 5.14 Original Image of GOLD HILL

Compressed  Compression  PSNR 


Configuration        bpp  RMS 
Size (Bytes)  Ratio  (db) 
Minimum 
30024  8.7  0.916  30.038  8.027 
Compression 
Medium Compression  6070  43.18  0.185  28.119  10.01 
Maximum 
3636  72.09  0.110  26.417  12.18 
Compression 
 

                         Table 5.3 Compression Level & Noise Measurements of GOLD HILL Image 

                                
 Reconstructed Image after  Reconstructed Image after 
  Reconstructed Image after 
Minimum Compression  Medium Compression  Maximum Compression 
 

                                                  Figure 5.15 Reconstructed Images of GOLD HILL

62 | P a g e  
 
5.4 Performance Graphs (Compression Ratio Vs PSNR (db)) 
 

LENA
32
PSNR(db) 31 9.11, 30.894
30
47.18, 29.53
29
69.58, 28.05 Series1
28
9
27
0 20 40 60 80
Compression Ratio
                            

                                  Figure 5.16 Compression Ratio Vs PSNR (db) Graph of LENA

BARBARA
25.5
25 8.82, 25.017
PSNR(db)

24.5 32.01, 24.47
2
24
23.5 53.33, 23.42
Series1
23 7

0 20 40 60
Compression Ratio
                            

Figure 5.17 Compression Ratio Vs PSNR (db) Graph of BARBARA

GOLD HILL
31
30 8.7, 30.038
PSNR(db)

29
43.18, 28.11
28 9
27 72.09, 26.41Series1
26 7
0 20 40 60 80
Compression Ratio
                            

                              Figure 5.18 Compression Ratio Vs PSNR (db) Graph of GOLD HILL

63 | P a g e  
 
CONCLUSIONS

We have designed and implemented a Wavelet transform based image codec on


re-programmable hardware - FPGA. The encoder has multiple configurations which
support different compression levels.
• Wavelet based image compression is ideal for adaptive compression since it is
inherently a multi-resolution scheme. Variable levels of compression can be
easily achieved. The number of wave-letting stages can be varied, resulting in
different number of sub bands. The zero thresholds for truncating coefficients of
small magnitude can be varied. Different filter banks with different
characteristics can be used. For example, audio data as much longer correlation
and hence longer filter are used for audio, compared to video. Filters tuned to the
nature of the data achieve much higher compression.
• Efficient fast algorithm (pyramidal computing scheme) for the computation of
discrete wavelet coefficients makes a wavelet transform based encoder
computationally efficient.
• Computationally intensive problems often require a hardware intensive solution.
Unlike a microprocessor with a single MAC unit, a hardware implementation
achieves greater parallelism, and hence higher throughput.
• Reconfigurable hardware is best suited for rapid prototyping applications where
the lead time for implementation can be critical. It is an ideal development
environment, since bugs can be fixed and multiple design iterations can be done,
with out incurring any non recurring engineering costs.
• Reconfigurable hardware is also suited for applications with rapidly changing
requirements. In effect, the same piece of silicon can be reused.
• With respect to limitations, achieving good timing/area performance on these
FPGAs is much harder, when compared to an ASIC or a custom IC
implementation. There are two reasons for this. The first pertains to the fixed size
look-up tables. This leads to under utilization of the device. The second reason is
that the pre-fabricated routing resources run out fast with higher device
utilization. 
 

64 | P a g e  
 
FUTURE WORK
 

The lessons learned from this work will help us enhance similar implementations in the
future. Few of the improvements that we now foresee are listed below:

• A corresponding wavelet transform codec on the FPGA is build which can be


used for color images.
• Efficiency can be increased by using wavelets of larger Filter lengths as CDF 9/7
• Throughput should be increased for using the Codec in Real Time Applications.  
 

65 | P a g e  
 
APPENDIX
 
 

Zero threshold levels for different configurations


Sub  Minimum  Medium  Maximum 
Bands  Compression  Compression  Compression 
0  0  0  0 
1  39  78  156 
2  27  54  108 
3  104  208  416 
4  79  156  316 
5  50  100  200 
6  191  382  764 
 

66 | P a g e  
 
REFERENCES

1. Cohen, I. Daubechies, J. Feauveau, Biorthogonal Bases of Compactly Supported


Wavelets, Communications of Pure Applied Math, vol 45, 1992.
2. Sarin Mathen, Secure Hashing Implementation on FPGA, ITTC Technical
Report.
3. Gilbert Strang, Wavelets and Dialation Equations: A Brief Introduction, SIAM
Review, vol 31, no. 4, December 1989, pp. 614-627.
4. J. Villasenor, B. Belzer, J. Liao, “Wavelet Filter: Evaluation for Image
Compression,” IEEE Transactions on Image Processing, vol. 2, pp. 1053-1060,
August 1995.
5. M. Antonini, M. Barlaud, P.Mathieu, and I. Daubechies, "Image coding using
wavelet transform," IEEE Trans. Image Processing, vol. 1, pp. 205-220, Apr.
1992.
6. R. Calderbank and I. Daubechies and W. Sweldens and B.L. Yeo, Lossless
Image Compression using Integer to Integer Wavelet Transforms, International
Conference on Image Processing (ICIP), Vol. I, 1997.
7. “THE WAVELETS” tutorials by Robi Polikar:
http://engineering.rowan.edu/~polikar/WAVELETS/WTtutorial.html
8. JPEG2000 Standard for Image Compression by Tinku Acharya, Ping Ting Tsai
9. Data Compression the complete reference by David Salomon.
10. Altera Corporation, “Nios II Processor Reference Handbook”.
11. Altera Corporation, “Nios II Processor Software Developer’s Handbook”.
12. Altera Corporation, “Embedded Peripherals Reference Handbook”.
13. Altera Corporation, “SOPC builder Handbook”.
14. Altera Corporation, “Avalon Interface Specifications Tutorial”.
15. Altera Corporation, “Nios II Custom Instruction User Guide”.
16. http://www.altera.com

67 | P a g e