Vous êtes sur la page 1sur 10

VHDL implementation of wavelet packet transforms using

Mukul Shirvaikar and Tariq Bushnaq
Electrical Engineering Department
The University of Texas at Tyler
Tyler, TX 75799
e-mail: mshirvaikar@uttyler.edu


The wavelet transform is currently being used in many engineering fields. The real-time implementation of the Discrete
Wavelet Transform (DWT) is a current area of research as it is one of the most time consuming steps in the JPEG2000
standard. The standard implements two different wavelet transforms: irreversible and reversible Daubechies. The former
is a lossy transform, whereas the latter is a lossless transform. Many current JPEG2000 implementations are software-
based and not efficient enough to meet real-time deadlines. Field Programmable Gate Arrays (FPGAs) are
revolutionizing image and signal processing. Many major FPGA vendors like Altera and Xilinx have recently developed
SIMULINK tools to support their FPGAs. These tools are intended to provide a seamless path from system-level
algorithm design to FPGA implementation. In this paper, we investigate FPGA implementation of 2-D lifting-based
Daubechies 9/7 and Daubechies 5/3 transforms using a Matlab/Simulink tool that generates synthesizable VHSIC
Hardware Description Language (VHDL) code. The goal is to study the feasibility of this approach for real time image
processing by comparing the performance of the high-level toolbox with a handwritten VHDL implementation. The
hardware platform used is an Altera DE2 board with a 50MHz Cyclone II FPGA chip and the Simulink tool chosen is
DSPBuilder by Altera.

Keywords: Wavelet Transform, FPGA, MATLAB, SIMULINK, JPEG 2000.


There is a current demand for high quality digital images and videos over bandwidth-limited channels such as the
Internet and cellular phones. There is a major push for a proportional increase in the capabilities of the current
communication systems, facing an onslaught of media production. Instead of replacing the physical infrastructure to
reach higher bandwidth, it is more economical to develop image compression schemes to be used over the current
communication systems, which allow representing images in a compact form. The JPEG transform has become one of
the most known image compression techniques and it is used in different applications. The compression standard was
developed by the Joint Photographic Experts Group (JPEG), whose abbreviation became the name of the standard. When
applied to a still image, the standard uses the Discrete Cosine Transform (DCT), which applies cosine functions of
different frequencies for data analysis and decorrelation. The DCT yields the transform domain coefficients which get
quantized and entropy encoded1. JPEG is a lossy technique that delivers a compression ratio in the range of 10:1 to 20:1,
while preserving acceptable image quality for most consumer applications2. However, when the desired compression
ratio is 24:1 or above, the decrease in available bits allows only the average pixels of the 8 × 8 blocks to be encoded 2.
The decompressed image is constructed using these 8 × 8 blocks, thereby creating undesired checkerboard effects in the

For this reason and others such as variable rate data stream compression, the Joint Photographic Experts Group has
developed a new image compression standard called JPEG2000 3. The JPEG2000 standard provides a better compression
ratio than the original JPEG standard4. JPEG2000 also includes extra features which were not available in previous
standards such as allowing different compression schemes for digital images. Although it does not specify a particular
compression algorithm at the heart of the standard, the Discrete Wavelet Transform (DWT) has replaced the DCT in

Real-Time Image Processing 2008, edited by Nasser Kehtarnavaz, Matthias F. Carlsohn,

Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 6811, 68110T, © 2008 SPIE-IS&T · 0277-786X/08/$18

SPIE-IS&T/ Vol. 6811 68110T-1

2008 SPIE Digital Library -- Subscriber Archive Copy
most popular implementations of the new standard. The DWT allows both the lossy and lossless types of image
compression mentioned earlier. The JPEG2000 standard can be implemented using two different wavelet transforms: the
irreversible Daubechies 9/7 and the reversible Daubechies 5/3 (also called Le Gall 5/3). The Daubechies 9/7 is real-to-
real lossy transform that uses non-rational filter coefficients, whereas the Daubechies 5/3 is an integer-to-integer lossless
transform that uses rational filter coefficients. The standard also provides improved functionality such as enhanced error
resilience and a flexible file format 5.

(a) Original image (158Kb)

(b) JPEG (1.8 Kb) (c) JPEG2000 (1.8 Kb)

Figure 1: Comparison between JPEG and JPEG2000 compression 2

In Figure 1, the original image (a) was compressed by a ratio of approximately 88:1 using JPEG and JPEG2000. The
JPEG image in (b) is degraded due to the checkerboard affect mentioned earlier, while the JPEG2000 standard results in
a superior image quality as shown in (c). All these features and more qualify JPEG2000 to be the image compression
standard choice of the future.

These powerful features come at the expense of process complexity. JPEG2000 implementations are up to six times
more computationally complex than JPEG 6. Software implementations of JPEG200 have the advantage of being
flexible, however they may not suitable to meet the hard deadlines of most real time systems. On the other hand,
hardware implementations such as ASICs offer high performance in terms of speed but lack the flexibility of software
implementations on DSP platforms. A compromise between these two is the use of Field Programmable Gate Arrays

SPIE-IS&T/ Vol. 6811 68110T-2

(FPGA). FPGAs can be used as coprocessors along with other software components to execute the time consuming
stages of the encoding process in JPEG2000. These stages are designed in hardware and loaded into the reconfigurable
FPGAs for fast hardware execution. This approach preserves the flexibility of software and the high performance of
hardware1. Figure 2 shows benchmark performance comparisons of software and hardware implementations of
JPEG2000. We notice from Figure 2 shows that the entropy encoding and DWT are the most time-consuming stages of
the standard. Looking at (a) and (b) we realize that hardware can accelerate the execution of DWT tremendously. This is
mainly due to the multiple parallel data paths in the algorithm making it ideally suitable for FPGA implementation. The
entropy encoding component on the other hand is better suited to a software implementation. It would be prudent to
implement the DWT component on a coprocessor FPGA when designing a real-time system that incorporates the
JPEG2000 standard.

I I 120
100 100

S Other Bo
S Entropy w
0 Entropy
;E 40-
20 -

Lossless Lossy Loss less Lossy
(a) Software JPEG2000 benchmark (b) Hardware JPEG2000 benchmark

Figure 2: Software vs. hardware benchmark fro JPEG2000 standard 6

The introduction of FPGAs adds the engineering burden of CAD tools and Hardware Description Languages (HDLs).
The engineering personnel’s knowledge and expertise in this area can be a limiting factor in meeting the current market
demand for hardware implementation of algorithms such as the DWT. MATLAB is currently one of the most popular
engineering software packages. It provides a powerful graphical system design toolbox known as SIMULINK. With this
in mind, major FPGA vendors like Altera and Xilinx have recently developed SIMULINK tools to support their FPGAs.
These tools are intended to provide a seamless path from system level algorithm design to FPGA implementation.

In this paper, we specifically investigate FPGA implementation of 2-D lifting-based Daubechies 9/7 and Daubechies 5/3
transforms using a MATLAB/SIMULINK tool that generates synthesizable VHDL code. The goal is to study the
feasibility of this approach for real time image processing by comparing the performance of the high-level toolbox with a
handwritten VHDL implementation. The hardware platform used in this experiment is an Altera DE2 FPGA board with
8MB of SDRAM, 4MB of flash memory, a SD card slot, and a 50MHz Cyclone II FPGA chip. The Simulink tool chosen
is DSPBuilder by Altera and has an extensive library of building blocks. The design flow implementation allows users to
convert designs to VHDL, synthesize, and program the board using a single GUI. The results of the transform can be
displayed on the 16x2 LCD display mounted on the board and are compared to the correct values from MATLAB
implementations. The SIMULINK implementation is compared with a handwritten VHDL implementation for both
transforms; evaluating FPGA hardware utilization (number of logical components) and execution time.


The first phase in the JPEG2000 encoding process is two levels of discrete wavelet decomposition 6. As mentioned
earlier, JPEG2000 implements reversible and irreversible DWT. The inputs to the DWT stage are tiles of the image data
and an L-level decomposition. The wavelet transform is performed using either the 9/7 floating-point wavelet discussed
in Antonio et al 7, or the 5/3 integer wavelet discussed in Calderbank et al 8. Progression is possible with either wavelet,
but 5/3 must be used to progress to a lossless transform as mentioned in Marcellin et al 9. It is important to keep in mind
that all wavelet transforms described in the JPEG2000 are one-dimensional. Therefore, in order to perform a 2-D DWT

SPIE-IS&T/ Vol. 6811 68110T-3

on an image, the transform is first applied horizontally and then vertically on the pixels of the input image or tile. The
standard can support two filtering modes: convolution mode and lifting mode 10. Due to its simplicity, we choose to use
the lifting scheme in our DWT implementation.

2.1 Daubechies 5/3

The Daubechies 5/3 transform uses the lifting based implementation described in Sweldens 11. The transform name
suggests that it uses 5 taps for the analysis filters and 3 taps for the synthesis filters. Daubechies 5/3 is an implementation
of reversible integer-to-integer wavelet transform which is constructed using the algorithm described in Calderbank et al
. Reversible transforms have many advantages such as: finite-precision arithmetic, mapping integers to integers, and
they approximate the linear wavelet transform from which they are derived 12. The lifting scheme splits the data into odd
and even samples resulting in a more compact representation. Figure 3 shows the different stages of the lifting scheme
namely: split, predict, and update phase. The first stage splits the input image pixels into halves separating the even
samples from the odd ones. This is called also the lazy wavelet transform because it does not decorrelate the data, but
rather it splits it into odd and even samples 13.

The second stage is to use the even samples to predict the odd ones with the use of equation (1.1). The more correlation
present in the data the closer will be the predicted value to the odd samples.


2b1 !f bLedi! Cf fG


Figure 3: Lifting implementation of the Daubechies 5/3 DWT

⎡ X (2n) + X ext (2n + 2) ⎤

Y (2n + 1) = X ext (2n + 1) − ⎢ ext ⎥ (1.1)
⎣ 2 ⎦

The final scaling coefficients are the difference between the original odd samples and the predicted ones, which is the
outcome of equation (1.1). If the input data had perfect correlation the outcome of equation (1.1) would be zero.
Therefore, it is apparent that the wavelet coefficients capture the high frequency (deviation from DC level) in the image.

In the final phase the even samples are lifted using the scaling coefficients and computed as following

⎡ Y (2n − 1) + Y (2n + 1) + 2 ⎤
Y (2n) = X ext (2n) + ⎢ ⎥ (1.2)
⎣ 4 ⎦

The scaling coefficients are the outcome of equation (1.2). The coefficients preserve the DC level of the input data
capturing the low frequency component of the image.

SPIE-IS&T/ Vol. 6811 68110T-4

Real signals in everyday life are time-limited and do not extend infinitely. The lifting scheme assumes an infinite signal,
which becomes a problem towards the edges of the real finite signal. This problem is solved by extending the signal at
the left and right edges of the signal. In order to avoid discontinuity at the edges, the signals are extended periodically
and symmetrically. JPEG200 uses the 1D_EXTR procedure described in the standard 3. Table 1 demonstrates how the
signal is extended when using the Daubechies 5/3 transform. Depending upon whether the first data element io is
considered even or odd, the signal is symmetrically extended by adding two or one data elements at the left and right
edges of the finite signal.

Table 1: Signal extension for the Daubechies 5/3 DWT

io ileft iright
even 2 1
Odd 1 2

2.2 Daubechies 9/7

Daubechies 9/7 is an irreversible transform popularly used to implement JPEG2000. The transform has 9 analysis filter
coefficients and 7 synthesis coefficients. Figure 4 shows the different stages of a lifting-based implementation of the 9/7
DWT. The split stage divides the odd and even samples as mentioned earlier in the Daubechies 5/3 DWT.

lubiif 2b1 !T bL6H( :e bLGfl( :e



Figure 4: Lifting implementation of the Daubechies 9/7 DWT

Equations (2.1) through (2.6) describe the different stages of the transform namely: predict I, update I, update II, scale I,
scale II, which yield the final wavelet and scaling coefficients. The coefficient values as defined by the JPEG2000
standard and used in the equations below are shown in Table 2.

Predict I: Y (2n + 1) = Xext (2n + 1) + [α ×[ Xext (2n) + Xext (2n + 2)]] (2.1)

Update I: Y (2n) = Xext (2n) + [β× [Y (2n − 1) + Y (2n + 1)]] (2.2)

Predict II: Y (2n + 1) = Y (2n + 1) + [ γ × [Y (2n) + Y (2n + 2)]] (2.3)

Update II: Y (2n) = Y (2n) + [δ × [Y (2n − 1) + Y (2n + 1)]] (2.4)

SPIE-IS&T/ Vol. 6811 68110T-5

Scale I: Y (2n + 1) = − K × Y (2n + 1) (2.5)

Scale II: Y (2n) = ( ) × Y (2n) (2.6)

Table 2: Filter coefficients

α -1.586 134 342

β -0.052 980 118
γ 0.882 911 075
δ 0.443 506 852
K 1.230 174 105

Finally, as mentioned in the discussion of the Daubechies 5/3 DWT, Table 3 includes the minimum symmetric extension
values as described in the JPEG2000 1D_EXTR procedure. This extension makes the signal sufficiently large for
processing the finite signal.

Table 3: Signal extension for the Daubechies 9/7 DWT

io ileft iright
even 4 3
odd 3 4

3. Matlab/Simulink

Nowadays, many DSP algorithms are being designed using FPGAs instead of traditional ASICs or DSP processors. The
main FPGA producers in the market are Altera and Xilinx and both companies have reported revenues greater than $1
billion14. A new visual design methodology introduced by both companies is based on the popular and established
Matlab/Simulink interface. This approach sounds promising, due to the large number of Matlab programmers, which is
over 1 million worldwide14. Xilinx had introduced its own Simulink tool which is called System Generator, and Altera
has developed DSPBuilder. In this paper we will use DSPBuilder in the experiment due to its numerous features, which
will be discussedbelow and also later in this paper.

The Altera library includes 171 design blocks arranged in 11 subgroups. These building blocks allow the user to design
DSP systems without the need to import any custom Matlab or HDL code into the design 14. In addition, DSPBuilder has
a very smooth and intuitive design flow. The program is fully integrated with Simulink and does not require the user to
use any other support or third-party software throughout the entire design process. DSPBuilder provides a single
graphical user interface (GUI) that allows the user to perform synthesis, filter, and board programming. The program
offers a collection of board components such as LEDs, DIP switches, A/D, and D/A convertors. These blocks allow the
user to effectively eliminate the need for manual pin assignment and configuration for the different components on the
development kit. Figure 5 shows some of the subcategories and building blocks in DSPBuilder.


The objective of the experiments in this section is to evaluate the Quality of Results (QoR) of the Simulink tool
compared to the traditional hand-written HDL. To do so, we implement Daubechies 5/3 and 9/7 using both hand-written
VHDL and DSPBuilder and compare the performance of each implementation in terms of speed and number of logical

SPIE-IS&T/ Vol. 6811 68110T-6

The decomposition level was set to one in all implementations in order to normalize the comparison. Also, for simplicity
the design only implements integer arithmetic instead of using the more accurate fixed point arithmetic. Figure 6 shows
the general topology of the design for both reversible and irreversible DWT. The input image is 64 × 64 gray level
standard Lena portrait stored in RAM. Both DSPBuilder and Altera’s ISE (Quartus II) provide a ready interface to a dual
port RAM allocated on the system’s development kit.

ki ibrar Brows or ________

File Ed1L View Help

AIIB us: Input or U ulpul Port

Costs o signol too bus.


AlLero DSP Duilder DlockseL
7:0 << AItBus

ALLah Avalon-M H Master

• 1_H Doards A
-i Complex Type
GaLe & ConLrol M
TO & Bus
Rote Chonge
Simulation Blocks Library
StoLe Mochine Functions Avalon-S I Packet Eormat Converter
+ Wi Communicotions Dlockset Avolon-S T Sink

Control System Toolbox
Wi. Fuzzy Logic Toolbox Avalon-ST Source
Neural Network Toolbox

• Real-Time Workshop

* Simulink Control Design

Barrel Shitter

Simulink Extras
]: Fl 0]: P]1 Binary Point Casting
• Simulink Response Optimization
I StoteFlow Binary To Seven Segments
• • System Identilication Toolbox

S B it Level Sum ot Products

B itwise Logical B us D perator

FL. B us Builder

Bus Concotenotron

Figure 5: DSPBuilder menu and components

The top entity 2-D DWT provides the starting offset and the step size to the 1-D DWT entity. The 1-D DWT entity
applies DWT on either a given row or column from the image data. The implementation does not reorder the results of
the 1-D transform since we are only interested in performance comparison. The design extends the data as mentioned
earlier accordingly and applies the sliding window method to calculate the scaling and wavelet coefficients 15. The
standard approach to implementing DWT requires several passes for each row to implement each lifting and scaling
stage in the transform. The Daubechies 5/3 would require two passes for each row and Daubechies 9/7 would require six
passes for each row to avoid the need to buffer the entire row of data 15. The sliding window method only requires one

SPIE-IS&T/ Vol. 6811 68110T-7

pass over a row of data regardless of the number of stages of the DWT. The size of the window depends on the number
of data points needed to process the final scaling and wavelet coefficients at each window slide. The technique reduces
the number of data pixels that need to be buffered. For example, Daubechies 5/3 requires only a 4 pixel buffer and
Daubechies 9/7 requires a 6 pixel buffer in each window slide 15.

Figure 6: Design structure of the DWT implementation

The correctness of the transforms is verified by comparing the output of both transforms to that of a handwritten Matlab
algorithm performing the identical operations. The Matlab code implements 2-D DWT and complies with the constraints
and the specifications of the HDL design mentioned earlier. The output of the transform is displayed on the LCD screen
attached to the development kit and compared to the output of the Matlab code. Note that a 64 × 64 image generates
4096 outputs. To save time, we only compare a random row and a column of the transform’s output with the same row
and column from the output of the Matlab program.

Table 4: DSPBuilder vs. hand-written VHDL

Design Results
Daubechies 5/3 110 203.13 79 250.40
Daubechies 9/7 133 66.08 107 109.0


The outputs of each design were compared and verified to ensure correctness of the design. Table 4 shows the results of
the synthesis report of the HDL and DSPBuilder designs. The first column shows the number of logical components
(LC) generated by the DSPBuilder and the VHDL code. In the case of Daubechies 5/3 the DSPBuilder utility generated
79 LCs compared to 110 LCs generated by the hand-written VHDL. The DSPBuilder design generated a lesser number
of logical components implying a more efficient design. The second column shows the maximum frequency (fmax)
achieved by each design. fmax is the maximum clock frequency that can be achieved without violating setup and hold time
requirements. Again DSPBuilder has outperformed the VHDL code in time analysis. In the case of Daubechies 9/7 the
fmax generated by DSP builder is 109 MHz which is equal to a 9.17 ns delay, whereas the VHDL design generated an fmax

SPIE-IS&T/ Vol. 6811 68110T-8

equal to 66.08 MHz which is equal to a 15.13 ns delay. Without question, DSPBuilder consistently outperformed the
hand-written VHDL code in all the experiments performed.

The results of the experiments suggest that DSPBuilder has a desired QoR and can be considered as a competitive
alternative to the traditional method of FPGAs development. An additional advantage of the DSP builder is that it allows
for faster implementation and hence time-to-market. As a bonus, it provides automated test benches at no extra cost or


The wavelet transform is currently being used in many engineering fields. The real time implementation of the Discrete
Wavelet Transform (DWT) is a current area of research as it is one of the most time consuming steps in the JPEG2000
standard. Popular realizations of the standard implement two different wavelet transforms: irreversible and reversible
Daubechies. FPGAs are revolutionizing image and signal processing and many major FPGA vendors like Altera and
Xilinx have recently developed SIMULINK tools to support their products. These tools are intended to provide a
seamless path from system-level algorithm design to FPGA implementation. The results of this experiment have
demonstrated that the Simulink design flow for FPGAs has a better QoR than hand-written VHDL code, and is a viable
alternative for DSP design. The sophisticated DSPBuilder toolbox is another formidable strength of this approach. The
Quality of Results achieved by DSPBuilder match or exceed handwritten VHDL making Matlab/SIMULINK tools
worthy of consideration when it comes to DSP algorithm implementation.


The authors would like to thank Dr. Yea Zong Kuo, Comcept Division, L3 Communications Inc. for her role as a
technical consultant during the course of this project.


1. Zafarifar, B., “Micro-codable Discrete Wavelet Transform,” Computer Engineering Laboratory, Delft University of
Technology, July 2002.
2. Brennan, J., “FPGA Coprocessing in a JPEG2000 Implementation,” School of Information Technology and
Electrical Engineering, University of Queensland, October 2001.
3. ISO/IEC FCD 15444-1:2000 V1.0, 16 March 2000), http://www.jpeg.org, (Last visited December, 12, 2007).
4. Santa-Cruz, D. and Ebrahimi, T., “An Analytical Study of JPEG 2000 Functionalities,” Proc. Int’l Conf. Image
Processing, IEEE, New Jersey, 2000.
5. Adams, M.D. , “The JPEG2000 Still Image Compression Standard”, Department of Electrical and Computer
Engineering, University of Victoria, December 2005.
6. Cantineau, O., “Enabling Real-Time JPEG2000 with FPGA Architecture,” Global Signal Processing Conferences &
Expos (GSPx), International Signal Processing Con., CF-JPG031505-1.0, March 2005.
7. Antonio, M., Barlaud, M., Mathieu, P. and Daubechies, I., “Image coding using wavelet transform,” IEEE Trans. On
Image Proc., vol. 1, no. 2, pp 205-220, Apr. 1992.
8. Calderbank, A. R., Daubechies, I., Sweldens, W. and Yeo, B. –L, “Wavelet transforms that map integers to
Integers,” Appl. Comput Harmon. Anal, vol. 5 pp. 332-369, July 1998.
9. Marcellin, M. W., Gormish, M. J., Bilgin, A. and Boliek, M. P., “An Overview of JPEG-2000,” Proc. of IEEE Data
Compression Conference, pp. 532-541, 2000.
10. Skoras, A., Christopoulos, C., and Ebrahimi, T., “The JPEG 2000 Still Image Compression Standard,” IEEE Signal
Processing Magazine, pp 36-58, September 2001.
11. Sweldens, W. , “The lifting scheme: A new philosophy in biorthogonal wavelet constructions,” Proc. SPIE, vol.
2569, pp. 68-79, Sept. 1995.
12. Adams, M. D. and Kossenti, F., “Reversible Integer-to-integer Wavelet Transforms for Image Compression:
Performance Evaluation and Analysis,” IEEE Transactions on, vol. 9, no. 6, June 2000.

SPIE-IS&T/ Vol. 6811 68110T-9

13. Shim M., and Laine, A., “Overcomplete lifted wavelet representations for multiscale feature analysis,” IEEE
International Conference on Image Proc., vol. 2, pp. 242-246, Oct 1998.
14. Meyer-Baese, U., Vera, A., Mayer-Baese, A., Pattichis, M. and Perry, R., “Discrete Wavelet Transform FPGA
Design using Matlab/Simulink ,” ECE Dept., FAMU-FSU, 2004.
15. Andreas, S., Richard, C., “Discrete Wavelet Transform Core for Image Processing Applications,” Proc. of SPIE-
IS&T Electronic Images, SPIE, vol. 5671, 2005.

SPIE-IS&T/ Vol. 6811 68110T-10