Vous êtes sur la page 1sur 3

Efficient Pipelined VLSI Architectre with Dual

Scanning Method for 2-D Lifting-Based Discrete


Wavelet Transform
Anand Darji S.N.Merchant,A.N.Chandorkar
Electrical Engineering Department Electrical Engineering Department
Indian Institute of Technology Bombay Indian Institute of Technology Bombay
Powai, India Powai, India
anand@ee.iitb.ac.in mechant@ee.iitb.ac.in, anc@ee.iitb.ac.in

Abstract—In this paper, we describe a high speed, memory utilized far less numbers of adders and multiplier compared to
efficient, very low power and dual memory scan based pipelined convolution based approach.
VLSI architecture for 2-D Discrete Wavelet Transform (DWT)
based on Legall 5/3 filter. Proposed architecture consists of two The 2-D DWT usually implemented as 1-D DWT as leave
1-D pipelined architectures along with transpose unit (TU). cell along with TU and storage place of O(N2) . Line based
Architecture consumes two inputs per clock cycle and produces architectures in [3]-[5] to reduce the size of transportation
two outputs per cycle. Moreover dual scan technique is employ to buffer. Liao et al.[5] have proposed two DWT architectures
enhance throughput with 100% hardware utilization efficiency with recursive and dual scan methods for multi-level and
without significant increase in power. This architecture uses 2N single-level 2-D DWT, respectively. Xiong et al.[4] have
on chip buffer and five transpose register to process single level suggested improved method to reduce the size of on-chip
2-D DWT of image size of NxN. RTL (Register Transfer Level) is buffers. Wu et al. have [6] proposed pipelined architecture with
written using VHDL and netlist is compiled using Synopsys modified lifting scheme to reduce critical path to only one
Design Vision using UMC 180 nm MMRF technology cell library. multiplier delay. Y. Lai et al.[7] have proposed pipelined
After formal verification netlist is imported to cadence Soc architecture using 1-D core based on dual scan method. This
encounter for GDS-II file generation for (Application Specific paper, we have proposed 5/3 lifting filter based 2-D DWT
Integrated Circuit) ASIC. Simulation results show positive slack using two pipelined 1-D as sub cell and its VLSI
with 200 Mhz frequency. Core area of proposed architecture is implementation. We have used decimation property of DWT
only 0.73 mm2 with low power consumption such as 13.38 mw.
for interleaving to enhance throughput.
Keywords- ASIC ,DWT, Dyanamic Power, JPEG 2000, RTL, The rest of this paper is organized as follows. In section II,
VHDL the proposed high speed and low power 2-D DWT structure is
presented. ASIC implementation results and performance
I. INTRODUCTION comparison is presented in section III and concluding remark
are presented in section IV.
The DWT is used in many image processing applications
such as compression, bioinformatics, texture discrimination
[1]. There are lots of disadvantages of (Discrete Cosine II. PROPOSED ARCHITECTURE
Transform) DCT in comparison to DWT This template, Proposed 2-D DWT architecture is as depicted Fig.1 has
modified such as blocking artifact, less resolution capability. been used for ASIC implementation. Image or video frame is
DWT understands Human Visual System (HVS) better so that read from dual port ROM/RAM with dual scan method and
it has been accept in JPEG 2000 standard and adopted as the given to Row Processing Unit (RPU) to calculate 1-D DWT.
transform coder in MPEG-4 still texture coding. DWT has These 1-D coefficients are given to TU. In dual scan from each
many useful properties like symmetrical transform, integer-to- row two pixels are scanned per clock and given to RPU then in
integer transform and in-place computation. The conventional next clock two pixels from second row is scanned. This process
implementation using filter bank approach for 2-D DWT is repeated till the end. The TU manages the column wise input
demands very high computational power, and most of the to Column Processing Unit (CPU) and calculates 2-D DWT
applications demands real-time processing with low power coefficients. This architecture is scalable and can be used for
consumption. High speed yet low power implementation of 2- multilevel DWT.
D DWT to meet the timing requirement of real-time and low
power applications is therefore, considered as challenging task. The proposed architecture shown in Fig. 2 takes two inputs
and gives two outputs per cycle. Data1 and Data2 are the odd
As convolution based approach of 2-D DWT demands and even input samples given to hardware in single clock for
more silicon area and power, Swelden et al. [2] have suggested 100 % hardware utilization. This architecture is very simple
lifting based scheme for biorthogonal filters, in which liner design as compared to other architectures suggested in [5] and
filter is factorized into few lifting steps. Lifting based scheme [8] which have complex control path to achieve 100%

978-1-61284-865-52011
c IEEE 329
hardware utilization efficiency. Simple control path helps in Table I DATA SEQUENCE OF RPU AND CPU
power minimization by introducing less switching. Usually 2-D
DWT architecture has to wait for one complete row to be Clk Input 1D DWT Output 2D DWT Output
processed to start column processing and has requirement of 1 X1,1 ; X1,2
line buffer with size N for image size NxN. Proposed 2 X2,1 ; X2,2 L1,1 ; H1,2
architecture has a property to process two rows at alternate
3 X1,3 ; X1,4 L2,1 ; H2,2
clocks which gives required 1-D coefficients to start column
4 X2,3 ; X2,4 L1,3 ; H1,4 LL1,1 ; LH1,2
processing simultaneously to get 2-D DWT coefficients. This
dual scanning not only saves line buffer but reduces latency as 5 X1,5 ; X1,6 L2,3 ; H2,4 HL2,1 ; HH2,2
well. Transpose Unit (TU) is responsible for sequencing 6 X2,5 ; X2,6 L1,5 ; H1,6 LL1,3 ; LH1,4
coefficient available from RPU to CPU. Architecture of CPU is 7 X1,7 ; X1,8 L2,5 ; H2,6 HL2,3 ; HH2,4
same as RPU but has 2N line buffers. Here, in order to reduce .. …. …. ….
power we use direct mapped divide by two and four instead of
shifter or multiplier. Further, designed TU utilizes only five
registers compare to normal requirement of 1.5N as
transposing buffers [9] and uses two multiplexers which work
on half clock rate. This design reduces number of transitions to
reduce power. Moreover design of TU is independent of size
on input image size. Proposed architecture has very low
memory requirement and produces 2-D coefficients at a latency .
of only three cycles. This way lot of parallelism is introduced
Figure 3 Post Synthesis Simulation Results of Gate level netlist
to save clocks as shown in Table I for data sequencing. This
architecture has critical path of four adder delay can be reduced
to only two adder delay by inserting pipeline registers. Then Design Vision synthesized netlist is imported in
Cadence SOC encounter for ASIC implementation. SOC
encounter is responsible for place and route the imported
netlist as per user constraints in terms of area and speed and
produces circuit level netlist. Detail report from SOC
encounter is described in Table II. Post layout circuit power is
calculated using Synopsys prime power for 2-D DWT
operation on 256x256 image tile. We can see low Post routed
Figure 1. Pipelined 2-D DWT Architecture netlist is simulated with same input vectors to see the effect of
wire load the waveforms for the same is shown in Fig. 4. It is
clearly visible from Fig. 3 and Fig. 4 that post synthesis is
matches with post route simulation for same input vectors .

Table II CHIP LAYOUT REPORTS OF PROPOSED ARCHITECTURE USING SOC


ENCOUNTER

Parameter Value
Standard Cells 28459

Total area of Standard Cells 0.7386 mm2

Total area of Core 0.7388 mm2

Core Density 99.976 %

Figure 2. Row Processing Unit (RPU) Core Density 99.976 %

III. ASIC IMPLEMENTATION AND PERFORMACE ANALYSIS Total Wire Length 0.3611 um

The proposed architecture is implemented using UMC 180 Cell with Maximum Capacitance 4.8244 pF
nm technology standard cell library with clock frequency of a
200 MHz. The RTL design is synthesized using Synopsys
Design Vision using standard cell library to calculate
estimated power and area. Gate level netlist obtained after
synthesis is exited using test vectors for verification and
resultant waveforms are shown in Fig.3. The power estimated
by Design Vision is based on the voltage, capacitance and
time values provided as a standard.
Figure 4 Post rout Post Layout Simulation

330 2011 International Symposium on Integrated Circuits


Table III COMPARISONS OF CHIP LEVEL IMPLEMENTATION REFERENCES
[1] S.G.Mallat,” A theory for multiresolution signal decomposition: the
Parameter Liao et al.[5] Lai et al. [7] Proposed wavelet representation,”IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 11, no. 7, pp. 674-693, July 1989.
2-D 9/7 Filter 2-D 5/3 & 9/7 2-D 5/3
Specification [2] Daubechies I. and W. Sweldens, “FactoringWavelet Transforms into
(RA) filter filter
Lifting Schemes,” The Journal of Fourier Analysis and Applications,
UMC Vol. 4, No. 1, pp. 247–269,1998.
TSMC 180nm TSMC 180nm
Technology 180nm [3] P.C. Wu and L.G.Chen,” An efficeint architecture for two-dimensional
CMOS CMOS
CMOS discrete wavelet transform”, IEEE Trans. Circuits and Systems for
Core Area 2.25 mm2 0.704 mm2 0.7388 mm2 Video technology, vol.11, no.4, pp.536-545, Apr. 2001
[4] Chengyi Xiong, Jinwen Tian, and Jian Liu, “Efficient Architectures for
Frequency 50 MHz 100 MHz 200 MHz Two-Dimensional Discrete Wavelet Transform Using Lifting Scheme,”
IEEE Transaction on Image Processing , Vol. 16, No. 3, March 2007.
Power --- 102.6 mW 13.38 mW
[5] Hongyu Liao, Mrinal Kr. Mandal,” Efficient Architecture for 1-D and 2-
D Lifting Based Wavelet Transform,” IEEE Transaction on signal
IV. CONCLUSION Processing, Vol. 52, N0. 5, May 2004.
Our design is compared with other chip level [6] Bing-Fei Wu and C-F. Lin, “A High-Performance and Memory-Efficient
Pipeline Architecture for the 5/3 and 9/7 Discrete Wavelet Transform of
implementation as shown in Table III. The core size of our JPEG2000 Codec,” IEEE Transaction on circuits and systems for Video
design is slightly more than in [7] but very much less than [5] Technology, Vol. 15, No. 12, December 2005
which is recursive architecture (RA) and the power is reduced [7] Yeong-Kang Lai,L-F Chen and Y-C. Shih, “A High-Performance and
drastically. The increase in core area is negligible in view of Memory-Efficient VLSI Architecture with Parallel Scanning Method for
the power reduction achieved. The reduction in power is due 2-D Lifting-Based Discrete Wavelet Transform,” IEEE Transactions on
Consumer Electronics, Vol. 55, No. 2, May 2009.
to the optimized design using pipeline, dual scanning and less
[8] M. Ferretti and D. Rizzo, “A parallel architecture for the 2-D discrete
transpose buffer compare to other familiar architecture for wavelet transform with integer lifting scheme,” J. VLSI Signal
same throughput rate. Both designs use 256x256 image tile for Processing, vol. 28, pp. 165–185, July 2001.
design evaluation. [9] P-C.Tseng,C-T.Huang, and L-G.Chen,” Generic RAM-based
architecture fro two-dimnetional discrete wavelet transform with line-
based method,”IEEE Trans. Circuit Syst.Video Technology,vol.
15,no.7,July 2005.

2011 International Symposium on Integrated Circuits 331

Vous aimerez peut-être aussi