Vous êtes sur la page 1sur 78

A Low Complexity Platform-Based

Psychoacoustic Model (PAM) for MPEG-2/4


AAC Encoder

MPEG-2/4 AAC

A Low Complexity Platform-Based


Psychoacoustic Model (PAM) for MPEG-2/4
AAC Encoder

Hsing-Chuang Liu
Department of Electrical Engineering
National Central University
Chung-Li, 32001, Taiwan, R.O.C
Phone: 886-3-4227151, Ext. 34580, Fax: 886-3-4255830
Email: metero@dsp.ee.ncu.edu.tw


MP3 ,
iPod MPEG
MP3 MPEG-2/4 AACAAC
MP3 AAC

AAC
AAC
memory-based DSP
PAM

gating-clockMulti-Vth
0.13CMOS 43k 3.1MHz
3.67 SOC

~i~

Abstract
Since MP3 has been published, and became popular consumer applications, the digital
audio technique is an important part in daily life. The applications of digital audio technique
include broadcast system (DAB/DAB+), portable players, iPod, and mobile phone etc.
Organization of Moving Picture Experts Group (MPEG) proposed MPEG-2/4 AAC standard
which is the audio encoding technique of next generation. Both the performance and
compression ratio of AAC are better than MP3. However, the algorithm is more complex and
computation-intensive. Hence, how to reduce the computation and maintain quality is a major
challenge of AAC encoder.
In this thesis, we optimize the key component in MPEG-2/4 AAC encoder, which is
psychoacoustic model (PAM). PAM has different complicated functions to model the human
auditory system. This work exploits several methods to achieve low cost consideration, which
are memory-based architecture for filterbank, DSP-oriented threshold generator, shared
memory, and coefficient merged scheme. We use fully pipelined MDCT and fast algorithm
for filterbank to improve performance. Moreover, we apply cache-register, clock-gating,
operand isolation, and multi-Vth cell to save power consumption. As the synthesis result, our
PAM consumes 43 k gate counts in TSMC 0.13 COMS technology, 3.1MHz operation
frequency, 3.69mW for AAC encoder. Meanwhile, we also integrate our design into a SOC
platform and perform the verification on the platform.

~ii~

IC

:

Audio
:subband
:YUYU
019 410


2008 7 14

~iii~

Content
....................................................................................................................................... i
Abstract ................................................................................................................................. ii
Content ................................................................................................................................. iv
List of Figures ...................................................................................................................... vi
List of Tables...................................................................................................................... viii
Chapter 1 Introduction........................................................................................................ 1
1.1
The History and Feature of Audio Application.................................................... 1
1.2
The MPEG-2/4 AAC and HE-AAC v1/v2, SLS Encoder System........................ 5
1.3
Overview of SoC Platform-Based Design ........................................................... 9
1.4
Motivation........................................................................................................ 10
1.5
Thesis Organization.......................................................................................... 12
Chapter 2 The Overview of MPEG-2/4 AAC Encoder .................................................... 13
2.1

Filterbank ......................................................................................................... 15
2.2.1 Window Shape Adaptation .......................................................................... 16
2.2.2 Window Type Decision ............................................................................... 16
2.2.3 Modified Discrete Cosine Transform ........................................................... 17

2.2 Psychoacoustic Model............................................................................................ 18


2.3 The Other Signal Processing of AAC Encoder ....................................................... 21
Chapter 3 The Algorithm of Low Complexity MDCT-Based Psychoacoustic Model..... 24
3.1 Fast FFT-Based MDCT Algorithm......................................................................... 26
3.3 Low Complexity MDCT-Based Psychoacoustic Model .......................................... 32
Chapter 4 The Architecture of Low Complexity Psychacoustic Model........................... 36
4.1 Architecture of PAM .............................................................................................. 37
4.2 Design of MDCT ................................................................................................... 40
4.3 Design of Threshold Generator............................................................................... 45
Chapter 5 Platform-Based Design of Low Complexity Psychoacoustic Model ............... 46
5.1 Design Approach of MPEG-2/4 AAC Codec.......................................................... 47
5.2 Software/Hardware Development ........................................................................... 47
5.2.1 Software development ................................................................................. 48
5.2.2 Deliverable IP development ......................................................................... 49
5.3 Software/Hardware Co-simulation and Co-verification........................................... 51
~iv~

Chapter 6 Implementations and Results........................................................................... 53


6.1 Performance Evaluation ......................................................................................... 54
6.2 Power Analysis and Evaluation .............................................................................. 55
6.3 Design For Testing Strategy ................................................................................... 57
6.4 Comparison............................................................................................................ 57
Chapter 7 Conclusions....................................................................................................... 60
Reference............................................................................................................................ 62

~v~

List of Figures

Fig 1.1: The block diagram of MPEG-2/4 AAC encoder........................................................ 5


Fig.1.2 The principle of HE-AAC v1 (SBR) .......................................................................... 6
Fig.1.3 The block diagram of HE-AAC v1 (SBR) .................................................................. 6
Fig.1.4 The principle of HE-AAC v2 (PS) ............................................................................. 7
Fig.1.5 The block diagram of HE-AAC v2 (PS)..................................................................... 7
Fig.1.6 The block diagram of AAC-SLS (Lossless) ............................................................... 8
Fig.2.1 The block diagram of MPEG-2 AAC encoder .......................................................... 14
Fig.2.2 The block diagram of MPEG-4 AAC encoder .......................................................... 14
Fig 2.3. The different window shape of MPEG-2/4 AAC standard. ...................................... 16
Fig.2.4 Windows shape transfer mechanism for different windows shape ............................ 17
Fig.2.5. The masking effect of the human hearing. ............................................................... 19
Fig.2.6 The block diagram of PAM...................................................................................... 19
Fig.2.7. The detail data flow of PAM ................................................................................... 20
Fig.2.8 The quantization loop flow chart of AAC encoder ................................................... 23
Fig 3.1 The block diagram of AAC encoder with different PAM ......................................... 26
Fig.3.2 The flow chart of MDCT/IMDCT/FFT .................................................................... 27
Fig 3.3 the signal flow graph of radix-23 DIT algorithm....................................................... 28
Fig.3.4(a) The coefficient table merge scheme I................................................................... 31
Fig.3.4(b) The coefficient table merge scheme II ................................................................. 31
Fig. 3.5 The Pseudo code of the spreading function.............................................................. 32
Fig 3.6 The liner array method of look-up table ................................................................... 33
Fig.3.7. The window shape decision scheme of Huangs method ......................................... 34
Fig.3.8 The flow chart of logarithmic-based threshold generator.......................................... 35
Fig 4.1 (a) The original MDCT-based architecture............................................................... 37
Fig 4.1 (b) The folding architecture design for MDCT-based PAM...................................... 38
~vi~

Fig.4.2 The utilization of memory usage between original and rescheduled.......................... 38


Fig 4.3 The memory requirement of different approach: (a) local memory (b) shared memory
............................................................................................................................................ 39
Fig.4.4 The block diagram of hardware shared design for MDCT/IMDCT/FFT ................... 40
Fig 4.5 Architecture design of coefficient generator ............................................................. 41
Fig 4.6 The signal flow chart of butterfly unit. ..................................................................... 41
Fig. 4.7 Architecture design of butterfly unit........................................................................ 42
Fig 4.8 (a) Pre/Post twiddle operation timing chart of pipeline ............................................. 42
Fig 4.8 (b) Butterfly unit timing chart of pipeline................................................................. 43
Fig 4.9 The memory-based architecture of (a) original (b) with cache register...................... 43
Fig 4.10 The memory access scheme of original and cache register design........................... 44
Fig.4.11. The block diagram of DSP-oriented TG design. .................................................... 45
Fig.5.1 The property of HW/SW co-design approach........................................................... 48
Fig.5.2 The block diagram of SoC platform ......................................................................... 48
Fig. 5.3 Design flow of constructed deliverable IP ............................................................... 49
Fig. 5.4 The timing chart of HW/SW co-design (a) encoder flow (b) decoder flow .............. 50
Fig 5.5 The IP design with AMBA-slave wrapper................................................................ 51
Fig 5.6 The co-verification by SystemC tool (CoWare Platform Architect) .......................... 52
Fig 6.1 The cell-based IC design flow. ................................................................................. 54
Fig6.2 (a) The cycle counts compare with previous design with each application................. 58
Fig6.2 (b) The ROM table size compare with previous design with each application............ 58
Fig6.2 (c) The Memory access counts compare with previous design with each application . 58

~vii~

List of Tables
Table 1.1: Brief the history and feature of MPEG audio standards. ........................................ 4
Table 1.2 the complexity of MPEG-2/4 AAC encoder ......................................................... 11
Table 2.1 Coding tool usage of different profile AAC encoder............................................. 15
Table 3.1 The complexity analysis of MDCT (N=2048)....................................................... 29
Table 3.2 The table merge scheme of each reference coefficient. ......................................... 30
Table 6.1 The power analysis of each application ................................................................ 55
Table 6.2 Memory utilization compare with previous design................................................ 58
Table 6.3 ROM table utilization compare with previous design............................................ 59
Table 6.4 Cycle counts compare with previous design ......................................................... 59
Table 6.5 Logic gate compare with previous design ............................................................. 59

~viii~

Chapter 1
Introduction

In multimedia applications, audio is very popular in consumer products, which can be


found from internet audio to digital audio broadcasting, especially in portable device such as
Apples iPod, MP3 player, and mobile phone. MPEG-1 Layer-3 [1] (MP3) is the most
universal audio standard, because it has CD format audio quality with 10% bit-consumption
of CD format. And then MPEG-2/4 Advance Audio Coding (AAC) [2-4] has become more
and more popular and widely, due to it has high compression ratio and high audio quality,
even to multi-channel, and surrounds sound. And, the MPEG-2/4 AAC can achieve the same
audio quality with higher data compassion rate than MP3 by several complex algorithms in
the AAC encoding/decoding flow.

1.1 The History and Feature of Audio Application

The first revolution of digital audio industry is invented Compact Disc (CD) in 1982. It has

~1~

high quality for stereo audio at sampling rate 44.1 kHz. Since mobile and internet become
more and more popular, CD standard is not enough in low bit-rate and bandwidth limited
environment for transportation. The table 1.1 is a briefly list for the revolution of the MPEG
family. The first MPEG audio standard is MPEG-1 [1] which creates the new challenge of
mobile and internet technology in 1992. This standard build with three layers for different
application of communications-based and storage-based, like Digital Audio Broadcasting
(DAB), synchronized video-and-audio sequence on CD-ROM, Integrated Services Digital
Network (ISDN) etc. The MP3 (MPEG-1 Layer 3) is a new fashion and popular audio
coding technology in the world, and is a milestone technique of audio compression, because it
is high compression ratio which only consume 10% bit-rate relative to CD format and
maintain transparent audio quality. The MPEG-2 [2][3] and MPEG-4 [4] are proposed in
1994 and 1998. The first version of MPEG-2 (MPEG-2 BC, backwards compatible) [2]
standard which created in 1994 is the multi-channel extension of MPEG-1 standard. It
enhances the multi-channel ability, more flexibility in sampling rate and bit-rate supporting
relative to MPEG-1 standard. Of course, new audio coding technique will be progressive
invented. In 1997, the MPEG-2 [3] (MPEG-2 non-backwards compatible, NBC) is made for
High-Definition Television (HDTV), and high-quality applications. The main feature of
MPEG-2 NBC standard is changed the hybrid filterbank scheme for the consideration of
higher frequency resolution. The MPEG-4 AAC standard is almost the same as MPEG-2
AAC. Moreover, MPEG-4 AAC improves coding efficiency by adding Temporal Noise
Shaping (TNS), Long Term Prediction (LTP) and TWIN-VQ on MPEG-2 AAC. In 2003, the
Spectral Band Replication (SBR) is proposed and become the prime coding tool in MPEG-4
AAC standard named MPEG-4 HE-AAC v1 (High Efficiency)[5]. The new technique of SBR
is based on the low frequency spectrum information to reconstruct high frequency band part.
This technique significantly improves the compression efficiency and reduces 50% bit
consumption of original audio encoder. It reaches perceptually transparent quality at 64 kbit/s
per channel, and this technique also can be used in multi-channel scheme. In 2004, the
advance coding tool named MPEG-4 HE-AAC v2 is published by MPEG. HE-AAC v2 [6]
exploit parametric stereo coding. The new coding tool is based on single channel information
~2~

to reconstruct multi-channel spectrum. It provides good audio quality at bit-rate around 16 to


24 kbit/s for stereo content. Nowadays, the multimedia application requires higher audio
quality than previous standard. In order to satisfy this requirement, the HD-audio and lossless
audio scheme is proposed by MPEG and company. In 2005 lossless coding is published
named MPEG-4 Audio Lossless Coding (ALS) [7] and MPEG-4 Scalable Lossless Coding
(SLS) [8]. Lossless audio coding achieves the compressed digital audio data without any
quality loss. This topic is suitable for professional and high-end consumer applications. The
Audio Lossless Coding (ALS) is based on linear prediction to achieve de-correlation, where
each frame of the original signal is predicted by previous frame. The difference between
original and predicted called residual. If prediction were worked well, the residual will be
smaller than the original value. The residual is usually coded by simple entropy coding such
as Rice codes. The (Scalable Lossless Coding) SLS algorithm is a scalable transform-based
coder, providing a gradual refinement of the description of the transform coefficients, and
selecting for perceptually weighted reconstruction levels. The SLS coding scheme uses
Integer MDCT (IntMDCT) transform to avoid inducing arithmetic error and achieve lossless
reconstruction. The SLS coding tool includes integer transform and entropy coding, it can
used as a stand-alone lossless codec without AAC kernel. Above all of MPEG-2/4 addition
coding tool, the different application can exploit enhancement coding tool to accomplish low
bit-rate or high quality approach

~3~

Table 1.1: Brief the history and feature of MPEG audio standards.
Year Standards

Sampling rate

Bit-rate

(kHz)

(kbits/sec)

1992 MPEG-1 Layer

32, 44.1, 48

32 - 448

12

1992 MPEG-1 Layer

32, 44.1, 48

32 - 384

12

1992 MPEG-1 Layer

32, 44.1, 48

32 - 320

12

1994 MPEG-2 Layer

32, 44.1, 48

32 - 448

1 - 5.1

16, 22.05, 24

32 - 256

32, 44.1, 48

32 - 384

16, 22.05, 24

8 - 160

32, 44.1, 48

32 - 384

16, 22.05, 24

8 - 160

1994 MPEG-2 Layer

1994 MPEG-2 Layer

1997 MPEG-2 AAC

8, 11.025, 12, 16, 22.05, 24, 32, 8-64k/ch

Channels

1 5.1

1 5.1

1 96

44.1, 48, 64, 88.2, 96


1998 MPEG-4 AAC

8, 11.025, 12, 16, 22.05, 24, 32, 8-64k/ch

1 96

44.1, 48, 64, 88.2, 96


2003 MPEG-4 HE-AAC
(SBR)
2004 MPEG-4 HE-AAC
(PS)
2005 MPEG-4 ALS

8, 11.025, 12, 16, 22.05, 24, 32, 8-64k/ch

1 96

44.1, 48, 64, 88.2, 96


8, 11.025, 12, 16, 22.05, 24, 32, 8-64k/ch

1 96

44.1, 48, 64, 88.2, 96


Based on liner prediction coding scheme

(Lossless coding)
2005 MPEG-4 SLS
(Lossless coding)

Based on original AAC with extra integer transform and


entropy coding

~4~

1.2 The MPEG-2/4 AAC and HE-AAC v1/v2, SLS Encoder System
l

Original MPEG-2/4 AAC encoder coding scheme

The MPEG-2/4 AAC includes many functional units such as filterbank with modified
discrete cosine transform (MDCT) with window operation, psychoacoustic model,
quantization loop, joint stereo coding, and temporal noise shaping etc. The flow chart of
encoder is shown as figure 1.1. The time domain audio sampling signal (PCM data) feeds into
filterbank to obtain frequency spectrum. The PAM calculates Signal-to-Masking Ratio (SMR)
used to determine the precision of Q Loop and window shape selection. Window shape is
calculated for filterbank. After MDCT converts the time domain data into frequency spectrum,
the MDCT coefficients transfer to SPP to remove their redundancy and irrelevance by joint
stereo coding, mid/side coding and temporal noise shaping (TNS). Finally, the spectrums
perform non-uniformly quantization and noiselessly coding based on the masking threshold
and available number of bit to minimize the audible quantization error in the Q Loop.

Fig 1.1: The block diagram of MPEG-2/4 AAC encoder

~5~

HE-AAC v1 (SBR)

The technique of SBR use low frequency component of the audio spectrum to reconstruct
high band information. The SBR bit-streams only save low frequency band spectrum and
control signal. The principle and spectrum recovered scheme are shown as figure1.2. The
block diagram of HE-AAC v1 is depicted in figure 1.3. In SBR coding scheme, it adds many
tool on AAC kernel, such as Analysis Quadrature Mirror Filterbank (AQMF), Envelop Data
Calculator, SBR-related Modules, and Down-sampler. HE-AAC v1 (SBR) is a dual-rate
system. The audio signals of full sampling rate feeds into SBR Encoder and down-sampler
directly. The audio signals of half sampling rate which is output signal of down-sampler feed
into AAC encoder. The SBR encoder calculates control parameters to ensure that the
reconstructed high frequency results is perceptually transparent as possible as similar to the
original high band.

Fig.1.2 The principle of HE-AAC v1 (SBR)

Fig.1.3 The block diagram of HE-AAC v1 (SBR)

~6~

HE-AAC v2 (PS)
HE-AAC v2 bit-stream is obtained by downmixing the stereo audio to mono and the

SBR coding tool. The HE-AAC v2 decoder is based on 2-3 kbit/s of side information
(Parametric Stereo information) and mono audio information to recover transparent
multi-channel audio signal. The figure 1.4 is shown as the principle. In PS coding scheme,
only one audio channel signal with the parametric side information is transmitted. Thus, the
additional bit-rate spends on the single mono channel (combined with some PS side
information) will improve the perceived quality substantially of the audio compared to a
standard stereo stream at similar bit-rate. The figure 1.5 is shown the block diagram of
HE-AAC v2.

Fig.1.4 The principle of HE-AAC v2 (PS)

Fig.1.5 The block diagram of HE-AAC v2 (PS)

~7~

AAC-SLS (Lossless audio coding)

AAC-SLS is Lossless audio coding scheme including the basic layer and lossless
enhancement layer (LLE), which is shown as figure 1.6. In particular, the core layer is a
simply MPEG-2/4 AAC audio codec. The input audio signal is lossless transformed by
IntMDCT which can obtain the same spectrum form forward and inverse transform with
lifting scheme [9] in AAC-SLS encode scheme. Meanwhile, the lossless frequency spectrums
are fed to the core layer, AAC encoder to generate the core layer bit-stream. The LLE layer
calculates residual information between lossless spectrum and reconstructed information of
basic layer. The encoder exploits bit plan with Golomb Rice code to improve entropy coding
efficiency. The user can obtain lossy audio signal by basic layer with original AAC codec in
bandwidth limited, or lossless information by basic layer and enhancement layer. As the result,
the user can select as possible as similar to the original audio signal by basic layer or the
lossless signal by enhancement layer in different bandwidth resource and different
requirement.

Fig.1.6 The block diagram of AAC-SLS (Lossless)

~8~

1.3 Overview of SoC Platform-Based Design

System-on-a-chip (SoC) is a concept of integrating many components of computer or


electronic system into single integrated circuit. This conception of SoC is more and more
popular in IC design. In general speaking, SoC design usually includes digital, analog, and
mixed-signal circuit to accomplish the complicated system and various target. However, the
SoC design vendor might not support all of functional unit in specification. Subsequently, the
reusability idea overcomes this problem. The reusable design in SoC is called Virtual
Component (VC) or Intellectual Property (IP) [10]. Based on this idea, the hardware circuit or
software module can be reused in different platform or design. Nowadays, the SoC designs
usually integrated many various IP to provide enough functionality on system architecture.
A typical SoC Platform-based design consists of one or multiple microprocessor or DSP,
internal, external memory, and IP. The Hardware and Software (HW/SW) co-design is the
powerful design kit in SoC design. The advantage of co-design technique can achieve various
applications. And then, this approach is driven by some reasons such as the time-to-market
constrains, system flexibility, performance improvement by hardware accelerator. When
targeting on a complex design such as multimedia or digital communication system, HW/SW
partition is very importation. A good partition can get better performance. Generally the
software part has higher flexibility but the performance would be dropped. In contrast the
hardware part provides higher performance in computation-intensive function.

~9~

1.4 Motivation

In general speaking, the architecture design for audio applications can be classified to three
approaches, DSP-based, pure-ASIC component, and semi-ASIC architecture. A DSP-based
design requires higher cost and consumes more power. The pure-ASIC approach research has
been proposed in previous paper with physical implementation. However, pure-ASIC design
will lose flexibility. A semi-ASIC architecture provides more flexible features like DSP-based
and high performance advantage like pure-ASIC. As shown as [11], the semi-ASIC design is
used with HW/SW co-design to complete the system. It can achieve the compromise between
performance and flexibility by hardware accelerator and embedded processor. According to
the benefit of semi-ASIC, our design exploits this approach to implement high performance
and high flexibility MPEG-2/4 AAC encoder.
The original MPEG-2/4 system considers the low complexity and high quality issues to
implement design in embedded system or portable device. Because the portable device has
limited resources, the audio codec for portable consumer devices have many constraints, such
as low power, low cost, low memory requirement and real time constrainetc. The
implemented method of AAC encoder can be classified of programmable-based, PC-based
[12] [13], DSP-based [14][15], RISC-based and SW/HW co-design approach [16]. In portable
device, the AAC encoder still has many challenges of computation-intensive and complicate
algorithm, especially battery limited consideration. The PC-based and RISC-based approach
may not provide enough computation ability for real time constrain, and this approach has
much power consumption, so this approach is not suitable for portable application. The most
of paper discuss the implement and optimize AAC in DSP-based system, but it also has high
power consumption problem. Although Y. Takamizawa et al proposed a good method for low
power and low resource DSP-based implementation, but their implementation didnt support
the window shape switching mechanism. Without window switching mechanism may degrade
the audio quality. The SW/HW co-design has software approach flexibility and hardware
approach performance like Lu et al. [16]. But, the Lu method uses the window shape decision
~10~

in time domain. Based on purely time domain information decided window shape type does
not perform so reliably audio quality [17]. Our previous design is also a HW/SW approach
design [18]. It design hardware accelerator module, PAM to speedup the encoding flow.
However, the architecture has problem in special hardware unit, register file and resource
sharing and, it is not efficient in area utilization. As the result, our design modifies
architecture of previous design, and improves performance and hardware cost and efficiency
with VLSI architecture design technique (pipeline, parallel, folding and unfolding). After
hardware accelerator design, we also construct deliverable IP for ARM or ARM-based
system.
At first, the complexity profiling of AAC encoder is shown in Table 1.2. The analysis is
based on pc-based method with Microsoft Visual C++ 6.0 to simulate AAC LC (Low
Complexity) encoder at sampling 44.1k and bit-rate 128/64 kb. According profiling result, the
PAM and Q Loop occupy over 90% computation loading. The algorithms of PAM will be
analyzed in following chapter. The special functions occupy the heavy computation load of
PAM, in order to improve system performance, we choice the most complexity module, PAM
as hardware accelerator, and optimize hardware design for better performance. We also
construct platform-based component for HW/SW co-design scheme. The essential
considerations of hardware design are low complexity, low cost, low power, and real-time
constrain, and then, we improve our previous design based on above all consideration.

Table 1.2 the complexity of MPEG-2/4 AAC encoder


Functional unit

Complexity percentage

MOPS

(Bitrate 128kb/s)

Complexity percentage

MOPS

(Bitrate 64kb/s)

Filterbank

2.3%

2.1

1.4%

2.1

PAM

57.3%

52.2

35.1%

52.2

Q Loop

36.5%

33

61.1%

90.1

SPP

3.9%

3.5

2.4%

3.5

Total

100%

90.8

100%

147.9

~11~

1.5 Thesis Organization

This is thesis is classified into seven chapter, and we introduce form MPEG family history
to our MPEG-2/4 AAC encoder design. As following, the content of each chapter will be
briefly discussed.

Chapter 1 introduces the history of MPEG Audio Family, and introduces the trend of
current MPEG audio development and motivation.

Chapter 2 discusses the overall the MPEG-2/4 AAC encoder and discusses the
sub-module on the AAC LC profile. That includes filterbank, PAM, joint stereo
coding, TNS, and Q Loop.

Chapter 3 talk about the optimize method for MPEG-2/4 AAC encoder in algorithm
level. In this chapter, we will introduce the MDCT-based PAM and provide some
technique to reduce the complexity arithmetic computation.

Chapter 4 focuses on the design the architecture of MDCT-based PAM. We divide the
hardware into two parts, a hardware sharing design for filterbank and a DS-like design
for Threshold Generator (TG). TG is the algorithm of step3-step13 in PAM.

Chapter 5 introduces the platform-based design for our system. This platform includes
embedded CPU and hardware IP. In this chapter, we will discuss the platform-based
system construction, deliverable IP constructed, and software/hardware co-simulation,
and co-verification.

Chapter 6 shows the implement result and discuss low power technique in physical
layer. Finally, we will provide the comparison data with previous paper.

Chapter 7 makes some conclusions.

~12~

Chapter 2
The Overview of MPEG-2/4 AAC
Encoder

In chapter 1, the basic functional block of AAC encoder has been introduced, and then we
will discuss the algorithms of AAC encoder with Low complexity (LC) profile for each
sub-module in this chapter. MPEG-2/4 AAC provides the state-of-art technique for achieving
transparent quality at low bit rate. The detail encoder flow of MPEG-2 AAC and MPEG-4
AAC are shown as figure 2.1 and 2.2. MPEG-2 and MPEG-4 system are very similar. The
MPEG-4 system is an enhancement version of MPEG-2. The basic coding tool includes
Filterbank, Psychoacoustic-Model, TNS, Joint coding and Quantization Loop. The
enhancement tools of MPEG-4 are Long Term prediction (LTP), Perceptual Noise Shaping
(PNS), and Transformation-domain Weighted Interleave Vector Quantization (TWIN-VQ).
The common tools of MPEG-2/4 are 1.Gain control, 2. Psychoacoustic model 3. Filterbank,
4.Prediction, 5.Quantization and coding 6.Noiseless coding, 7.Temporal Noise Shaping(TNS),
8. Mid/Side(M/S) Stereo Coding, 9.Intersity Stereo Coupling, 10.Bistream Multiplexing.
~13~

Fig.2.1 The block diagram of MPEG-2 AAC encoder

Fig.2.2 The block diagram of MPEG-4 AAC encoder

The AAC standard provides three profiles including main profile, Low complexity (LC)
profile, and Scalable Sampling Rate (SSR) profile to provide various network bandwidth and
storage capacity applications. The table2.1 is shown as the different tools in different profile
encoder. The Main Profile exploits all coding tools except gain control. It demands substantial
processing power and yields highest efficiency. LC profile applies lesser compression to save
processing and memory usage. This profile uses in various applications, because it can
achieve enough quality with few complexity and memory. SSR profile uses lowest
computation complexity. It can be used to various bandwidths for frequency scalable ability.
Those three profiles are trade-off between performance and quality

~14~

Table 2.1 Coding tool usage of different profile AAC encoder


Tools

Main

LC

SSR

Gain control

NO

NO

YES

Psychoacoustic model

YES

YES

YES

Filterbank

YES

YES

YES

Prediction

YES

NO

NO

Quantization/coding

YES

YES

YES

Noiseless coding

YES

YES

YES

TNS

YES

Limited

Limited

Mid/Side(M/S)

YES

YES

YES

Intensity coupling

YES

NO

NO

The power consumption and complexity loading are major consideration in portable
consumer application. Based on those constrains, we only focus on the LC profile to
compromise between the audio quality and requirement of portable application.

2.1 Filterbank

The filterbank provides critical sampling, overlapping of blocks and good frequency
selectivity. A sub-sampling in frequency domain is performed critical sampling in combined
with overlapping blocks. Using sub-sampling would cause the aliasing in time domain which
can cancel by overlap and add operation of two sampling blocks in the synthesis filterbank.
This technique is called Time domain aliasing cancellation (TDAC) [9]. And then, we know
filterbank is consisted of window overlap operation, and modified discrete cosine transform
(MDCT). After that we will talk about each module of filterbank.

~15~

2.2.1 Window Shape Adaptation


The MDCT and window switch effectively reduce the redundancy of audio signals with
minimizing the pre-echo effect which is commonly happen in transform coder. In order to get
better spectral separation ability, MPEG-2/4 AAC supports two different window shapes that
can be switched dynamically. They are Kaiser-Bessel derived (KBD) window and sine
window. The KBD window achieves better stop band attenuation while the sine window has
better pass band selectivity.

2.2.2 Window Type Decision


The AAC standard specifies various window shapes, and window length for MDCT to
avoid pre-echo effect and keep transform efficiency. In longer windows length, MDCT will
get more resolution in frequency domain. Oppositely, the transform of filterbank will get
higher resolution in time domain with shorter window length. The MPEG-2/4 AAC standard,
has four window shape and window length (LONG WINDOW, SHORT WINDOW, START
WINDOW, and STOP WINODW) shown as Fig. 2-3. Long window, Start window, and Stop
window are 2048 point, Short window is 256 points. All of the window type has 50% overlap
with previous window. When audio signal is stationary, the long block uses for better
frequency domain resolution and transformed efficiency. Oppositely, the short window uses
to avoid the pre-echo effect, when the signal is transient. The short window can get higher
precision in time domain, but it adds extra calculation loading and side information decision,
relative to long window shape.

Fig 2.3. The different window shape of MPEG-2/4 AAC standard.

~16~

As the result, the different window shape selection can get more efficiency or resolution,
but how to decide the window shape. AAC encoder exploits Perceptual Entropy (PE) which is
estimated by PAM to select window shapes and determine the signal property. The result of
selected window shape is used to decide current windows shape, and widow shape transfer
mechanism. The state diagram of window decision is depicted in Fig. 2-4. The START
WINDOW and STOP WINDOW is the buffer when the window type is changed from LONG
to SHORT or SHORT to LONG, respectively.

Fig.2.4 Windows shape transfer mechanism for different windows shape

2.2.3 Modified Discrete Cosine Transform


The modified discrete cosine transform (MDCT) is a Fourier-related transform, based on
the discrete cosine transform type-IV (DCT-IV) with the additional property of overlapping. It
is designed to perform on consecutive blocks of larger data, where subsequent blocks are
overlapped previous blocks, so that the last half block coincides with the first half of the next
block. This overlapping is used to energy-compaction qualities of DCT or MDCT especially
attractive for signal compression applications, since it helps to avoid artifacts stemming from
the block boundaries. The MDCT used in MPEG-AAC employ TDAC to eliminate the
~17~

aliasing effect between two discontinued frames. The MDCT formula is written as follow:

The MDCT formula is

X t (m ) =

2
N

2 N 1

w(k )x (k )cos 4 N (2m + 1)(2k + N + 1)


k =0

m = 0,1.....N 1

Where

X(m) is frequency domain spectral coefficient index.


W(k) is window coefficient.
X(k) is time domain input sequence.
N is window length of the each transform window shape.
k is time domain index.
m is spectral domain index.

2.2 Psychoacoustic Model

The psychoacoustic model is the most important functional unit of perceptual audio coding.
It model humans sense of hearing system and separate the audio signal which is heard or not.
Now we will discuss the masking effect in audio spectrum. Masking refers to a process where
one sound is rendered inaudible because of the presence of another sound. In figure 2.5 [19],
we see a loud signal masking two other signals at nearby frequencies. Other signals of
frequency components are below this curve which is reconstructed by loud signal would not
be heard when the masker is present. Just like with the threshold in quiet, we can exploit this
effect to remove the signal components under the new threshold which is inaudible.

~18~

Fig.2.5. The masking effect of the human hearing.

The masking threshold of each audio frame is a major consideration of audio quality and
coding efficiency. As the result, the psychoacoustic model has two tasks to execute in MPEG
audio encoder, decide window shape for filterbank which is outside of PAM and calculating
masking threshold which mean signal to mask ratio (SMR). According to the ISO/IEC
14496-2 [4] standard, we can arrange the PAM into 13 steps. The block diagram of PAM is
shown in Fig. 2-6 and detail flow chart is shown as Fig. 2-7.

Fig.2.6 The block diagram of PAM

~19~

Steps 1, 2 (FFT)
r(w)

i(w)

Steps 3-4
r(w), e(b)

Spreading
Function

c(b)

Step 5
cb(b)

r(w),
e(b),
en(b)

Step 6
tb(b) tonality index

Steps 7-10
nb(b) threshold

Step 11
PE

Step 12
Step 13

block type

SMR(n)
Fig.2.7. The detail data flow of PAM

In PAM, the 13 steps in calculating the masking threshold are arranged as follows: Step 1-2
are exploited to a time-to-frequency mapping; Step 3-10 are used to calculate the masking
thresholds (SMR) ; Step 11-12 are used to determine the windows shape; Step 13 outputs
final ratios of signal energy to threshold for Quantization Loop. Steps 1-2, PAM normalizes
the time-domain samples as input and transforms into frequency-domain spectrums of real
part r(w) and imaginary part i(w) by FFT. Real-part spectrums are used to calculate the
partitioned energy and imaginary-part spectrums are used to calculate the weighted
unpredictability measure c(b) in Steps 3-4. In Step 5, partitioned energy and unpredictability
are convolved with the spreading function to estimate the effects across the partitioned bands.
Step 6 is used to estimate tonality index to indicated tonal-like signal. Step 7 calculates the
Signal-to-Noise Ratio (SNR) and masking partitioned energy threshold. The steps 8-10
estimate the masking curve of spectrum. Perceptual Entropy (PE) is calculated to determine
the windows shape by Steps 11-12. Window shape decision requires detecting whether there
~20~

is a transient signal in the frame. Finally, Signal-to-Mask Ratio (SMR) is computed in Step 13
as output. w, b, and n indicate indices in the spectral line domain, the threshold calculation
partition domain, and the coder scale-factor band domain, respectively.

2.3 The Other Signal Processing of AAC Encoder


l

Temporal Noise Shaping

Pre-echoes is a problem with most block-based coding schemes. TNS uses frequency
domain prediction to shape the quantization noise and make echoes, or noise, unnoticeable
signal. It uses a filter to deal with original spectrum and quantizes. It transmits quantized filter
coefficients to the bit-stream. The filter performed in the encoder, which leads to a temporally
shaped distribution of quantization noise, and then the noise would not be noticeable in the
decoded audio signal with implemented correctly. TNS is only applied for long blocks and
not short blocks.

Joint Stereo Coding

The joint stereo coding can be classify with SS ("simple" or "L/R" stereo), MS ("mid-side"
stereo), or IS ("intensity" stereo). Joint stereo stream may only employ a single coding
method. It can switch multiple methods on one frame or even sub-frame for the goal of
efficiency or quality. As following, we will introduce the various methods. Simple stereo (SS)
or Left-Right (L/R) are the most straightforward method of coding a stereo signal. For each
channel is treated as a completely separate entity. This can be inefficient and may adversely
impact quality when both channels contain nearly identical signals. Mid-side stereo coding
calculates a "mid"-channel by summation of left and right channel, and a "side"-channel by
different of left and right channel. The mid-side stereo can significant reduce bit-rate, because
the encoder can use fewer bit to save the side-channel information. The M/S coding is a
special case of transform coding, and retains the audio perfectly without introducing artifacts.
Finally, the intensity stereo coding is a method that saves bit-rate by replacing left and right
channels signal by a single representing signal adds directional information. This replacement
~21~

is psychoacoustically justified in the higher frequency range since the human auditory system
is insensitive to the signal phase at frequencies. Intensity stereo is by definition a lossy coding
method thus it is primarily useful at low bit-rates.

Quantization Loop

Quantization is the combination of dividing and rounding a real quantity into a small
discrete number. A quantized value is more compact to store. However, quantization
introduces error after multiplication the reconstructed and old values may differ. The
quantizer of AAC encoder exploits two nested loop, which are inner loop and outer loop to
encode the spectral data and reduce redundant information. In AAC encoder, it uses a
non-uniform quantizer, which has a nonlinear operation of |X|3/4. The equation of non-uniform
qunatizer is

x _ quantizer (i ) = int[

spectrum
2

3
4

3
( gl scf ( i ))
16

] + 0.4054

Where
Spectrum is spectral data of audio signal
gl(i) is the global scale factor (rate controlling parameter)
scf(i) is the scale factor (distortion controlling parameter)

The Quantization loop need satisfy two rules; one is quality, it means quantization noise
must below SMR; the other is bit-rate requirement, it means bit-rate consumption must less
than bit-rate requirement. However, those two rules are not always achievable, especially in
the low bit rate consideration. The standard defines two extra rules to solve this problem. First,
all of the scale factor bands have been amplified. Second, the different of two consecutive
scale factor bands exceeds 60. After that, we will introduce the quantization loop flow. Figure
2.8 is shown the Quantization loop flow.

~22~

Fig.2.8 The quantization loop flow chart of AAC encoder

At first, the quantization loop will initial the gl and scf(i). After initial, the rate control
mechanism started to calculate bit-rate consumption (inner loop) until the bit-rate
consumption is less than specifies requirement, or the inner loop will adjust gl to allow higher
bit-rate consumption. And then the control flow will start outer loop mechanism. The
quantization noise will be estimated in this state. The outer loop will finish with two situation,
one is the quantization noise below the SMR curve, the other is re-jump to inner loop and
continue excite quantization step until the exit rules are satisfied, prime rules or extra rules.

~23~

Chapter 3
The Algorithm of Low Complexity
MDCT-Based Psychoacoustic Model

The original encoder flow exploits filterbank with MDCT (2048-point LONG window
shape and eight 256-point SHORT window shapes) to transfer audio data from time-domain
signal to frequency-domain spectrum. And psychoacoustic module exploits FFT (2048-point
LONG window shape and eight 256-point SHORT window shapes) to achieve the similar
function. Meanwhile, the MDCT spectrums can replace the complex-FFT spectrums through
the MDCT-based PAM algorithm. The original PAM scheme and MDCT-based PAM block
diagram are shown as figure 3.1 (a)(b).
The MDCT-based PAM algorithm is first published by Takamizawa [20]. He finds that the
spectrum information from MDCT is enough to PAM calculation. Thus, the time to frequency
transform, FFT in PAM can be replaced by MDCT and combined with filterbank which is
outside of PAM. MDCT-based algorithm reduces the FFT computation loading. However, the
MDCT-based spectrum lacks phase-information from FFT imaginary part. The lack
~24~

information can be calculated by tonality which is achieved by Spectral Flatness Measure


(SFM) [21]. The MDCT-based PAM has two important advantages , one is the redction of
compuation which mean original three transforms (one MDCT and two FFT (LONG and
SHORT)) are reduced to one transform (MDCT). The other is the reduction of frame memory
by that SFM replaces original calculation of unpredictability.
The MDCT-based PAM can reduce the computation loading and memory requirement,
but the window shape decision of current frame is existent problem. The AAC coding tools
apply the dynamic window switching mechanism in filterbank, so the current frame window
shape must be selected before filterbank. When the window shapes decision are not carefully,
the quality of encode signal will degrade. The MDCT-based PAM has a serious problem
which combines window shape decision and filterbank in the same time. In order to prevent
this problem, the windows shape decision in time domain is proposed by [20][22]. In
Takamizawas method [20], he applies time domain information to solve this effect, but do
not specify the details of this technique. In Dimkoviaes method [22], he also applies time
domain information to determine window shapes by calculating the difference between
neighboring sub-blocks and admit that tuning efforts are required to maintain the quality.
However, the [23] shows that window shape decision purely by time domain information does
not perform so reliably for quality requirement. Huangs methods [24][25] proposed three
methods to reduce the computation loading, memory requirement and keeping transparent
audio quality. The Huang method incorporates the advantages and improves the disadvantage
of previous design, Above all, the Huang method can be briefly classified three methods.
Method 1 is proposed to simplify the PAM algorithms. Method 2 deals with sound quality
because the simplified PAM will affect the output (SMR and window shape). Method 3
release the memory requirement and computation loading such as size of look-up table and
data memory. In following section, the Huangs methods will discuss more detail.

~25~

(a) FFT-based PAM

(b) MDCT-based PAM


Fig 3.1 The block diagram of AAC encoder with different PAM

3.1 Fast FFT-Based MDCT Algorithm

In general speaking, the MDCT is complexity transform with large additions and
multiplications. Our design exploit fast algorithm to accomplish MDCT to reduce
computation complexity and improve performance. The various fast algorithm of MDCT have
been proposed. Based on [26], the fast algorithm can be classified into (1) Factorizing MDCT
computation into the formula of complex or real valued FFT e.g.[27], (2) Through
trigonometric equivalence map MDCT to DCT-2, and apply fast DCT algorithm to achieve
the computation e.g. [28], (3) Using trigonometric equivalence to convert the MDCT
coefficients into twiddle factor form recursively e.g. [29] (4) By matrix decomposition to
reduce size from N to N/2 and then apply DCT/IDCT2 kernel to achieve the formula e.g. [30]

~26~

Our design apply the FFT-based (1) algorithm to implement MDCT, according to the
consideration of VLSI architecture implementation. Based on [27], the MDCT formula can be
rewritten as

X ( m) = e

2i
1
2 i
mk
( m+ )

n
n/4
8
f
e
e
(
)
, m = 0~ n/4-1
k

k =0

2 i
1
( m + ) n / 41
n
8

Where
n = 2048 (long window), 256 (short window)
fk = ( f (2k ) f (n 2k 1)) + i ( f (n / 2 + 2k ) n / 2 2k 1)
X (2m) = Re( X (m)), X (2m + n / 2) = Im( X (m)), m = 0 ~ n / 4 1
The MDCT flow is shown as figure 3.2

Fig.3.2 The flow chart of MDCT/IMDCT/FFT

At First, the input data need reorder to N/4-point with real part and image part. The fk is the
reorder operation. After that the complex number need multiplied by pre-twiddle coefficient
to suitable FFT operation. The e

2 i
1
( m+ )
n
8

is pre/post-twiddle operation in FFT-based MDCT.

~27~

And then, time-to-frequency operation achieves by N/4-point FFT kernel. After that the FFT
spectrum recovers to the MDCT spectrum by post-twiddle operation which is the same as
pre-twiddle. Finally, the complex data has to de-interlever into real number, which is mapping
512-point FFT spectrum with complex part to 1024-point MDCT spectrum with only real
number. In order to improve hardware efficiency, and reduce computation loading, we use
radix-23 FFT algorithm. The figure 3.3 is shown signal flow graph of radix-23 FFT.

W84Nn

W82Nn
W86Nn

W81Nn
W85Nn

W81

W83Nn

W87Nn

W83

Fig 3.3 the signal flow graph of radix-23 DIT algorithm

The radix-23 algorithm cad reduce more than 50% computation load relative to previous
work with radix-2 algorithm. And then, the complex multiplications are major operation of
FFT algorithm. In traditional method, the complex multiplier implement by four real
multipliers and two adders. The equation is shown as

A(a + bj ) * B(c + dj ) = (a * c b * d ) + j (a * d + b * c)
According [31], the complex multiplier can realize by three multipliers and five adders, and
reduce multiplier counts to improve performance. The formula is shown as

A(a + bj ) * B (c + dj ) =

[c * (a b) + b * (c d )]
+ j[d * (a + b) + b * (c d )]
~28~

Based on [9], the complex multiplier only needs three multipliers and three adders by
lifting scheme. This method use matrix decomposition to factorize original matrix to three
sub-matrixes. And then each sub-matrix only has one multiplier, but it needs extra arithmetic
operation such as divider and subtraction. The extra operation can ignore by pre-pressing in
our case, because the parameters(c,d) are twiddle factor of FFT and pre/post twiddle which
consist of sine and cosine coefficients. The formula is shown as

c d a
A(a + bj ) * B (c + dj ) =

d c b
1 0 1 1 0 a
=


1 0 1 0 b
where = (c 1 / d ), = ( d )
Above all, the lifting scheme requires fewer operators, but this method need to extra
arithmetic operators. And then, the overhead would cause extra computation loading or extra
coefficient tables. Based on low cost consideration, this method would not suitable in our
design. Our design selects three multiplications and five additions to achieve FFT algorithm
to balance complexity and cost criteria. Table 3.1 is shown as complexity of various
approaches for MDCT algorithm with FFT.
Table 3.1 The complexity analysis of MDCT (N=2048)
Arithmetic operator

Direct

With Radix-2 FFT

With Radix-22/2 FFT

With Raidx-23 FFT

Multiplication

2,097,152

13,312

7296

5760

Addition

2,095,104

12,288

21376

18816

According Table 3.1, using FFT-based approach with radix-23 algorithm can greatly reduce
computation loading. And then, this approach can apply VLSI technique such pipeline or
folding to achieve low cost and low complexity goal.
Following description, we will introduce the proposed coefficient merged scheme for
Window operation, MDCT/IMDCT, and twiddle factor of FFT. In the filterbank via MDCT
flow, the four modules require coefficients, including window operation, pre-twiddle, FFT
~29~

operation, and post-twiddle. The table 3.2 is shown the original equation of each coefficient
respectively long and short.
Table 3.2 The table merge scheme of each reference coefficient.

Sin

2i
2i
Cos
2048
2048

Sin

2i
4096

Cos

2i
4096

2i
2i
Cos
2048
2048
(i + 0.125)
Re = Cos
1024
(i + 0.125)
Im = Sin
1024
Sin

(i + 0.125)
128
(i + 0.125)
Im = Sin
128
Re = Cos

Sin

(i + 0.5)
2048

Sin

(i + 0.5)
256

Based on the table 3.2, the coefficients of each equation consist of trigonometric function,
including sine and cosine. The coefficients can merge through trigonometric symmetric
property and sine-cosine similar property. The total flow of coefficient merged has three steps,
step one is profiling each equation, the equation is very similar; excluding the resolution of
each one is different. And then we select one equation to regard as a reference coefficient, and
reconstruct the others coefficients. FFT twiddle factor is used to reference coefficient, because
it has higher resolution, and it only need 1/8 period of cosine and sine coefficients to store in
table. The other coefficients can be reconstructed by 1/8 stored values. The step one is shown
as figure 3.4 FFT twiddle factor. In step two, the others coefficient such as pre/post-twiddle
and window operation are recovered by 1/8 stored coefficient. Moreover, the coefficient
merge scheme can separate into two schemes for different applications. The resolution of
original coefficient is enough to recover the pre/post twiddle in MDCT approach. However,
~30~

the filterbank has window operation which needs higher resolution. In order to recover this
coefficient, the resolution of reference coefficient has to increase to solve this problem. The
coefficient merged scheme I and II is illustrated as figure 3.4 (a)(b). The step three uses to
compensate the numerical error of coefficient table such as pre/post twiddle (long) and
window coefficient (long). Those coefficients require higher resolution to complete recover,
but that would cause extra overhead to recover correct coefficient. Our design exploits
approximate method with error compensation to recover those coefficients to maintain the
quality and avoid more loading. Above all, the coefficient merge schemes can reduce more
than 70% ROM (original: 3976 word scheme I: 512word scheme II: 1024 word) table
requirement

Fig.3.4(a) The coefficient table merge scheme I

Fig.3.4(b) The coefficient table merge scheme II

Moreover, this approach of MDCT can extend to relative function such IMDCT for AAC
decoder and 2048 FFT for DAB+ system. The figure 3.2 is shown as MDCT/IMDCT/FFT
signal flow chart. And then the proposed coefficient merge schemes can apply into IMDCT
and FFT function. It shows that, this algorithm and hardware design can apply in audio codec
application and DAB+ receiver system. In VLSI architecture design, it can achieve low cost
~31~

and configurable design for multi-applications, because the algorithm of this design is
VLSI-oriented, and then the detail discussion will introduce in next chapter.

3.3 Low Complexity MDCT-Based Psychoacoustic Model

In previous discussion, PAM is the most complexity and important module in MPEG-2/4
AAC encoder. And then Huangs methods [24][25] exploits three methods to reduce the
computation load and memory bandwidth. Those three methods will introduce in this section.

Method 1 : Pre-processing spreading function and optimization

The step 5 of PAM is used to calculate spreading function in PAM coding flow. However,
spreading function includes complexity arithmetic operator such as square roots, power of
tens etc. The pseudo code of spreading function is shown as figure 3.5.
Spreading function (bark value i, bark value j)
{
if (j>=i) tmpx = 3.0*(j-i)
else tmpx = 1.5*(j-i)
tmpz = 8 * minimum(((tmpx-0.5)2-2(tempx-0.5)),0)
tmpy = 15.811389 + 7.5(tmpx+0.474) 17.5(1.0+(tmpx+0.474)2)0.5
if (tmpy < -100) then return 0
else return 10((tmpz+tmpy)/10)
}
Fig. 3.5 The Pseudo code of the spreading function.

The spreading function is according to bark value, and then the bark value depends on
sampling rate and window shape. And, the complex operation can be replaced by look-up
table to reduce computation load. Furthermore, Huang optimize the look-up table size, which
he found the non-zero values are distributed in diagonal of look-up table and reorder the
non-zero values to liner array. Figure 3.6 is illustrated as linear array of look up table. This
method reduces computation load with a little overhead. And its optimization reduces look-up

~32~

table requirement from 6664 words to 2067 words in sampling rate: 44100.

1
2
The linear array for
non-zero values

2
70

start
end
The array for indices

70
zero values

non-zero values

Fig 3.6 The liner array method of look-up table

Method 2 : Window type decision in frequency domain

Pervious work [20][22] of MDCT-based PAM can not guarantee good quality with the
window shpae decision in time domain alone. The Huangs method [24] can reduce the
complexity and prevent quality degrade. The figure 3.7 is illisuated the window shape
decision scheme of Huangs method. This method has two phase to excuate, phase 1 predicte
the window shape of current frame by Perceptual Entropy (PE). And then phase 2 uses to
calculate the spectrum with selected window shape and the signal-to-mask ratio (SMR) for
quantization loop and the other coding tools.

~33~

Input buffer

Delay 2

MDCT 1 (LONG)

MDCT 2 (1 of 4)

Threshold
Generation 1
(LONG)

Threshold
Generation 2
(SHORT)

Delay 1
Window
shape
SMR

Delay 1

Output buffer

Fig.3.7. The window shape decision scheme of Huangs method

The scheme applies two parallel PAM and two delay unit for different phase required. The
PAM 1 with only Long window shape generate threshold, detect the transients and decide
window shape for current frame. The sequence will be stored in delay until PAM 1 finish, and
then window shape information and audio data feds into PAM2 to calculate the required
information. The window shape transition method is the same as the definition of the standard.
By this scheme, it improves the quality of MDCT-based PAM.

Method 3 : Logarithmic-based threshold generator

After previous optimized methods, the PAM flow still includes complex operation, such as
log10, division, and power of tens. In VLSI-oriented and DSP-oriented design consideration,
those arithmetic operations are difficult implemented in hardware or DSP instruction. In order
to implement PAM on hardware or DSP platform and reduce complex computation, Huang
[13] exploit log-scale and reschedule technique to calculate step 7-13. Based on this method,
multiplications and divisions in the original domain correspond to the summations and
subtractions in the logarithmic domain. The flow chart of optimal result is shown in figure.3.8.
~34~

Moreover, memory storage and bandwidth requirement in Threshold Generation (TG) are
reduced, because energy and masking threshold data are only in the block. The word length of
those data in logarithmic format is less then original format.

Fig.3.8 The flow chart of logarithmic-based threshold generator.

~35~

Chapter 4
The Architecture of Low Complexity
Psychacoustic Model

After pervious chapters, the algorithm is clear discussion of MDCT-based PAM and AAC
encoder flow. This chapter will descript the hardware design focused on architecture design
with low power and low complexity. Based on the Huang method, (1) MDCT-based PAM
algorithm reduces FFT computation by MDCT-spectrum and that correspond to hardware
design, which only needs one type filterbank with FFT-based MDCT to reduce computation
complexity. (2) The unpredictability measure is replaced by SFM, it not only avoid
computing special function (sine, cosine, square root and division), but also reduce the
memory utilization. (3) The complex special functions of spreading function are replaced by
look-up table, but it need extra ROM table in hardware design. (4) The complex equations of
TG only need arithmetic operation of log10 and power of tens by logarithmic-based design. (5)
The power of tens can be reduced by logarithmic-based quantization loop algorithm. The
complexity arithmetic operation only needs log10 in PAM.

~36~

4.1 Architecture of PAM

The PAM has three modules, including Filterbank (MDCT), Threshold Generator, and
controller in proposed hardware design. Based on Huang proposed MDCT-based algorithm
method 2, this method requires parallel PAM to achieve window shape decision and SMR
calculation. The figure 4.1(a) is shown as the original MDCT-based architecture. The
architecture has predicted phase and evaluated phase. Predicted phase uses to calculate current
frame window shape. Evaluated phase calculates SMR and frequency spectrum. Our design
implements PAM with VLSI architecture design technique, folding technique to match low
cost consideration. The figure 4.1 (b) is shown as folding architecture design for
MDCT-based PAM. By using data rescheduled technique to arrange original schedule input
data for applying in folding architecture. After data rescheduling and folding architecture, the
hardware only requires one PAM module, and accomplishes two phases mechanism to obtain
essential information for encoder flow.

Fig 4.1 (a) The original MDCT-based architecture

~37~

Delay unit

Audio
raw data

Evaluated
phase
Input
data
Spectrum

MDCT-based PAM
Predicted phase
Long type

Block
type

MDCT
(1 of 4)

Threshold
generator
(Long/Short)

Predicted phase

SMR
Block
type

Evaluated
phase
Long/Start/Short/Stop type
Folding architecture
Control signal

Data path

Proposed MDCT-based
PAM flow

Fig 4.1 (b) The folding architecture design for MDCT-based PAM

In general case, the storage component like RAM and ROM is major cost in digital design.
After profiling memory usage and analyzing utilization of PAM, the total memory usage is
92160 bits in previous design, and the utilization of each memory is shown as figure 4.2.
Original memory usage is inefficiency. The idle state of each memory is more than active
time. The inefficiency memory in digital design cause more power consumption.

Fig.4.2 The utilization of memory usage between original and rescheduled


~38~

The different area block in figure 4.2 means different memory size. The memory size
includes 1024x24, 128x24, and 128x24 in original design; 512x24 and 256x24 in proposed
design. Previous design of PAM applies local memory in each module (MDCT, TG) which is
shown as figure 4.3 (a). The proposed method has a conception of shared memory to improve
the efficiency and achieve low cost constrain, this approach is shown as figure 4.3(b). The
proposed methods include two techniques for this goal which are memory reschedule and
memory partition. In reschedule scheme, the same word-length memory is shared by multiple
modules in distributed time schedule. In partition scheme, the utilization of memory can
improve again, which mean that the 2N word-length RAM can be replaced by two N
word-length RAM, or the N-point RAM can be replaced by two N/2-point RAM for different
modules. The figure 4.2 proposed methods is shown as memory utilization with rescheduled
and partition. Finally, the requirement memory reduces form 92160 bits to 49152 bits. In
order words the proposed method saves about 50% storage element. After that, the key
module of MDCT and TG will describe in next section.

Fig 4.3 The memory requirement of different approach: (a) local memory (b) shared memory

~39~

4.2 Design of MDCT

The major computation load of PAM is occupied by calculating MDCT. Our design not
only improves performance in algorithm level, but also applies VLSI architecture design to
obtain better performance. In previous chapter discussion, FFT-based MDCT has advantage
in VLSI architecture design, and exploit radix-23 algorithm to reduce complexity. The FFT
design is the major part of filterbank. The proposed method exploits memory-based FFT with
fully pipeline butterfly unit to compromise with cost and performance constrain. After that,
this hardware design can accomplish multiple functions including MDCT, IMDCT, and FFT
with corresponded temporal size memory. The figure 4.4 is shown as block diagram of
hardware shared design for MDCT/IMDCT/FFT. The proposed design consists of a memory
unit (RAM), butterfly unit, Cache-register, controller, Address generator, and coefficient
generator, as follow as each block in hardware design will be described.

Fig.4.4 The block diagram of hardware shared design for MDCT/IMDCT/FFT

The figure 4.5 is architecture view of coefficient generator. Based on proposed coefficient
merged method, the coefficient table only needs 1024 word with multiple muxs and offset
~40~

compensation which is consisted of simple adder/substation to reconstruct all of coefficient


value. The original address remap to new address to obtain corresponded coefficient. Based
on new address and coefficient type to decide the coefficient value, weather need to offset

Sub_Add

MUX

compensation or not. Finally, the reconstructed value can calculate by those two steps.

MUX

Fig 4.5 Architecture design of coefficient generator

The similar MDCT architecture design is performed the butterfly operation of FFT [32].
Based on previous design [32], proposed method exploits fast algorithm and cache-register to
improve performance and reduce power consumption. Next, the butterfly unit and cacheregister mechanism will be introduced. The signal flow chart of butterfly is shown as Figure.
4.6. Because, proposed method exploits radix-23 algorithm, the signal flow is different with
original radix-2 butterfly unit.

Fig 4.6 The signal flow chart of butterfly unit.

The figure 4.7 shows the architecture of butterfly unit. The butterfly unit is consisted of one
multiplier, three adders/subtraction and four pipeline registers. The figure 4.8 is the timing
~41~

chart for the pipelined butterfly unit of pre/post twiddle operation and butterfly operation. The
hardware achieves 3 clock cycle pre pre/post twiddle, 6 clock cycle pre butterfly. In other
words, butterfly operation need two complex multiplications in radix-23, and pre/post twiddle
need one complex multiplication, however the complex multiplication achieve to 3 clock
cycle with three multiplications and five additions method. The utilization of multiplier is
100% such that each product is generated every clock and feds to add/sub module. As the
pipelining timing chart shows as figure 4.8, continues result are outputted after 4 cycles for
pre/post twiddle and 7 cycles for butterfly operation.

Fig. 4.7 Architecture design of butterfly unit.

Fig 4.8 (a) Pre/Post twiddle operation timing chart of pipeline

~42~

Fig 4.8 (b) Butterfly unit timing chart of pipeline

Our design also proposes solution for low power consideration. In general speaking, the
memory occupies about 30%-50% power consumption in digital design, and access-intensive
memory causes more power consumption. Meanwhile, memory is below hard IP in cell-based
design. The architecture of memory can not modify in transistor level or circuit level. How to
reduce power consumption in this situation? The designer only has right to modify the
memory access counts and select memory type. The proposed design selects single-port
SRAM as storage elements and exploits cache-register to reduce memory access counts to
reduce power consumption.

Memory
Memory
Write

Rea
d

Write

Cache-register

Read

Butterfly unit
(Radix-2)

Butterfly unit
(Radix-2)

(a) Original memory-based design

(b) Modified memory-based design


with Cache register

Fig 4.9 The memory-based architecture of (a) original (b) with cache register

~43~

The figure 4.9 is shown the original and modified memory-based design. The cache register
design adds a little overhead, however it reduces more 50% memory access counts in radix-23
butterfly operation. The cache registers are between the memory and butterfly unit and store
temporal data in each 8-points FFT operation. The figure 4.10 is shown as the memory access
scheme of different design. The PE (process element) will access memory in each 2-point
butterfly computation in original design for 8-point FFT. The cache registers design only
access memory when reading data from memory and saving data back. The others access is
replaced by cache registers. The memory access counts can be calculated by equation which is
shown as following. Based on this scheme, the memory access can be reduced more than 60%
of original scheme.
In original scheme for 8-point example
Memory access: 4 butterfly unit x each 2-point R/W access x stage (4 x 2 x (2+2) x 3 = 96)
In cache-register scheme for 8-point example
Memory access: 8-point Read/8-point Write

(8 x (2+2) = 32)

Cache register access: 4 butterfly unit x 2-point R/W access x stage-1 (4 x 2x (2+2) x 2 = 64)

Fig 4.10 The memory access scheme of original and cache register design.

The address generator and controller are implemented by Finite State Machine (FSM).
Moreover, address generator will transfer original address to coefficient merged table address

~44~

and calculate memory R/W address and cached-register R/W enable signal. The memory unit
is divided to eight parts, because it has benefit for power consumption and it has flexible
memory size for various applications (MDCT/IMDCT/FFT).

4.3 Design of Threshold Generator

The threshold generator calculates SMR for quantization loop and perceptual entropy (PE)
for windows shape decision. In previous design [33] [34], the architecture of TG is based on
DSP-oriented design to achieve low area cost, and exploit logarithmic-based numerical format
to reduce complex arithmetic operation and word-length of the data. After logarithmic scale,
the operators only need multiplier, multiplication-and addition, logarithm, adder/subtraction,
and comparator. The TG includes two blocks, inner block and outer block. Inner block
achieve the arithmetic computation which is consisted of Logarithmic unit (LOG) [35],
Multiplication-Addition unit (MAC) and Arithmetic logic unit (ALU). Outer block
accomplish control and spreading function which is consisted of controller and ROM table.

Fig.4.11. The block diagram of DSP-oriented TG design.

~45~

Chapter 5
Platform-Based Design of Low
Complexity Psychoacoustic Model

The system-on-chip (SoC) design concept has become more and more practical by advance
of IC fabrication and electronic design automation (EDA) technologies. And the SoC design
can achieve a complex system in single chip with low power, low cost and high performance
consideration. The existent platform-based methodology [36] is defined as architectural
framework with a set of pre-qualified software and hardware IP. The proposed design exploits
this design methodology to construct a pre-qualified software and hardware IP which can be
integrated into platform-based design. The designer only modifies the wrapper to match bus
specifications for different processor core (DSP, ARM, PowerPC or user define processor).
The proposed design is a reusable IP in different applications (audio codec, DAB+ system).
The main features of this IP are flexible with different clock rates for different application. It
can provide the lower clock rate with the real-time constrain, and reduce hardware resource in
different applications.

~46~

5.1 Design Approach of MPEG-2/4 AAC Codec

In general speaking, the architecture design for audio application can be classified to three
approaches, DSP-based [13][37], pure-ASIC [38][39], and semi-ASIC [11][16] architecture.
A DSP-based architecture needs higher cost and higher power consumption for specific audio
applications, and software-based programming is not as efficient as that in dedicated hardware.
In the result, DSP-based approach always requires higher operation frequency than the
dedicated hardware design to meet the real-time constrain. On the other hand, the pure-ASIC
design solves the cost, performance, and power consumption problem, but it will lost flexible.
The semi-ASIC designs use HW/SW co-design to complete the system. The HW/SW
co-design is trade-off between flexible of DSP-based, and advantage of pure-ASIC. In other
words, the semi-ASIC design is more cost effective than the DSP-based design and more
flexible than the pure-ASIC design.

5.2 Software/Hardware Development

A major concept of HW/SW co-design is that computation-intensive functions are achieved


by hardware accelerator, and others are processed by programming software. The figure 5.1 is
shown as the property of HW/SW co-design approach. In previous chapter discussion, the
PAM and filterbank occupy more than 50% computation load, and it includes regular
arithmetic operation. PAM and Filterbank are implemented by hardware accelerator and other
control-intensive part by software, to obtain better performance and lower cost. In order to
construct of platform-based HW/SW co-design, this design exploits multiple components in
this platform which includes the embeded processor, embedded memory, others system
functional blocks, and hardware accelerator with wrapper (PAM, filterbank) to communicate
with user defined bus. The platform block diagram is shown in figure 5.2. It provides a
completed SoC platform with well-design IP which can be easily integrated to platform-based

~47~

system. In fact, this work not only constructs this SoC platform to achieve HW/SW co-design
for MPEG-2/4 AAC codec, but also provide the integrated design flow, and verification
strategy. Detailed discussion will be described in next sections.

Fig.5.1 The property of HW/SW co-design approach

Fig.5.2 The block diagram of SoC platform

5.2.1 Software development


The software part is based on the MPEG-2/4 AAC standard reference source code. In
software development, our design develops the C-code in PC environment, and then use the
corresponded simulator and compiler to construct the software part in various processor core.
The designer needs translation from the C-code to the embedded processor (DSP, ARM etc.)
~48~

compatible code. It mainly consists of two modules, one is the control-intensive function and
the other is the interface between software/hardware parts.

5.2.2 Deliverable IP development


Based on previous chapter discussion, the hardware accelerator has been constructed well,
and finished functional verification. The design well hardware accelerator has to be
constructed to a deliverable IP to integrate on various SoC platform or system easily. The
flow of constructed deliverable IP is illustrated in figure 5.3. Our design is based on AMBA
bus and ARM processor, a deliverable model is provided for ARM-based platform. Following
is the detailed description for the model and composed of three stages.

Deliverable IP Design Flow


IP Qualify

Wrapper
design

AMBA protocol
verification

Co-simulation
with software

Design well
hardware

AMBA slave
Compatible IP
Fig. 5.3 Design flow of constructed deliverable IP

Stage.1 IP qualify [40] can help the designer with better coding style. By this stage, the
verilog code can prevent some potential problems in register transition level (RTL), which
includes simulation, synthesis, timing analysis, and design for test problem. It can improve
the readability when the other designer integrated this design.

Stage.2 The design well IP had been integrated on ARM-based platform, and used AMBA
bus to communicate with ARM processor, but the original I/O specification is design for user
defined interface. The IP needs adjust interface to apply AMBA specification by wrapper
design which uses to communicate with bus. Proposed IP exploit AHB slave wrapper to
communicate with processor and bus. And then the data transfer mechanism is separated into

~49~

initial phase and data phase to transfer control signal and frame data. Initial phase setups
control informant to IP core, and data phase transfers input data (time domain signal) and
receive output data (SMR value, window shape). Moreover, the wrapper is designed as
memory map to assist software development. The timing chart of SW/HW co-design and
encode/decode flow is shown as figure 5.4.

Fig. 5.4 The timing chart of HW/SW co-design (a) encoder flow (b) decoder flow

In encoder flow, the time domain data will transfer to IP core via wrapper interface and
calculate SMR information. And then, processor will receive SMR data from IP and calculate
other essential coding tools. Relatively, the decoder flow calculates pre-processing part by
processor firstly. After that, processor sends frequency domain uncompressed data to IP and
perform IMDCT to obtain time domain data (PCM out). The processor can be used by the
other functions to achieve multi-function simultaneously in this co-design scheme. In other
words, software and hardware modules can work in parallel to improve system performance.
The wrapper has another task which transfers data between different clock domains. For
real-time constrain or IP performance constrain, the clock rate of IP may not be the same as
the clock rate of AMBA bus and ARM processor. In order to communicate IP and processor,
the wrapper require FIFO mechanism to transfer data. The FIFO mechanism achieves by
dual-port SRAM, and control logic. Figure 5.5 is illustrated as IP design with AMBA-slave
wrapper.
~50~

Fig 5.5 The IP design with AMBA-slave wrapper

Stage.3 After wrapper design in hardware part, it has to be verified by processor instruction
set. This design exploit Synopsys DesignWare AMBA Verification IP [41] to verify wrapper
function whether it applies with AMBA protocol or not. This stage generates random
command and pattern to test wrapper protocol, and observes the response to check the
wrapper function correct or not.

5.3 Software/Hardware Co-simulation and Co-verification

The design has to prove that this approach is practicable to physical implement after
hardware/software individual development. But, it may not achieve verification after physical
implement. In order to show this design is practicable in system level, our design exploits
system level cad tool to achieve co-simulation and co-verification. Using system level cad
tool, CoWare Platform Architect [42], we can model system including processor, bus, and
hardware accelerator. This approach can provide required model for hardware-software
co-simulation and co-verification. Base on the co-simulation and co-verification results, it can
debug easily and avoid mismatch on performance requirement or functional specifications.
~51~

Fig 5.6 The co-verification by SystemC tool (CoWare Platform Architect)

~52~

Chapter 6
Implementations and Results

The proposed low complexity architecture is implemented by Chip Implementation Center


(CIC) cell-based design flow. The platform-based is implemented by CIC Multi-Project
System-on-Chip (MPSoC) design flow. In behavior level, the system development is based on
the reference source code which is ISO/IEC 14496-3, and replaces original algorithm by fast
algorithm of MDCT-based PAM. After that, the behavior model modifies to fixed-point to
simulate the numerical error of finite word-length. Next, the synthesizable code is written by
verilog, and it is simulated to verification specified function. The RTL design is synthesized
by synopsys design compiler with TSMC 0.13 cell library. Finally, the physical implement
exploit APR tool (cadence SOC encounter) to achieve place and route of gate level netlist.
After cell-based design flow, we also follow platform-based design flow to construct HW/SW
co-design with IPQ, Virtual platform prototype, and RTL-platform verification. The figure 6.1
is shown as cell-based IC design flow.

~53~

System
Specification

System level
Design
(C simulation)

Module
Design
(Verilog coding)

Comparing
(Verilog vs. C)

Synthesis
(Synopsys Design
compiler)

RTL Level
Gate Level
Simulation

Compare
(Verilog vs. c)

Place and Route


(Soc Encounter)

Fig 6.1 The cell-based IC design flow.

6.1 Performance Evaluation

In multimedia application, the real-time constrain is the most important consideration. We


will analysis the performance of proposed design in this section. The real-time operation
frequency for each application can be calculated by the cycle count of each function. As
following formula, we find out the real-time operation frequency. The specification of our
design is that sample rate: 48 kHz, 2-channel signal in audio application, FFT size: 2048 point
in DAB+ system Type I.

MDCT of encoder (sample rate:48kHz, 2-channel ):


Frequency =

Sample _ rate * Channel * ( phase)


48000 * 2 * 2
* Cycle _ count =
*14233 2.7 Mhz
frame _ size
1024

~54~

PAM of encoder (sample rate:48kHz, 2-channel ):


Frequency =

Sample _ rate * Channel * ( phase)


48000 * 2 * 2
*16432 3.1Mhz
* Cycle _ count =
frame _ size
1024
IMDCT of decoder (sample rate:48kHz, 2-channel ):

Frequency =

Sample _ rate * Channel * ( phase)


48000 * 2 *1
*14233 1.4Mhz
* Cycle _ count =
frame _ size
1024
FFT of DAB+ system (2048 point FFT):
Frequency =

FFT _ cycle _ count


40460
=
33Mhz
Symbol _ required _ time 1.25ms

Based on the evaluated result, the frequency needs 1.4 MHz or 3.1 MHz for audio decoder
and encoder, and needs 33MHz for DAB+ system.

6.2 Power Analysis and Evaluation

The power consumption is important issues in portable application, such as mobile phone,
mp3/AAC player, DAB+ receiver, PDA etc. The power consumption of each applications
can be evaluated to prove that our design match low power consideration. After that, we
profile the distribution of power consumption in different application. The power evaluation
is based on the power analysis tool, primepower. The power consumption of different
applications is shown as table 6.1.

Table 6.1 The power analysis of each application


Application

MDCT@(5MHz)

PAM@(@10MHz)

IMDCT@(5MHz)

FFT@(40MHz)

Power consumption

1.279mW

6.13mW

1.293mW

11.89mW

Dynamic/Leakage

1.186mW/ 93uW

6mW/0.13mW

1.2mW/93uW

11.79mW/93uW

Logic/Memory

0.88mW/0.391mW

5.23mW/0.9mW

0.88mW/0.411mW

8.12mW/3.77mW

After optimization

0.8994mW(70%)

3.694mW(60%)

0.912mW(70%)

8.7mW(73%)

Dynamic/Leakage

0.849mW/50uW

3.64mW/53uW

0.862mW/50uW

8.653mW/50uW

Logic/Memory

0.517mW/0.382mW

2.88mW/0.8mW

0.52mW/0.4mW

4.98mW/3.71mW

~55~

The power consumption can be classified into dynamic power and leakage power. Those
two power consumption formats can be derived from equation as follow.

In dynamic power

power = * Capacitor * frequency * volatge2


In leakage power
V gs Vth

power = ( I 0 * e

nVT

* (1 e

V DS
nVT

V DS
nVT

)e

) * voltage

After power consumption evaluating and equation deriving, the various gate-level and
circuit-level technique can reduce power consumptions to meet low power constrain. The
logic gate transition is the major part of dynamic power, thus the clock-gating and operand
isolation has been proposed to save power. In general case, the dynamic power can reduce
30%-50% by reducing transition counts. Moreover, the supply voltage has large effect in
dynamic power. In formula shown, power consumption and voltage have square relationship,
thus the power gating and voltage scaling technique has been proposed. The power gating
means that turn off power supply in idle state and voltage scaling is trade off between circuit
performance and power consumption in physical layer. Above those two method, that can
greatly save dynamic power more then 50%. The power consumption of memory is noticeable
which is based our profiling. Reducing memory access counts is efficiency method to save
memory power consumption like previous discussion of proposed architecture.
The leakage power is difficult problem in digital design. The solution of leakage power has
to physical layer or circuit level. In circuit level, the multi-threshold voltage logic cell and
body-biased method have been proposed to reduce leakage power.

~56~

6.3 Designs for Testing Strategy

In order to increase testability of hardware design, the testing circuit such as scan chain and
memory BIST will insert into hardware design. Logic part employs scan chain as DFT circuit
to detect stuck at fault, and then the coverage of our design can achieves 90.59%. Memory
part employs BIST circuit to verify the correctness of memory

6.4 Comparison

In this section, we will show the comparison with previous work, including MDCT[18][34],
IMDCT[43], and PAM[18][34] in audio application and FFT[44][45] for DAB/DAB+ system.
The figure 6.2 is shows as comparison of various features like cycle count, ROM table size,
memory access counts. Proposed hardware shared design provide high performance with
fewer cycle count, low cost with fewer ROM table, and low power with fewer memory access
counts. In previous design [45], the process element of that design exploits four multipliers to
achieve complex multination in each cycle. This design should be normalize to one multiplier
case to compare with our design, that the cycle counts of figure 6.2 (a) has be already
normalized to one multiplier approach. After that, we also compare with previous design of
PAM [18][34], the table 6.2~6.5 are shown as comparing with memory utilization, ROM table
utilization, Cycle count for calculating, logic gate count. In our design, PAM require 16432,
and 16063 for long, and eight short windows shape. ROM size and memory requirement are
5916Byte and 6144 Byte, which are about 50% improvement between previous designs.

~57~

64512

45516

19060
15360
13200

40460

30720

13200

33792

ISCAS
2006

Ours IEICE
2007

MDCT

Ours Trans. on
Trans. on Ours
BroadcastingBroadcasting
2003
2007
IMDCT

FFT

Fig6.2 (a) The cycle counts compare with previous design with each application

Fig6.2 (b) The ROM table size compare with previous design with each application

Fig6.2 (c) The Memory access counts compare with previous design with each application

Table 6.2 Memory utilization compare with previous design


RAM unit

NCU [18]

Proposed

MDCT

6144 Byte

6144 Byte

TG

5376 Byte

Total

11520 Byte

6144 Byte

~58~

Table 6.3 ROM table utilization compare with previous design


ROM unit

NCU [18]

Proposed

MDCT

5632Byte

2048 Byte

TG

3868 Byte

3868 Byte

Total

9500 Byte

5916 Byte

Table 6.4 Cycle counts compare with previous design


Window shape

NCU [18]

Proposed

Long

22480

16432

Eight Short

23572

16063

Table 6.5 Logic gate compare with previous design


Logical gate

NCU [18]

Proposed

Total PAM

72K

43K

~59~

Chapter 7
Conclusions

In this thesis, we proposed an optimized design for MDCT-based psychoacoustic model,


and hardware shared design for MDCT/IMDCT/FFT. In algorithm level, we exploit Huangs
method to reduce complexity and maintain quality. Beside, we also exploits fast algorithm to
reduce computation loading of filterbank, and requirement of coefficient table. In architecture
level, we reduces memory requirement by memory reschedule and partition. After that, we
proposed hardware shared design with fully pipeline butterfly, and coefficient table merged
architecture for multi-application. It can achieve low cost and high performance relative to
previous design. Moreover, we apply cache-register to reduce memory access counts for low
power constrain. Meanwhile, we use DSP-oriented design for threshold generator for low cost
goal. In circuit level, we reduce dynamic power by clock-gating and operand isolation, and
use high threshold voltage cell to reduce leakage power. Above all, proposed design has
feature of low cost, low power, and high performance, which are suitable for portable
application. We also exploit platform-based architecture for AAC encoder and construct
deliverable IP. By this approach, the system can achieve fast time to market consideration.

~60~

Moreover, this approach also can improve performance, and flexibility. The proposed design
can perform encoding stereo channel data in real-time constraint with sample rate 48000Hz
below clock rate 10 MHz. Based on these results, the proposed architecture has the high
efficiency, low power and low complexity advantages.
In future, the AAC kernel can extend to higher technique such as SBR and PS like MPEG
AAC family specification. Moreover, the proposed design can extend to pure ASIC design for
AAC encoder, or platform architecture to develop multi-function application such as MPEG-4
system including video and audio. Beside, it also can focus on quantization loop which is
another key component of AAC encoder, to improve performance and quality. We will
convince that there will be many applications around us with these audio coding applications
in the near future.

~61~

Reference
[1]. MPEG. Coding of moving pictures and associated audio for digital storage media at up to 1.5
Mbit/s, part 3: Audio, International Standard IS 11172-3, ISO/IEC JTC1/SC29 WG11, 1992.
[2]. MPEG. Information Technology generic coding of moving pictures and associated audio, part
3: Audio, International Standard IS 13818-3, ISO/IEC JTC1/SC29 WG11, 1994.
[3]. MPEG. MPEG-2 Advanced Audio Coding, AAC, International Standard IS 13818-7, ISO/IEC
JTC1/SC29 WG11, 1997.
[4]. MPEG. Information technology Coding of audio-visual objects Part 3: Audio, International
Standard IS 14496-3, ISO/IEC JTC1/SC29 WG11, 1999.
[5]. MPEG. Information technology Coding of audio-visual objects Part 3: Audio, Amendment
1: Bandwidth extension. ISO/IEC 14496-3:2001/Amd. 1:2003, Nov. 2003.
[6]. MPEG Information technology Coding of audio-visual objects part 3: Audio, Amendment 2:
Parametric coding for high-quality audio, ISO/IEC 14496-3/Amd. 2: 2004.
[7]. MPEG Information technology Coding of audio-visual objects part 3: Audio, Amendment 2:
Audio Lossless Coding, ISO/IEC 14496-3/Amd. 2: 2005.
[8]. MPEG Information technology Coding of audio-visual objects part 3: Audio, Amendment 3:
Scalable Lossless Coding, ISO/IEC 14496-3/Amd. 3: 2005.
[9]. R. Geiger, T. Sporer, J. Koller, and K. Brandenburg, Audio Coding based on Integer
Transform, in AES 111th Convention, New York, NY, USA Preprint 5471 Sept 2001.
[10]. P. Coussy, A. Baganne, and E. Martin. Virtual component IP re-use in telecommunication
systems design: a case study of MPEG-2/JPEG2000 encoder, IEEE Proc .ICECS2002. vol. 2,
pp.733-736, Sept. 2002
[11]. C.N. Liu, and T.H. Tsai, SoC platform based design of MPEG-2/4 AAC audio decoder, IEEE
Proc .ISCAS2005. vol. 3, pp.2581-2584, May. 2005
[12]. Domazet, D.; Kovac, M.; Advanced software implementation of MPEG-AAC audio encoder,
4th

EURASIP

Conference

focused

on

Video/Image

Processing

and

Multimedia

Communications, 2003. Volume 2, 2-5 July 2003 Page(s):679 - 684 vol.2


[13]. D. Huang, X. Gong, D. Zhou, T. Miki, S. Hotani, Implementation of the MPEG-4 Advanced
Audio Coding encoder on ADSP-21060 SHARC, in Proceedings of the 1999 IEEE
~62~

International Symposium on Circuits and Systems, Vol. 3, page(s): 544 547.


[14]. D. Alberto, P. Rafael, R. Begona, A. Enrique, and P. Antonio; A Robust and Efficient
Implementation of MPEG-2/4 AAC Natural Audio Coders , in AES 112th Convention 2002
May 10-13 Munich,Germany
[15]. P. Antonio, A. Enrique, R. Begona, P. Rafael, and D. Alberto; Realtime implementations of
MPEG-2 and MPEG-4 natural audio coders , in AES 110th Convention 2001 May 12-15
Amsterdam, The Netherlands
[16]. Y.C. Lu; C.-F. Shen and C.K. Chen; A novel hardware accelerator architecture for MPEG-2/4
AAC encoder, 2004 IEEE International Conference on Multimedia and Expo, 2004. ICME '04.
Volume 2, 27-30 June 2004 Page(s):1139- 1142 Vol.2
[17]. M. Kahrs, K. Brandenburg, Applications of digital signal processing to audio and acoustics.
Kluwer Academic Publishers, 1998, p.59.
[18]. J.H. Luo, Design and VLSI implementation of Low Complexity MDCT-based Psychoacoustic
Model Co-Processor for MPEG-2/4 AAC Encoder, Department of Electrical Engineering
National Central University Chung-Li; Master thesis, 2006
[19]. Fengduo Hu, ITE Technology Incorporated,2003.
[20]. Y. Takamizawa, T. Nomura, and

M. Ikekawa, High-quality and processor-efficient

implementation of an MPEG-2 AAC encoder, in Proceedings of the 2001 IEEE International


Conference on Acoustics, Speech, and Signal Processing, Vol. 2, Page(s): 985 988.
[21]. J. D. Johnston, Transform coding of audio signals using perceptual noise criteria, IEEE
Journal on Selected Areas in Communications, Vol. 6, No 2, pp. 314-323, Feb., 1988.
[22]. I. Dimkoviae, D. Milovanoviae, Z. Bojkoviae, Fast software implementation of MPEG
advanced audio encoder, 2002 14th International Conference on Digital Signal Processing,
Vol. 2, Page(s): 839 843.
[23]. M. Kahrs, K. Brandenburg, Applications of digital signal processing to audio and acoustics.
Kluwer Academic Publishers, 1998, p.59.
[24]. S.W Huang; T.H. Tsai; L.G. Chen; A low complexity design of psycho-acoustic model for
MPEG-2/4 advanced audio coding, IEEE Transactions on Consumer Electronics Volume 50,
Issue 4, Nov. 2004 Page(s):1209 - 1217 Digital Object Identifier 10.1109/TCE.2004.1362521
[25]. S.W. Huang; L.G. Chen; T.H. Tsai; Memory and Computationally Efficient Psychoacoustic
Model for MPEG AAC on 16-bit Fixed-point Processors Circuits and Systems, 2005. ISCAS
2005. Symposium on IEEE International 23-26 May 2005 Page(s):3155 3158.
~63~

[26]. P.S. Wu, and Y.T. Hwan; Efficient IMDCT core designs for audio signal processing, IEEE
Workshop on Signal Processing Systems, 2003. SIPS 2003. 27-29 Aug. 2003 Page(s):275
280
[27]. P. Duhmel, Y. Mahieux, and J.P. Petit, A fast algorithm for the implementation of filter banks
based 1on time domain aliasing cancellation , International Conference on Acoustics, Speech,
and Signal Processing, Vol. 3, Page(s): 2209-2212, Apr, 1991
[28]. Britanak, V.; Rao, K.R.; An efficient implementation of the forward and inverse MDCT in
MPEG audio coding, Signal Processing Letters, IEEE Volume 8, Issue 2, Feb. 2001
Page(s):48 - 51 Digital Object Identifier 10.1109/97.895372.
[29]. Y.H. Fan, Madisetti, V.K. and Mersereau, R.M.; On fast algorithms for computing the inverse
modified discrete cosine transform, IEEE Signal Processing Letters, Volume 6, Issue 3, March
1999 Page(s):61 - 64 Digital Object Identifier 10.1109/97.744625
[30]. M.-H Cheng, Y.-H. Hsu;Fast IMDCT/MDCT algorithms-a matrix approach, IEEE Trans. On
Signal Processing. Jan 2003, pp 221-9
[31]. A.Wenzler and E. Luder, New structures for complex multipliers and their noise analysis, in
Proc. IEEE Int. Symp. Circuits Syst.,May 1995, vol. 2, pp. 14321435.
[32]. Lau, W. and Chwu, A.; A common transform engine for MPEG and AC3 audio decoder,
IEEE Transactions on Consumer Electronics, Volume 43, Issue 3, Aug. 1997 Page(s):559 566
[33]. T.H. Tsai, S.W. Huang, J.H. Luo, Architecture Design of Psychoacoustic Model for
MPEG-2/4 AAC Audio Encoder The 16th VLSI Design/CAD Symposium (VLSI), 2005.
[34]. T.H. Tsai, J.H. Luo, S.W. Huang, Low Complexity Architecture Design of MDCT-Based
Psychoacoustic Model for MPEG 2/4 AAC Encoder, IEEE Proc .ISCAS2006. May. 2006
[35]. Abed, K.H. Siferd, R.E. CMOS VLSI implementation of a low-power logarithmic converter
IEEE Transactions on Computers, Volume 52, Issue 11, Nov. 2003 Page(s):1421 1433.
[36]. H. Chang et al., Surviving the SoC Revolution: A Guide to Platform-based Designs, Kluwer
Academic, Norwell, Mass., 1999
[37]. M. A. Watson and P. Buettner, Design and implementation of AAC decoders, IEEE Trans.
Consumer Electronics, vol. 46, issue 3, pp.819-824, Aug. 2000.
[38]. T. H. Tsai, C. N. Liu and Y. W. Wang, A pure-ASIC design approach for MPEG-2 AAC audio
decoder, in Proc. 4th IEEE Int. Conf. Information, Communications & Signal Processing and
4th Pacific-Rim Conf. Multimedia (ICICS-PCM), vol. 3, pp.1633-1636, Dec. 2003.

~64~

[39]. P. Liu, L. Liu, N. Deng, X. Fu, J. Liu, Q. Liu, G. Zhang, and B. He, VLSI Implementation for
Portable Application Oriented MPEG-4 Audio Codec, Circuits and Systems, 2007. ISCAS
2007. Symposium on IEEE International (ISCAS2007), pp. 777 - 780, May. 2007.
[40]. IP Qualification Alliance, IP Qualification Guidelines, Industrial Technology Research
Institute, 2003
[41]. Synopsys Inc. DesignWare AHB Verification IP Databook , Synopsys, May 2006.
[42]. CoWare Inc. http://www.corware.com/
[43]. T.H. Tsai and C.N. Liu, A Configurable Common Filterbank Processor for Multi-Standard
Audio Decoder, IEICE Transactions on Fundamentals of Electronics, Communications and
Computer Sciences, Vol. E90-A, No.9, pp.1913-1923, Sep. 2007.
[44]. C.C. Wang, and Y.C. Lin An Efficient FFT Processor for DAB Receiver Using
Circuit-Sharing Pipeline Design IEEE Transactions on Broadcasting, Vol. 53, Issue.3,
pp.670-677, Sep. 2007.
[45]. S.C. Tai , C.C. Wang, and C.Y. Lin FFT and IMDCT circuit sharing in DAB receiver IEEE
Transactions on Broadcasting, Vol. 49, Issue.2, pp.124-131, June 2003.

~65~

Vous aimerez peut-être aussi