Académique Documents
Professionnel Documents
Culture Documents
MPEG-2/4 AAC
Hsing-Chuang Liu
Department of Electrical Engineering
National Central University
Chung-Li, 32001, Taiwan, R.O.C
Phone: 886-3-4227151, Ext. 34580, Fax: 886-3-4255830
Email: metero@dsp.ee.ncu.edu.tw
MP3 ,
iPod MPEG
MP3 MPEG-2/4 AACAAC
MP3 AAC
AAC
AAC
memory-based DSP
PAM
gating-clockMulti-Vth
0.13CMOS 43k 3.1MHz
3.67 SOC
~i~
Abstract
Since MP3 has been published, and became popular consumer applications, the digital
audio technique is an important part in daily life. The applications of digital audio technique
include broadcast system (DAB/DAB+), portable players, iPod, and mobile phone etc.
Organization of Moving Picture Experts Group (MPEG) proposed MPEG-2/4 AAC standard
which is the audio encoding technique of next generation. Both the performance and
compression ratio of AAC are better than MP3. However, the algorithm is more complex and
computation-intensive. Hence, how to reduce the computation and maintain quality is a major
challenge of AAC encoder.
In this thesis, we optimize the key component in MPEG-2/4 AAC encoder, which is
psychoacoustic model (PAM). PAM has different complicated functions to model the human
auditory system. This work exploits several methods to achieve low cost consideration, which
are memory-based architecture for filterbank, DSP-oriented threshold generator, shared
memory, and coefficient merged scheme. We use fully pipelined MDCT and fast algorithm
for filterbank to improve performance. Moreover, we apply cache-register, clock-gating,
operand isolation, and multi-Vth cell to save power consumption. As the synthesis result, our
PAM consumes 43 k gate counts in TSMC 0.13 COMS technology, 3.1MHz operation
frequency, 3.69mW for AAC encoder. Meanwhile, we also integrate our design into a SOC
platform and perform the verification on the platform.
~ii~
IC
:
Audio
:subband
:YUYU
019 410
2008 7 14
~iii~
Content
....................................................................................................................................... i
Abstract ................................................................................................................................. ii
Content ................................................................................................................................. iv
List of Figures ...................................................................................................................... vi
List of Tables...................................................................................................................... viii
Chapter 1 Introduction........................................................................................................ 1
1.1
The History and Feature of Audio Application.................................................... 1
1.2
The MPEG-2/4 AAC and HE-AAC v1/v2, SLS Encoder System........................ 5
1.3
Overview of SoC Platform-Based Design ........................................................... 9
1.4
Motivation........................................................................................................ 10
1.5
Thesis Organization.......................................................................................... 12
Chapter 2 The Overview of MPEG-2/4 AAC Encoder .................................................... 13
2.1
Filterbank ......................................................................................................... 15
2.2.1 Window Shape Adaptation .......................................................................... 16
2.2.2 Window Type Decision ............................................................................... 16
2.2.3 Modified Discrete Cosine Transform ........................................................... 17
~v~
List of Figures
~vii~
List of Tables
Table 1.1: Brief the history and feature of MPEG audio standards. ........................................ 4
Table 1.2 the complexity of MPEG-2/4 AAC encoder ......................................................... 11
Table 2.1 Coding tool usage of different profile AAC encoder............................................. 15
Table 3.1 The complexity analysis of MDCT (N=2048)....................................................... 29
Table 3.2 The table merge scheme of each reference coefficient. ......................................... 30
Table 6.1 The power analysis of each application ................................................................ 55
Table 6.2 Memory utilization compare with previous design................................................ 58
Table 6.3 ROM table utilization compare with previous design............................................ 59
Table 6.4 Cycle counts compare with previous design ......................................................... 59
Table 6.5 Logic gate compare with previous design ............................................................. 59
~viii~
Chapter 1
Introduction
The first revolution of digital audio industry is invented Compact Disc (CD) in 1982. It has
~1~
high quality for stereo audio at sampling rate 44.1 kHz. Since mobile and internet become
more and more popular, CD standard is not enough in low bit-rate and bandwidth limited
environment for transportation. The table 1.1 is a briefly list for the revolution of the MPEG
family. The first MPEG audio standard is MPEG-1 [1] which creates the new challenge of
mobile and internet technology in 1992. This standard build with three layers for different
application of communications-based and storage-based, like Digital Audio Broadcasting
(DAB), synchronized video-and-audio sequence on CD-ROM, Integrated Services Digital
Network (ISDN) etc. The MP3 (MPEG-1 Layer 3) is a new fashion and popular audio
coding technology in the world, and is a milestone technique of audio compression, because it
is high compression ratio which only consume 10% bit-rate relative to CD format and
maintain transparent audio quality. The MPEG-2 [2][3] and MPEG-4 [4] are proposed in
1994 and 1998. The first version of MPEG-2 (MPEG-2 BC, backwards compatible) [2]
standard which created in 1994 is the multi-channel extension of MPEG-1 standard. It
enhances the multi-channel ability, more flexibility in sampling rate and bit-rate supporting
relative to MPEG-1 standard. Of course, new audio coding technique will be progressive
invented. In 1997, the MPEG-2 [3] (MPEG-2 non-backwards compatible, NBC) is made for
High-Definition Television (HDTV), and high-quality applications. The main feature of
MPEG-2 NBC standard is changed the hybrid filterbank scheme for the consideration of
higher frequency resolution. The MPEG-4 AAC standard is almost the same as MPEG-2
AAC. Moreover, MPEG-4 AAC improves coding efficiency by adding Temporal Noise
Shaping (TNS), Long Term Prediction (LTP) and TWIN-VQ on MPEG-2 AAC. In 2003, the
Spectral Band Replication (SBR) is proposed and become the prime coding tool in MPEG-4
AAC standard named MPEG-4 HE-AAC v1 (High Efficiency)[5]. The new technique of SBR
is based on the low frequency spectrum information to reconstruct high frequency band part.
This technique significantly improves the compression efficiency and reduces 50% bit
consumption of original audio encoder. It reaches perceptually transparent quality at 64 kbit/s
per channel, and this technique also can be used in multi-channel scheme. In 2004, the
advance coding tool named MPEG-4 HE-AAC v2 is published by MPEG. HE-AAC v2 [6]
exploit parametric stereo coding. The new coding tool is based on single channel information
~2~
~3~
Table 1.1: Brief the history and feature of MPEG audio standards.
Year Standards
Sampling rate
Bit-rate
(kHz)
(kbits/sec)
32, 44.1, 48
32 - 448
12
32, 44.1, 48
32 - 384
12
32, 44.1, 48
32 - 320
12
32, 44.1, 48
32 - 448
1 - 5.1
16, 22.05, 24
32 - 256
32, 44.1, 48
32 - 384
16, 22.05, 24
8 - 160
32, 44.1, 48
32 - 384
16, 22.05, 24
8 - 160
Channels
1 5.1
1 5.1
1 96
1 96
1 96
1 96
(Lossless coding)
2005 MPEG-4 SLS
(Lossless coding)
~4~
1.2 The MPEG-2/4 AAC and HE-AAC v1/v2, SLS Encoder System
l
The MPEG-2/4 AAC includes many functional units such as filterbank with modified
discrete cosine transform (MDCT) with window operation, psychoacoustic model,
quantization loop, joint stereo coding, and temporal noise shaping etc. The flow chart of
encoder is shown as figure 1.1. The time domain audio sampling signal (PCM data) feeds into
filterbank to obtain frequency spectrum. The PAM calculates Signal-to-Masking Ratio (SMR)
used to determine the precision of Q Loop and window shape selection. Window shape is
calculated for filterbank. After MDCT converts the time domain data into frequency spectrum,
the MDCT coefficients transfer to SPP to remove their redundancy and irrelevance by joint
stereo coding, mid/side coding and temporal noise shaping (TNS). Finally, the spectrums
perform non-uniformly quantization and noiselessly coding based on the masking threshold
and available number of bit to minimize the audible quantization error in the Q Loop.
~5~
HE-AAC v1 (SBR)
The technique of SBR use low frequency component of the audio spectrum to reconstruct
high band information. The SBR bit-streams only save low frequency band spectrum and
control signal. The principle and spectrum recovered scheme are shown as figure1.2. The
block diagram of HE-AAC v1 is depicted in figure 1.3. In SBR coding scheme, it adds many
tool on AAC kernel, such as Analysis Quadrature Mirror Filterbank (AQMF), Envelop Data
Calculator, SBR-related Modules, and Down-sampler. HE-AAC v1 (SBR) is a dual-rate
system. The audio signals of full sampling rate feeds into SBR Encoder and down-sampler
directly. The audio signals of half sampling rate which is output signal of down-sampler feed
into AAC encoder. The SBR encoder calculates control parameters to ensure that the
reconstructed high frequency results is perceptually transparent as possible as similar to the
original high band.
~6~
HE-AAC v2 (PS)
HE-AAC v2 bit-stream is obtained by downmixing the stereo audio to mono and the
SBR coding tool. The HE-AAC v2 decoder is based on 2-3 kbit/s of side information
(Parametric Stereo information) and mono audio information to recover transparent
multi-channel audio signal. The figure 1.4 is shown as the principle. In PS coding scheme,
only one audio channel signal with the parametric side information is transmitted. Thus, the
additional bit-rate spends on the single mono channel (combined with some PS side
information) will improve the perceived quality substantially of the audio compared to a
standard stereo stream at similar bit-rate. The figure 1.5 is shown the block diagram of
HE-AAC v2.
~7~
AAC-SLS is Lossless audio coding scheme including the basic layer and lossless
enhancement layer (LLE), which is shown as figure 1.6. In particular, the core layer is a
simply MPEG-2/4 AAC audio codec. The input audio signal is lossless transformed by
IntMDCT which can obtain the same spectrum form forward and inverse transform with
lifting scheme [9] in AAC-SLS encode scheme. Meanwhile, the lossless frequency spectrums
are fed to the core layer, AAC encoder to generate the core layer bit-stream. The LLE layer
calculates residual information between lossless spectrum and reconstructed information of
basic layer. The encoder exploits bit plan with Golomb Rice code to improve entropy coding
efficiency. The user can obtain lossy audio signal by basic layer with original AAC codec in
bandwidth limited, or lossless information by basic layer and enhancement layer. As the result,
the user can select as possible as similar to the original audio signal by basic layer or the
lossless signal by enhancement layer in different bandwidth resource and different
requirement.
~8~
~9~
1.4 Motivation
In general speaking, the architecture design for audio applications can be classified to three
approaches, DSP-based, pure-ASIC component, and semi-ASIC architecture. A DSP-based
design requires higher cost and consumes more power. The pure-ASIC approach research has
been proposed in previous paper with physical implementation. However, pure-ASIC design
will lose flexibility. A semi-ASIC architecture provides more flexible features like DSP-based
and high performance advantage like pure-ASIC. As shown as [11], the semi-ASIC design is
used with HW/SW co-design to complete the system. It can achieve the compromise between
performance and flexibility by hardware accelerator and embedded processor. According to
the benefit of semi-ASIC, our design exploits this approach to implement high performance
and high flexibility MPEG-2/4 AAC encoder.
The original MPEG-2/4 system considers the low complexity and high quality issues to
implement design in embedded system or portable device. Because the portable device has
limited resources, the audio codec for portable consumer devices have many constraints, such
as low power, low cost, low memory requirement and real time constrainetc. The
implemented method of AAC encoder can be classified of programmable-based, PC-based
[12] [13], DSP-based [14][15], RISC-based and SW/HW co-design approach [16]. In portable
device, the AAC encoder still has many challenges of computation-intensive and complicate
algorithm, especially battery limited consideration. The PC-based and RISC-based approach
may not provide enough computation ability for real time constrain, and this approach has
much power consumption, so this approach is not suitable for portable application. The most
of paper discuss the implement and optimize AAC in DSP-based system, but it also has high
power consumption problem. Although Y. Takamizawa et al proposed a good method for low
power and low resource DSP-based implementation, but their implementation didnt support
the window shape switching mechanism. Without window switching mechanism may degrade
the audio quality. The SW/HW co-design has software approach flexibility and hardware
approach performance like Lu et al. [16]. But, the Lu method uses the window shape decision
~10~
in time domain. Based on purely time domain information decided window shape type does
not perform so reliably audio quality [17]. Our previous design is also a HW/SW approach
design [18]. It design hardware accelerator module, PAM to speedup the encoding flow.
However, the architecture has problem in special hardware unit, register file and resource
sharing and, it is not efficient in area utilization. As the result, our design modifies
architecture of previous design, and improves performance and hardware cost and efficiency
with VLSI architecture design technique (pipeline, parallel, folding and unfolding). After
hardware accelerator design, we also construct deliverable IP for ARM or ARM-based
system.
At first, the complexity profiling of AAC encoder is shown in Table 1.2. The analysis is
based on pc-based method with Microsoft Visual C++ 6.0 to simulate AAC LC (Low
Complexity) encoder at sampling 44.1k and bit-rate 128/64 kb. According profiling result, the
PAM and Q Loop occupy over 90% computation loading. The algorithms of PAM will be
analyzed in following chapter. The special functions occupy the heavy computation load of
PAM, in order to improve system performance, we choice the most complexity module, PAM
as hardware accelerator, and optimize hardware design for better performance. We also
construct platform-based component for HW/SW co-design scheme. The essential
considerations of hardware design are low complexity, low cost, low power, and real-time
constrain, and then, we improve our previous design based on above all consideration.
Complexity percentage
MOPS
(Bitrate 128kb/s)
Complexity percentage
MOPS
(Bitrate 64kb/s)
Filterbank
2.3%
2.1
1.4%
2.1
PAM
57.3%
52.2
35.1%
52.2
Q Loop
36.5%
33
61.1%
90.1
SPP
3.9%
3.5
2.4%
3.5
Total
100%
90.8
100%
147.9
~11~
This is thesis is classified into seven chapter, and we introduce form MPEG family history
to our MPEG-2/4 AAC encoder design. As following, the content of each chapter will be
briefly discussed.
Chapter 1 introduces the history of MPEG Audio Family, and introduces the trend of
current MPEG audio development and motivation.
Chapter 2 discusses the overall the MPEG-2/4 AAC encoder and discusses the
sub-module on the AAC LC profile. That includes filterbank, PAM, joint stereo
coding, TNS, and Q Loop.
Chapter 3 talk about the optimize method for MPEG-2/4 AAC encoder in algorithm
level. In this chapter, we will introduce the MDCT-based PAM and provide some
technique to reduce the complexity arithmetic computation.
Chapter 4 focuses on the design the architecture of MDCT-based PAM. We divide the
hardware into two parts, a hardware sharing design for filterbank and a DS-like design
for Threshold Generator (TG). TG is the algorithm of step3-step13 in PAM.
Chapter 5 introduces the platform-based design for our system. This platform includes
embedded CPU and hardware IP. In this chapter, we will discuss the platform-based
system construction, deliverable IP constructed, and software/hardware co-simulation,
and co-verification.
Chapter 6 shows the implement result and discuss low power technique in physical
layer. Finally, we will provide the comparison data with previous paper.
~12~
Chapter 2
The Overview of MPEG-2/4 AAC
Encoder
In chapter 1, the basic functional block of AAC encoder has been introduced, and then we
will discuss the algorithms of AAC encoder with Low complexity (LC) profile for each
sub-module in this chapter. MPEG-2/4 AAC provides the state-of-art technique for achieving
transparent quality at low bit rate. The detail encoder flow of MPEG-2 AAC and MPEG-4
AAC are shown as figure 2.1 and 2.2. MPEG-2 and MPEG-4 system are very similar. The
MPEG-4 system is an enhancement version of MPEG-2. The basic coding tool includes
Filterbank, Psychoacoustic-Model, TNS, Joint coding and Quantization Loop. The
enhancement tools of MPEG-4 are Long Term prediction (LTP), Perceptual Noise Shaping
(PNS), and Transformation-domain Weighted Interleave Vector Quantization (TWIN-VQ).
The common tools of MPEG-2/4 are 1.Gain control, 2. Psychoacoustic model 3. Filterbank,
4.Prediction, 5.Quantization and coding 6.Noiseless coding, 7.Temporal Noise Shaping(TNS),
8. Mid/Side(M/S) Stereo Coding, 9.Intersity Stereo Coupling, 10.Bistream Multiplexing.
~13~
The AAC standard provides three profiles including main profile, Low complexity (LC)
profile, and Scalable Sampling Rate (SSR) profile to provide various network bandwidth and
storage capacity applications. The table2.1 is shown as the different tools in different profile
encoder. The Main Profile exploits all coding tools except gain control. It demands substantial
processing power and yields highest efficiency. LC profile applies lesser compression to save
processing and memory usage. This profile uses in various applications, because it can
achieve enough quality with few complexity and memory. SSR profile uses lowest
computation complexity. It can be used to various bandwidths for frequency scalable ability.
Those three profiles are trade-off between performance and quality
~14~
Main
LC
SSR
Gain control
NO
NO
YES
Psychoacoustic model
YES
YES
YES
Filterbank
YES
YES
YES
Prediction
YES
NO
NO
Quantization/coding
YES
YES
YES
Noiseless coding
YES
YES
YES
TNS
YES
Limited
Limited
Mid/Side(M/S)
YES
YES
YES
Intensity coupling
YES
NO
NO
The power consumption and complexity loading are major consideration in portable
consumer application. Based on those constrains, we only focus on the LC profile to
compromise between the audio quality and requirement of portable application.
2.1 Filterbank
The filterbank provides critical sampling, overlapping of blocks and good frequency
selectivity. A sub-sampling in frequency domain is performed critical sampling in combined
with overlapping blocks. Using sub-sampling would cause the aliasing in time domain which
can cancel by overlap and add operation of two sampling blocks in the synthesis filterbank.
This technique is called Time domain aliasing cancellation (TDAC) [9]. And then, we know
filterbank is consisted of window overlap operation, and modified discrete cosine transform
(MDCT). After that we will talk about each module of filterbank.
~15~
~16~
As the result, the different window shape selection can get more efficiency or resolution,
but how to decide the window shape. AAC encoder exploits Perceptual Entropy (PE) which is
estimated by PAM to select window shapes and determine the signal property. The result of
selected window shape is used to decide current windows shape, and widow shape transfer
mechanism. The state diagram of window decision is depicted in Fig. 2-4. The START
WINDOW and STOP WINDOW is the buffer when the window type is changed from LONG
to SHORT or SHORT to LONG, respectively.
aliasing effect between two discontinued frames. The MDCT formula is written as follow:
X t (m ) =
2
N
2 N 1
m = 0,1.....N 1
Where
The psychoacoustic model is the most important functional unit of perceptual audio coding.
It model humans sense of hearing system and separate the audio signal which is heard or not.
Now we will discuss the masking effect in audio spectrum. Masking refers to a process where
one sound is rendered inaudible because of the presence of another sound. In figure 2.5 [19],
we see a loud signal masking two other signals at nearby frequencies. Other signals of
frequency components are below this curve which is reconstructed by loud signal would not
be heard when the masker is present. Just like with the threshold in quiet, we can exploit this
effect to remove the signal components under the new threshold which is inaudible.
~18~
The masking threshold of each audio frame is a major consideration of audio quality and
coding efficiency. As the result, the psychoacoustic model has two tasks to execute in MPEG
audio encoder, decide window shape for filterbank which is outside of PAM and calculating
masking threshold which mean signal to mask ratio (SMR). According to the ISO/IEC
14496-2 [4] standard, we can arrange the PAM into 13 steps. The block diagram of PAM is
shown in Fig. 2-6 and detail flow chart is shown as Fig. 2-7.
~19~
Steps 1, 2 (FFT)
r(w)
i(w)
Steps 3-4
r(w), e(b)
Spreading
Function
c(b)
Step 5
cb(b)
r(w),
e(b),
en(b)
Step 6
tb(b) tonality index
Steps 7-10
nb(b) threshold
Step 11
PE
Step 12
Step 13
block type
SMR(n)
Fig.2.7. The detail data flow of PAM
In PAM, the 13 steps in calculating the masking threshold are arranged as follows: Step 1-2
are exploited to a time-to-frequency mapping; Step 3-10 are used to calculate the masking
thresholds (SMR) ; Step 11-12 are used to determine the windows shape; Step 13 outputs
final ratios of signal energy to threshold for Quantization Loop. Steps 1-2, PAM normalizes
the time-domain samples as input and transforms into frequency-domain spectrums of real
part r(w) and imaginary part i(w) by FFT. Real-part spectrums are used to calculate the
partitioned energy and imaginary-part spectrums are used to calculate the weighted
unpredictability measure c(b) in Steps 3-4. In Step 5, partitioned energy and unpredictability
are convolved with the spreading function to estimate the effects across the partitioned bands.
Step 6 is used to estimate tonality index to indicated tonal-like signal. Step 7 calculates the
Signal-to-Noise Ratio (SNR) and masking partitioned energy threshold. The steps 8-10
estimate the masking curve of spectrum. Perceptual Entropy (PE) is calculated to determine
the windows shape by Steps 11-12. Window shape decision requires detecting whether there
~20~
is a transient signal in the frame. Finally, Signal-to-Mask Ratio (SMR) is computed in Step 13
as output. w, b, and n indicate indices in the spectral line domain, the threshold calculation
partition domain, and the coder scale-factor band domain, respectively.
Pre-echoes is a problem with most block-based coding schemes. TNS uses frequency
domain prediction to shape the quantization noise and make echoes, or noise, unnoticeable
signal. It uses a filter to deal with original spectrum and quantizes. It transmits quantized filter
coefficients to the bit-stream. The filter performed in the encoder, which leads to a temporally
shaped distribution of quantization noise, and then the noise would not be noticeable in the
decoded audio signal with implemented correctly. TNS is only applied for long blocks and
not short blocks.
The joint stereo coding can be classify with SS ("simple" or "L/R" stereo), MS ("mid-side"
stereo), or IS ("intensity" stereo). Joint stereo stream may only employ a single coding
method. It can switch multiple methods on one frame or even sub-frame for the goal of
efficiency or quality. As following, we will introduce the various methods. Simple stereo (SS)
or Left-Right (L/R) are the most straightforward method of coding a stereo signal. For each
channel is treated as a completely separate entity. This can be inefficient and may adversely
impact quality when both channels contain nearly identical signals. Mid-side stereo coding
calculates a "mid"-channel by summation of left and right channel, and a "side"-channel by
different of left and right channel. The mid-side stereo can significant reduce bit-rate, because
the encoder can use fewer bit to save the side-channel information. The M/S coding is a
special case of transform coding, and retains the audio perfectly without introducing artifacts.
Finally, the intensity stereo coding is a method that saves bit-rate by replacing left and right
channels signal by a single representing signal adds directional information. This replacement
~21~
is psychoacoustically justified in the higher frequency range since the human auditory system
is insensitive to the signal phase at frequencies. Intensity stereo is by definition a lossy coding
method thus it is primarily useful at low bit-rates.
Quantization Loop
Quantization is the combination of dividing and rounding a real quantity into a small
discrete number. A quantized value is more compact to store. However, quantization
introduces error after multiplication the reconstructed and old values may differ. The
quantizer of AAC encoder exploits two nested loop, which are inner loop and outer loop to
encode the spectral data and reduce redundant information. In AAC encoder, it uses a
non-uniform quantizer, which has a nonlinear operation of |X|3/4. The equation of non-uniform
qunatizer is
x _ quantizer (i ) = int[
spectrum
2
3
4
3
( gl scf ( i ))
16
] + 0.4054
Where
Spectrum is spectral data of audio signal
gl(i) is the global scale factor (rate controlling parameter)
scf(i) is the scale factor (distortion controlling parameter)
The Quantization loop need satisfy two rules; one is quality, it means quantization noise
must below SMR; the other is bit-rate requirement, it means bit-rate consumption must less
than bit-rate requirement. However, those two rules are not always achievable, especially in
the low bit rate consideration. The standard defines two extra rules to solve this problem. First,
all of the scale factor bands have been amplified. Second, the different of two consecutive
scale factor bands exceeds 60. After that, we will introduce the quantization loop flow. Figure
2.8 is shown the Quantization loop flow.
~22~
At first, the quantization loop will initial the gl and scf(i). After initial, the rate control
mechanism started to calculate bit-rate consumption (inner loop) until the bit-rate
consumption is less than specifies requirement, or the inner loop will adjust gl to allow higher
bit-rate consumption. And then the control flow will start outer loop mechanism. The
quantization noise will be estimated in this state. The outer loop will finish with two situation,
one is the quantization noise below the SMR curve, the other is re-jump to inner loop and
continue excite quantization step until the exit rules are satisfied, prime rules or extra rules.
~23~
Chapter 3
The Algorithm of Low Complexity
MDCT-Based Psychoacoustic Model
The original encoder flow exploits filterbank with MDCT (2048-point LONG window
shape and eight 256-point SHORT window shapes) to transfer audio data from time-domain
signal to frequency-domain spectrum. And psychoacoustic module exploits FFT (2048-point
LONG window shape and eight 256-point SHORT window shapes) to achieve the similar
function. Meanwhile, the MDCT spectrums can replace the complex-FFT spectrums through
the MDCT-based PAM algorithm. The original PAM scheme and MDCT-based PAM block
diagram are shown as figure 3.1 (a)(b).
The MDCT-based PAM algorithm is first published by Takamizawa [20]. He finds that the
spectrum information from MDCT is enough to PAM calculation. Thus, the time to frequency
transform, FFT in PAM can be replaced by MDCT and combined with filterbank which is
outside of PAM. MDCT-based algorithm reduces the FFT computation loading. However, the
MDCT-based spectrum lacks phase-information from FFT imaginary part. The lack
~24~
~25~
In general speaking, the MDCT is complexity transform with large additions and
multiplications. Our design exploit fast algorithm to accomplish MDCT to reduce
computation complexity and improve performance. The various fast algorithm of MDCT have
been proposed. Based on [26], the fast algorithm can be classified into (1) Factorizing MDCT
computation into the formula of complex or real valued FFT e.g.[27], (2) Through
trigonometric equivalence map MDCT to DCT-2, and apply fast DCT algorithm to achieve
the computation e.g. [28], (3) Using trigonometric equivalence to convert the MDCT
coefficients into twiddle factor form recursively e.g. [29] (4) By matrix decomposition to
reduce size from N to N/2 and then apply DCT/IDCT2 kernel to achieve the formula e.g. [30]
~26~
Our design apply the FFT-based (1) algorithm to implement MDCT, according to the
consideration of VLSI architecture implementation. Based on [27], the MDCT formula can be
rewritten as
X ( m) = e
2i
1
2 i
mk
( m+ )
n
n/4
8
f
e
e
(
)
, m = 0~ n/4-1
k
k =0
2 i
1
( m + ) n / 41
n
8
Where
n = 2048 (long window), 256 (short window)
fk = ( f (2k ) f (n 2k 1)) + i ( f (n / 2 + 2k ) n / 2 2k 1)
X (2m) = Re( X (m)), X (2m + n / 2) = Im( X (m)), m = 0 ~ n / 4 1
The MDCT flow is shown as figure 3.2
At First, the input data need reorder to N/4-point with real part and image part. The fk is the
reorder operation. After that the complex number need multiplied by pre-twiddle coefficient
to suitable FFT operation. The e
2 i
1
( m+ )
n
8
~27~
And then, time-to-frequency operation achieves by N/4-point FFT kernel. After that the FFT
spectrum recovers to the MDCT spectrum by post-twiddle operation which is the same as
pre-twiddle. Finally, the complex data has to de-interlever into real number, which is mapping
512-point FFT spectrum with complex part to 1024-point MDCT spectrum with only real
number. In order to improve hardware efficiency, and reduce computation loading, we use
radix-23 FFT algorithm. The figure 3.3 is shown signal flow graph of radix-23 FFT.
W84Nn
W82Nn
W86Nn
W81Nn
W85Nn
W81
W83Nn
W87Nn
W83
The radix-23 algorithm cad reduce more than 50% computation load relative to previous
work with radix-2 algorithm. And then, the complex multiplications are major operation of
FFT algorithm. In traditional method, the complex multiplier implement by four real
multipliers and two adders. The equation is shown as
A(a + bj ) * B(c + dj ) = (a * c b * d ) + j (a * d + b * c)
According [31], the complex multiplier can realize by three multipliers and five adders, and
reduce multiplier counts to improve performance. The formula is shown as
A(a + bj ) * B (c + dj ) =
[c * (a b) + b * (c d )]
+ j[d * (a + b) + b * (c d )]
~28~
Based on [9], the complex multiplier only needs three multipliers and three adders by
lifting scheme. This method use matrix decomposition to factorize original matrix to three
sub-matrixes. And then each sub-matrix only has one multiplier, but it needs extra arithmetic
operation such as divider and subtraction. The extra operation can ignore by pre-pressing in
our case, because the parameters(c,d) are twiddle factor of FFT and pre/post twiddle which
consist of sine and cosine coefficients. The formula is shown as
c d a
A(a + bj ) * B (c + dj ) =
d c b
1 0 1 1 0 a
=
1 0 1 0 b
where = (c 1 / d ), = ( d )
Above all, the lifting scheme requires fewer operators, but this method need to extra
arithmetic operators. And then, the overhead would cause extra computation loading or extra
coefficient tables. Based on low cost consideration, this method would not suitable in our
design. Our design selects three multiplications and five additions to achieve FFT algorithm
to balance complexity and cost criteria. Table 3.1 is shown as complexity of various
approaches for MDCT algorithm with FFT.
Table 3.1 The complexity analysis of MDCT (N=2048)
Arithmetic operator
Direct
Multiplication
2,097,152
13,312
7296
5760
Addition
2,095,104
12,288
21376
18816
According Table 3.1, using FFT-based approach with radix-23 algorithm can greatly reduce
computation loading. And then, this approach can apply VLSI technique such pipeline or
folding to achieve low cost and low complexity goal.
Following description, we will introduce the proposed coefficient merged scheme for
Window operation, MDCT/IMDCT, and twiddle factor of FFT. In the filterbank via MDCT
flow, the four modules require coefficients, including window operation, pre-twiddle, FFT
~29~
operation, and post-twiddle. The table 3.2 is shown the original equation of each coefficient
respectively long and short.
Table 3.2 The table merge scheme of each reference coefficient.
Sin
2i
2i
Cos
2048
2048
Sin
2i
4096
Cos
2i
4096
2i
2i
Cos
2048
2048
(i + 0.125)
Re = Cos
1024
(i + 0.125)
Im = Sin
1024
Sin
(i + 0.125)
128
(i + 0.125)
Im = Sin
128
Re = Cos
Sin
(i + 0.5)
2048
Sin
(i + 0.5)
256
Based on the table 3.2, the coefficients of each equation consist of trigonometric function,
including sine and cosine. The coefficients can merge through trigonometric symmetric
property and sine-cosine similar property. The total flow of coefficient merged has three steps,
step one is profiling each equation, the equation is very similar; excluding the resolution of
each one is different. And then we select one equation to regard as a reference coefficient, and
reconstruct the others coefficients. FFT twiddle factor is used to reference coefficient, because
it has higher resolution, and it only need 1/8 period of cosine and sine coefficients to store in
table. The other coefficients can be reconstructed by 1/8 stored values. The step one is shown
as figure 3.4 FFT twiddle factor. In step two, the others coefficient such as pre/post-twiddle
and window operation are recovered by 1/8 stored coefficient. Moreover, the coefficient
merge scheme can separate into two schemes for different applications. The resolution of
original coefficient is enough to recover the pre/post twiddle in MDCT approach. However,
~30~
the filterbank has window operation which needs higher resolution. In order to recover this
coefficient, the resolution of reference coefficient has to increase to solve this problem. The
coefficient merged scheme I and II is illustrated as figure 3.4 (a)(b). The step three uses to
compensate the numerical error of coefficient table such as pre/post twiddle (long) and
window coefficient (long). Those coefficients require higher resolution to complete recover,
but that would cause extra overhead to recover correct coefficient. Our design exploits
approximate method with error compensation to recover those coefficients to maintain the
quality and avoid more loading. Above all, the coefficient merge schemes can reduce more
than 70% ROM (original: 3976 word scheme I: 512word scheme II: 1024 word) table
requirement
Moreover, this approach of MDCT can extend to relative function such IMDCT for AAC
decoder and 2048 FFT for DAB+ system. The figure 3.2 is shown as MDCT/IMDCT/FFT
signal flow chart. And then the proposed coefficient merge schemes can apply into IMDCT
and FFT function. It shows that, this algorithm and hardware design can apply in audio codec
application and DAB+ receiver system. In VLSI architecture design, it can achieve low cost
~31~
and configurable design for multi-applications, because the algorithm of this design is
VLSI-oriented, and then the detail discussion will introduce in next chapter.
In previous discussion, PAM is the most complexity and important module in MPEG-2/4
AAC encoder. And then Huangs methods [24][25] exploits three methods to reduce the
computation load and memory bandwidth. Those three methods will introduce in this section.
The step 5 of PAM is used to calculate spreading function in PAM coding flow. However,
spreading function includes complexity arithmetic operator such as square roots, power of
tens etc. The pseudo code of spreading function is shown as figure 3.5.
Spreading function (bark value i, bark value j)
{
if (j>=i) tmpx = 3.0*(j-i)
else tmpx = 1.5*(j-i)
tmpz = 8 * minimum(((tmpx-0.5)2-2(tempx-0.5)),0)
tmpy = 15.811389 + 7.5(tmpx+0.474) 17.5(1.0+(tmpx+0.474)2)0.5
if (tmpy < -100) then return 0
else return 10((tmpz+tmpy)/10)
}
Fig. 3.5 The Pseudo code of the spreading function.
The spreading function is according to bark value, and then the bark value depends on
sampling rate and window shape. And, the complex operation can be replaced by look-up
table to reduce computation load. Furthermore, Huang optimize the look-up table size, which
he found the non-zero values are distributed in diagonal of look-up table and reorder the
non-zero values to liner array. Figure 3.6 is illustrated as linear array of look up table. This
method reduces computation load with a little overhead. And its optimization reduces look-up
~32~
table requirement from 6664 words to 2067 words in sampling rate: 44100.
1
2
The linear array for
non-zero values
2
70
start
end
The array for indices
70
zero values
non-zero values
Pervious work [20][22] of MDCT-based PAM can not guarantee good quality with the
window shpae decision in time domain alone. The Huangs method [24] can reduce the
complexity and prevent quality degrade. The figure 3.7 is illisuated the window shape
decision scheme of Huangs method. This method has two phase to excuate, phase 1 predicte
the window shape of current frame by Perceptual Entropy (PE). And then phase 2 uses to
calculate the spectrum with selected window shape and the signal-to-mask ratio (SMR) for
quantization loop and the other coding tools.
~33~
Input buffer
Delay 2
MDCT 1 (LONG)
MDCT 2 (1 of 4)
Threshold
Generation 1
(LONG)
Threshold
Generation 2
(SHORT)
Delay 1
Window
shape
SMR
Delay 1
Output buffer
The scheme applies two parallel PAM and two delay unit for different phase required. The
PAM 1 with only Long window shape generate threshold, detect the transients and decide
window shape for current frame. The sequence will be stored in delay until PAM 1 finish, and
then window shape information and audio data feds into PAM2 to calculate the required
information. The window shape transition method is the same as the definition of the standard.
By this scheme, it improves the quality of MDCT-based PAM.
After previous optimized methods, the PAM flow still includes complex operation, such as
log10, division, and power of tens. In VLSI-oriented and DSP-oriented design consideration,
those arithmetic operations are difficult implemented in hardware or DSP instruction. In order
to implement PAM on hardware or DSP platform and reduce complex computation, Huang
[13] exploit log-scale and reschedule technique to calculate step 7-13. Based on this method,
multiplications and divisions in the original domain correspond to the summations and
subtractions in the logarithmic domain. The flow chart of optimal result is shown in figure.3.8.
~34~
Moreover, memory storage and bandwidth requirement in Threshold Generation (TG) are
reduced, because energy and masking threshold data are only in the block. The word length of
those data in logarithmic format is less then original format.
~35~
Chapter 4
The Architecture of Low Complexity
Psychacoustic Model
After pervious chapters, the algorithm is clear discussion of MDCT-based PAM and AAC
encoder flow. This chapter will descript the hardware design focused on architecture design
with low power and low complexity. Based on the Huang method, (1) MDCT-based PAM
algorithm reduces FFT computation by MDCT-spectrum and that correspond to hardware
design, which only needs one type filterbank with FFT-based MDCT to reduce computation
complexity. (2) The unpredictability measure is replaced by SFM, it not only avoid
computing special function (sine, cosine, square root and division), but also reduce the
memory utilization. (3) The complex special functions of spreading function are replaced by
look-up table, but it need extra ROM table in hardware design. (4) The complex equations of
TG only need arithmetic operation of log10 and power of tens by logarithmic-based design. (5)
The power of tens can be reduced by logarithmic-based quantization loop algorithm. The
complexity arithmetic operation only needs log10 in PAM.
~36~
The PAM has three modules, including Filterbank (MDCT), Threshold Generator, and
controller in proposed hardware design. Based on Huang proposed MDCT-based algorithm
method 2, this method requires parallel PAM to achieve window shape decision and SMR
calculation. The figure 4.1(a) is shown as the original MDCT-based architecture. The
architecture has predicted phase and evaluated phase. Predicted phase uses to calculate current
frame window shape. Evaluated phase calculates SMR and frequency spectrum. Our design
implements PAM with VLSI architecture design technique, folding technique to match low
cost consideration. The figure 4.1 (b) is shown as folding architecture design for
MDCT-based PAM. By using data rescheduled technique to arrange original schedule input
data for applying in folding architecture. After data rescheduling and folding architecture, the
hardware only requires one PAM module, and accomplishes two phases mechanism to obtain
essential information for encoder flow.
~37~
Delay unit
Audio
raw data
Evaluated
phase
Input
data
Spectrum
MDCT-based PAM
Predicted phase
Long type
Block
type
MDCT
(1 of 4)
Threshold
generator
(Long/Short)
Predicted phase
SMR
Block
type
Evaluated
phase
Long/Start/Short/Stop type
Folding architecture
Control signal
Data path
Proposed MDCT-based
PAM flow
Fig 4.1 (b) The folding architecture design for MDCT-based PAM
In general case, the storage component like RAM and ROM is major cost in digital design.
After profiling memory usage and analyzing utilization of PAM, the total memory usage is
92160 bits in previous design, and the utilization of each memory is shown as figure 4.2.
Original memory usage is inefficiency. The idle state of each memory is more than active
time. The inefficiency memory in digital design cause more power consumption.
The different area block in figure 4.2 means different memory size. The memory size
includes 1024x24, 128x24, and 128x24 in original design; 512x24 and 256x24 in proposed
design. Previous design of PAM applies local memory in each module (MDCT, TG) which is
shown as figure 4.3 (a). The proposed method has a conception of shared memory to improve
the efficiency and achieve low cost constrain, this approach is shown as figure 4.3(b). The
proposed methods include two techniques for this goal which are memory reschedule and
memory partition. In reschedule scheme, the same word-length memory is shared by multiple
modules in distributed time schedule. In partition scheme, the utilization of memory can
improve again, which mean that the 2N word-length RAM can be replaced by two N
word-length RAM, or the N-point RAM can be replaced by two N/2-point RAM for different
modules. The figure 4.2 proposed methods is shown as memory utilization with rescheduled
and partition. Finally, the requirement memory reduces form 92160 bits to 49152 bits. In
order words the proposed method saves about 50% storage element. After that, the key
module of MDCT and TG will describe in next section.
Fig 4.3 The memory requirement of different approach: (a) local memory (b) shared memory
~39~
The major computation load of PAM is occupied by calculating MDCT. Our design not
only improves performance in algorithm level, but also applies VLSI architecture design to
obtain better performance. In previous chapter discussion, FFT-based MDCT has advantage
in VLSI architecture design, and exploit radix-23 algorithm to reduce complexity. The FFT
design is the major part of filterbank. The proposed method exploits memory-based FFT with
fully pipeline butterfly unit to compromise with cost and performance constrain. After that,
this hardware design can accomplish multiple functions including MDCT, IMDCT, and FFT
with corresponded temporal size memory. The figure 4.4 is shown as block diagram of
hardware shared design for MDCT/IMDCT/FFT. The proposed design consists of a memory
unit (RAM), butterfly unit, Cache-register, controller, Address generator, and coefficient
generator, as follow as each block in hardware design will be described.
The figure 4.5 is architecture view of coefficient generator. Based on proposed coefficient
merged method, the coefficient table only needs 1024 word with multiple muxs and offset
~40~
Sub_Add
MUX
compensation or not. Finally, the reconstructed value can calculate by those two steps.
MUX
The similar MDCT architecture design is performed the butterfly operation of FFT [32].
Based on previous design [32], proposed method exploits fast algorithm and cache-register to
improve performance and reduce power consumption. Next, the butterfly unit and cacheregister mechanism will be introduced. The signal flow chart of butterfly is shown as Figure.
4.6. Because, proposed method exploits radix-23 algorithm, the signal flow is different with
original radix-2 butterfly unit.
The figure 4.7 shows the architecture of butterfly unit. The butterfly unit is consisted of one
multiplier, three adders/subtraction and four pipeline registers. The figure 4.8 is the timing
~41~
chart for the pipelined butterfly unit of pre/post twiddle operation and butterfly operation. The
hardware achieves 3 clock cycle pre pre/post twiddle, 6 clock cycle pre butterfly. In other
words, butterfly operation need two complex multiplications in radix-23, and pre/post twiddle
need one complex multiplication, however the complex multiplication achieve to 3 clock
cycle with three multiplications and five additions method. The utilization of multiplier is
100% such that each product is generated every clock and feds to add/sub module. As the
pipelining timing chart shows as figure 4.8, continues result are outputted after 4 cycles for
pre/post twiddle and 7 cycles for butterfly operation.
~42~
Our design also proposes solution for low power consideration. In general speaking, the
memory occupies about 30%-50% power consumption in digital design, and access-intensive
memory causes more power consumption. Meanwhile, memory is below hard IP in cell-based
design. The architecture of memory can not modify in transistor level or circuit level. How to
reduce power consumption in this situation? The designer only has right to modify the
memory access counts and select memory type. The proposed design selects single-port
SRAM as storage elements and exploits cache-register to reduce memory access counts to
reduce power consumption.
Memory
Memory
Write
Rea
d
Write
Cache-register
Read
Butterfly unit
(Radix-2)
Butterfly unit
(Radix-2)
Fig 4.9 The memory-based architecture of (a) original (b) with cache register
~43~
The figure 4.9 is shown the original and modified memory-based design. The cache register
design adds a little overhead, however it reduces more 50% memory access counts in radix-23
butterfly operation. The cache registers are between the memory and butterfly unit and store
temporal data in each 8-points FFT operation. The figure 4.10 is shown as the memory access
scheme of different design. The PE (process element) will access memory in each 2-point
butterfly computation in original design for 8-point FFT. The cache registers design only
access memory when reading data from memory and saving data back. The others access is
replaced by cache registers. The memory access counts can be calculated by equation which is
shown as following. Based on this scheme, the memory access can be reduced more than 60%
of original scheme.
In original scheme for 8-point example
Memory access: 4 butterfly unit x each 2-point R/W access x stage (4 x 2 x (2+2) x 3 = 96)
In cache-register scheme for 8-point example
Memory access: 8-point Read/8-point Write
(8 x (2+2) = 32)
Cache register access: 4 butterfly unit x 2-point R/W access x stage-1 (4 x 2x (2+2) x 2 = 64)
Fig 4.10 The memory access scheme of original and cache register design.
The address generator and controller are implemented by Finite State Machine (FSM).
Moreover, address generator will transfer original address to coefficient merged table address
~44~
and calculate memory R/W address and cached-register R/W enable signal. The memory unit
is divided to eight parts, because it has benefit for power consumption and it has flexible
memory size for various applications (MDCT/IMDCT/FFT).
The threshold generator calculates SMR for quantization loop and perceptual entropy (PE)
for windows shape decision. In previous design [33] [34], the architecture of TG is based on
DSP-oriented design to achieve low area cost, and exploit logarithmic-based numerical format
to reduce complex arithmetic operation and word-length of the data. After logarithmic scale,
the operators only need multiplier, multiplication-and addition, logarithm, adder/subtraction,
and comparator. The TG includes two blocks, inner block and outer block. Inner block
achieve the arithmetic computation which is consisted of Logarithmic unit (LOG) [35],
Multiplication-Addition unit (MAC) and Arithmetic logic unit (ALU). Outer block
accomplish control and spreading function which is consisted of controller and ROM table.
~45~
Chapter 5
Platform-Based Design of Low
Complexity Psychoacoustic Model
The system-on-chip (SoC) design concept has become more and more practical by advance
of IC fabrication and electronic design automation (EDA) technologies. And the SoC design
can achieve a complex system in single chip with low power, low cost and high performance
consideration. The existent platform-based methodology [36] is defined as architectural
framework with a set of pre-qualified software and hardware IP. The proposed design exploits
this design methodology to construct a pre-qualified software and hardware IP which can be
integrated into platform-based design. The designer only modifies the wrapper to match bus
specifications for different processor core (DSP, ARM, PowerPC or user define processor).
The proposed design is a reusable IP in different applications (audio codec, DAB+ system).
The main features of this IP are flexible with different clock rates for different application. It
can provide the lower clock rate with the real-time constrain, and reduce hardware resource in
different applications.
~46~
In general speaking, the architecture design for audio application can be classified to three
approaches, DSP-based [13][37], pure-ASIC [38][39], and semi-ASIC [11][16] architecture.
A DSP-based architecture needs higher cost and higher power consumption for specific audio
applications, and software-based programming is not as efficient as that in dedicated hardware.
In the result, DSP-based approach always requires higher operation frequency than the
dedicated hardware design to meet the real-time constrain. On the other hand, the pure-ASIC
design solves the cost, performance, and power consumption problem, but it will lost flexible.
The semi-ASIC designs use HW/SW co-design to complete the system. The HW/SW
co-design is trade-off between flexible of DSP-based, and advantage of pure-ASIC. In other
words, the semi-ASIC design is more cost effective than the DSP-based design and more
flexible than the pure-ASIC design.
~47~
system. In fact, this work not only constructs this SoC platform to achieve HW/SW co-design
for MPEG-2/4 AAC codec, but also provide the integrated design flow, and verification
strategy. Detailed discussion will be described in next sections.
compatible code. It mainly consists of two modules, one is the control-intensive function and
the other is the interface between software/hardware parts.
Wrapper
design
AMBA protocol
verification
Co-simulation
with software
Design well
hardware
AMBA slave
Compatible IP
Fig. 5.3 Design flow of constructed deliverable IP
Stage.1 IP qualify [40] can help the designer with better coding style. By this stage, the
verilog code can prevent some potential problems in register transition level (RTL), which
includes simulation, synthesis, timing analysis, and design for test problem. It can improve
the readability when the other designer integrated this design.
Stage.2 The design well IP had been integrated on ARM-based platform, and used AMBA
bus to communicate with ARM processor, but the original I/O specification is design for user
defined interface. The IP needs adjust interface to apply AMBA specification by wrapper
design which uses to communicate with bus. Proposed IP exploit AHB slave wrapper to
communicate with processor and bus. And then the data transfer mechanism is separated into
~49~
initial phase and data phase to transfer control signal and frame data. Initial phase setups
control informant to IP core, and data phase transfers input data (time domain signal) and
receive output data (SMR value, window shape). Moreover, the wrapper is designed as
memory map to assist software development. The timing chart of SW/HW co-design and
encode/decode flow is shown as figure 5.4.
Fig. 5.4 The timing chart of HW/SW co-design (a) encoder flow (b) decoder flow
In encoder flow, the time domain data will transfer to IP core via wrapper interface and
calculate SMR information. And then, processor will receive SMR data from IP and calculate
other essential coding tools. Relatively, the decoder flow calculates pre-processing part by
processor firstly. After that, processor sends frequency domain uncompressed data to IP and
perform IMDCT to obtain time domain data (PCM out). The processor can be used by the
other functions to achieve multi-function simultaneously in this co-design scheme. In other
words, software and hardware modules can work in parallel to improve system performance.
The wrapper has another task which transfers data between different clock domains. For
real-time constrain or IP performance constrain, the clock rate of IP may not be the same as
the clock rate of AMBA bus and ARM processor. In order to communicate IP and processor,
the wrapper require FIFO mechanism to transfer data. The FIFO mechanism achieves by
dual-port SRAM, and control logic. Figure 5.5 is illustrated as IP design with AMBA-slave
wrapper.
~50~
Stage.3 After wrapper design in hardware part, it has to be verified by processor instruction
set. This design exploit Synopsys DesignWare AMBA Verification IP [41] to verify wrapper
function whether it applies with AMBA protocol or not. This stage generates random
command and pattern to test wrapper protocol, and observes the response to check the
wrapper function correct or not.
The design has to prove that this approach is practicable to physical implement after
hardware/software individual development. But, it may not achieve verification after physical
implement. In order to show this design is practicable in system level, our design exploits
system level cad tool to achieve co-simulation and co-verification. Using system level cad
tool, CoWare Platform Architect [42], we can model system including processor, bus, and
hardware accelerator. This approach can provide required model for hardware-software
co-simulation and co-verification. Base on the co-simulation and co-verification results, it can
debug easily and avoid mismatch on performance requirement or functional specifications.
~51~
~52~
Chapter 6
Implementations and Results
~53~
System
Specification
System level
Design
(C simulation)
Module
Design
(Verilog coding)
Comparing
(Verilog vs. C)
Synthesis
(Synopsys Design
compiler)
RTL Level
Gate Level
Simulation
Compare
(Verilog vs. c)
~54~
Frequency =
Based on the evaluated result, the frequency needs 1.4 MHz or 3.1 MHz for audio decoder
and encoder, and needs 33MHz for DAB+ system.
The power consumption is important issues in portable application, such as mobile phone,
mp3/AAC player, DAB+ receiver, PDA etc. The power consumption of each applications
can be evaluated to prove that our design match low power consideration. After that, we
profile the distribution of power consumption in different application. The power evaluation
is based on the power analysis tool, primepower. The power consumption of different
applications is shown as table 6.1.
MDCT@(5MHz)
PAM@(@10MHz)
IMDCT@(5MHz)
FFT@(40MHz)
Power consumption
1.279mW
6.13mW
1.293mW
11.89mW
Dynamic/Leakage
1.186mW/ 93uW
6mW/0.13mW
1.2mW/93uW
11.79mW/93uW
Logic/Memory
0.88mW/0.391mW
5.23mW/0.9mW
0.88mW/0.411mW
8.12mW/3.77mW
After optimization
0.8994mW(70%)
3.694mW(60%)
0.912mW(70%)
8.7mW(73%)
Dynamic/Leakage
0.849mW/50uW
3.64mW/53uW
0.862mW/50uW
8.653mW/50uW
Logic/Memory
0.517mW/0.382mW
2.88mW/0.8mW
0.52mW/0.4mW
4.98mW/3.71mW
~55~
The power consumption can be classified into dynamic power and leakage power. Those
two power consumption formats can be derived from equation as follow.
In dynamic power
power = ( I 0 * e
nVT
* (1 e
V DS
nVT
V DS
nVT
)e
) * voltage
After power consumption evaluating and equation deriving, the various gate-level and
circuit-level technique can reduce power consumptions to meet low power constrain. The
logic gate transition is the major part of dynamic power, thus the clock-gating and operand
isolation has been proposed to save power. In general case, the dynamic power can reduce
30%-50% by reducing transition counts. Moreover, the supply voltage has large effect in
dynamic power. In formula shown, power consumption and voltage have square relationship,
thus the power gating and voltage scaling technique has been proposed. The power gating
means that turn off power supply in idle state and voltage scaling is trade off between circuit
performance and power consumption in physical layer. Above those two method, that can
greatly save dynamic power more then 50%. The power consumption of memory is noticeable
which is based our profiling. Reducing memory access counts is efficiency method to save
memory power consumption like previous discussion of proposed architecture.
The leakage power is difficult problem in digital design. The solution of leakage power has
to physical layer or circuit level. In circuit level, the multi-threshold voltage logic cell and
body-biased method have been proposed to reduce leakage power.
~56~
In order to increase testability of hardware design, the testing circuit such as scan chain and
memory BIST will insert into hardware design. Logic part employs scan chain as DFT circuit
to detect stuck at fault, and then the coverage of our design can achieves 90.59%. Memory
part employs BIST circuit to verify the correctness of memory
6.4 Comparison
In this section, we will show the comparison with previous work, including MDCT[18][34],
IMDCT[43], and PAM[18][34] in audio application and FFT[44][45] for DAB/DAB+ system.
The figure 6.2 is shows as comparison of various features like cycle count, ROM table size,
memory access counts. Proposed hardware shared design provide high performance with
fewer cycle count, low cost with fewer ROM table, and low power with fewer memory access
counts. In previous design [45], the process element of that design exploits four multipliers to
achieve complex multination in each cycle. This design should be normalize to one multiplier
case to compare with our design, that the cycle counts of figure 6.2 (a) has be already
normalized to one multiplier approach. After that, we also compare with previous design of
PAM [18][34], the table 6.2~6.5 are shown as comparing with memory utilization, ROM table
utilization, Cycle count for calculating, logic gate count. In our design, PAM require 16432,
and 16063 for long, and eight short windows shape. ROM size and memory requirement are
5916Byte and 6144 Byte, which are about 50% improvement between previous designs.
~57~
64512
45516
19060
15360
13200
40460
30720
13200
33792
ISCAS
2006
Ours IEICE
2007
MDCT
Ours Trans. on
Trans. on Ours
BroadcastingBroadcasting
2003
2007
IMDCT
FFT
Fig6.2 (a) The cycle counts compare with previous design with each application
Fig6.2 (b) The ROM table size compare with previous design with each application
Fig6.2 (c) The Memory access counts compare with previous design with each application
NCU [18]
Proposed
MDCT
6144 Byte
6144 Byte
TG
5376 Byte
Total
11520 Byte
6144 Byte
~58~
NCU [18]
Proposed
MDCT
5632Byte
2048 Byte
TG
3868 Byte
3868 Byte
Total
9500 Byte
5916 Byte
NCU [18]
Proposed
Long
22480
16432
Eight Short
23572
16063
NCU [18]
Proposed
Total PAM
72K
43K
~59~
Chapter 7
Conclusions
~60~
Moreover, this approach also can improve performance, and flexibility. The proposed design
can perform encoding stereo channel data in real-time constraint with sample rate 48000Hz
below clock rate 10 MHz. Based on these results, the proposed architecture has the high
efficiency, low power and low complexity advantages.
In future, the AAC kernel can extend to higher technique such as SBR and PS like MPEG
AAC family specification. Moreover, the proposed design can extend to pure ASIC design for
AAC encoder, or platform architecture to develop multi-function application such as MPEG-4
system including video and audio. Beside, it also can focus on quantization loop which is
another key component of AAC encoder, to improve performance and quality. We will
convince that there will be many applications around us with these audio coding applications
in the near future.
~61~
Reference
[1]. MPEG. Coding of moving pictures and associated audio for digital storage media at up to 1.5
Mbit/s, part 3: Audio, International Standard IS 11172-3, ISO/IEC JTC1/SC29 WG11, 1992.
[2]. MPEG. Information Technology generic coding of moving pictures and associated audio, part
3: Audio, International Standard IS 13818-3, ISO/IEC JTC1/SC29 WG11, 1994.
[3]. MPEG. MPEG-2 Advanced Audio Coding, AAC, International Standard IS 13818-7, ISO/IEC
JTC1/SC29 WG11, 1997.
[4]. MPEG. Information technology Coding of audio-visual objects Part 3: Audio, International
Standard IS 14496-3, ISO/IEC JTC1/SC29 WG11, 1999.
[5]. MPEG. Information technology Coding of audio-visual objects Part 3: Audio, Amendment
1: Bandwidth extension. ISO/IEC 14496-3:2001/Amd. 1:2003, Nov. 2003.
[6]. MPEG Information technology Coding of audio-visual objects part 3: Audio, Amendment 2:
Parametric coding for high-quality audio, ISO/IEC 14496-3/Amd. 2: 2004.
[7]. MPEG Information technology Coding of audio-visual objects part 3: Audio, Amendment 2:
Audio Lossless Coding, ISO/IEC 14496-3/Amd. 2: 2005.
[8]. MPEG Information technology Coding of audio-visual objects part 3: Audio, Amendment 3:
Scalable Lossless Coding, ISO/IEC 14496-3/Amd. 3: 2005.
[9]. R. Geiger, T. Sporer, J. Koller, and K. Brandenburg, Audio Coding based on Integer
Transform, in AES 111th Convention, New York, NY, USA Preprint 5471 Sept 2001.
[10]. P. Coussy, A. Baganne, and E. Martin. Virtual component IP re-use in telecommunication
systems design: a case study of MPEG-2/JPEG2000 encoder, IEEE Proc .ICECS2002. vol. 2,
pp.733-736, Sept. 2002
[11]. C.N. Liu, and T.H. Tsai, SoC platform based design of MPEG-2/4 AAC audio decoder, IEEE
Proc .ISCAS2005. vol. 3, pp.2581-2584, May. 2005
[12]. Domazet, D.; Kovac, M.; Advanced software implementation of MPEG-AAC audio encoder,
4th
EURASIP
Conference
focused
on
Video/Image
Processing
and
Multimedia
[26]. P.S. Wu, and Y.T. Hwan; Efficient IMDCT core designs for audio signal processing, IEEE
Workshop on Signal Processing Systems, 2003. SIPS 2003. 27-29 Aug. 2003 Page(s):275
280
[27]. P. Duhmel, Y. Mahieux, and J.P. Petit, A fast algorithm for the implementation of filter banks
based 1on time domain aliasing cancellation , International Conference on Acoustics, Speech,
and Signal Processing, Vol. 3, Page(s): 2209-2212, Apr, 1991
[28]. Britanak, V.; Rao, K.R.; An efficient implementation of the forward and inverse MDCT in
MPEG audio coding, Signal Processing Letters, IEEE Volume 8, Issue 2, Feb. 2001
Page(s):48 - 51 Digital Object Identifier 10.1109/97.895372.
[29]. Y.H. Fan, Madisetti, V.K. and Mersereau, R.M.; On fast algorithms for computing the inverse
modified discrete cosine transform, IEEE Signal Processing Letters, Volume 6, Issue 3, March
1999 Page(s):61 - 64 Digital Object Identifier 10.1109/97.744625
[30]. M.-H Cheng, Y.-H. Hsu;Fast IMDCT/MDCT algorithms-a matrix approach, IEEE Trans. On
Signal Processing. Jan 2003, pp 221-9
[31]. A.Wenzler and E. Luder, New structures for complex multipliers and their noise analysis, in
Proc. IEEE Int. Symp. Circuits Syst.,May 1995, vol. 2, pp. 14321435.
[32]. Lau, W. and Chwu, A.; A common transform engine for MPEG and AC3 audio decoder,
IEEE Transactions on Consumer Electronics, Volume 43, Issue 3, Aug. 1997 Page(s):559 566
[33]. T.H. Tsai, S.W. Huang, J.H. Luo, Architecture Design of Psychoacoustic Model for
MPEG-2/4 AAC Audio Encoder The 16th VLSI Design/CAD Symposium (VLSI), 2005.
[34]. T.H. Tsai, J.H. Luo, S.W. Huang, Low Complexity Architecture Design of MDCT-Based
Psychoacoustic Model for MPEG 2/4 AAC Encoder, IEEE Proc .ISCAS2006. May. 2006
[35]. Abed, K.H. Siferd, R.E. CMOS VLSI implementation of a low-power logarithmic converter
IEEE Transactions on Computers, Volume 52, Issue 11, Nov. 2003 Page(s):1421 1433.
[36]. H. Chang et al., Surviving the SoC Revolution: A Guide to Platform-based Designs, Kluwer
Academic, Norwell, Mass., 1999
[37]. M. A. Watson and P. Buettner, Design and implementation of AAC decoders, IEEE Trans.
Consumer Electronics, vol. 46, issue 3, pp.819-824, Aug. 2000.
[38]. T. H. Tsai, C. N. Liu and Y. W. Wang, A pure-ASIC design approach for MPEG-2 AAC audio
decoder, in Proc. 4th IEEE Int. Conf. Information, Communications & Signal Processing and
4th Pacific-Rim Conf. Multimedia (ICICS-PCM), vol. 3, pp.1633-1636, Dec. 2003.
~64~
[39]. P. Liu, L. Liu, N. Deng, X. Fu, J. Liu, Q. Liu, G. Zhang, and B. He, VLSI Implementation for
Portable Application Oriented MPEG-4 Audio Codec, Circuits and Systems, 2007. ISCAS
2007. Symposium on IEEE International (ISCAS2007), pp. 777 - 780, May. 2007.
[40]. IP Qualification Alliance, IP Qualification Guidelines, Industrial Technology Research
Institute, 2003
[41]. Synopsys Inc. DesignWare AHB Verification IP Databook , Synopsys, May 2006.
[42]. CoWare Inc. http://www.corware.com/
[43]. T.H. Tsai and C.N. Liu, A Configurable Common Filterbank Processor for Multi-Standard
Audio Decoder, IEICE Transactions on Fundamentals of Electronics, Communications and
Computer Sciences, Vol. E90-A, No.9, pp.1913-1923, Sep. 2007.
[44]. C.C. Wang, and Y.C. Lin An Efficient FFT Processor for DAB Receiver Using
Circuit-Sharing Pipeline Design IEEE Transactions on Broadcasting, Vol. 53, Issue.3,
pp.670-677, Sep. 2007.
[45]. S.C. Tai , C.C. Wang, and C.Y. Lin FFT and IMDCT circuit sharing in DAB receiver IEEE
Transactions on Broadcasting, Vol. 49, Issue.2, pp.124-131, June 2003.
~65~