Vous êtes sur la page 1sur 349

Audio Coding

Yuli You

Audio Coding
Theory and Applications

123

Yuli You, Ph.D.


University of Minnesota in Twin Cities

ISBN 978-1-4419-1753-9
e-ISBN 978-1-4419-1754-6
DOI 10.1007/978-1-4419-1754-6
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2010931862
c Springer Science+Business Media, LLC 2010

All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

To My parents and Wenjie, Amy and Alan

Preface

Since branching out of speech coding in the early 1970s, audio coding has now
slipped into our daily lives in a variety of applications, such as mobile music/video
players, digital television/audio broadcasting, optical discs, online media streaming,
and electronic games. It has become one of the essential technologies in todays
consumer electronics and broadcasting equipments.
In its more than 30 years of evolution, many audio coding technologies had come
into the spotlight and then became obsolete, but only a minority have survived and
are deployed in major modern audio coding algorithms. While covering all the major
turns and branches of this evolution is valuable for technology historians or for
people with intense interests, it is distracting and even inundating for most readers.
Therefore, those historic events will be omitted and this book will, instead, focus on
the current state of this evolution. Such a focus also helps to provide full coverage
to selected topics in this book.
This state of the art is presented from the perspective of a practicing engineer and
adjunct associate professor, who single-handedly developed the whole DRA audio
coding standard, from algorithm architecture to assembly-code implementation and
to subjective listening tests. This perspective has a clear focus on why and how
to. In particular, many purely theoretical details such as proof of perfect reconstruction property of various filter banks are omitted. Instead, the emphasis is on
the motivation for a particular technology, why it is useful, what it is, and how it is
integrated into a complete algorithm and implemented in practical products. Consequently, many practical aspects of audio coding technologies normally excluded in
audio coding books, such as transient detection and implementation of decoders on
low-cost microprocessors, are covered in this book.
This book should help readers to grasp the state-of-the-art audio coding technologies and build a solid foundation for them to either understand and implement
various audio coding standards or develop their own should the need arise. It is,
therefore, a valuable reference for engineers in the consumer electronics and broadcasting industry and for graduate students of electrical engineering.
Audio coding seeks to achieve data compression by removing perceptual irrelevance and statistical redundancy from a source audio signal and the removal efficiency
is powerfully augmented by data modeling which compacts and/or decorrelates the

vii

viii

Preface

source signal. Therefore, the presentation of this book is centered around these three
basic elements and organized into the following five parts.
Part I gives an overview of audio coding, describing the basic ideas, the key challenges, important issues, fundamental approaches, and the basic codec architecture.
Part II is devoted to quantization, the tool for removing perceptual irrelevancy.
Chapter 2 delineates scalar quantization which quantizes a source signal one sample
at a time. Both uniform and nonuniform quantization, including the LloydMax
algorithm, are discussed. Companding is posed as a structured and simple method
to implement nonuniform quantization.
Chapter 3 describes vector quantization which quantizes two or more samples
of a source signal as one block each time. Also included is the LindeBuzoGray
(LBG) or k-means algorithm which builds an optimal VQ codebook from a set of
training data.
Part III is devoted to data modeling which transforms a source signal into a representation that is energy-compact and/or decorrelated. Chapter 4 describes linear
prediction which uses a linear combination of the historic samples of the source
signal as a prediction for the current sample so as to arrive at a prediction error
signal that has lower energy and is decorrelated. It first explains why quantizing
the prediction error signal, instead of the source signal, can dramatically improve
coding efficiency. It then presents open-loop DPCM and DPCM, the two most common forms of linear prediction, derives the normal equation for optimal prediction,
presents LevinsonDurbin algorithm that iteratively solves the normal equation,
shows that the prediction error signal has a white spectrum and is thus decorrelated, and illustrates that the prediction decoder filter provides an estimate of the
spectrum of the source signal. Finally, a general framework for linear prediction
that can shape the spectrum of quantization noise to desirable shapes, such as that
of the absolute threshold of hearing, is presented.
Chapter 5 deals with transforms which linearly transform a block of source signal
samples into another block of coefficients whose energy is compacted to a minority. It first explains why this compaction of energy leads to dramatically improved
coding efficiency through the AMGM inequality and the associated optimal bit allocation strategy. It then derives the KarhunenLoeve transform from the search for
the optimal transform. Finally, it presents suboptimal and practical transforms, such
as discrete Fourier transform (DFT) and discrete cosine transform (DCT).
Chapter 6 presents subband filter banks as extended transforms in which historic
blocks of source samples overlap with the current block. It describes various aspects
of subband coding, including reconstruction error and polyphase representation and
illustrates that the dramatically improved coding efficiency is also achieved through
energy compaction.
Chapter 7 is devoted to cosine modulated filter banks (CMFB), whose structure
is amenable for fast implementation. It first builds this filter bank from DFT and
explains that it has a structure of a prototype filter plus cosine modulation. It then
presents nonperfect reconstruction and perfect reconstruction CMFB and their efficient implementation structures. Finally, it illustrates that modified discrete cosine
transform (MDCT), the most widely used filter bank in audio coding, is a special
and simple case of CMFB.

Preface

ix

Part IV is devoted to entropy coding, the tool for removing statistical redundancy.
Chapter 8 establishes that entropy is determined by the probability distribution of the
source signal and is the fundamental lower limit of bit rate reduction. It then shows
that any meaningful entropy codes have to be uniquely decodable and, to be practically implementable, should be instantaneously decodable. Finally, it illustrates that
prefix-free codes are just such codes and further proves Shannons noiseless coding
theorem, which essentially states that the entropy can be asymptotically approached
by a prefix-free code if source symbols are coded as blocks and the block size goes
to infinity.
Chapter 9 presents Huffman code, an optimal prefix-free code widely used in
audio coding. It first presents Huffmans algorithm, which is an iterative procedure
to build a prefix-free code from the probability distribution of the source signal,
and then proves its optimality. It also addresses some practical issues related to
the application of Huffman coding, emphasizing the importance of coding source
symbols as longer blocks.
While the previous parts can be applied to signal coding in general, Part V is devoted to audio. Chapter 10 covers perceptual models which determines which part
of the source signal is inaudible (perceptually irrelevant) and thus can be removed. It
starts with the absolute threshold of hearing, which is the absolute sensitivity level
of the human ear. It then illustrates that the human ear processes audio signals in
the frequency domain using nonlinear and analog subband filters and presents Bark
scale and critical bands as tools to describe the nonuniform bandwidth of these subband filters. Next, it covers masking effects which describe the phenomenon that
a weak sound becomes less audible due to the presence of a strong sound nearby.
Both simultaneous and temporal masking are covered, but emphasis is given to the
former because it is more thoroughly studied and extensively used in audio coding.
The rest of the chapter addresses a few practical issues, such as perceptual bit allocation, converting masked threshold to the subband domain, perceptual entropy, and
an example perceptual model.
Chapter 11 addresses the resolution challenge posed by transients. It first illustrates that audio signals are mostly quasistationary, hence need fine frequency resolution to maximize energy compaction but are frequently interrupted by transients,
which requires fine time resolution to avoid pre-echo artifacts. The challenge,
therefore, arises: a filter bank cannot have fine frequency and time resolution simultaneously according to the Fourier uncertainty principle. It then states that one
approach to address this challenge is to adapt frequency resolution in time to the
presence and absence of transients and further presents switched-window MDCT as
an embodiment: switching the window length of MDCT in such a way that short
windows are applied to transients and long ones to quasistationary episodes. Two
such examples are given, which can switch between two and three window lengths,
respectively. For the double window length example, two more techniques, temporal noise shaping and transient localization are given, which can further improve the
temporal resolution of the short windows. Practical methods for transient detection
are finally presented.

Preface

Chapter 12 deals with joint channel coding. Only two widely used methods are
covered, they are joint intensity coding and sum/difference (M/S stereo) coding.
Methods to deal with low-frequency effect (LFE) channels are also included.
Chapter 13 covers a few practical issues frequently encountered in the development of audio coding algorithms, such as how to organize various data, how to
assign entropy codebooks, how to optimally allocate bit resources, how to organize
bits representing various compressed data and control commands into a bit stream
suitable for transmission over various channels, and how to make the algorithm
amenable for implementation on low-cost microprocessors.
Chapter 14 is devoted to performance assessment, which, for a given bit rate,
becomes an issue of how to evaluate coding impairments. It first points out that
objective methods are highly desired, but are generally inadequate, so subjective
listening tests are necessary. The double-blind principle of subjective listening test
is then presented, along with the two methods, namely the ABX test and ITU-R
BS.1116, that implement it.
Finally, Chap. 15 presents Dynamic Resolution Adaptation (DRA) audio coding standard as an example to illustrate how integrate the technologies described
in this book to create a practical audio coding algorithm. DRA algorithm has been
approved by the Blu-ray Disc Association as part of its BD-ROM 2.3 specification
and by Chinese government as its national standard.
Yuli You
Adjunct Associate Professor
Department of Electrical and Computer Engineering
yuliyou@hotmail.com

Contents

Part I Prelude
1

Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.1 Audio Coding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.2 Basic Idea .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.3 Perceptual Irrelevance .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.4 Statistical Redundancy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.5 Data Modeling .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.6 Resolution Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.7 Perceptual Models .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.8 Global Bit Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.9 Joint Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.10 Basic Architecture .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.11 Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

3
4
6
8
9
9
11
13
13
14
14
16

Part II Quantization
2

Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.1 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.2 Re-Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3 Uniform Quantization .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3.2 Midtread and Midrise Quantizers .. . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3.3 Uniformly Distributed Signals . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3.4 Nonuniformly Distributed Signals . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.4 Nonuniform Quantization .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.4.1 Optimal Quantization and Lloyd-Max Algorithm . . . . . . . . . .
2.4.2 Companding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

19
21
23
24
24
25
27
28
33
35
39

Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3.1 The VQ Advantage .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3.2 Formulation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3.3 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

43
43
46
48

xi

xii

Contents

3.4
3.5

LBG Algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 48
Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 49

Part III Data Model


4

Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.1 Linear Prediction Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.2 Open-Loop DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.2.1 Encoder and Decoder .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.2.2 Quantization Noise Accumulation .. . . . . . . . .. . . . . . . . . . . . . . . . .
4.3 DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.3.1 Quantization Error .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.3.2 Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4 Optimal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4.1 Optimal Predictor .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4.2 LevinsonDurbin Algorithm .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4.3 Whitening Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4.4 Spectrum Estimator.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.5 Noise Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.5.1 DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.5.2 Open-Loop DPCM . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.5.3 Noise-Feedback Coding .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

53
53
55
55
57
59
59
60
61
61
63
65
68
69
69
71
71

Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.1 Transform Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2 Optimal Bit Allocation and Coding Gain . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.1 Quantization Noise . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.2 AMGM Inequality . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.3 Optimal Conditions .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.4 Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.5 Optimal Bit Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.6 Practical Bit Allocation.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.7 Energy Compaction . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.3 Optimal Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.3.1 KarhunenLoeve Transform . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.3.2 Maximal Coding Gain .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.3.3 Spectrum Flatness . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.4 Suboptimal Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.4.1 Discrete Fourier Transform . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.4.2 DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

73
73
76
76
77
78
79
80
81
82
82
83
84
85
85
86
88

Subband Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
6.1 Subband Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
6.1.1 Transform Viewed as Filter Bank .. . . . . . . . . .. . . . . . . . . . . . . . . . .
6.1.2 DFT Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
6.1.3 General Filter Banks. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

91
91
92
93
94

Contents

xiii

6.2
6.3

Subband Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 96
Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 97
6.3.1 Decimation Effects . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 98
6.3.2 Expansion Effects . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .100
6.3.3 Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .102
Polyphase Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .103
6.4.1 Polyphase Representation . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .103
6.4.2 Noble Identities .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .107
6.4.3 Efficient Subband Coder . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .109
6.4.4 Transform Coder.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .109
Optimal Bit Allocation and Coding Gain . . . . . . . . . . . .. . . . . . . . . . . . . . . . .110
6.5.1 Ideal Subband Coder . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .110
6.5.2 Optimal Bit Allocation and Coding Gain . .. . . . . . . . . . . . . . . . .111
6.5.3 Asymptotic Coding Gain . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .112

6.4

6.5

Cosine-Modulated Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .115


7.1 Cosine Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .115
7.1.1 Extended DFT Bank .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .116
7.1.2 2M -DFT Bank.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .117
7.1.3 Frequency-Shifted DFT Bank .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .119
7.1.4 CMFB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .120
7.2 Design of NPR Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .122
7.3 Perfect Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .123
7.4 Design of PR Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .124
7.4.1 Lattice Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .124
7.4.2 Linear Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .127
7.4.3 Free Optimization Parameters . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .129
7.5 Efficient Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .131
7.5.1 Even m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .131
7.5.2 Odd m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .134
7.6 Modified Discrete Cosine Transform .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .136
7.6.1 Window Function .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .136
7.6.2 MDCT .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .137
7.6.3 Efficient Implementation .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .138

Part IV
8

Entropy Coding

Entropy and Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .145


8.1 Entropy Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .146
8.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .148
8.2.1 Entropy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .148
8.2.2 Model Dependency .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .150
8.3 Uniquely and Instantaneously Decodable Codes . . . .. . . . . . . . . . . . . . . . .152
8.3.1 Uniquely Decodable Code . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .152
8.3.2 Instantaneous and Prefix-Free Code . . . . . . . .. . . . . . . . . . . . . . . . .153

xiv

Contents

8.4

8.3.3 Prefix-Free Code and Binary Tree . . . . . . . . . .. . . . . . . . . . . . . . . . .154


8.3.4 Optimal Prefix-Free Code .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .155
Shannons Noiseless Coding Theorem . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .156
8.4.1 Entropy as the Lower Bound .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .156
8.4.2 Upper Bound .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .158
8.4.3 Shannons Noiseless Coding Theorem . . . . .. . . . . . . . . . . . . . . . .159

Huffman Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .161


9.1 Huffmans Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .161
9.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .163
9.2.1 Codeword Siblings . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .163
9.2.2 Proof of Optimality .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .165
9.3 Block Huffman Code.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .166
9.3.1 Efficiency Improvement .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .167
9.3.2 Block Encoding and Decoding.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . .169
9.4 Recursive Coding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .169
9.5 A Fast Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .170

Part V

Audio Coding

10 Perceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .173


10.1 Sound Pressure Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .174
10.2 Absolute Threshold of Hearing .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .174
10.3 Auditory Subband Filtering .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .176
10.3.1 Subband Filtering .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .176
10.3.2 Auditory Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .177
10.3.3 Bark Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .179
10.3.4 Critical Bands .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .179
10.3.5 Critical Band Level .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .184
10.3.6 Equivalent Rectangular Bandwidth .. . . . . . . .. . . . . . . . . . . . . . . . .184
10.4 Simultaneous Masking .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .185
10.4.1 Types of Masking .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .186
10.4.2 Spread of Masking.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .188
10.4.3 Global Masking Threshold .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .191
10.5 Temporal Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .193
10.6 Perceptual Bit Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .194
10.7 Masked Threshold in Subband Domain .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . .195
10.8 Perceptual Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .195
10.9 A Simple Perceptual Model.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .197
11 Transients. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .199
11.1 Resolution Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .199
11.1.1 Pre-Echo Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .202
11.1.2 Fourier Uncertainty Principle . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .204
11.1.3 Adaptation of Resolution with Time. . . . . . . .. . . . . . . . . . . . . . . . .205

Contents

xv

11.2 Switched-Window MDCT . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .207


11.2.1 Relaxed PR Conditions and Window Switching . . . . . . . . . . . .207
11.2.2 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .209
11.3 Double-Resolution Switched MDCT. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .210
11.3.1 Primary and Transitional Windows .. . . . . . . .. . . . . . . . . . . . . . . . .210
11.3.2 Look-Ahead and Window Sequencing . . . . .. . . . . . . . . . . . . . . . .213
11.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .214
11.3.4 Window Size Compromise .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .215
11.4 Temporal Noise Shaping .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .215
11.5 Transient-Localized MDCT . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .217
11.5.1 Brief Window and Pre-Echo Artifacts . . . . . .. . . . . . . . . . . . . . . . .217
11.5.2 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .220
11.5.3 Indication of Window Sequence to Decoder . . . . . . . . . . . . . . . .222
11.5.4 Inverse TLM Implementation .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .223
11.6 Triple-Resolution Switched MDCT . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .224
11.7 Transient Detection.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .226
11.7.1 General Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .227
11.7.2 A Practical Example .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .228
12 Joint Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .231
12.1 M/S Stereo Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .231
12.2 Joint Intensity Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .232
12.3 Low-Frequency Effect Channel . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .234
13 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .235
13.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .235
13.1.1 Frame-Based Processing . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .236
13.1.2 TimeFrequency Tiling.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .236
13.2 Entropy Codebook Assignment . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .238
13.2.1 Fixed Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .239
13.2.2 Statistics-Adaptive Assignment .. . . . . . . . . . . .. . . . . . . . . . . . . . . . .240
13.3 Bit Allocation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .241
13.3.1 Inter-Frame Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .241
13.3.2 Intra-Frame Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .242
13.4 Bit Stream Format .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .243
13.4.1 Frame Header .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .243
13.4.2 Audio Channels .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .244
13.4.3 Error Protection Codes . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .245
13.4.4 Auxiliary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .245
13.5 Implementation on Microprocessors . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .246
13.5.1 Fitting to Low-Cost Microprocessors.. . . . . .. . . . . . . . . . . . . . . . .246
13.5.2 Fixed-Point Arithmetic .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .247

xvi

Contents

14 Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .251


14.1 Objective Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .252
14.2 Subjective Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .252
14.2.1 Double-Blind Principle .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253
14.2.2 ABX Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253
14.2.3 ITU-R BS.1116 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253
15 DRA Audio Coding Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .255
15.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .255
15.2 Architecture.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .256
15.3 Bit Stream Format .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .258
15.3.1 Frame Synchronization .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .259
15.3.2 Frame Header .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .262
15.3.3 Audio Channels .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .264
15.3.4 Window Sequencing for LFE Channels . . . .. . . . . . . . . . . . . . . . .278
15.3.5 End of Frame Signature . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .279
15.3.6 Auxiliary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .280
15.3.7 Unpacking the Whole Frame. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .280
15.4 Decoding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .281
15.4.1 Inverse Quantization .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .281
15.4.2 Joint Intensity Decoding . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .282
15.4.3 Sum/Difference Decoding.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .283
15.4.4 De-Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .285
15.4.5 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .286
15.4.6 Inverse TLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .289
15.4.7 Decoding the Whole Frame . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .289
15.5 Formal Listening Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .290
Large Tables . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .293
A.1 Quantization Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .293
A.2 Critical Bands for Short and Long MDCT . . . . . . . . . . .. . . . . . . . . . . . . . . . .294
A.3 Huffman Codebooks for Codebook Assignment . . . .. . . . . . . . . . . . . . . . .301
A.4 Huffman Codebooks for Quotient Width
of Quantization Indexes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .303
A.5 Huffman Codebooks for Quantization Indexes
in Quasi-Stationary Frames .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .304
A.6 Huffman Codebooks for Quantization Indexes
in Frames with Transients.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .318
A.7 Huffman Codebooks for Indexes of Quantization
Step Sizes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .332
References .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .335
Index . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .339

Part I

Prelude

Chapter 1

Introduction

Sounds are physical waves that propagate in the air or other media. Such waves,
which may be expressed as changes in air pressure, may be transformed by an analog
audio system using a transducer, such as a microphone, into continuous electrical
waves in the forms of current and/or voltage changes. This transformation of sounds
into an electrical representation, which we call an audio signal, facilitates the storage, transmission, duplication, amplification, and other processing of sounds. To
reproduce the sounds, the electrical representation, or audio signal, is converted
back into physical waves via loudspeakers.
Since electronic circuits and storage/transmission media are inherently noisy and
nonlinear, audio signals are susceptible to noise and distortion, resulting in loss of
sound quality. Consequently, modern audio systems are mostly digital in that the
audio signals obtained above are sampled into discrete-time signals and then digitized into numerical representations, which we call digital audio signals. Once in the
digital domain, a lot of technologies can be deployed to ensure that no inadvertent
loss of audio quality occurs.
Pulse-code modulation (PCM) is usually the standard representation format for
digital audio signals. To obtain a PCM representation, the waveform of an analog
audio signal is sampled regularly at uniform intervals (sampling period) to generate
a sequence of samples (a discrete-time signal), which are then quantized to generate
a sequence of symbols, each represented as a numerical (usually binary) code.
The NyquistShannon sampling theorem states that an analog signal that was
sampled can be perfectly reconstructed from the samples if the sample rate exceeds
twice the highest frequency in the original analog signal [68]. To ensure this condition is satisfied, the input analog signal is usually filtered with a low-pass filter
whose stopband corner frequency is less than half of the sample rate. Since the
human ears perceptual range for pure tones is widely believed to be between 20 Hz
and 20 kHz (see Sect. 10.2) [102], such low-pass filters may be designed in such a
way that the cutoff frequency starts at 20 kHz and a few kilohertz are allowed as
the transition band before the stopband. For example, the sample rate is 44.1 kHz
for compact discs (CD) and 48 kHz for sound tracks in DVD-Video. Some people, however, believe that the human ear can perceive frequencies much higher than
20 kHz, especially when transients present, so sampling rates as high as 192 kHz

Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 1,


c Springer Science+Business Media, LLC 2010


1 Introduction

are used in some audio systems, such as DVD-Audio. Note that, there is still power
in the stopband for any practical low-pass filters, so perfect reconstruction is only
approximately satisfied.
The subsequent quantization process also introduces noise. The more the number of bits is used to represent each audio sample, the less the quantization noise
becomes (see Sect. 2.3). The compact discs (CD), for example, use 16 bits to represent each sample. Due to the limited resolution and dynamic range of the human ear,
16 bits per sample are argued by some to be sufficient to deliver the full dynamics of
almost all music, but higher resolution audio formats are called for in applications
such as soundtracks in feature films, where there is often a very wide dynamic range
between whispered conversations and explosions. The higher resolution format also
enables more headroom to be left for audio processing which may inadvertently or
intentionally introduce noise. Twenty-four bits per sample, used by DVD, are widely
believed to be sufficient for most applications, but 32 bits are not uncommon.
Digital audio signals rarely come as a single channel or monaural sound. The
CD delivers two channels (stereo) and the DVD up to 7.1 channels (surround
sound) that consists of seven normal channels (front left, front right, center,
surround left, surround right, back left, and back right, for example) and one lowfrequency effect (LFE) channel. The terminology of 0.1 is used to indicate that
an LFE channel has a very low bandwidth, usually no more than 120 Hz. Apparently, the number of channels are in an increasing trend. For example, Japans NHK
demonstrated 22.2 channel surround sound in 2005 and Chinas DRA audio coding
standard (see Chap. 15) allows for 64.3 channel surround sound.

1.1 Audio Coding


Higher audio quality demands higher sample rate, more number of bits per sample,
and more channels. But all of these come with a significant cost: a large number of
bits to represent and transfer the digital audio signals.
Let b denote the number bits to represent each PCM sample and Fs the sample
rate in samples per second, the bit rate to represent and transfer an Nch channel
audio signal is
(1.1)
B0 D b  Fs  Nch
in bits per second. As an example, let us consider a moderate case which is typically
deployed by DVD-Video: 48 kHz sample rate and 24 bits per sample. This amounts
to a bit rate of 48  24 D 1;152 kbps (kilo bits per second) for each channel. The
total bit rate becomes 2,304 kbps for stereo, 6,912 kbps for 5.1, 9,216 kbps for 7.1,
and 2,7648 kbps for 22.2 surround sound, respectively. And this is not the end of
the story. If the 192 kHz sample rate is used, for example, these bit rates increase by
four times.
For mass consumption, audio signals need to be delivered to consumers through
some sort of communication/broadcasting or storage channel whose capacity is usually very limited.

1.1 Audio Coding

Storage channels usually have the best channel capacity. DVD-Video, for
example, is designed to hold at least two hours of film with standard definition
video and 5.1 surround sound. It was given a capacity of 4.7 GB (gigabytes), which
was the state-of-arts when DVD was standardized. If two hours of 5.1 surround
sound is to be delivered in the standard PCM format (24 bits per sample and 48 kHz
sample rate), it needs about 6.22 GB (gigabytes) storage space. This is more than the
capacity of the whole DVD disc and there is no capacity left for the video, whose
demand for bit rate is usually more than ten times that of audio signals. Apparently,
there is a problem of insufficient channel capacity for the delivery of audio signals.
This problem is much more acute with wireless channels. For example, overthe-air audio and/or television broadcasting usually allocates no more than 64 kbps
channel capacity to deliver stereo audio. If delivered at 24 bits per sample and
48 kHz sample rate PCM, a stereo audio signal needs a bit rate of 2,304 kbps, which
is 36 times the allocated channel capacity.
This problem of insufficient channel capacity for delivering audio signals may
be addressed by either allocating more channel capacity or reducing the demand of
it. Allocating more channel capacity is usually very expensive and even physically
impossible in situations such as wireless communication or broadcasting. It is often
more plausible and effective to pursue the demand reduction route: reducing the
bit rate necessary for delivering audio signals. This is the task of digital audio
(compression) coding.
Audio coding achieves this goal of bit rate reduction through an encoder and
a decoder, as shown in Fig. 1.1. The encoder obtains a compact representation of
the input audio signal, often referred to as the source signal, that demands less
bits. The bits for this compact representation is delivered through a communication/broadcasting or storage channel to the decoder, which then reconstructs the
original audio signal from the received compact representation.
Note that the term channel used here is an abstraction or aggregation of channel
coder, modulator, physical channel, channel demodulator, and channel decoder in
communication literature. Since the channel is well known for introducing bit errors,
the compact representation received by the decoder may be different than that at the
output of the encoder. From the viewpoint of audio coding, however, the channel

Fig. 1.1 Audio coding involves an encoder to transform a source audio signal into a compact representation for transmission through a channel and a decoder to decode the compact representation
received from the channel to reconstruct the source audio signal

1 Introduction

may be assumed to be error-free, but an audio coding system must be designed in


such way that it can tolerate a certain degree of channel errors. At the very least, the
decoder must be able to detect and recover from most channel errors.
Let B be the bit rate needed to deliver the compact representation, the performance of an audio coding algorithm may be assessed by the compression ratio
defined below:
B0
:
(1.2)
rD
B
For the previous example of over-the-air audio and/or television broadcasting, the
required compression ratio is 36:1.
The compact representation obtained by the encoder may allow the decoder to
perfectly reconstruct the original audio signal, i.e., the reconstructed audio signal
at the output of the decoder is an exact or identical copy of the source audio signal
inputted to the encoder, bit by bit. This audio coding process is called lossless.
Otherwise, it is called lossy, meaning that the reconstructed audio signal is just an
approximate copy of the source audio signal, some information is lost in the coding
process and the audio signal is irreversibly distorted (hopefully not perceived).

1.2 Basic Idea


According to information theory [85, 86], the minimal average bit rate that is necessary to transmit a source signal is its entropy, which is determined by the probability
distribution of the source signal (see Sect. 8.2). Let H denote the entropy of the
source signal, the following difference:
R D B0  H

(1.3)

is the component in the source signal that is statistically redundant for the purpose of transmitting the source signal to the decoder and is thus called statistical
redundancy.
The goal of lossless audio coding is to remove statistical redundancy from the
source signal as much as possible so that it is delivered to the decoder with a bit rate
B as close as possible to the entropy. This is illustrated in Fig. 1.2. Note that, while
entropy coding is a general terminology for coding techniques that remove statistical
redundancy, a lossless audio coding algorithm usually involves sophisticated data
modeling (to be discussed in Sect. 1.3), so the use of entropy coding in Fig. 1.2 is an
over simplification if the context involves lossless coding algorithms and may imply
that data modeling is part of it.
The ratio of compression achievable by lossless audio coding is usually very limited, an overall compression ratio of 2:1 may be considered high. This level of compression ratio cannot satisfy many practical needs. As stated before, the over-the-air
audio and/or television broadcasting, for example, may require a compression ratio
of 36:1. To achieve this level of compression, some information in the source signal
has to be irreversibly discarded by the encoder.

1.2 Basic Idea

Fig. 1.2 A lossless audio coder removes through entropy coding statistical redundancy from the
the source audio signal to arrive at a compact representation

Fig. 1.3 A lossy audio coder removes both perceptual irrelevancy and statistical redundancy from
the source audio signal to achieve much higher compression ratio

This irreversible loss of information causes distortion in the reconstructed audio


signal at the decoder output. The distortion may be significant if assessed using
objective measures such as mean square error, but is perceived differently by the
human ear, which audio coding serves. Proper coder design can ensure that no distortion can be perceived by the human ear, even if the distortion is outstanding when
assessed by objective measures. Furthermore, even if some distortion can be perceived, it may still be tolerated if it is not annoying. The portion of information
in the source signal that leads to either unperceivable or unannoying distortion is,
therefore, perceptually irrelevant and thus may be removed from the source signal
to significantly reduce bit rate.
After removal of perceptual irrelevance, there is still statistical redundancy in the
remaining signal components, which can be further removed through entropy coding. Therefore, a lossy coder usually consists of two modules as shown in Fig. 1.3.

1 Introduction

Note that, while quantization is a general terminology for coding techniques that
remove perceptual irrelevance, a lossy audio coding algorithm usually involves
sophisticated data modeling, so the use of quantization in Fig. 1.3 is an over simplification if the context involves lossy audio coding algorithms and may imply that
data modeling is part of it.

1.3 Perceptual Irrelevance


The basic approach to removing perceptual irrelevance is quantization which involves representing the samples of the source signal with lower resolution (see
Chaps. 2 and 3). For example, the integer value of 1,000, which needs 10 bits for
binary representation, may be quantized by a scalar quantizer (SQ) with a quantization step size of 9 as
1; 000  9  111
which now only needs 7 bits. At the decoder side, the original value may be reconstructed as
111  9 D 999
for a quantization error of
1; 000  999 D 1:
Consider the value of 1,000 above as a sample of a 10-bit PCM signal (no sign),
the above quantization process may be applied to all samples of the PCM signal to
generate another PCM signal of only 7 bits, for a compression ratio of 10:7.
Of course, the original 10-bit PCM signal cannot be perfectly reconstructed from
the 7-bit one due to quantization error. The quantization error is obviously dependent on the quantization step size, the larger the step size, the larger the quantization
error. If the level of quantization error above is considered perceptually irrelevant,
we have effectively compressed the 10-bit PCM signal into a 7-bit one. Otherwise,
the quantization step size needs to be reduced until the quantization error is perceptually irrelevant. To optimize compression performance, the step size can be
adjusted to an optimal value which gives a quantization error that is just not perceivable.
The quantization scheme illustrated above is the simplest uniform scalar quantization (see Sect. 2.3) which may be characterized by a constant quantization step
size applied to the whole dynamic range of the input signal. The quantization step
size may be made variable depending on the values of the input signal so as to better adapt to the perceived quality of the quantized and reconstructed signal. This
amounts to nonuniform quantization (see Sect. 2.4).
To exploit the inter-sample structure and correlation among the samples of the
input signal, a block of these samples may be grouped together and quantized as a
vector, amounting to vector quantization (VQ) (see Chap. 3).

1.5 Data Modeling

1.4 Statistical Redundancy


The basic approach to removing statistical redundancy is entropy coding whose
basic idea is to use long codewords to represent less frequent sample values and
short codewords for more frequent ones. As an example, let us consider the four
PCM sample values listed in the first column of Table 1.1 which has a probability
distribution listed in the second column of the same table. Since there are four PCM
sample values, we need to use at least 2 bits to represent a PCM signal that draws
sample values from the sample set above. However, if we use the codewords listed
in the third column of Table 1.1 to represent the PCM sample values, we end up
with the following average bit rate:
1

1
1
1
1
C 2  C 3  C 4  D 1:875 bits;
2
4
8
8

which amounts to a compression ratio of 2:1.875.


The code in Table 1.1 is a variant of the unary code, which is not optimal for the
probability distribution in the table. For an arbitrary probability distribution, if there
is an optimal code which uses the least average number of bits to code the samples
of the source signal, Huffman code is one of such codes [29] (see Chap. 9).

1.5 Data Modeling


If audio coding involved only techniques for removing perceptual irrelevance and
statistical redundancy, it would be a much simpler field of study and coding performance would also be significantly limited. Fortunately, there is another class of
techniques that make audio coding much more effective and also complex. It is data
modeling.
Audio signals, like many other signals, are usually strongly correlated and have
internal structures that can be expressed via data models. As an example, let us consider the 1,000 Hz sinusoidal signal shown at the top of Fig. 1.4, which is represented
using 16-bit PCM with a sample rate of 48 kHz. Its periodicity manifests that it
is strongly correlated. One simple approach to modeling the periodicity so as to

Table 1.1 The basic idea of entropy coding is to use long


codewords to represent less frequent sample values and
short codewords for more frequent ones
PCM sample value
Probability
Entropy code
1
0
1
2
1
1
01
4
1
2
001
8
1
3
0001
8

10

1 Introduction

Amplitude

x 104
2
0
2
0.005

0.01
Time (second)

0.015

0.02

0.005

0.01
Time (second)

0.015

0.02

0
Frequency (Hz)

Magnitude (dB)

Amplitude

x 10
2
0
2

80
60
40
20
0

2
x104

Fig. 1.4 A 1,000 Hz sinusoidal signal represented as 16-bit PCM with a sample rate of 48 kHz
(top), its linear prediction error signal (middle), and its DFT spectrum (bottom)

remove correlation is through linear prediction (see Chap. 4). Let x.n/ denote the
nth sample of the sinusoidal signal and x.n/
O
its predicted value, an extremely simple
prediction scheme is to use the immediate preceding sample value as the prediction
for the current sample value:
x.n/
O
D x.n  1/:
This prediction is, of course, not perfect, so there is prediction error or residue
which is
e.n/ D x.n/  x.n/
O
and is shown in the middle of Fig. 1.4. If we elect to send this residue signal, instead
of the original signal, to the decoder, we will end up with a much smaller number
of bits due to its significantly reduced dynamic range. In fact, its dynamic range is
[4278, 4278] which can be represented using 14-bit PCM, resulting a compression
ratio of 16:14.

1.6 Resolution Challenge

11

Another approach to decorrelation is orthogonal transforms (see Chap. 5) that,


when properly designed, can transform the input signal into coefficients which are
decorrelated and whose energy is compacted to a small number of coefficients. This
compaction of energy is illustrated in the bottom of Fig. 1.4 which plots the logarithmic magnitude of the Discrete Fourier Transform (DFT) (see Sect. 5.4.1) of the
1,000 Hz sinusoidal signal at the top of Fig. 1.4. Instead of dealing with the periodically occurring large sample values of the original sinusoidal signal in the time
domain, there are only a small number of large DFT coefficients in the frequency
domain and the rest are extremely close to zero. A bit allocation strategy can be
deployed which allocates bits to the representation of the DFT coefficients based
on their respective magnitudes. Due to the energy compaction, only a small number
of large DFT coefficients demand significant number of bits and the rest majority
demand little if any, so a tremendous degree of compression can be achieved.
DFT is rarely used in practical audio coding algorithms partly because it is
a real-to-complex transform: for a block of N real input samples, it generates a
block of N complex coefficients, which actually consist of N real and N imaginary
coefficients. Discrete cosine transforms (DCT), which is a real-to-real transform,
are more widely used in place of DFT. Note that, the N is hereafter referred to as
the block size or block length.
When blocks of transform coefficients are coded independent of each other,
discontinuity occurs at the block boundaries. Referred to as the blocky effect, the
discontinuity causes periodic clicking sound in the reconstructed audio and is
usually very annoying. To overcome this blocky effect, lapped transforms that overlap between blocks are developed [49], which may be considered as special cases
of subband filter banks [93] (see Chap. 6). Another benefit of overlapping between
blocks is that the resultant transforms have sharper frequency responses and thus
better energy compacting performance.
To mitigate codec implementation cost, structured filter banks that are amenable
to fast algorithms are mostly deployed in practical audio coding algorithms. Prominent among them are cosine modulated filter banks (CMFB) whose implementation
cost is essentially that of a prototype FIR filter and DCT (see Chap. 7). A special
case of it, modified discrete cosine transform (MDCT) (see Sect. 7.6), whose prototype filter is only twice as long as the block size, has essentially dominated various
audio coding standards.

1.6 Resolution Challenge


The compression achieved through energy compaction is based on two assumptions.
The first is that the input signal be quasistationary, full of fine frequency structures,
such as the one shown at the top of Fig. 1.5. This assumption is mostly correct
because audio signals are quasistationary most of the time. The second assumption
is that the transform or subband filter bank have a good frequency resolution to
resolve these fine frequency structures. Since the frequency resolution of a transform

12

1 Introduction

Amplitude

0.2
0.1
0
0.1
0.2

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045


Time (second)

Amplitude

1
0.5
0
0.5
1

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045


Time (second)

Fig. 1.5 Audio signals are quasistationary (such as the one shown at the top) most of the time, but
are frequently interrupted by dramatical transients (such as the one shown at the bottom)

or filter bank is largely proportional to the block size, this second assumption calls
for the deployment of transforms or filter banks with large block sizes. To achieve
a high degree of energy compaction, the block size should be as large as possible,
limited only by the physical memory of the encoder/decoder as well as the delay
associated with buffering a long block of samples.
Unfortunately, the first assumption above is not correct all the time quasistationary episodes of audio signals are frequently interrupted by dramatic transients
which may rise from absolute quiet to extreme loudness within a few samples.
Examples of such transients include sudden gun shots and explosions. A less dramatic transient produced by a music instrument is shown at the bottom of Fig. 1.5.
For such a transient, it is well known that a long transform or filter bank would produce a flat spectrum that corresponds to small, if any, energy compaction, resulting
in poor compression performance. To mitigate this problem, a short transform or
filter bank should be used that has fine time resolution to localize the transient in the
time domain.
To be able to code all audio signals with high coding performance all the time,
a transform/filter bank that would have good resolution in both time and frequency
domains is desired. Unfortunately, the Fourier uncertainty principle [75], which is
related to the Heisenberg uncertainty principle [90], states that this is impossible:
a transform/filter bank can have a good resolution either in the time or frequency
domain, but not both (see Sect. 11.1). This poses one of the most difficult challenges
in audio coding.
This challenge is usually addressed along the line of adapting the resolution of
transforms/filter banks with time to the changing resolution demands of the input

1.8 Global Bit Allocation

13

signal: applying long block sizes to quasistationary episodes and short ones to
transients. There are a variety of ways to implement this scheme, most dominant
among them seems to be the switched block-size MDCT (see Sect. 11.2).
In order for the resolution adaptation to occur on the fly, a transient detection
mechanism which detects the occurrence of transients and identifies their locations
(see Sect. 11.7) is needed.

1.7 Perceptual Models


While quantization is the tool for removing perceptual irrelevance, a question as
to what is the optimal degree of perceptual irrelevance that can be safely removed
without audible distortion remains. This question is addressed by perceptual models
that mimic the psychoacoustic behaviors of the human ear.
When a source signal is fed to a perceptual model, it provides as output, some
kind of description as to which parts of the source audio signal are perceptually
irrelevant. This description usually comes in the form a threshold of power, called
masking threshold, below which sound cannot be perceived by the human ear and
thus can be removed.
Since the human ear does most signal processing in the frequency domain, a
perceptual model is best built in the frequency domain and the masking threshold
given as a function of the frequency.
Consequently, the data model should desirably be a frequency transform/filter
bank so that the results from the perceptual model, such as the masking threshold,
can be readily and effective utilized. It is, therefore, not a surprise that most modern
audio coders operate in the frequency domain. However, it is still possible for an
audio coder to operate in other domains, but there should be a mechanism to bridge
that domain and the frequency domain in which the human ear mostly processes
sound signals.

1.8 Global Bit Allocation


The adjustment of the quantization step size affects proportionally the level of
quantization noise and inversely the number of bits needed to represent transform
coefficients or subband samples. A small quantization step size can ensure that the
quantization noise is not perceivable, but at the expense of consuming a large number of bits. A large quantization step size, on the other hand, demands a small
number of bits, but at the expense of a high level of quantization noise. Since a
lossy audio coder usually operates under a tight bit rate budget with a limited number of bits that can be used, a global bit allocation mechanism needs to be installed
to optimally allocate the limited bit resource so as to minimize the total perceived
power of quantization noise.

14

1 Introduction

The basic bit allocation strategy is to allocate bits (by decreasing the quantization
step size) iteratively to a group of transform coefficients or subband samples whose
quantization noise is most audible until either the bit pool is exhausted or quantization noises for all transform coefficients/subband samples are below the masking
thresholds.

1.9 Joint Channel Coding


The discrete channels of a multichannel audio signals are coordinated, synchronized
in particular, to produce dynamic sound imaging, so the inter-channel correlation
in a multichannel audio signal is very strong. This statistic redundancy can be
exploited through some forms of joint channel coding, either in the temporal or
transform/subband domain.
The human ear relies on a lot of cues in the audio signal to achieve sound localization and the processing involved is very complex. However, a lot of psychoacoustic
experiments have consistently indicated that some components of the audio signal
are either insignificant or even irrelevant for sound localization, thus can be removed
for bit rate reduction.
Joint channel coding is the general term for the techniques that explore interchannel statistic redundancy and perceptual irrelevance. Unfortunately, this is a less
studied area and the existing techniques are rather primitive. The ones primarily
used by most audio coding algorithms are sum/difference coding (M/S stereo coding)
and joint intensity coding, both are discussed in Chap. 12.

1.10 Basic Architecture


The various techniques discussed in the earlier sections can now be put together
to arrive at the basic audio encoder architecture shown in Fig. 1.6. The multiplexer
in the figure is a necessary module that packs all elements of the compressed audio
data into a coherent bit stream adhering to a specific format suitable for transmission
over various communication channels.
The corresponding basic decoder architecture is shown in Fig. 1.7. Each module
in this figure simply performs the inverse, and usually simpler, operation of the corresponding module in the encoder. Perceptual model, transient detection and global
bit allocation are usually complex and computationally expensive, so are not suitable for inclusion in the decoder. In addition, the decoder usually does not have the
relevant information to perform those operations. Therefore, these modules usually
do not have counterparts in the decoder. All that the decoder needs are the results
from these modules and they can be packed into the bit stream as part of the side
information.

1.10 Basic Architecture

15

Fig. 1.6 The basic audio encoder architecture. The solid lines represent movement of audio data
and the dashed line indicates control information

Fig. 1.7 The basic audio decoder architecture. The solid lines represent movement of audio data
and the dashed line indicates control information

In addition to the modules shown in Figs. 1.6 and 1.7, the development of
an audio coding algorithm also involves many practical and important issues,
including:
Data Structure. How transform coefficients or subband samples from all audio
channels are organized and accessed by the audio coding algorithm.
Bit Stream Format. How entropy codes and bits representing other control information are packed into a coherent bit stream.

16

1 Introduction

Implementation. How to structure the arithmetic of the algorithm to make the


encoder and, especially, the decoder amenable for easy implementation on cheap
hardwares such as fixed-point microprocessors.
An important but often overlooked issue is the necessity for frame-based processing. An audio signal may last as long as a few hours, so encoding/decoding it as
a monolithic piece takes a long time and demands tremendous hardware resources.
The resulting monolithic piece of encoded bit stream also makes real-time delivery
and decoding impossible. A practical approach is to segment a source audio signal
into consecutive frames usually ranging from 2 to 50 ms in duration and code each
of them in sequence. Most transforms/filter banks, such as MDCT, are block-based,
so the size of the frames can be conveniently set to be either the same as the block
size or a multiple of it.

1.11 Performance Assessment


Performance evaluation is an essential and necessary part of any algorithm development. The intuitive performance measure for audio coding is the compression ratio
defined in (1.2). Although simple, its effectiveness is very limited, mostly because it
changes with time for a given audio signal and even more dramatically between different audio signals. A rational approach is to use the worst compression ratio for a
set of difficult audio signals as the compression ratio for the audio coding algorithm.
This is usually enough for lossless audio coding.
For lossy audio coding, however, there is another factor that affects the usefulness of compression ratio the perception or assessment of coding distortion. The
compression ratio defined in (1.2) assumes that there is no audible distortion in the
decoded audio signal. This is a critical assumption that renders the compression
ratio meaningful. If this assumption were removed, any algorithm can achieve the
maximal possible compression ratio, which is infinity, by not sending any bits to
the decoder. Of course, this results in the maximal distortion of no audio signal outputted by the decoder. On the other extreme, we can throw an excessive number of
bits to the encoder to make the distortion far below the threshold of perception, in
the course wasting the precious bit resource. It is, therefore, necessary to establish
the level of just inaudible distortion before compression ratio can be calculated. It
is the compression ratio calculated at this point that authentically reflects the coding
performance of the underlying audio coding algorithm.
A more widely used approach to performance assessment, especially when different audio coding algorithms are compared, is to perceptually evaluate the level
of distortion for a given bit rate and a selected set of critical test signals. The perceptual evaluation of distortion must be ultimately performed by the human ear
through listening tests. For the same piece of decoded audio signal, different people
are likely to hear differently, some may hear distortion and some may not. Playback
equipments and the listening conditions also significantly impact the audibility of
distortion. Therefore, a set of procedure and method for conducting casual and formal listening tests are needed and are discussed in Chap. 14.

Part II

Quantization

Quantization literally is a process of converting samples of a discrete-time source


signal into a digital representation with reduced resolution. It is a necessary step for
converting analog signals in the real world to digital signals, which enables digital
signal processing. During this process of conversion, quantization also achieves a
tremendous deal of compression because an analog sample is considered as having
infinite resolution, thus requiring an infinite number of bits to represent, while a
digital sample is of limited resolution and is represented using a limited number of
bits. This conversion also means that a tremendous amount of information is lost
forever. This loss of information might be a serious concern, but can be made imperceptible or tolerable by properly designing the quantization process. The human
ear, for example, is widely believed to be unable to perceive resolution higher than
24 bits per sample. Any information or resolution more than this may be considered
as irrelevant, hence can be discarded through quantization.
When a digital signal is acquired through quantizing an analog one, the primary
concern is to make sure that the digital signal is obtained at the desired resolution, or all relevant information is not lost. There is little, if any, attempt to seek a
compact representation for the acquired digital signal. A compact representation is
pursued afterwards, usually when the need for storage or transmission arises. Once
the digital signal is inspected under the spotlight of compact representation, one
may be surprised by the amount of unnecessary or irrelevant information that it still
contains. This irrelevance can be removed by re-quantizing the already-quantized
digital signal.
There is essentially no difference in methodology between quantizing an analog
signal and re-quantizing a digital signal, so they will not be distinguished in the
treatment in this book.
Scalar quantization (SQ) quantizes a source signal one sample at a time. It is simple, but its performance is not as good as the more sophisticated vector quantization
(VQ) which quantizes a block of input samples each time.

Chapter 2

Scalar Quantization

An audio signal is a representation of sound waves usually in the form of sound


pressure level that varies with time. Such a signal is continuous both in value and
time, hence carries an infinite amount of information.
The first step of significant compression is accomplished when a continuous-time
audio signal is converted into a discrete-time signal using sampling. In what constitutes uniform sampling, the simplest sampling method, the continuous-time signal
is sampled at a regular interval T , called sampling period. According to Nyquist
Shannon sampling theorem [65, 68], the original continuous-time signal can be
perfectly reconstructed from the sampled discrete-time signal if the continuoustime signal is band-limited and its bandwidth is no more than half of the sample
rate (1=T ). Therefore, sampling accomplishes a tremendous amount of lossless
compression if the source signal is ideally bandlimited.
After sampling, each sample of the discrete-time signal has a value that is continuous, so the number of possible distinct output values is infinite. Consequently, the
number of bits needed to represent and/or convey such a value exactly to a recipient
is unlimited.
For the human ear, however, an exact continuous sample value is unnecessary
because the resolution that the ear can perceive is very limited. Many believe that it
is less than 24 bits. So a simple scheme of replacing an analog sample value with
an integer value that is closet to it would not only satisfy the perceptual capability
of the ear, but also removes a tremendous deal of imperceptible information from a
continuously valued signal. For example, the hypothetical analog samples in the
left column of Table 2.1 may be represented by the respective integer values in the
right column. This process is called quantization.
The underlying mechanism for quantizing the sample values in Table 2.1 is to
divide the real number line into real intervals and then map each of such interval
to an integer value. This is shown in Table 2.2, which is call a quantization table.
The quantization process actually involves three steps as shown in Fig. 2.1 and
explained below:
Forward Quantization. A source sample value is used to look up the left column to find the interval, referred to as decision interval, that it falls into and the
corresponding index, referred to as quantization index, in the center column is then
identified. This mapping is referred to as encoder mapping.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 2,
c Springer Science+Business Media, LLC 2010


19

20

2 Scalar Quantization
Table 2.1 An example of mapping analog sample values to integer
values that would take place in a process called quantization
Analog sound pressure level
Integer sound pressure level
3.4164589759   
3.124341   
2.14235   
1.409086743   
0.61341984378562890423   
0.37892458   
0.61308   
1.831401348156   
2.8903219654710   
3.208913064   

3
3
2
1
1
0
1
2
2
3

Table 2.2 Quantization table that maps source sample


intervals in the left column to integer values in the right
column
Sample value interval
Index
Integer value
(1, 2.5)
2.5, 1.5)
1.5, 0.5)
0.5, 0.5)
0.5, 1.5)
1.5, 2.5)
2.5, 1)

0
1
2
3
4
5
6

3
2
1
0
1
2
3

Fig. 2.1 Quantization involves an encoding or forward quantization stage represented by Q,


which maps an input value to the quantization index, and a decoding or inverse quantization stage
represented by Q1 , which maps the quantization index to the quantized value

Index Transmission. The quantization index is transmitted to the receiver.


Inverse Quantization. Upon receiving the quantization index, the receiver uses it to
read out the integer value, referred to as the quantized value, in the right column.
This mapping is referred to as decoder mapping.
The quantization table above maps sound pressure levels with infinite range and
resolution into seven integers, which need only 3 bits to represent, thus achieving
a great deal of data compression. However, this comes with a price: much of the
original resolution is lost forever. This loss of information may be significant, but it
was done on purpose: those lost pieces of information are irrelevant to our needs or
perception, we can afford to discard them.

2.1 Scalar Quantization

21

2.1 Scalar Quantization


To pose the quantization process outlined above mathematically, let us consider a
source random variable X with a probability density function (PDF) of p.X /.
Suppose that we wish to quantize this source with M decision intervals defined by
the following M C 1 endpoints
bq ; q D 0; 1; : : : ; M;

(2.1)

referred to as decision boundaries, and with the following M quantized values,


xO q ; q D 1; 2; : : : ; M;

(2.2)

which are also called output values or representative values. A source sample
value x is quantized to the quantization index q if and only if x falls into the qth
decision interval
q D bq1 ; bq /;
(2.3)
so the operation of forward quantization is
q D Q.x/; if and only if bq1  x < bq :

(2.4)

The quantized value can be reconstructed from the quantization index by the following inverse quantization
(2.5)
xO q D Q1 .q/;
which is also referred to as backward quantization. Since q is a function of x as
shown in (2.4), xO q is also a function of x and can be written as:
x.x/
O
D xO q D Q1 Q.x/ :

(2.6)

This quantization scheme is called scalar quantization (SQ) because the source
signal is quantized one sample each time.
The function in (2.6) is another approach to describing the inputoutput map of
a quantizer, in addition to the quantization table. Figure 2.2 is such a function that
describe the quantization map of Table 2.2.
The quantization operation in (2.4) obviously causes much loss of information,
the reconstructed quantized value obtained in (2.5) or (2.6) is different than the input
to the quantizer. The difference between them is called quantization error
q.x/ D x.x/
O
 x:

(2.7)

It is also referred to as quantization distortion or quantization noise.


Equation (2.7) may be rewritten as
x.x/
O
D x C q.x/;

(2.8)

so the quantization process is often modeled as an additive noise process as shown


in Fig. 2.3.

22

2 Scalar Quantization

Fig. 2.2 Inputoutput map for the quantizer shown in Table 2.2

+
Fig. 2.3 Additive noise model for quantization

The average loss of information introduced by quantization may be characterized


by average quantization error. Among the many norms that may be used to measure
this error, the L-2 norm or Euclidean distance is usually used and is called mean
squared quantization error (MSQE):
Z
q2

q 2 .x/p.x/dx
Z1
1
.x.x/
O
 x/2 p.x/dx

D
1

M Z
X

bq

qD1 bq1

.x.x/
O
 x/2 p.x/dx

(2.9)

2.2 Re-Quantization

23

Since x.x/
O
D xO q is a constant within the decision interval bq1 ; bq /, we have
q2

M Z
X

bq

qD1 bq1

.xO q  x/2 p.x/dx:

(2.10)

The MSQE may be better appreciated when compared with the power of the
source signal. This may be achieved using signal-to-noise ratio (SRN) defined
below
!
x2
;
(2.11)
SNR (dB) D 10 log10
q2
where x2 is the variance of the source signal.
It is obvious that the smaller the decision intervals, the smaller the error term
.xO q  x/2 in (2.10), thus the smaller the mean squared quantization error q2 . This
indicates that q2 is inversely proportional to the number of decision intervals M .
The placement of each individual decision boundary and the quantized value also
play major roles in the final q2 . The problem of quantizer design may be posed in a
variety of ways, including:
 Given a fixed M :

M D Constant;

(2.12)

find the optimal placement of decision boundaries and quantized values so that
q2 is minimized. This is the most widely used approach.
 Given a distortion constraint:
q2 < Threshold;

(2.13)

find the optimal placement of decision boundaries and quantized values so that
M is minimized. A minimal M means a minimal number of bits needed to represent the quantized value, hence a minimal bit rate.

2.2 Re-Quantization
The quantization process was presented above with the assumption that the source
random variable or sample values are continuous or analog. Quantization by name
usually gives the impression that it were only for quantizing analog sample values. When dealing with such analog source sample values, the associated forward
quantization is referred to as ADC (analog-to-digital conversion) and the inverse
quantization as DAC (digital-to-analog conversion).

24

2 Scalar Quantization
Table 2.3 An quantization table for re-quantizing a discrete
source
Decision interval
Quantization index
Re-quantized value
0, 10)
0
5
10, 20)
1
15
20, 30)
2
25
30, 40)
3
35
40, 50)
4
45
50, 60)
5
55
60, 70)
6
65
70, 80)
7
75
80, 90)
8
85
90, 100)
9
95

Discrete sources sample values can also be further quantized. For example,
consider a source that takes integer sample values between 0 through 100. If it is
decided, for some reason, that this resolution is too much or irrelevant for a particular application and sample values spaced at an interval of 10 are really what are
needed, a quantization table shown in Table 2.3 can be established to re-quantize
the integer sample values.
With discrete sources sample values, the formulation of quantization process in
Sect. 2.1 is still valid with the replacement of probability density function with probability distribution function and integration with summation.

2.3 Uniform Quantization


Both quantization Tables 2.2 and 2.3 are the embodiment of uniform quantization,
which is the simplest among all quantization schemes. The decision boundaries of
a uniform quantizer are equally spaced, so its decision intervals are all of the same
length and can be represented by a constant called quantization step size. For example, the quantization step size for Table 2.2 is 1 and for Table 2.3 is 10.
When an analog signal is uniformly sampled and subsequently quantized using
a uniform quantizer, the resulting digital representation is called pulse-code modulation (PCM). It is the default form of representation for many digital signals, such
as speech, audio, and video.

2.3.1 Formulation
Let us consider a uniform quantizer that covers an interval of Xmin ; Xmax  of a
random variable X with M decision intervals. Since its quantization step size is

2.3 Uniform Quantization

25

D

Xmax  Xmin
;
M

(2.14)

its decision boundaries can be represented as


bq D Xmin C   q;

q D 0; 1; : : : ; M:

(2.15)

The mean of an decision interval is often selected as the quantized value for that
interval:
xO q D Xmin C   q  0:5; q D 1; 2; : : : ; M:

(2.16)

For such a quantization scheme, the MSQE in (2.10) becomes


q2 D

M Z
X

Xmin Cq

qD1 Xmin C.q1/

.Xmin C   q  0:5  x/2 p.x/dx:

(2.17)

Let
y D Xmin C   q  0:5  x;
(2.17) becomes
q2 D

M Z
X

0:5

qD1 0:5

y 2 p Xmin C   q  .y C 0:5/2 dy:

(2.18)

Plugging in (2.15), (2.18) becomes


q2 D

M Z
X

0:5

qD1 0:5



x 2 p bq  .x C 0:5/ dx:

(2.19)

Plugging in (2.16), (2.18) becomes


q2 D

M Z
X

0:5

qD1 0:5

x 2 p.xO q  x/dx:

(2.20)

2.3.2 Midtread and Midrise Quantizers


There are two major types of uniform quantizers. The one shown in Fig. 2.2 is called
midtread because it has zero as one of its quantized values. It is useful for situations
where it is necessary for the zero value to be represented. One such example is
control systems where a zero value needs to be accurately represented. This is also

26

2 Scalar Quantization

important for audio signals because the zero value is needed to represent the absolute
quiet. Due to the midtreading of zero, the number of decision intervals (M ) is odd
if a symmetric sample value range (Xmin D Xmax ) is to be covered.
Since both the decision boundaries and the quantized values can be represented
by a single step size, the implementation of the midtread uniform quantizer is simple
and straight forward. The forward quantizer may implemented as
q D round

x


(2.21)

where round./ is the rounding function which returns the integer that is closest to
the input. The corresponding inverse quantizer may be implemented as
xO q D q:

(2.22)

The other uniform quantizer does not have zero as one of its quantized values,
so is called midrise. This is shown in Fig. 2.4. Its number of decision intervals is
even if a symmetric sample value range is to be covered. The forward quantizer may
implemented as
(
x
C 1; if x > 0I
truncate 
(2.23)
qD
x
truncate   1; otherwiseI

Fig. 2.4 An example of midrise quantizer

2.3 Uniform Quantization

27

where truncate./ is the truncate function which returns the integer part of the
input, without the fractional digits. Note that q D 0 is forbidden for a midrise
quantizer. The corresponding inverse quantizer is expressed below
(
xO q D

.q  0:5/; if q > 0I
.q C 0:5/; otherwise:

(2.24)

2.3.3 Uniformly Distributed Signals


As seen in (2.20), the MSQE of a uniform quantizer depends on the probability density function. When this density function is uniformly distributed over
Xmin ; Xmax :
p.x/ D

1
; x 2 Xmin ; Xmax ;
Xmax  Xmin

(2.25)

(2.20) becomes
q2

M Z 0:5
X
1
D
y 2 dx
Xmax  Xmin qD1 0:5

M
X
3
1
Xmax  Xmin qD1 12

M
3
Xmax  Xmin 12

Due to the step size given in (2.14), the above equation becomes
q2 D

2
:
12

(2.26)

For the uniform distribution in (2.25), its variance (signal power) is


x2 D

1
Xmax  Xmin

Xmax
Xmin

x 2 dx D

.Xmax  Xmin /2
;
12

(2.27)

28

2 Scalar Quantization

so the signal-to-noise ratio (SNR) of the uniform quantizer is


SNR (dB) D 10 log10


x2
q2

.Xmax  Xmin /2 12
12
2


Xmax  Xmin
D 20 log10


D 10 log10

(2.28)

Due to the step size given in (2.14), the above SNR expression becomes
SNR (dB) D 20 log10 .M / D

20
log2 .M /  6:02 log2 .M /:
log2 .10/

(2.29)

If the quantization indexes are represented using fixed-length codes, each codeword
can be represented using
R D ceil log2 .M / bits;

(2.30)

which is referred as bits per sample or bit rate. Consequently, (2.29) becomes
SNR (dB) D

20
R  6:02R dB;
log2 .10/

(2.31)

which indicates that, for each additional bit allocated to the quantizer, the SNR is
increased by about 6.02 dB.

2.3.4 Nonuniformly Distributed Signals


Most signals, and audio signals in particular, are rarely uniformly distributed. As
indicated by (2.20), the contribution of each quantization error to the MSQE is
weighted by the probability density function. A nonuniform distribution means that
the weighting is different now, so a different MSQE is expected and is discussed in
this section.

2.3.4.1 Granular and Overload Error


A nonuniformly distributed signal, such as Gaussian, is usually not bounded, so the
dynamic range Xmin ; Xmax  of a uniform quantizer cannot cover the whole range
of the source signal. This is illustrated in Fig. 2.5. The areas beyond Xmin ; Xmax 
are called overload areas. When a source sample falls into an overload area, the
quantizer can only assign either the minimum or the maximum quantized value to it:

2.3 Uniform Quantization

29
PDF

-Xmax

Overload
_3

Xmax

Granular
_2

Overload

Fig. 2.5 Overload and granular quantization errors

x.x/
O
D

Xmax  0:5; if x > Xmax I


Xmin C 0:5; if x < Xmin :

(2.32)

This introduces additional quantization error, called overload error or overload


noise. The mean squared overload error is obviously the following
Z
2
q.overload/
D

x  .Xmax  0:5/2 p.x/dx

Xmax

Xmin
1

x  .Xmin C 0:5/2 p.x/dx:

(2.33)

The MSQE given in (2.17) only accounts for quantization error within
Xmin ; Xmax , which is referred to as granular error or granular noise. The total
quantization error is
2
2
D q2 C q.overload/
:
(2.34)
q.total/
For a given PDF p.x/ and the number of decision intervals M , (2.17) indicates
that the smaller the quantization step size  is, the smaller the granular quantization
noise q2 becomes. According to (2.14), however, the smaller quantization step size
 also translates into smaller Xmin and Xmax for a fixed M . Smaller Xmin and
Xmax obviously leads to larger overload areas, hence a larger overload quantization
2
error q.overload/
. Therefore, the choice of , or equivalently the range Xmin ; Xmax 
of the uniform quantizer, represents a trade-off between granular and overload quantization errors.
This trade-off is, of course, relative to the effective width of the given PDF,
which may be characterized by its variance . The ratio of the quantization range
Xmin ; Xmax  over the signal variance
Fl D

Xmax  Xmin
;


(2.35)

called the loading factor, is apparently a good description of this trade-off. For
Gaussian distribution, a loading factor of 4 means that the probability of input

30

2 Scalar Quantization

samples going beyond the range is 0.045. For a loading factor of 6, the probability
reduces to 0.0027. For most applications, 4 loading is sufficient.

2.3.4.2 Optimal SNR and Step Size


To find the optimal quantization step size  that gives the minimum total MSQE
2
, let us drop (2.17) and (2.33) into (2.34) to obtain
q.total/
2
D
q.total/

M Z
X

Xmin Cq

qD1 Xmin C.q1/

x  .Xmax  0:5/2 p.x/dx

C
Z
C

x  .Xmin C   q  0:5/2 p.x/dx

Xmax
Xmin
1

x  .Xmin C 0:5/2 p.x/dx:

(2.36)

Usually, a uniform quantizer is symmetrically designed such that


Xmin D Xmax :

(2.37)

Then (2.14) becomes


D

2Xmax
:
M

(2.38)

Replacing all Xmin and Xmax with  using the above equations, we have
2
q.total/

M Z
X

.q0:5M /

.q  0:5  0:5M /  x2 p.x/dx

qD1 .q10:5M /

C
Z

x  0:5.M  1/2 p.x/dx

0:5M
0:5M

x C 0:5.M  1/2 p.x/dx:

(2.39)

p.x/ D p.x/

(2.40)

1

Assuming a symmetric PDF:

and doing a variable change of y D x in the last term of (2.39), it turns out that
this last term becomes the same as the second term, so (2.39) becomes

2.3 Uniform Quantization


2
q.total/
D

31

M Z
X

.q0:5M /

.q  0:5  0:5M /  x2 p.x/dx

qD1 .q10:5M /
Z 1

x  0:5.M  1/2 p.x/dx

C2

(2.41)

0:5M

Now that both (2.39) and (2.41) are only a function of , their minimum can be
found by setting their respective first order derivative against  to zero:
@ 2

D 0:
@ q.total/

(2.42)

This equation can be solved using a variety of numerical methods, see [76], for
example.
Figure 2.6 shows optimal SNR achieved by a uniform quantizer at various bits
per sample (see (2.30)) for Gaussian, Laplacian, and Gamma distributions [33]. The
SNR given in (2.31) for uniform distribution, which is the best SNR that a uniform
quantizer can achieve, is plotted as the bench mark. It is a straight line in the form of
SNR(R) D a C bR (dB);

(2.43)

with a slope of
bD

20
 6:02
log2 .10/

(2.44)

and an intercept of
a D 0:

(2.45)

50
45

Optimal SNR (dB)

40

Uniform
Gaussian
Laplacian
Gamma

35
30
25
20
15
10
5
0
1

4
5
Bits Per Sample

Fig. 2.6 Optimal SNR achieved by a uniform quantizer for uniform, Gaussian, Laplacian, and
Gamma distributions

32

2 Scalar Quantization

Apparently, the curves for other PDFs also seem to fit a straight line with
different slopes and intercepts. Notice that both the slope b and the intercept a
decrease as the peakedness or kurtosis of the PDF increases in the order of uniform, Gaussian, Laplacian, and Gamma, indicating that the overall performance of
a uniform quantizer is inversely related to PDF kurtosis. This degradation in performance is mostly reflected in the intercept a. The slope b is only moderately affected.
There is, nevertheless, reduction in slope when compared with the uniform
distribution. This reduction indicates that the quantization performance for other
distributions relative to the uniform distribution becomes worse at higher bit rates.
Figure 2.7 shows the optimal step size normalized by the signal variance,
opt =x , for Gaussian, Laplacian, and Gamma distributions as a function of the
number of bits per sample [33]. The data for uniform distribution is used as the
benchmark. Due to (2.14), (2.27) and (2.30), the normalized quantization step size
for the uniform distribution is






2
2
log2 .M /
R
D log10 p
; (2.46)
log10
D log10 p


x
log
10
log
3
3
2
2 10
so it is a straight line. Apparently, as the peakedness or kurtosis increases in the
order of uniform, Gaussian, Laplacian, and Gamma distributions, the step size
also increases. This is necessary for optimal balance between granular and overload quantization errors: an increased kurtosis means that the probability density is
spread more toward the tails, resulting more overload error, so the step size has to
be increased to counteract this increased overload error.

Optimal Step Size

100

101

102

Uniform
Gaussian
Laplacian
Gamma
1

4
5
Bits Per Sample

Fig. 2.7 Optimal step size used by a uniform quantizer to achieve optimal SNR for uniform,
Gaussian, Laplacian, and Gamma distributions

2.4 Nonuniform Quantization

33

The empirical formula (2.43) is very useful for estimating the minimal total
MSQE for a particular quantizer, given the signal power and bit rate. In particular, dropping in the SNR definition in (2.11) to (2.43), we can represent the total
MSQE as
(2.47)
10 log10 q2 D 10 log10 x2  a  bR
or
q2 D 100:1.aCbR/ x2 :

(2.48)

2.4 Nonuniform Quantization


Since the MSQE formula (2.10) indicates that the quantization error incurred by a
source sample x is weighted by the PDF p.x/, one approach to reduce MSQE is
to reduce quantization error in densely distributed areas where the weight is heavy.
Formula (2.10) also indicates that the quantization error incurred by a source sample value x is actually the distance between it and the quantized value x,
O so large
quantization errors are caused by input samples far away from the quantized value,
i.e., those which are near the decision boundaries. Therefore, reducing quantization
errors in densely distributed areas necessitates using smaller decision intervals. For
a given number of decision intervals M , this also means that larger decision intervals need to be placed to the rest of the PDF support so that the whole input range
is covered.
From the perspective of resource allocation, each quantization index is a piece of
bit resource that is allocated in the course of quantizer design, and there are only M
pieces of resources. A quantization index is one-to-one associated with a quantized
value and decision interval, so a piece of resource is considered as consisting of a
set of quantization index, quantized value, and a decision interval. The problem of
quantizer design may be posed as optimal allocation of these resources to minimize
the total MSQE. To achieve this, each piece of resources should be allocated to
carry the same share of quantization error contribution to the total MSQE. In other
words, the MSQE conbribution carried by individual pieces of resources should be
equalized.
For a uniform quantizer, its resources are allocated uniformly, except for the first
and last quantized values which cover the overload areas. As shown at the top of
Fig. 2.8, its resources in the tail areas of the PDF are not fully utilized because low
probability density or weight causes them to carry too little MSQE contribution.
Similarly, its resources in the head area are over utilized because high probability
density or weight causes them to carry too much MSQE contribution. To reduce the
overall MSQE, those mis-allocated resources need to be re-distribute in such a way
that the MSE produced by individual pieces of resource are equalized. This is shown
at the bottom of Fig. 2.8.

34

2 Scalar Quantization
PDF

-3

-3

-3

-3

-3

-3

-3

-3

-3 x

Resources are uniformly allocated for a uniform quantizer


PDF

Resources are nonuniformly allocated for a nonuniform quantizer

Fig. 2.8 Quantization resources are under-utilized by the uniform quantizer (top) in the tail areas
and over-utilized in the head area of the PDF. These resources are re-distributed in the nonuniform quantizer (bottom) so that individual pieces of resources carry the same amount of MSQE
contribution, leading to smaller MSQE

The above two considerations indicates that the MSQE can be reduced by assigning the size of decision intervals inversely proportional to the probability density.
The consequence of this strategy is that the more densely distributed the PDF is,
the more densely placed the decision intervals can be, thus the smaller the MSQE
becomes.
One approach to nonuniform quantizer design is to post it as an optimization
problem: finding the quantization intervals and quantized values that minimizes the
MSQE. This leads to the Lloyd-Max algorithm. Another approach is to transform
the source signal through a nonlinear function in such a way that the transformed
signal has a PDF that is almost uniform, then a uniform quantizer may be used to
deliver improved performance. This leads to companding.

2.4 Nonuniform Quantization

35

2.4.1 Optimal Quantization and Lloyd-Max Algorithm


Given a PDF p.x/ and a number of decision intervals M , one approach to the design
of a nonuniform quantizer is to find the set of decision boundaries fbq gM
0 and quansuch
that
the
MSQE
in
(2.10)
is
minimized.
Towards
the solution
tized values fxO q gM
1
of this optimization problem, let us first consider the following partial derivative
@q2
@xO q

Z
D2

bq

bq1

D 2xO q

.xO q  x/p.x/dx

bq

bq

p.x/dx  2

bq1

xp.x/dx:

(2.49)

bq1

Setting it to zero, we have


R bq
b

xO q D R q1
bq

xp.x/dx
;

(2.50)

bq1 p.x/dx

which indicates that the quantized value for each decision interval is the centroid of
the probability mass in the interval.
Let us now consider another partial derivative
@q2
@bq

D .xO q  bq /2 p.bq /  .xO qC1  bq /2 p.bq /

(2.51)

Setting it to zero, we have


bq D

1
.xO q C xO qC1 /;
2

(2.52)

which indicates that the decision boundary is simply the midpoint of the neighboring
quantized values.
Solving (2.50) and (2.52) would give us the optimal set of decision boundaries
2
O q gM
fbq gM
0 and quantized values fx
1 that minimizes q . Unfortunately, to solve
(2.50) for xO q we need bq1 and bq , but to solve (2.52) for bq we need xO q and xO qC1 .
The problem is a little difficult.

2.4.1.1 Uniform Quantizer as a Special Case


Let us consider a simple case where the probability distribution is uniform as given
in (2.25). For such a distribution, (2.50) becomes
xO q D

bq1 C bq
:
2

(2.53)

36

2 Scalar Quantization

Incrementing q for this equation, we have


xO qC1 D

bq C bqC1
:
2

(2.54)

Dropping (2.53) and (2.54) into (2.52), we have


4bq D bq1 C bq C bq C bqC1 ;

(2.55)

bqC1  bq D bq  bq1 :

(2.56)

bq  bq1 D ;

(2.57)

bqC1  bq D :

(2.58)

which leads us to
Let us denote
plugging it into (2.56), we have

Therefore, we can conclude by induction on q that all decision boundaries are


uniformly spaced.
For quantized values, let us subtract (2.53) from (2.54) to give
xO qC1  xO q D

bqC1  bq C bq  bq1
:
2

(2.59)

Plugging in (2.57) and (2.58), we have


xO qC1  xO q D ;

(2.60)

which indicates that the quantized values are also uniformly spaced. Therefore, uniform quantizer is optimal for uniform distribution.

2.4.1.2 Lloyd-Max Algorithm


Lloyd-Max algorithm is an iterative procedure for solving (2.50) and (2.52) for
an arbitrary distribution, so an optimal quantizer is also referred to as Lloyd-Max
quantizer. Note that its convergence is not proven, but only experimentally found.
Before presenting the algorithm, let us first note that we already know the first
and last decision boundaries:
b0 D Xmin and bM D Xmax :

(2.61)

For unbounded inputs, we may set Xmin D 1 and/or Xmax D 1. Also, we rearrange (2.52) into
(2.62)
xO qC1 D 2bq  xO q ;

2.4 Nonuniform Quantization

37

The algorithm involves the following iterative steps:


1. Make a guess for xO 1 .
2. Let q D 1.
3. Plugging xO q and bq1 into (2.50) to solve for bq . This may be done by integrating the two integrals in (2.50) forward from bq1 until the equation holds.
4. Plugging xO q and bq into (2.62) to get a new xO qC1 .
5. Let q D q C 1.
6. Go back to step 3 unless q D M .
7. When q D M , calculate
R bM
b

1
 D xO M  R M
bM

bM 1

xp.x/dx

(2.63)

p.x/dx

8. Stop if
jj < predetermined threshold:

(2.64)

9. Decrease xO 1 if  > 0 and increase xO 1 otherwise.


10. Go back to step 2.
A little explanation is in order for (2.63). The iterative procedure provides us
with an xO M upon entering step 7, which is used as the first term to the right of
(2.63). On the other hand, since we know bM from (2.61), we can use it with bM 1
provided by the procedure to obtain another estimate of xO M using (2.50). This is
given as the second term on the right side of (2.63). The two estimates of the same
xO M should be equal if equations (2.50) and (2.52) are solved. Therefore, we stop the
iteration at step 8 when the absolute value of their difference is smaller than some
predetermined threshold.
The adjustment procedure for xO 1 at step 9 can also be easily explained. The
iterative procedure is started with a guess for xO 1 at step 1. Based on this guess, a
O q gM
whole set of decision boundaries fbq gM
0 and quantized values fx
1 are obtained
from step 2 through step 8. If the guess is off, the whole set derived from it is off.
In particular, if the guess is too large, the resulting xO M will be too large. This will
cause  > 0, so xO 1 needs to be reduced; and vice versa.

2.4.1.3 Performance Gain


Figure 2.9 shows optimal SNR achieved by Lloyd-Max algorithm for uniform,
Gaussian, Laplacian, and Gamma distributions against the number of bits per sample [33]. Since the uniform quantizer is optimal for uniform distribution, its optimal
SNR curve in Fig. 2.9 is the same as in Fig. 2.6, thus can serve as the reference. Notice that the optimal SNR curves for the other distributions are closer to this curve
in Fig. 2.9 than in Fig. 2.6. This indicates that, for a given number of bits per sample, optimal nonuniform quantization achieves better SNR than optimal uniform
quantization.

38

2 Scalar Quantization
45
Uniform
Gaussian
Laplacian
Gamma

40

Optimal SNR (dB)

35
30
25
20
15
10
5
0

4
5
Bits Per Sample

Fig. 2.9 Optimal SNR versus bits per sample achieved by Lloyd-Max algorithm for uniform,
Gaussian, Laplacian, and Gamma distributions

Apparently, the optimal SNR curves in Fig. 2.9 also fit straight lines well, so can
be approximated by the same equation given in (2.43) with improved slope b and
intercept a. The improved performance of nonuniform quantization results in better
fitting to a straight line than those in Fig. 2.6.
Similar to uniform quantization in Fig. 2.6, both the slope b and the intercept
a decrease as the peakedness or kurtosis of the PDF increases in the order of uniform, Gaussian, Laplacian, and Gamma, indicating that the overall performance of
a LloydMax quantizer is inversely related to PDF kurtosis. Compared with the uniform distribution, all other distributions have reduced slopes b, indicating that their
performance relative to the uniform distribution becomes worse as the bit rate increases. However, the degradations of both a and b are less conspicuous than those
in Fig. 2.6.
In order to compare the performance between Lloyd-Max quantizer and uniform
quantizer, Fig. 2.10 shows optimal SNR gain of Lloyd-Max quantizer over uniform
quantizer for uniform, Gaussian, Laplacian, and Gamma distributions:
Optimal SNR Gain D SNRNonuniform  SNRUniform ;
where SNRNonuniform is taken from Fig. 2.9 and SNRUniform from Fig. 2.6. Since the
Lloyd-Max quantizer for uniform distribution is a uniform quantizer, the optimal
SNR gain is zero for uniform distribution. It is obvious that the optimal SNR gain is
more profound when the distribution is more peaked or is of larger kurtosis.

2.4 Nonuniform Quantization

39

8
Uniform
Gaussian
Laplacian
Gamma

Optimal SNR Gain (dB)

7
6
5
4
3
2
1
0

Bits Per Sample

Fig. 2.10 Optimal SNR gain of Lloyd-Max quantizer over uniform quantizer for uniform,
Gaussian, Laplacian, and Gamma distributions

2.4.2 Companding
Finding the whole set of decision boundaries fbq gM
O q gM
0 and quantized values fx
1
for an optimal nonuniform quantizer using Lloyd-Max algorithm usually involves a
large number of iterations, hence may be computationally intensive, especially for
a large M . The storage requirement for these decision boundaries and quantization
values may also become excessive, especially for the decoder. Companding is an
alternative.
Companding is motivated by the observation that a uniform quantizer is simple
and effective for a matching uniformly distributed source signal. For a nonuniformly
distributed source signal, one could use a nonlinear function f .x/ to convert it into
another one with a PDF similar to a uniform distribution. Then the simple and effective uniform quantizer could be used. After the quantization indexes are transmitted
to and subsequently received by the decoders, they are first inversely quantized to
reconstruct the uniformly quantized values and then the inverse function f 1 .x/ is
applied to produce the final quantized values. This process is illustrated in Fig. 2.11.
The nonlinear function in Fig. 2.11 is called a compressor because it usually
has a shape similar to that shown in Fig. 2.12 that stretches the source signal when
its sample value is small and compresses it otherwise. This shape of compression
is to match the typical shape of PDF, such as Gaussian and Laplacian, which has
large probability density for small absolute sample values and tails off towards large
absolute sample values, in order to make the converted signal have a PDF similar to
a uniform distribution.

40

2 Scalar Quantization

Fig. 2.11 The source sample value is first converted by the compressor into another one with a
PDF similar to a uniform distribution. It is then quantized by a uniform quantizer and the quantization index is transmitted to the decoder. After inverse quantization at the decoder, the uniformly
quantized value is converted by the expander to produce the final quantized value
Expander
1

0.5

0.5
Output

Output

Compressor
1

0.5

0.5
1
1

0.5

0
Input

0.5

1
1

0.5

0
Input

0.5

Fig. 2.12 -Law companding deployed in North American and Japanese telecommunication
systems

The inverse function is called an expander because the inverse of compression


is expansion. After the compression-expansion, hence companding, the effective
decision boundaries when viewed from the expander output is nonuniform, so the
overall effect is nonuniform quantization.
When companding is actually used in speech and audio applications, additional
considerations are given to the perceptual properties of the human ear. Since the perception of loudness by the human ear may be considered as logarithmic, logarithmic
companding is widely used.
2.4.2.1 Speech Processing
In speech processing, the -law companding, deployed in North American and
Japanese telecommunication systems, has a compression function given by [33]

2.4 Nonuniform Quantization

41

y D f .x/ D sign.x/

ln.1 C jxj/
; 1  x  1I
ln.1 C /

(2.65)

where  D 256 and x is the normalized sample value to be compounded and is


limited to 13 magnitude bits. Its corresponding expanding function is
x D f 1 .y/ D sign.y/

.1 C /jyj  1
; 1  y  1:


(2.66)

Both functions are plotted in Fig. 2.12.


A similar companding, called A-law companding, is deployed in Europe, whose
compression function is
sign.x/
y D f .x/ D
1 C ln.A/

Ajxj;
0  jxj  A1 I
1 C ln.Ajxj/; A1 < jxj  1I

(2.67)

where A D 87:7 and the normalized sample value x is limited to 12 magnitude bits.
Its corresponding expanding function is
(
x D f 1 .y/ D sign.y/

1Cln.A/
jyj;
A

ejyj.1Cln.A//1
ACA ln.A/

0  jyj 
;

1
1Cln.A/

1
1Cln.A/ I

< jyj  1:

(2.68)

It is usually very difficult to implement both the logarithmic and exponential


functions used in the companding schemes above, especially on embedded microprocessors with limited resources. Many such processors even do not have a floating
point unit. Therefore, the companding functions are usually implemented using
piece-wise linear approximation. This is adequate due to the fairly low requirement
for speech quality in telephonic systems,

2.4.2.2 Audio Coding


Companding is not as widely used in audio coding as in speech processing, partly
due to higher quality requirement and wider dynamic range which renders implementation more difficult. However, MPEG 1&2 Layer III [55, 56] and MPEG
2&4 AAC [59, 60] use the following exponential compression function to quantize
MDCT coefficients:
(2.69)
y D f .x/ D sign.x/jxj3=4 ;
which may be considered as an approximation to the logarithmic function. The
allowed compressed dynamic range is 8191  y  8191. The corresponding
expanding function is obviously
x D f 1 .y/ D sign.y/jyj4=3 :

(2.70)

42

2 Scalar Quantization

The implementation cost for the above exponential function is a remarkable issue
in decoder development. Piece-wise linear approximation may lead to degradation
in audio quality, hence may be unacceptable for high fidelity application. Another
alternative is to store the exponential function as a quantization table. This amounts
to 13  3 D 39 KB if each of the 213 entries in the table are stored using 24 bits.
The most widely used companding in audio coding is the companding of quantization step sizes of uniform quantizers. Since quantization step sizes are needed in
the inverse quantization process in the decoder, they need to be packed into the bit
stream and transmitted to the decoder. Transmitting these step sizes with arbitrary
resolution is out of the question, so it is necessary that they be quantized.
The perceived loudness of quantization noise is usually considered as logarithmically proportional to the quantization noise power, or linearly proportional to
the quantization noise power in decibel. Due to (2.28), this means the perceived
loudness is linearly proportional to the quantization step size in decibel. Therefore,
almost all audio coding algorithms use logarithmic companding to quantize quantization step sizes:
(2.71)
D f ./ D log2 ./;
where  is the step size of a uniform quantizer. The corresponding expander is
obviously
(2.72)
 D f 1 ./ D 2 :
Another motivation for logarithmic companding is to cope with the wide dynamic
range of audio signals, which may amount to more than 24 bits per sample.

Chapter 3

Vector Quantization

The scalar quantization discussed in Chap. 2 quantizes the samples of a source signal
one by one in sequence. It is simple because it deals with only one sample each
time, but it can only achieve so much for quantization efficiency. We now consider
quantizing two or more samples as one block each time and call this approach vector
quantization (VQ).

3.1 The VQ Advantage


Let us suppose that we need to quantize the following source sample sequence:
f1:2; 1:4; 1:7; 1:9; 2:1; 2:4; 2:6; 2:9g:

(3.1)

If we use the scalar midtread quantizer given in Table 2.2 and Fig. 2.2, we get the
following SQ indexes:
f1; 1; 2; 2; 2; 2; 3; 3g;
(3.2)
which is also the sequence for the quantized values since the quantization step size
is one. Since the range of the quantization indexes is 1; 3, we need 2 bits to convey
each index. This amounts to 8  2 D 16 bits for encoding the whole sequence.
If two samples are quantized as a block, or vector, each time using the VQ codebook given in Table 3.1, we end up with the following sequence of indexes:
f0; 1; 1; 2g:
When this sequence is used by the decoder to look up Table 3.1, we obtain exactly
the same reconstructed sequence as in (3.2), so the total quantization error is the
same. Now 2 bits are still needed to convey each index, but there are only four
indexes, so we need 4  2 D 8 bits to convey the whole sequence. This is only half
the number of bits needed by the SQ while the total quantization error is same.
To explain why VQ can achieve much better performance than SQ, let us view
the sequence in (3.1) as a sequence of two-dimensional vectors:

Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 3,


c Springer Science+Business Media, LLC 2010


43

44

3 Vector Quantization

Table 3.1 An example VQ


codebook

Fig. 3.1 A source sequence,


viewed as a sequence of
two-dimensional vectors, is
plotted as dots using the first
and second elements of each
vector as x and y coordinates,
respectively. The solid
straight lines represent the
decision boundaries for SQ
and the solid curved lines for
VQ. The solid dashed lines
represent the quantized values
for SQ and the diamonds
represent the quantized or
representative vectors for VQ

Index
0
1
2

Representative vector
[1,1]
[2,2]
[3,3]

3.5

2.5

1.5

0.5
0.5

1.5

f1:2; 1:4; 1:7; 1:9; 2:1; 2:4; 2:6; 2:9g;

2.5

3.5

(3.3)

and plot them in Fig. 3.1 as dots using the first and second elements of each vector as
x and y coordinates, respectively. For the first element of each vector, we use solid
vertical lines to represent its decision boundaries and dashed vertical lines its quantized value. Consequently, its decision intervals are represented by vertical strips
defined by two adjacent vertical lines. For the second element of each vector, we do
the same with horizontal solid and dashed lines, respectively, so that its decision intervals are represented by horizontal strips defined by two adjacent horizontal lines.
When the first element is quantized using SQ, a vertical decision strip is activated
to produce the quantized value. Similarly, a horizontal decision strip is activated
when the second element is quantized using SQ. Since the first and second elements
of each vector are quantized separately, the quantization of the whole sequence
can be viewed as a process of alternative activation of vertical and horizontal
decision strips.
However, if the results of the SQ described above for the two elements of each
vector are viewed jointly, the quantization decision is actually represented by the
squares where the vertical and horizontal decision strips cross. Each decision square
represents the decision intervals for both elements of a vector. The crossing point of
dashed lines in the middle of the decision square represents the quantized values for
both the elements of the vector. Therefore, the decision square and the associated
crossing point inside it are the real decision boundaries and quantized values for
each source vector.

3.1 The VQ Advantage

45

It is now easy to realize that many of the decision squares are never used by SQ.
Due to their existence, however, we still need a two-dimensional vector to identify
and hence represent each of the decision squares. For the current example, each of
such vectors needs 2  2 D 4 bits to be represented, corresponding to 2 bits per
sample. So there is no bit saved. What a waste those unused decision squares cause!
To avoid this waste, we need to consider the quantization of each source vector as
a joint and simultaneous action and forgo the SQ decision squares. Along this line of
thinking, we can arbitrarily place decision boundaries in the two-dimensional space.
For example, noting that the data points are scattered almost along a straight line,
we can re-design the decision boundaries as those depicted by the curved solid lines
in Fig. 3.1. With this design, we designate three crossing points represented by the
diamonds in the figure as the quantized or representative vectors. They obviously
represent the same quantized values as those obtained by SQ, thus leaving quantization error unchanged. However, the two-dimensional decision boundaries carve out
only three decision regions, or decision cells, with only three representative vectors
that need to be indexed and transmitted to the decoder. Therefore, we need to transmit 2 bits per vector, or 1 bit per sample, to the decoder, amounting to a bit reduction
of 2 to 1.
Of course, source sequences in the real world are not as simple as those in (3.1)
and Fig. 3.1. To look at realistic sequences, Fig. 3.2 plots a correlated Gaussian sequence (meanD0, varianceD1) in the same way as Fig. 3.1. Also plotted are the
decision boundaries (solid lines) and quantized values (dashed lines) of the same
uniform scalar quantizer used above. Apparently, the samples are highly concentrated along a straight line and there are a lot of SQ decision squares that are wasted.
One may argue that a nonuniform SQ would do much better. Recall that it was
concluded in Sect. 2.4 that the size of a decision interval of a nonuniform SQ should
be inversely proportional to the probability density. The translation of this rule into
two dimension is that the size of the SQ decision squares should be inversely proportional to the probability density. So a nonuniform SQ can improve the coding
performance by placing small SQ decision squares in densely distributed areas and
large ones in loosely distributed areas. However, there still are a lot of SQ decision
regions placed in areas that are extremely sparsely populated by the source samples,
causing a waste of bit resources.
3.5
2.5
1.5
0.5

Fig. 3.2 A correlated


Gaussian sequence (meanD0,
varianceD1) is plotted as
two-dimensional vectors over
the decision boundaries (solid
lines) and quantized values
(dashed lines) of a midtread
uniform SQ (step sizeD1)

0.5
1.5
2.5
3.5
3.5 2.5 1.5 0.5

0.5

1.5

2.5

3.5

46
Fig. 3.3 An independent
Gaussian sequence (meanD0,
varianceD1) is plotted as two
dimensional vectors over the
decision boundaries (solid
lines) and quantized values
(dashed lines) of a midtread
uniform SQ (step sizeD1)

3 Vector Quantization
3.5
2.5
1.5
0.5
0.5
1.5
2.5
3.5
3.5 2.5 1.5 0.5

0.5

1.5

2.5

3.5

To void this waste, the restriction that decision boundaries have to be square
has to be removed. Once this restriction is removed, we arrive at the world of VQ,
where decision regions can be of arbitrary shapes and can be arbitrarily allocated to
match the probability distribution density of the source sequence, achieving better
quantization performance.
Even with uncorrelated sequences, VQ can still achieve better performance than
SQ. Figure 3.3 plots an independent Gaussian sequence (meanD0, varianceD1) over
the decision boundaries (solid lines) and quantized values (dashed lines) of the same
midtread uniform SQ. Apparently the samples are still concentrated, even though
not as much as the correlated sequence in Fig. 3.2, so a VQ can still achieve better
performance than an SQ. At least, an SQ has to allocate decision squares to cover
the four corners of the figure, but a VQ can use arbitrarily shaped decision regions
to cover those areas without wasting bits.

3.2 Formulation
Let us consider an N -dimensional random vector
x D x0 ; x1 ; : : : ; xN 1 T

(3.4)

with a joint PDF p.x/ over a vector space or support of . Suppose this vector space
1
in a mutually exclusive and collectively
is divided by a set of M regions fgM
0
exhaustive way:
M
1
[
D
q
(3.5)
qD0

and
p \ q D

for all

p q;

(3.6)

3.2 Formulation

47

where is the null set. These regions, referred to as decision regions, play a role
similar to decision intervals in SQ. To each decision region, a representative vector
is assigned:
(3.7)
rq 7! q ; for q D 0; 1; : : : ; M  1:
The source vector x is vector-quantized to VQ index q as follows:
q D Q.x/

if and only if

x 2 q :

(3.8)

The corresponding reconstructed vector is the representative vector:


xO D rq D Q1 .q/:

(3.9)

Plugging (3.8) into (3.9), we have


xO .x/ D rq .x/ D Q1 Q.x/ :

(3.10)

The VQ quantization error is now a vector given below


q.x/ D xO  x:

(3.11)

xO D x C q.x/;

(3.12)

Since this may be rewritten as

the additive quantization noise model in Fig. 2.3 is still valid.


The quantization noise is best measured by a distance, such as the L-2 norm or
Euclidean distance defined below
d.x; xO / D .x  xO /T .x  xO / D

N
1
X

.xk  xO k /2 :

(3.13)

kD0

The average quantization error is then


Z
Err D
D

d.x; xO /p.x/dx

M
1 Z
X
qD0

d.x; rq /p.x/dx:

(3.14)
(3.15)

If the Euclidean distance is used, the average quantization noise may be again called
MSQE.
The goal of VQ design is to find a set of decision regions fg0M 1 and representative vectors frq g0M 1 , referred to as a VQ codebook, that minimizes this average
quantization error.

48

3 Vector Quantization

3.3 Optimality Conditions


A necessary condition for an optimal solution to the VQ design problem stated
above is that, for a given set of representative vectors frq g0M 1 , the corresponding
decision regions fg0M 1 should decompose the input space in such a way that each
source vector x is always clustered to its nearest representative vector [19]:
Q.x/ D rq

d.x; rq /  d.x; rk / for all k q:

if and only if

(3.16)

Due to this, the decision boundaries can be defined as


q D fx j d.x; rq /  d.x; rk / for all k qg:

(3.17)

Such disjoint sets are referred to as Voronoi regions, which rids us off the trouble of
literally defining and representing the boundary of each decision region. This also
indicates that a whole VQ scheme can be fully described by the VQ codebook, or
1
the set of representative vectors frq gM
.
0
Another condition for an optimal solution is that, given a decision region q , the
best choice for the representative vector is the conditional mean of all vectors within
the decision region [19]:
Z
xq D

xp.xjx 2 q /dx:

(3.18)

x2q

3.4 LBG Algorithm


Let us now consider the problem of VQ design, or finding the set of representative
vectors and thus decompose the input space into a set of Voronoi regions so that the
average quantization error in (3.15) is minimized. An immediate obstacle that needs
to be addressed is that, other than multidimensional Gaussian, there is essentially
no suitable theoretical joint PDF p.x/ to work with. Instead, we can usually draw
, from the source governed by an unknown
a large set of input vectors, fxk gL1
kD0
PDF. Therefore, a feasible approach is to use such a set of vectors, referred to as the
training set, to come up with a optimal VQ codebook.
In absence of the joint PDF p.x/, the average quantization error defined in (3.15)
is no longer available. It may be replaced by the following total quantization error:
Err D

L1
X

d .xk ; xO .xk // :

(3.19)

kD0

For the same reason, the best choice of the representative vector in (3.18) needs to
be replaced by
1 X
xq D
xk
(3.20)
Lq
xk 2q

where Lq is the number of training vectors in decision region q .

3.5 Implementation

49

The following LindeBuzoGray algorithm (LBG algorithm) [42], also referred


to as k-means algorithm, has been found to converge to a local minimum of (3.19):
1. nD0.
2. Make a guess for the representative vectors frq g0M 1 . By (3.17), this implicitly
builds an initial set of Voronoi regions fg0M 1 .
3. nDnC1.
4. Quantize each training vector using (3.16). Upon completion, the training set has
been partitioned into Voronoi regions fq g0M 1 .
5. Build a new set of representative vectors using (3.20). This implicitly builds a
new set of Voronoi regions fg0M 1 .
6. Calculate the total quantization error Err.n/ using (3.19).
7. Go back to step 3 if
Err.n  1/  Err.n/
>
(3.21)
Err.n/
where  is a predetermined positive threshold.
8. Stop.
Steps 4 and 5 in the iterative procedure above can only cause the total quantization error to decrease, so the LBG algorithm converges to at least a local minimum
of the total quantization error (3.19). This also explains the stopping condition
in (3.21).
There is a chance that, upon the completion of step 4, a Voronoi region may be
empty in the sense that it contains not a single training vector. Suppose this happens
to region q, then step 5 is problematic because Lq D 0 in (3.20). This indicates that
representative vector rq is an outlier that is far away from the training set. A simple
approach to fixing this problem is to replace it with a training vector in the most
popular Voronoi region.
After the completion of the LBG algorithm, the resulting VQ codebook can be
tested against a separate set of source data, referred to as the test set, that are drawn
from the same source.
The LBG algorithm offers an approach to exploiting the real multidimensional
PDF directly from the data without a theoretical multidimensional distribution
model. This is an advantage over SQ which usually relies on a probability model.
This also enables VQ to remove nonlinear dependencies in the data, a clear advantage over other technologies, such as transforms and linear prediction, which can
only deal with linear dependencies.

3.5 Implementation
Once the VQ codebook is obtained, it can be used to quantize the source signal that
generates the training set. Similar to SQ, this also involves two stages as shown in
Fig. 3.4.

50

3 Vector Quantization

Fig. 3.4 VQ involves an encoding or forward VQ stage represented by VQ in the figure, which
maps a source vector to a VQ index, and a decoding or inverse VQ stage represented by VQ1 ,
which maps a VQ index to its representative vector
Fig. 3.5 The
vector-quantization of a
source vector entails
searching through the VQ
codebook to find the
representative vector that is
closest to the source vector

According to optimal condition of (3.16), the vector quantization of a source


vector x entails searching through all representative vectors in the VQ codebook to
find the one that is closest to the source vector. This is shown in Fig. 3.5.
The decoding is very simple because it only entails using the received VQ index
to look up the VQ codebook to retrieve the representative vector.
Since the size of a VQ codebook grows exponentially with the vector dimension, the amount of storage for the codebook may easily become a significant cost,
especially for the decoder which is usually more cost-sensitive. The amount of computation involved in searching through the VQ codebook for each source vector is
also a major concern for encoder. Therefore, VQ with a dimension of more than 20
is rarely deployed in practical applications.

Part III

Data Model

Companding discussed in Sect. 2.4.2 illustrates a basic framework for improving


quantization performance. If the compressor is replaced by a general signal
transformation built upon a data model, expander by the corresponding inverse
transformation, and the uniform quantizer by a general quantizer, respectively,
companding can be expanded into a general scheme for quantization performance
enhancement: data model plus quantization. The steps involved in such a scheme
may be summarized as follows:
1.
2.
3.
4.

Transform the source signal into another one using a data model.
Quantize the transformed signal.
Transmit the quantization indexes to the decoder.
Inverse-quantize the received quantized indexes to reconstruct the transformed
signal.
5. Inverse-transform the reconstructed signal to reconstruct the original signal.
A necessary condition for this scheme to work is that the transformation must be
invertible, either exactly or approximately. Otherwise, original signal cannot be reconstructed even if no quantization is applied.
The key to the success of such a scheme is that the transformed signal must be
compact. Companding achieves this using a nonlinear function to arrive at a PDF
that is similar to a uniform distribution. There are other methods that are much more
powerful, most prominently among them are linear prediction, linear transform, and
subband filter banks.
Linear prediction uses a linear combination of historic samples as a prediction for
the current sample. As long as the samples are fairly correlated, the predicted value
will be a good estimate to the current sample value, resulting a small prediction
error signal, which may be characterized by a smaller variance. Since the MSQE
of an optimal quantizer is proportional to the the variance of the source signal (see
(2.48)), the reduced variance will result in a reduced MSQE.
A linear transform takes a block of input samples to generate another block of
transform coefficients whose energy is compacted to a minority. Bit resources can
then be concentrated to those high-energy coefficients to arrive at a significantly
reduced MSQE.

A filter bank may be considered as an extension of transform by using samples


from multiple blocks to achieve higher level of energy compaction without changing
the block size, thus delivering even smaller MSQE.
The primary role of data modeling is to exploit the inner structure or correlation of the source signal. As discussed in Chap. 3, VQ can also achieve this. But a
major difficulty with VQ is that its complexity grows exponential with vector dimension, so a vector dimension of more 20 is usually considered as too complex
to be deployed. However, correlation in most signals is usually much stronger than
20 samples. Audio signals, in particular, are well known for strong correlations up
to thousands of samples. Therefore, VQ is usually not directly deployed, but rather
jointly with a data model.

Chapter 4

Linear Prediction

Let us consider the source signal x.n/ shown at the top of Fig. 4.1. A simple
approach to linear prediction is to just use the previous sample x.n  1/ as the
prediction for the current sample:
p.n/ D x.n  1/:

(4.1)

This prediction is, of course, not perfect, so there is prediction error or residue
r.n/ D x.n/  p.n/ D x.n/  x.n  1/;

(4.2)

which is shown at the bottom of Fig. 4.1. The dynamic range of the residue is obviously much smaller than that of the source signal. The variance of the residue is
2.0282, which is much smaller than 101.6028, the variance of the source signal.
The histograms of the source signal and the residue, both shown in Fig. 4.2, clearly
indicate that, if the residue, instead of the source signal itself, is quantized, the quantization error will be much smaller.

4.1 Linear Prediction Coding


More generally, for a source signal x.n/, a linear predictor makes an estimate of
its sample value at time instance n using a linear combination of its K previous
samples:
K
X
p.n/ D
ak x.n  k/;
(4.3)
kD1

fak gK
kD1

are the prediction coefficients and K is the predictor order. The


where
transfer function for the prediction filter is
A.z/ D

K
X

ak zk :

(4.4)

kD1

Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 4,


c Springer Science+Business Media, LLC 2010


53

54

4 Linear Prediction
Input Signal

20

20

1000

2000

3000

4000

5000

6000

7000

8000

6000

7000

8000

Prediction Residue

20

20

1000

2000

3000

4000

5000

Fig. 4.1 An source signal and its prediction residue


Input Signal

Prediciton Residue
2200

400

2000
1800

350

1600

300

1400
250
1200
200

1000

150

800
600

100
400
50
0

200

20

20

20

Fig. 4.2 Histograms for the source signal and its prediction residue

20

4.2 Open-Loop DPCM

55

The prediction cannot be perfect, so there is always a prediction error


r.n/ D x.n/  p.n/;

(4.5)

which is also referred to as prediction residue. With proper design of the prediction coefficients, this prediction error can be significantly reduced. To assess the
performance of this prediction error reduction, prediction gain is defined:
PG D

x2
r2

(4.6)

where x2 and r2 denotes the variances of the source and the residue signals, respectively. For example, the prediction gain for the simple predictor in last section is
PG D 10 log10

101:6028
 17 dB:
2:0282

Since the prediction residue is likely to have a much smaller variance than the
source signal, the MSQE could be significantly reduced if the residue signal is quantized in place of the source signal. This amounts to linear prediction coding (LPC).

4.2 Open-Loop DPCM


There are many methods for implementing LPC, which are mostly concerned how
quantization noise is handled in the encoder. Open-loop DPCM discussed in this
section is one of the simplest, but it suffers from quantization noise accumulation.

4.2.1 Encoder and Decoder


An encoder for implementing LPC is shown in Fig. 4.3 [66], where the quantizer
is modeled by an additive noise source (see (2.8) and Fig. 2.3). Since the quantizer
is placed outside the prediction loop, this scheme is often called open-loop DPCM
[17]. The name DPCM will be explained in Sect. 4.3. The overall transfer function
for the LPC encoder before the quantizer is
H.z/ D 1 

K
X

ak zk D 1  A.z/;

(4.7)

kD1

where A.z/ is defined in (4.4).


After quantization, the quantization indexes for the residue signal are transmitted
to the decoder and are used to reconstruct the quantized residue rO .n/ through inverse

56

4 Linear Prediction

Fig. 4.3 Encoder for open-loop DPCM. The quantizer is attached to the prediction residue and is
modeled by an additive noise source

Fig. 4.4 Decoder for open-loop DPCM

quantization. This process may be viewed as if the quantized residue rO .n/ were
received directly by the decoder. This is the convention adopted in Figs. 4.3 and 4.4.
Since the original source signal x.n/ is not available at the decoder, the prediction
scheme in (4.3) cannot be used by the decoder. Instead, the prediction at the decoder
has to use the past reconstructed sample values:
p.n/
O
D

K
X

ak x.n
O  k/:

(4.8)

kD1

According to (4.5), the reconstructed sample itself is obtained by


x.n/
O
D p.n/
O
C r.n/
O
D

K
X

ak x.n
O  k/ C rO .n/:

(4.9)

kD1

This leads us to the decoder shown in Fig. 4.4, where the overall transfer function
of the LPC decoder or the LPC reconstruction filter is
D.z/ D

1

1
PK

kD1

ak zk

1
:
1  A.z/

(4.10)

If there were no quantizer, the overall transfer function of the encoder (4.7) and
decoder (4.10) is obviously one, so the reconstructed signal at the decoder output
is the same as the encoder input. When the quantizer is deployed, the prediction

4.2 Open-Loop DPCM

57

residue used by the decoder is different from that by the encoder and the difference
is the additive quantization noise. This causes reconstruction error at the decoder
output
e.n/ D x.n/
O
 x.n/;
(4.11)
which is the quantization error of the overall open-loop DPCM.

4.2.2 Quantization Noise Accumulation


As illustrated above, the difference between the prediction residues used by the
decoder and the encoder is the additive quantization noise, so this quantization noise
is reflected in each reconstructed sample value. Since these sample values are convoluted by the prediction filter, it is expected that the quantization noise is accumulated
at the decoder output.
To illustrate this quantization noise accumulation, let us first use the additive
noise model of (2.8) to write the quantized residue as
rO .n/ D r.n/ C q.n/;

(4.12)

where q.n/ is again the quantization error or noise. The reconstructed sample at the
decoder output (4.9) can then be rewritten as
x.n/
O
D p.n/
O
C r.n/ C q.n/:

(4.13)

Dropping in the definition of residue (4.5), we have


x.n/
O
D p.n/
O
C x.n/  p.n/ C q.n/:

(4.14)

Therefore, the reconstruction error at the decoder output (4.11) is


e.n/ D x.n/
O
 x.n/ D p.n/
O
 p.n/ C q.n/:

(4.15)

Plugging in (4.3) and (4.8), this may be further expressed as


e.n/ D

K
X

ak x.n
O  k/  x.n  k/ C q.n/

kD1

K
X

ak e.n  k/ C q.n/;

(4.16)

kD1

which indicates that the reconstruction error is equal to the current quantization error
plus weighted sum of reconstruction errors in the past.

58

4 Linear Prediction

Moving this relationship backward one step to n  1, the reconstruction error at


n  1 is equal to the quantization error at n  1 plus the weighted sum of reconstruction errors before n  1. Repeating this procedure all the way back to the start of the
prediction, we conclude that quantization errors in each prediction step in the past
are all accumulated to arrive at the reconstruction error at the current step.
To illustrate this accumulation of quantization noise more clearly, let us consider
the simple difference predictor used in Sect. 4.1:
a1 D 1 and K D 1:

(4.17)

Plugging this into (4.16) we have


e.n/ D e.n  1/ C q.n/:

(4.18)

Let us suppose that x.0/ D x.0/,


O
then the above equation produces iteratively the
following reconstruction errors:
e.1/ D e.0/ C q.1/ D q.1/
e.2/ D e.1/ C q.2/ D q.1/ C q.2/
e.3/ D e.2/ C q.3/ D q.1/ C q.2/ C q.3/
::
:

(4.19)

For the nth step, we have


e.n/ D

n
X

q.k/;

(4.20)

kD1

which is the summation of quantization noise in all previous steps, starting from the
beginning of prediction.
To obtain a closed representation of quantization accumulation, let E.z/ denote
the Z-transform of reconstruction error e.n/ and Q.z/ the Z-transform of quantization noise q.n/, then the reconstruction error in (4.16) may be expressed as
E.z/ D

Q.z/
;
1  A.z/

(4.21)

which indicates that the reconstruction error at the decoder output is the quantization
error filtered by an all-pole LPC reconstruction filter (see (4.10)).
An all-pole IIR filter may be unstable and may produce large instances of reconstruction error which may be perceptually annoying, so open-loop DPCM is avoided
in many applications. But it is sometimes purposely deployed in other applications
to exploit the shaping of quantization noise spectrum by the all-pole reconstruction
filter, see Sects. 4.5 and 11.4 for details.

4.3 DPCM

59

4.3 DPCM
This problem of quantization error accumulation can be avoided by forcing the
encoder to use the same predictor as the decoder, i.e., the predictor given in (4.8).
This entails moving the quantizer in Fig. 4.3 inside the encoding loop, leading to
the encoder scheme shown in Fig. 4.5. This LPC scheme is often referred to as
differential pulse code modulation (DPCM) [9]. Note that a uniform quantizer is
usually deployed.

4.3.1 Quantization Error


To illustrate that quantization noise accumulation is no longer an issue with DPCM,
let us first note that the prediction residue is now given by (see Fig. 4.5):
r.n/ D x.n/  p.n/:
O

(4.22)

Plugging this into the additive noise model (4.12) for quantization, we have
r.n/
O
D x.n/  p.n/
O
C q.n/;

(4.23)

which is the quantized residue at the input to the decoder. Dropping the above equation into (4.9), we obtain the reconstructed value at the decoder
x.n/
O
D p.n/
O
C x.n/  p.n/
O
C q.n/ D x.n/ C q.n/:

(4.24)

Rearranging this equation, we have


e.n/ D x.n/
O
 x.n/ D q.n/;

(4.25)

E.z/ D Q.z/:

(4.26)

or equivalently

Fig. 4.5 Differential pulse


code modulation (DPCM).
A full decoder is embedded
as part of the encoder. The
quantizer is modeled by an
additive noise source

60

4 Linear Prediction

Therefore, the reconstruction error at the decoder output is exactly the same as the
quantization error of the residue in the encoder, there is no quantization error accumulation.

4.3.2 Coding Gain


As stated for (4.11), the reconstruction error at the decoder output is considered as
2
, is the MSQE
the quantization noise of LPC, so its variance, denoted as q.DPCM/
2
, the
for DPCM. For a given bit rate, this MSQE should be smaller than q.PCM/
MSQE for directly quantizing the source signal, to justify the increased complexity
of linear prediction. This improvement may be assessed by coding gain
GDPCM D

2
q.PCM/
2
q.DPCM/

(4.27)

which is usually evaluated in the context of scalar quantization.


2
Due to (4.25), q.DPCM/
is the same as the MSQE of the prediction residue
2
q.Residue/ :
2
2
q.DPCM/
D q.Residue/
:

(4.28)

2
Since q.Residue/
is related to the variance of the reside r2 by (2.48) for uniform and
nonuniform scalar quantization, we have
2
D 100:1.aCbR/ r2 ;
q.DPCM/

(4.29)

for a given bit rate of R. Note that the slope b and intercept a in the above equation
are determined by a particular quantizer and the PDF of the residue.
On the other hand, if the source signal is quantized directly with the same bit rate
R in what constitutes the PCM, (2.48) gives the following MSQE
2
D 100:1.aCbR/ x2 ;
q.PCM/

(4.30)

where x2 is the variance of the source signal. Here it is assumed that both the source
signal and the DPCM residue share the same set of parameters a and b. This assumption may not be valid in many practical applications.
Consequently, the coding gain for DPCM is
GDPCM D

2
q.PCM/
2
q.DPCM/

x2
D PG;
r2

(4.31)

4.4 Optimal Prediction

61

which is the same as the prediction gain defined in (4.6). This indicates that the
quantization performance of the DPCM system is dependent on the prediction gain,
or how well the predictor predicts the source signal. If the predictor in the DPCM is
properly designed so that
(4.32)
r2  x2 ;
then GDPCM 1 or a significant reduction in quantization error can be achieved.

4.4 Optimal Prediction


Now that it is established that the quantization performance of a DPCM system
depends on the performance of linear prediction, the next task is to find an optimal
predictor that maximizes the prediction gain.

4.4.1 Optimal Predictor


For a given source signal x.n/, the maximization of prediction gain is equivalent to
minimizing the variance of the prediction residue (4.22), which can be expressed as
r.n/ D x.n/  p.n/
O
D x.n/ 

K
X

ak x.n
O  k/

(4.33)

kD1

using (4.8). Therefore, the design problem is to find the set of prediction coefficients
that minimizes r2 :
fak gK
kD1
2
min r2 D E 4 x.n/ 

fak gK
kD1

K
X

!2 3
ak x.n
O  k/ 5 ;

(4.34)

kD1

where E./ is the expectation operator defined below


Z
E.y/ D

y./p./d:

(4.35)

Since x.n/
O
in (4.34) is the reconstructed signal and is related to the source signal
by the additive quantization noise (see (4.25)), the minimization problem in (4.34)
involves optimal selection of the prediction coefficients as well as the minimization
of quantization error. As discussed in Chaps. 2 and 3, independent minimization
of quantization error or quantizer design itself is frequently very difficult, so it is
highly desirable that the problem be simplified by taking quantization out of the
picture. This essentially implies that the DPCM scheme is given up and the openloop DPCM encoder in Fig. 4.3 is used when it comes to predictor design.

62

4 Linear Prediction

One way to consider this simplification is the assumption of fine quantization:


the quantization step size is so small that the resulting quantization error is negligible
x.n/
O
 x.n/:

(4.36)

This enables the replacement of x.n/


O
by x.n/ in (4.34).
Due to the arguments above, the prediction residue considered for optimization
purpose becomes
K
X

ak x.n  k/;

(4.37)

!2 3
ak x.n  k/ 5 :

(4.38)

r.n/ D x.n/  p.n/ D x.n/ 

kD1

and the predictor design problem (4.34) becomes


2
min r2 D E 4 x.n/ 

fak gK
kD1

K
X
kD1

To minimize this error function, we set the derivative of r2 with respect to each
prediction coefficient aj to zero:
@r2
D 2E
aj

"
x.n/

K
X

ak x.nk/ x.nj / D 0; j D 1; 2; : : : ; K: (4.39)

kD1

Due to (4.3), the above equation may be written as


E .x.n/  p.n// x.n  j / D 0; j D 1; 2; : : : ; KI

(4.40)

and, using (4.5), further as


E r.n/x.n  j / D 0; j D 1; 2; : : : ; K:

(4.41)

This that the minimal prediction error or residue must be orthogonal to all data used
in the prediction. This is called orthogonality principle.
By moving the expectation inside the summation, (4.39) may be rewritten as
K
X

ak E x.n  k/x.n  j / D E x.n/x.n  j / ; j D 1; 2; : : : ; K:

(4.42)

kD1

Now we are ready to make the second assumption: the source signal x.n/ is a wide
sense stationary process so that its autocorrelation function can be defined as
R.k/ D Ex.n/x.n  k/;

(4.43)

4.4 Optimal Prediction

63

and has the following property:


R.k/ D R.k/:

(4.44)

Consequently, (4.42) can be written as


K
X

ak R.k  j / D R.j /; j D 1; 2; : : : ; K:

(4.45)

kD1

It can be further written into the following matrix form:


Ra D r;

(4.46)

a D a1 ; a2 ; a3 ; : : : ; aK T ;

(4.47)

r D R.1/; R.2/; R.3/; : : : ; R.K/T ;

(4.48)

where

and

2
6
6
6
RD6
6
4

R.0/
R.1/
R.2/
::
:

R.1/
R.0/
R.1/
::
:

R.2/
R.1/
R.0/
::
:

3
   R.K  1/
   R.K  2/ 7
7
   R.K  3/ 7
7:
7
::
5
:

R.K  1/ R.K  2/ R.K  3/   

(4.49)

R.0/

The equations above are known as normal equations, YuleWalker prediction


equations or WienerHopf equations [63].
The matrix R and vector r are all built from the autocorrelation values of
fR.k/gK
kD0 . The matrix R is a Toeplitz matrix in that it is symmetric and all elements along a diagonal are equal. Such matrices are known to be positive definite
and therefore nonsingular, yielding a unique solution to the determination of the
linear prediction coefficients:
(4.50)
a D R1 r:

4.4.2 LevinsonDurbin Algorithm


LevinsonDurbin recursion is a procedure in linear algebra to recursively calculate
the solution to an equation involving a Toeplitz matrix [12, 41], thus avoiding an
explicit inversion of the matrix R.
The algorithm iterates on the prediction order, so the order of the prediction filter
is denoted for each filter coefficient using superscripts:
akn

(4.51)

64

4 Linear Prediction

where k is the kth coefficient for nth iteration. To get the prediction filter of order K,
we need to iterate through the following sets of filter coefficients:
nD1W

fa11 g

nD2W

fa12 ; a22 g

::
:
K
n D K W fa1K ; a1K ; : : : ; aK
g

The iteration above is possible because both the matrix R and vector r are built from
the autocorrelation values of fR.k/gK
kD0 . The algorithm proceeds as follows:
1.
2.
3.
4.

Set n D 0.
Set E 0 D R.0/.
Set n D n C 1.
Calculate
n D

R.n/ 

E n1

n1
X

!
akn1 R.n  k/

(4.52)

kD1

5. Calculate
ann D n

(4.53)

n1
akn D akn1  n ank
; for k D 1; 2; : : : ; n  1:

(4.54)

E n D .1  n2 /E n1 :

(4.55)

6. Calculate

7. Calculate

8. Go to step 3 if n < M .
For example, to get the prediction filter of order K D 2, two iterations are needed
as follows. For n D 1, we have
E 0 D R.0/

1 D

R.1/
R.1/
D
E0
R.0/

a11 D 1 D
"
E D .1 
1

12 /E 0

D 1

(4.56)

R.1/
R.0/

(4.57)

R.1/
R.0/

2 #

R.0/ D

(4.58)
R2 .0/  R2 .1/
:
R.0/

(4.59)

4.4 Optimal Prediction

65

For n D 2, we have
2 D

R.2/  a11 R.1/


R.2/R.0/  R2 .1/
D
1
E
R2 .0/  R2 .1/
a22 D 2 D

R.2/R.0/  R2 .1/
R2 .0/  R2 .1/

a12 D a11 .1  2 / D R.1/

R.0/  R.2/
:
R2 .0/  R2 .1/

(4.60)

(4.61)
(4.62)

Therefore, the final prediction coefficients for K D 2 are


a1 D a12

and

a2 D a22 :

(4.63)

4.4.3 Whitening Filter


From the definition of prediction residue (4.5), we can establish the following relationship:
x.n  j / D r.n  j / C p.n  j /:
(4.64)
Dropping it into the orthogonality principle (4.41), we have
E r.n/r.n  j / C E r.n/p.n  j / D 0; j D 1; 2; : : : ; K;

(4.65)

so the autocorrelation function of the prediction residue is


Rr .j / D E r.n/p.n  j / ; j D 1; 2; : : : ; K:

(4.66)

Using (4.3), the right-hand side of the above equation may be further expanded into
Rr .j / D 

K
X

ak E r.n/x.n  j  k/ ; j D 1; 2; : : : ; K:

(4.67)

kD1

4.4.3.1 Infinite Prediction Order


If the predictor order is infinity (K D 1), the orthogonality principle (4.41) ensures that the right-hand side of (4.67) is zero, so the autocorrelation function of the
prediction residue becomes
(
Rr .j / D

r2 ; j D 0I
0; otherwise:

(4.68)

66

4 Linear Prediction

It indicates that the prediction residue sequence is a white noise process. Note that
this conclusion is valid only when the predictor has an infinite number of prediction
coefficients.
4.4.3.2 Markov Process
For predictors with a finite number of coefficients, the above condition is generally
not true, unless the source signal is a Markov process with an order N  M . Also
called an autoregressive process and denoted as AR(N), such a process x.n/ is
generated by passing a white-noise process w.n/ through an N-th order all-pole
filter
W .z/
W .z/
;
(4.69)
D
X.z/ D
PN
k
1  B.z/
1  kD1 bk z
where
B.z/ D

N
X

bk zk ;

(4.70)

kD1

and X.z/ and W .z/ are the z transforms of x.n/ and w.n/, respectively. The corresponding difference equation is
x.n/ D w.n/ C

N
X

bk x.n  k/:

(4.71)

kD1

The autocorrelation function of the AR process is


Rx .j / D Ex.n/x.n  j /
N
X

D Ew.n/x.n  j / C

bk Ex.n  k/x.n  j /

kD1
N
X

D Ew.n/x.n  j / C

bk R.j  k/

(4.72)

kD1

Since w.n/ is white,


(
Ew.n/x.n  j / D

w2 ; j D 0I

(4.73)

0; j > 0:
Consequently,

(
Rx .j / D

w2 C
PN

PN

kD1

kD1

bk R.k/; j D 0I

bk R.j  k/; j > 0:

(4.74)

4.4 Optimal Prediction

67

A comparison with the WienerHopf equations (4.45) leads to the following set of
optimal prediction coefficients:

ak D

bk ; 0 < j  N I
0; N < j  M I

(4.75)

which essentially sets


A.z/ D B.z/:

(4.76)

The above result makes intuitive sense because it sets the LPC encoder filter (4.7)
to be the inverse of the filter (4.69) that generates the AR process. In particular, the
Z-transform of the prediction residue is given by
R.z/ D 1  A.z/X.z/

(4.77)

according to (4.7). Dropping in (4.69) and using (4.76), we obtain


R.z/ D 1  A.z/

W .z/
D W .x/;
1  B.z/

(4.78)

which is the unpredictable white noise that drives the AR process.


An important implication of the above equation is that the prediction residue
process is, once again, white for an AR process whose order is not larger than the
predictor order.

4.4.3.3 Other Cases


When a predictor with a finite order is applied to a general stochastic process, the
prediction residue process is generally not white, but may be considered as approximately white in practical applications.
As an example, let us consider the signal at the top of Fig. 4.1, which is not an
AR process. Applying the LevinsonDurbin procedure in Sect. 4.4.2, we obtain the
optimal first-order filter as
R.1/
 0:99:
(4.79)
a1 D
R.0/
The spectrum of the prediction residue using this optimal predictor is shown at the
top of Fig. 4.6. It is obviously flat, so may be considered as white.
Therefore, the LPC encoder filter (4.7) that produces the prediction residue signal
is sometimes called a whitening filter. Note that it is the inverse of the decoder or
the LPC reconstruction filter (4.10)
H.z/ D

1
:
D.z/

(4.80)

68

4 Linear Prediction
Prediction Residue

Magnitude (dB)

10
0
10
20
30

500

1000

1500

2000
2500
Frequency (Hz)

3000

3500

4000

3500

4000

Input Signal and Spectral Envelop

Magnitude (dB)

40

20

20

500

1000

1500

2000
2500
Frequency (Hz)

3000

Fig. 4.6 Power spectra of the prediction residue (top), source signal (bottom), and the estimate by
the reconstruction filter (the envelop in the bottom)

4.4.4 Spectrum Estimator


From (4.7), the Z-transform of the source signal may be expressed as,
X.z/ D

R.z/
R.z/
D
;
PK
H.z/
1  kD1 ak zk

(4.81)

which implies that the power spectrum of the source signal is


Srr .ej! /
Sxx .ej! / D
2 ;
PK

jk!

a
e
1

kD1 k

(4.82)

where Sxx .ej! / and Srr .ej! / are the power spectrum of x.n/ and r.n/, respectively.
Since r.n/ is white or nearly white, the equation above becomes
r2
Sxx .ej! / DD
2 :
PK

1  kD1 ak ejk!

(4.83)

4.5 Noise Shaping

69

Therefore, the decoder or the LPC reconstruction filter (4.10) provides an estimate
of the spectrum of the source signal and linear prediction is sometimes considered
as a temporal-frequency analysis tool.
Note that this is an all-pole model for the source signal, so it can model peaks
well, but is incapable of modeling zeros (deep valleys) that may exist in the source
signal. For this reason, linear prediction spectrum is sometimes referred to as an
spectral envelop. Furthermore, if the source signal cannot be modeled by poles,
linear prediction may fail completely.
The spectrum of the source signal and the estimate by the LPC reconstruction
filter (4.83) are shown at the bottom of Fig. 4.6. It can be observed that the spectrum
estimated by the LPC reconstruction filter matches the signal spectrum envelop
very well.

4.5 Noise Shaping


As discussed in Sect. 4.4, the optimal LPC encoder produces a prediction residue
sequence that is white or nearly white. When this white residue is quantized, the
quantization noise is often white as well, especially when fine quantization is used.
This is a mismatch to the sensitivity curve of the human ear which is well known
for its substantial variation with frequency. It is, therefore, desirable to shape the
spectrum of quantization noise at the decoder output to match the sensitivity curve of
the human ear so that more perceptual irrelevancy can be removed. Linear prediction
offers a flexible and simple mechanism for achieving this.

4.5.1 DPCM
Let us revisit the DPCM system in Fig. 4.5. Its output given by (4.23) may be expanded using (4.8) to become
rO .n/ D x.n/ 

K
X

ak x.n
O  k/ C q.n/

(4.84)

kD1

Due to (4.25), the above equation may be written as


rO .n/ D x.n/ 

K
X

ak x.n  k/ C q.n  k/ C q.n/

kD1

D x.n/ 

K
X
kD1

ak x.n  k/ C q.n/ 

K
X
kD1

ak q.n  k/

(4.85)

70

4 Linear Prediction

Fig. 4.7 Different schemes of linear prediction for shaping quantization noise at the decoder output: DPCM (top), open-loop DPCM (middle) and general noise feedback coding (bottom)

The Z-transform of the above equation is


O
R.z/
D 1  A.z/X.z/ C 1  A.z/Q.z/;

(4.86)

O is the Z-transform of rO .n/.


where R.z/
The above equation indicates that the DPCM system in Figs. 4.5 and 4.4 may be
implemented using the structure at the top of Fig. 4.7 [5]. Both the source signal and
the quantization noise are shaped by the same LPC encoder filter and the shaping is
subsequently reversed by the reconstruction filter in the decoder, so the overall transfer functions for the source signal and the quantization noise are the same and equal

4.5 Noise Shaping

71

to one. Therefore, the spectrum of the quantization noise at the decoder output is the
same as that produced by the quantizer (see (4.26)). In other words, the spectrum
of the quantization noise as produced by the quantizer is faithfully duplicated at the
decoder output, there is no shaping of quantization noise.
Many quantizers, including the uniform quantizer, produce a white quantization
noise spectrum when the quantization step size is small (fine quantization). This
white spectrum is faithfully duplicated at the decoder output by DPCM.

4.5.2 Open-Loop DPCM


Let us now consider the open-loop DPCM in Figs. 4.3 and 4.4 which are redrawn in
the middle of Fig. 4.7. Compared with the DPCM scheme at the top, the processing
for the source signal is unchanged, the overall transfer function is still one. However,
there is no processing for the quantization noise in the encoder, it is shaped only by
the LPC reconstruction filter, so the quantization noise at the decoder output is given
by (4.21).
As shown in Sect. 4.4.4, the optimal LPC reconstruction filter traces the spectral
envelop of the source signal, so the quantization noise is shaped toward that envelop.
If the quantizer produces white quantization noise, the quantization noise at the
decoder output is shaped to the spectral envelop of the source signal.

4.5.3 Noise-Feedback Coding


While the LPC filter coefficients are determined by the necessity to maximize prediction gain, the DPCM and open-loop DPCM schemes imply that the filter for
processing the quantization noise in the encoder can be altered to shape the quantization noise without any impact to the prefect reconstruction of the source signal.
Apparently, the transfer function for such a filter, called error-feedback function
or noise-feedback function, do not have to be either 1  A.z/ or 1, a different noise
feedback function can be used to shape the quantization noise spectrum to other desirable shapes. This gives rise to the noise-feedback coding shown at the bottom of
Fig. 4.7, where the noise-feedback function is denoted as
1  F .z/:

(4.87)

Figure 4.7 implies an important advantage of noise feedback coding: the shaping of quantization noise is accomplished completely in the encoder, there is zero
implementation impact or cost at the decoder.
When designing the noise feedback function, it is important to realize that the
noise-feedback function is only half of the overall noise-shaping filter

72

4 Linear Prediction

S.z/ D

1  F .z/
;
1  A.z/

(4.88)

the other half is the LPC reconstruction filter which is determined by solving the
normal equation (4.46). The Z-transform of the quantization error at the decoder
output is given by
E.z/ D S.z/Q.z/;
(4.89)
so the noise spectrum is
See .ej! / D jS.ej! /j2 Sqq .ej! /:

(4.90)

In audio coding, this spectrum is supposed to be shaped to match the masked threshold for a given source signal. In principle, this matching can be achieved for all
masked threshold shapes provided by a perceptual model [38].

Chapter 5

Transform Coding

Transform coding (TC) is a method that transforms a source signal into another
one with a more compact representation. The goal is to quantize the transformed
signal in such a way that the quantization error in the reconstructed signal is smaller
than directly quantizing the source signal.

5.1 Transform Coder


Transform coding is block-based, so the source signal x.n/ is first grouped into
blocks, each of which consists of M samples, and is represented by a vector:
x.n/ D x0 .n/; x1 .n/; : : : ; xM 1 .n/T
D x.nM /; x.nM  1/; : : : ; x.nM  M C 1/T :

(5.1)

The dimension M is called block size or block length.


For a linear transform, the transformed block is obtained as
y.n/ D Tx.n/;

(5.2)

y.n/ D y0 .n/; y. n/; : : : ; yM 1 .n/T

(5.3)

where
is called the transform of x.n/ or transform coefficients and the M  M matrix
2
6
6
TD6
4

t0;0
t1;0
::
:

t0;1
t1;1
::
:

tM 1;0

tM 1;1




t0;M 1
t1;M 1
::
:

3
7
7
7
5

(5.4)

   tM 1;M 1

is called the transformation matrix or simply the transform. This transform operation is shown in the left of Fig. 5.1.

Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 5,


c Springer Science+Business Media, LLC 2010


73

74

5 Transform Coding

Fig. 5.1 Flow chart of a transform coder. The coding delay for transforming and inverse
transforming the signals is ignored in this graph

Transform coding is shown in Fig. 5.1. The transform coefficients y.n/ are
quantized into quantized coefficients yO .n/ and the resulting quantization indexes
transmitted to the decoder. The decoder reconstructs from the received indexes
the quantized coefficients yO .n/ through inverse quantization. This process may be
viewed as if the quantized coefficients yO .n/ are received directly by the decoder.
The decoder then reconstructs an estimate xO .n/ of the source signal vector x.n/
from the quantized coefficients yO .n/ through an inverse transform
xO .n/ D T1 yO .n/;

(5.5)

where T1 represents the inverse transform. The reconstructed vector xO .n/ can
then be unblocked to rebuild an estimate x.n/
O
of the original source signal x.n/.
A basic requirement for a transform in the context of transform coding is that it
must be invertible
(5.6)
T1 T D I;
so that an source block can be recovered from its transform coefficients in the absence of quantization:
(5.7)
x.n/ D T1 y.n/:
Orthogonal transforms are most frequently used in practical applications. To
be orthogonal, a transform must satisfy
T1 D TT ;

(5.8)

TT T D I;

(5.9)

or

5.1 Transform Coder

75

Consequently, the inverse transform becomes


x.n/ D TT y.n/:

(5.10)

A transform matrix T can be considered as consisting of M rows of vectors


2

tT0

6 T 7
6 t1 7
6
7
T D 6 : 7;
6 :: 7
4
5
tTM 1

(5.11)

where tTk represents the kth row of T


tTk D tk;0 ; tk;1 ; : : : ; tk;M 1 ;

k D 0; 1; : : : ; M  1:

(5.12)

Equation (5.10) becomes


x.n/ D t0 ; t1 ; : : : ; tM 1  y.n/ D

M
1
X

yk tk ;

(5.13)

kD0

which means that the source can be represented as a linear combination of the vecM 1
tors ftk gkD0
. Therefore, the rows of T are often referred to as basis vectors or basis
functions.
One of the advantages of an orthogonal transform is that its inverse transform
TT is immediately defined without considering matrix inversion. The fact that the
inverse matrix is just the transposition of the transform matrix means that the inverse transform can be implemented by just transposing the transform flow graph or
running it backwards.
Another advantage of an orthogonal transform is energy conservation which
means that the transformed coefficients have the same total energy as the source
signal:
M
1
X
kD0

y 2 .k/ D

M
1
X

x 2 .k/:

(5.14)

kD0

This can be easily proved:


M
1
X
kD0

y 2 .k/ D yT y D xT TT Tx D xT x D

M
1
X

x 2 .k/;

(5.15)

kD0

where the orthogonal condition in (5.9) is used. In the derivation above, the block
index n is dropped for convenience. When the context is appropriate, this practice
will be followed in the reminder of this book.

76

5 Transform Coding

5.2 Optimal Bit Allocation and Coding Gain


The operations involved in transform and inverse transform are obviously sophisticated and entail a significant amount of calculation. This extra burden is carried
due to the anticipation that, for a given bit rate, the quantization error in the reconstructed signal will be smaller than directly quantizing the source signal. This is
explained below.

5.2.1 Quantization Noise


The quantization error for the transform coder is the reconstruction error at the decoder output:
O
 k/  x.nM  k/; k D 0; 1; : : : ; M  1;
qOk D x.nM

(5.16)

so the MSQE is
2
q.
x/
O D

M 1
M 1
1 X 2
1 X  2
qO k D
E qO k ;
M
M
kD0

(5.17)

kD0

where zero mean is assumed without loss of generality. Let


qO D qO0 ; qO1 ; : : : ; qOM 1 T ;

(5.18)

the above equation becomes


2
q.
x/
O D

1  T 
E qO qO :
M

(5.19)

From Fig. 5.1, we obtain


qO D TT q;
so
2
q.
x/
O D

(5.20)

1  T T 
E q TT q :
M

(5.21)

Due to the orthogonal condition in (5.9), TTT D I, so


2
q.
x/
O

M 1
M 1
1 X 2
1  T 
1 X  2
E q q D
D
E qk D
qk :
M
M
M
kD0

(5.22)

kD0

Suppose that the bit rate for the transform coder is R bits per source sample.
Since there are M samples in a block, the total number of bits available for coding
one block of source samples is MR. These bits are allocated to the quantizers in
Fig. 5.1. If the kth quantizer is allocated rk bits, the total must be MR:

5.2 Optimal Bit Allocation and Coding Gain

RD

77
M 1
1 X
rk :
M

(5.23)

kD0

Due to (2.48), the MSQE for the kth quantizer is


q2k D 100:1.aCbrk / y2k ; k D 0; 1; : : : ; M  1;

(5.24)

where the parameters a and b are dependent on the quantization scheme and the
probability distribution of yk .n/. Dropping (5.24) into (5.22), we obtain the following total MSQE:
2
q.
x/
O D

M 1
1 X 0:1.aCbrk / 2
10
yk :
M

(5.25)

kD0

Apparently, the MSQE is a function of both signal variance y2k of each transform
coefficient and the number of bits allocated to quantize it. While the former is determined by the source signal and the transform T, the later is by how the bits are
allocated to the quantizers, i.e., bit allocation strategy. For a given the bit rate R,
M 1
, called a bit
a different bit allocation strategy assigns a different set of frk gkD0
allocation, which results in a different MSQE. It is, therefore, imperative to find the
optimal one that gives the minimum MSQE.

5.2.2 AMGM Inequality


Before addressing this question, let us first define arithmetic mean
AM D

M 1
1 X
pk
M

(5.26)

kD0

and geometric mean


GM D

M
1
Y

!1=M
pk

(5.27)

kD0

for a given set of nonnegative numbers pk , k D 1; 2; : : : ; M  1. The AMGM


inequality [89] states that
AM
GM

(5.28)

p0 D p1 D    D pM 1 :

(5.29)

with equality if and only if

78

5 Transform Coding

For an interpretation of this inequality, let us consider a p0  p1 rectangle. Its


1
.A
perimeter is 2.p0 C p1 /, area is p0  p1 , and average length of sides is p0 Cp
2
p
square with the same area obviously has sides with a length of p0  p1 . The AM
p
1
GM inequality states that p0 Cp

p0  p1 with equality if and only if p0 D p1 .


2
In other words, among all rectangles with the same area, the square has the shortest
average length of sides.

5.2.3 Optimal Conditions


Applying the AMGM inequality to (5.22), we have
2
q.
x/
O

M 1
1 X 2
D
qk

M
1
Y

!1=M
q2k

(5.30)

q20 D q21 D    D q2M 1 :

(5.31)

kD0

kD0

with equality if and only if

Plugging in (5.24), we have


M
1
Y
kD0

!1=M
q2k

M
1
Y

!1=M
100:1.aCbrk / y2k

kD0

D 100:1a 100:1b

PM 1
kD0

rk

M
1
Y

!1=M
y2k

kD0

D 100:1.aCbR/

M
1
Y

!1=M
y2k

(5.32)

kD0

where (5.23) is used to arrive at the last equation. Plugging this back into (5.30), we
have
!1=M
M
1
Y
2
0:1.aCbR/
2
yk
;
(5.33)
q.x/
O
10
kD0

with equality if and only if (5.31) holds.


For a given transform matrix T, the variances of transform coefficients y2k are
determined by the source signal, so the right-hand side of (5.33) is a fixed value for
a given bit rate R. Therefore, (5.33) and (5.31) state that

5.2 Optimal Bit Allocation and Coding Gain

79

M 1
Optimal MSQE. MSQE from any bit allocation frk gkD0
is either equal to or larger
than the fixed value on the right-hand side of (5.33), which is thus the minimal
MSQE that could be achieved.

Optimal Bit Allocation. This minimal MSQE is achieved, or the equality holds, if
and only if (5.31) is satisfied, or the bits are allocated in such a way that the MSQE
from all quantizers are equalized.
Note that this result is obtained without dependence on the actual transform matrix used. As long as the transform matrix is orthogonal, the results in this section
hold.

5.2.4 Coding Gain


Now it is established that the minimal MSQE of a transform coder is given by the
right-hand side of (5.33), which we denote as
2
q.TC/

D 10

0:1.aCbR/

M
1
Y

!1=M
y2k

(5.34)

kD0

If the source signal is quantized directly as PCM with the same bit rate of R, the
MSQE is given by (2.48) and denoted as
2
D 100:1.aCbR/ x2 :
q.PCM/

(5.35)

Then the coding gain of a transform coder over PCM is


GTC D

2
q.PCM/
2
q.TC/

x2
D
;
QM 1 2 1=M

kD0 yk

(5.36)

which is the ratio of source variance x2 to the geometric mean of the transform
coefficient variances y2k .
Due to the energy conservation property of T (5.14), we have
x2

M 1
M 1
1 X 2
1 X 2
D
xk D
yk ;
M
M
kD0

(5.37)

kD0

where x2k is the variance of the kth element of the source vector x. Applying this
back into (5.36), we have
1

PM 1
kD0

GTC D  M
QM 1

2
kD0 yk

y2k
1=M ;

(5.38)

80

5 Transform Coding

which states that the optimal coding gain of a transform coder is the ratio of arithmetic to geometric mean of the transform coefficient variances. Due to the AMGM
inequality (5.28), we always have
GTC
1:

(5.39)

Note, however, this is valid if and only if the optimal bit allocation strategy is deployed. Otherwise, the equality in (5.30) will not hold, the transform coder will not
be able to achieve the minimal MSQE given by (5.34).

5.2.5 Optimal Bit Allocation


The optimal bit allocation strategy requires that bits be allocated in such a way that
the MSQE from all quantizers are equalized to a constant (see (5.31). Denoting such
a constant as 02 , we obtain from (5.24) that the bits allocated to the kth transform
coefficient is
10
1
log10 y2k  .10 log10 02 C a/:
(5.40)
rk D
b
b
Dropping it into (5.23), we obtain
M 1
10 1 X
1
log10 y2k :
 .10 log10 02 C a/ D R 
b
b M

(5.41)

kD0

Dropping this back into (5.40), we obtain the following optimal bit allocation:
#
"
M 1
10
1 X
2
2
log10 yk
log10 yk 
rk D R C
b
M
kD0

DRC

y2k
10
log2 
;
QM 1 2 1=M
b log2 10

yk
kD0

for k D 0; 1; : : : ; M  1:

(5.42)

If each transform coefficient is considered as satisfying a uniform distribution and


is quantized using a uniform quantizer, the parameter b is given by (2.44). Then the
above equation becomes
y2k
; for k D 0; 1; : : : ; M  1:
rk D R C 0:5 log2 
QM 1 2 1=M

yk
kD0

(5.43)

5.2 Optimal Bit Allocation and Coding Gain

81

Both (5.42) and (5.43) state that, aside from a global bias determined by the bit rate
R and the geometric mean of the variances of the transform coefficients, bits should
be assigned to a transform coefficient in proportion to the logarithm of its variance.

5.2.6 Practical Bit Allocation


It is unlikely that the optimal bit allocation strategy would allocate an integer value
rk to any transform coefficients. When the quantization indexes are directly packed
into a bit stream for delivery to the decoder, only an integer number of bits can be
packed each time. If entropy coding is subsequently used to code the quantization
indexes, an integer number of bits is not necessary, but number of quantization intervals has to be an integer. In this case, rk can be a rational number. But this still
cannot be guaranteed by the bit allocation strategy.
A simple approach to addressing this problem is to round rk to its nearest integer
or a value that corresponds to an integer number of quantization intervals. There
are, however, more elaborate methods, such as a water-filling procedure which iteratively allocates bits to transform coefficients with the largest quantization error
[4, 51].
In addition, the optimal bit allocation strategy assumes an ample supply of bits.
Otherwise, some of the transform coefficients with a small variance will get negative
number of bits, as can be seen from both (5.42) and (5.43).
If the bit rate R is not large enough to ensure positive number of bits allocated
to all transform coefficients, the strategy can be modified to include the following
clause:
(5.44)
rk D 0 if rk < 0:
Furthermore, if the variance of any transform coefficient becomes zero, zero bit is
allocated to it and the zero variance is subsequently taken out of the geometric mean
calculation. With these modifications, however, the equalization condition (5.31) of
the AMGM inequality no longer holds, so the quantizer cannot achieve the minimal
MSQE in (5.34) and the coding gain in (5.36).
To illustrate the above method, let us consider an extreme example where
y20 D M x2 ;
y2k D 0; k D 1; 2; : : : ; M  1:

(5.45)

The zero variance causes the geometric mean to completely break down, so both
(5.34) and (5.36) are meaningless. The modified strategy above, however, dictates
the following bit allocation:
b0 D MR;
bk D 0; k D 1; 2; : : : ; M  1:

(5.46)

82

5 Transform Coding

5.2.7 Energy Compaction


Dropping the above two equations for the extreme example into (5.25), we obtain
the total MSQE as
2
0:1.aCbMR/ 2
x :
(5.47)
q.
x/
O D 10
Since the MSQE for direct quantization (PCM) is still given by (5.35), the effective
coding gain is
2
q.PCM/
GTC D
D 100:1.M 1/bR :
(5.48)
2
q.
x/
O
To appreciate this improvement, consider uniform distribution whose parameter b
is given by (2.44), the above coding gain becomes
GTC .dB/  6:02.M  1/R:

(5.49)

For a scenario of M D 1;024 and R D 1 bits per source sample which is typical in
audio coding, this coding gain is 6164.48 dB!
An intuitive explanation for this dramatic improvement is energy compaction
and exponential reduction of quantization error with respect to bit rate. With direct
quantization, each sample in the source vector has a variance of x2 and is allocated
R bits, resulting in a total signal variance of M x2 and a total number of MR bits
for the whole block. With transform coding, however, this total variance of M x2 for
the whole block is compacted to the first coefficient and all MR bits for the whole
block are allocated to it. Since MSQE is linearly proportional to the variance (see
(5.25)), the MSQE for the first coefficient would increase M times due to the M
times increase in variance, but this increase is spread out to M samples in the block,
resulting no change. However, MSQE decreases exponentially with bit rate, so the
M times increase in bit rate causes M times MSQE decrease in decibel!
In fact, energy compaction is the key for coding gain in transform coding. Since
the transform matrix T is orthogonal, the arithmetic mean is constant for a given
signal, no matter how its energy is distributed by the transform to the individual
coefficients. This means that the numerator of (5.38) remains the same regardless
of the transform matrix T. However, if the transform distributes most of its energy
(variance) to a minority of transform coefficients and leaves the balance to the rest,
the geometric mean in the denominator of (5.38) becomes extremely small. Consequently, the coding gain becomes extremely large.

5.3 Optimal Transform


Section 5.2 has established that the coding gain is dependent on the degree of energy compaction that the transform matrix T delivers. Is there a transform that is
optimal in terms of having the best energy compaction capability or delivering the
best coding gain?

5.3 Optimal Transform

83

5.3.1 KarhunenLoeve Transform


To answer this question, let us go back to (5.2) to establish the following equation:
y.n/yT .n/ D Tx.n/xT .n/TT :

(5.50)

Taking expected values on both sides, we obtain the covariance of the transform
coefficients

where

Ryy D TRxx TT ;

(5.51)



Ryy D E y.n/yT .n/

(5.52)

and Rxx is the covariance matrix of source signal defined in (4.49). As noted there,
it is symmetric and Toeplitz.
By definition, the kth diagonal element of Ryy is the variance of kth transform
coefficient:
(5.53)
Ryy kk D E yk .n/yk .n/ D y2k ;
1
so the geometric mean of fy2k gM
kD0 is the product of the diagonal elements of Ryy :
M
1
Y

y2k D

kD0

M
1
Y

Ryy kk :

(5.54)

kD0

It is well known that a covariance matrix is positive semidefinite, i.e., its eigenvalues are all real and nonnegative [33]. For practical signals, Rxx may be considered
as positive definite (no zero eigenvalues). Due to (5.51), Ryy may also be considered
as positive definite as well, so the following inequality holds [93]:
M
1
Y

Ryy kk
det Ryy

(5.55)

kD0

with equality if and only if Ryy is diagonal.


Due to (5.51), we have
det Ryy D det T det Rxx det TT :

(5.56)

Taking determinant of (5.9) gives


det TT det T D det I D 1;

(5.57)

j det Tj D 1:

(5.58)

which leads to

84

5 Transform Coding

Dropping this back into (5.56), we obtain


det Ryy D det Rxx :

(5.59)

Consequently, the inequality in (5.55) becomes


M
1
Y

Ryy kk
det Rxx ;

(5.60)

kD0

again, with equality if and only if Ryy is diagonal.


Since Rxx is completely determined by statistical properties of the source signal,
the right-hand side of (5.60) is a fixed value. Due to (5.51), however, we can adjust
the transform matrix T to alter the value on the left-hand side of (5.60). The best we
can achieve by doing this is to find a T that makes Ryy a diagonal matrix:
Ryy D TRxx TT D diagfy20 ; y21 ; : : : ; y2M 1 g

(5.61)

so that the equality in (5.60) holds.


It is well known in matrix theory [28] that the matrix T which makes (5.61)
hold is an orthonormal matrix whose rows are the orthonormal eigenvectors of the
M 1
are the eigenvalues of Rxx . Such
matrix Rxx and the diagonal elements fy2k gkD0
a transform matrix is called KarhunenLoeve Transform (KLT) of source signal
x.n/ and the eigenvalues its transform coefficients.

5.3.2 Maximal Coding Gain


With a KarhunenLoeve transform matrix, the equality in (5.60) holds, which gives
the minimum value for the geometric mean:

Minimum:

M
1
Y

!1=M
y2k

D .det Rxx /1=M :

(5.62)

kD0

Dropping this back into (5.36), we establish that the maximum coding gain of the
optimal transform coder, for a given source signal x.n/, is
Maximum: GTC D

x2
.det Rxx /1=M

Note that this is made possible by


 Deploying the KarhunenLoeve transform
 Providing ample bits
 And following the optimal bit allocation strategy

(5.63)

5.4 Suboptimal Transforms

85

The maximal coding gain is achieved by KarhunenLoeve transform through


diagonalizing the covariance matrix Rxx into Ryy . The diagonalized covariance
matrix Ryy means that the transform coefficients that constitute the vector y are
uncorrelated, so maximum coding gain or energy compaction is directly linked to
decorrelation.

5.3.3 Spectrum Flatness


The maximal coding gain discussed above is optimal for a given block size M .
Typically, the maximal coding gain increases with the block size, approaching an
upper limit when the block size becomes infinity. It can be shown [33] that transform
coding is asymptotically optimal because this upper limit is equal to the theoretic
upper limit predicted by rate-distortion theory [33]:
lim GTC D

M !1

1
;
x2

(5.64)

where x2 is the spectrum flatness measure


x2

exp


R
1
ln Sxx .ej! /d!
2 
R
1
S .ej! /d!
2  xx

(5.65)

of the source signal x.n/ whose power spectrum is Sxx .ej! /.

5.4 Suboptimal Transforms


Barring its optimality, KLT is rarely used in practical applications. The first reason
is that KLT is signal-dependent: it is built from the covariance matrix of the source
signal. There are a few ramifications for this including:
 KLT is as good as the statistical signal model, but a good model is not always

available.
 Signal statistics changes with time in most applications. This calls for real-time

calculation of covariance matrix, eigenvectors, and eigenvalues. This is seldom


plausible in practical applications, especially in the decoder.
 Even if the encoder is assigned to do the calculation, transmission of eigenvectors
to the decoder consumes a large number bits, so is not feasible for compression
coding.
Even if the signal-dependent issues above are resolved and the associated eigenvectors and eigenvalues are available at our conveniences, the calculation of the
KarhunenLoeve transform itself is still a big deal, especially with a large M ,
because both the transform in (5.2) and the inverse transform in (5.10) require

86

5 Transform Coding

an order of M 2 calculations (multiplications and additions). This is unfavorable


when compared with structured transforms, such as DCT, whose structures are
amenable for fast implementation algorithms that require calculations on the order
of M log2 .M /.
There are many structured and signal-independent transforms which can be considered as suboptimal in the sense that their performances approach that of KLT
when the block size is large. In fact, all sinusoidal orthogonal transforms are found
to approach the performance of KLT when the block size tends to infinity [74],
including discrete Fourier transform, DCTs, and discrete sine transforms.
With such sinusoidal orthogonal transforms, frequency is a key characteristic for
the basis functions or vectors, so the transform coefficients are usually indexed by
frequency. Consequently, they are often referred to as frequency coefficients and
are considered as in the frequency domain. The transforms may also be referred to
as frequency transforms and timefrequency analysis.

5.4.1 Discrete Fourier Transform


DFT (Discrete Fourier Transform) is the most prominent transform in this category of sinusoidal transforms. Its transform matrix is given by
kn
;
T D WM D WM

(5.66)

WM D ej2=M

(5.67)

kn
Tk;n D WM
D ej2kn=M :

(5.68)

where
so that

The matrix is unitary, meaning


.WM /T WM D M I;

(5.69)

so its inverse is

1
.WM /T :
(5.70)
M
Note that the scale factor M above is usually disregarded when discussing orthogonal transforms, because it can be adjusted either on the forward or backward
transform side within a particular context. The matrix is also symmetric, that is,
W1 D

WTM D WM ;

(5.71)

so (5.70) becomes
W1 D

1 
W :
M M

(5.72)

5.4 Suboptimal Transforms

87

The DFT is more commonly written as


yk D

M
1
X

kn
x.n/WM

(5.73)

nD0

and its inverse (IDFT) as


x.n/ D

M
1
X

kn
yk WM
:

(5.74)

kD0

When DFT is applied to a block of M source samples, this block of M samples


are virtually extended periodically on both sides of the block boundary to infinity,
as shown at the top of Fig. 5.2. This introduces sharp discontinuities at both boundaries. In order to accommodate these discontinuities, DFT needs to incur a lot of
large coefficients, especially at high frequencies. This causes spread of energy, the
opposite of energy compaction, so DFT is not ideal for signal coding.

DFT

DFT-II

DFT-IV

Fig. 5.2 Periodic boundaries


of DFT (top), DCT-II
(middle) and DCT-IV
(bottom)

88

5 Transform Coding

In addition, DFT is a complex transform that produces complex transform


coefficients even for real signals. This simplies that the number of transform
coefficients that need to be quantized and conveyed to the decoder is doubled.
Due to the two drawbacks above, DFT is seldom directly deployed in practical
transform coders. It is, however, frequently deployed in many transform coders as a
conduit for fast calculation of other transforms because of its good structure for fast
algorithm and the abundance of such algorithms.

5.4.2 DCT
DCT is a family of Fourier-related transforms that use only real transform coefficients [2, 80]. Since the Fourier transform of a real and even signal is real and even,
a DCT operates on real data and is equivalent to a DFT of roughly twice the block
length. Depending on how the beginning and ending block boundaries are handled,
there are eight types of DCTs, and correspondingly eight types of DSTs, but only
two of them are widely used.

5.4.2.1 Type-II DCT


The most prominent member of the DCT family is the type-II DCT given below:
r
DCT-II: Ck;n D c.k/
where

(
c.k/ D



k
2
cos .n C 0:5/
M
M

p1 ;
2

if k D 0I

1;

otherwise:

(5.75)

(5.76)

Its inverse transform is


IDCT-II: CT n;k D Ck;n :

(5.77)

DCT-II tends to deliver the best energy compaction performance in the DCT and
DST family. It achieves this mostly because it uses symmetric boundary conditions
on both sides of its period, as shown in the middle of Fig. 5.2. In particular, DCT-II
extends its boundaries symmetrically on both sides of a period, so the samples can be
considered as periodic with a period of 2M and there is essentially no discontinuity
at both boundaries.
DCT-II for M D 2 was shown to be identical to the KLT for a first-order autoregression (AR) source [33]. Furthermore, the coding gain of DCT-II for other M is
shown to be very close to that of the KLT for such a source with high correlation
coefficient:
R.1/
 1:
(5.78)

D
R.0/
Similar results with real speech were also observed in [33].

5.4 Suboptimal Transforms

89

Since many real-world signals can be modeled as such a source, DCT-II is deployed in many signal coding or processing applications and is sometimes simply
called the DCT (its inverse is, of course, called the inverse DCT or the IDCT).
Two-dimensional DCT-II, which shares these characteristics, has been deployed by
many international image and video coding standards, such as JPEG [37], MPEG1&2&4 [54, 57, 58], and MPEG-4(AVC)/H.264 [61].

5.4.2.2 Type-IV DCT


Type-IV DCT is obtained by shifting the frequencies of the Type-II DCT in (5.75)
by =2M , so it has the following form
r
DCT-IV: Ck;n D

h
i
2
cos .n C 0:5/.k C 0:5/
:
M
M

(5.79)

It is also its own inverse transform.


Due to this frequency shifting, its right boundary is no longer smooth, as shown
at the bottom of Fig. 5.2. Such a sharp discontinuity requires a lot of large transform
coefficients to compensate, significantly degrading its energy compacting ability. So
it is not as useful as DCT-II. However, it serves as valuable building block for fast
algorithms of DCT, MDCT and other cosine-modulated filter banks.

Chapter 6

Subband Coding

Transform coders artificially divide an source signal into blocks, then process and
code each block independently of each other. This leads significant variations between blocks, which may become visible or audible as discontinuities at the block
boundaries. Referred to as blocking artifacts or blocking effects, these artifacts
may appear as tiles in decoded images or videos that were coded at low bit rates.
In audio, blocking artifacts sound like periodic clicking which is considered as annoying by many people. While the human eye can tolerate a large degree of blocking
artifacts, the human ear is extremely intolerant of such periodic clicking. Therefore, transform coding is rarely deployed in audio coding.
One approach to avoiding blocking artifacts is to introduce overlapping between
blocks. For example, a transform coder may be structured in such a way that it
advances only 90% of a block, so that there is a 10% overlap between blocks. Reconstructed samples in the overlapped region can be averaged to smooth out the
discontinuities between blocks, thus avoiding the blocking artifacts.
The cost for this overlapping is reduced coding performance. Since the number
of transform coefficients is equal to the number of source samples in a block, the
number of samples overlapping between blocks is the number of extra transform
coefficients that needs to be quantized and conveyed to the decoder. They consume
extra valuable bit resources, thus degrading coding performance.
What is really needed is overlapping in the temporal domain, but no overlapping
in the transform or frequency domain. This means the transform matrix T in (5.2)
is no longer M  M , but M  N , where N > M . This leads to subband coding
(SBC). For example, application of this idea to DCT leads to modified discrete cosine transform (MDCT) which has 50% overlapping in the temporal domain or its
transform matrix T is M  2M .

6.1 Subband Filtering


Subband coding is based on the decomposition of an source signal into subband
samples through a filter bank. This decomposition can be considered as an extension of a transform where the filter bank corresponds to the transform matrix and
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 6,
c Springer Science+Business Media, LLC 2010


91

92

6 Subband Coding

the subband samples to the transform coefficients. This extension is based on a new
perspective on transforms, which views the multiplication of a transform matrix
with the source vector as filtering of the source vector by a bank of subband filters whose impulse responses are the row vectors of the transform matrix. Allowing
these subband filters to be longer than the block size of the transform enables dramatical improvement in energy compaction capability, in addition to the elimination
of blocky artifacts.

6.1.1 Transform Viewed as Filter Bank


To extend from transform coding to subband coding, let us consider the transform
expressed in (5.2). The operations involved in such a transform to obtain the kth
component of the transform coefficient vector y.n/ from the source vector x.n/
may be written as
yk .n/ D tTk x.n/ D

M
1
X

tk;m xm .n/ D

mD0

M
1
X

tk;m x.nM  m/;

(6.1)

mD0

where the last step is obtained via (5.1). The above operation obviously can be
considered as filtering the source signal x.n/ by a filter with an impulse response
given by the kth basis vector or basis function tk . Consequently, the whole transform
may be considered as filtering by a bank of filters, called analysis filter banks, with
impulse responses given by the row vectors of the transform matrix T. This is shown
in Fig. 6.1.
Similarly, the inverse transform in (5.10) may be considered as the output from
a bank of filters, called synthesis filter banks, with impulse responses given by the
row vectors of the inverse matrix TT . This is also shown in Fig. 6.1.
For the suboptimal sinusoidal transforms discussed in Chap. 5, each of the filters
in either the analysis or synthesis bank is associated with a basis vector or basis

Fig. 6.1 Analysis and synthesis filter banks

6.1 Subband Filtering

93

function which corresponds to a specific frequency, so it deals with components of


the source signal associated with that frequency. Such filters are usually band-pass
and decompose the frequency into small bands, called subbands, so such filters are
M 1
are called
called subband filters and the decomposed signal components fyk gkD0
subband samples.

6.1.2 DFT Filter Bank


To illustrate the power of this new perspective on transforms, let us consider a simple
analysis filter bank shown in Fig. 6.2, which is built using the inverse DFT W
matrix given in (5.70). The delay chain in the analysis bank consists of M 1 delay
units z1 connected together in series. As the source signal x.n/ passes through it,
a bank of signals are extracted
uk .n/ D x.n  k/; k D 0; 1; : : : ; M  1:

(6.2)

This enables that M samples from the source signal are presented simultaneously
to the transform matrix. Due to (5.67), except for the scale factor of 1=M , the
subband samples for the kth subband is the output from the kth subband filter
and is given as
yk .n/ D

M
1
X

km
um .n/WM
; k D 0; 1; : : : ; M  1;

(6.3)

mD0

which is essentially the inverse DFT in (5.74). Due to (6.2), it becomes


km
x.n  m/WM
:

(6.4)

M
1
X

yk .n/ D

mD0

Fig. 6.2 DFT analysis filter


banks

94

6 Subband Coding

Its Z-transform is
Yk .z/ D

M
1
X

km
X.Z/zm WM
D X.Z/

mD0

M
1 
X

k
zWM

m
;

(6.5)

mD0

so the transfer function for the kth subband filter is


Hk .z/ D

M
1 
m
X
Yk .z/
k
zWM
D
:
X.z/
mD0

(6.6)

Since the transfer function for the zeroth subband is


H.z/ D

M
1
X

zm ;

(6.7)

mD0

the transfer functions for all other subbands may be represented by it:


k
:
Hk .z/ D H zWM

(6.8)

Its frequency response is




2k
Hk .ej! / D H ej.! M / ;

(6.9)

 
which is H ej! uniformly shifted by 2k
M in the frequency domain. Therefore,
H.z/ is called the prototype filter and all other subband filters in the DFT bank are
built by uniformly shifting or modulating the prototype filter. A filter bank with
such a structure is called a modulated filter bank. It is a prominent category of
filter banks which are most notably amenable for fast implementation.
The magnitude response of the prototype filter (6.7) is


 
sin M!
 !2  ;
H ej! D
sin 2

(6.10)

which is shown at the top of Fig. 6.3 for M D 8. According to (6.9), all other subband filters in the bank are shifted or modulated versions of it and are shown at the
bottom of Fig. 6.3.

6.1.3 General Filter Banks


Viewed from the new perspective of subband filtering, DFT apparently has rather
inferior energy compaction capability: the subband filters have wide transition bands
and their stopband attenuation is only about 13 dB. A significant amount of energy
in one subband is spilled into other subbands, appearing as ripples in Fig. 6.3.

6.1 Subband Filtering


H0

20
Amplitude (dB)

95
H0

10
0
10
20

0.2

0.4

0.6

0.8

/2

Amplitude (dB)

20

H0

H2

H1

H4

H3

H5

H6

H7

H0

10
0
10
20
0

0.2

0.4

0.6

0.8

/2

Fig. 6.3 Magnitude responses of a DFT analysis filter bank with M D 8. The top shows the
prototype filter. All subband filters in the bank are uniformly shifted or modulated versions of it
and are shown at the bottom

Fig. 6.4 Ideal bandpass subband filters for M D 8

While other transforms, such as KLT and DCT, may have better energy compacting capability than the DFT, they are eventually limited by M , the number
of samples in the block. It is well known in filter design that a sharp magnitude
response with less energy leakage requires long filters [67], so the fundamental limiting factor is the block size M of these transforms or subband filters.
To obtain even better performance, subband filters longer than M need to be
used. To maximize energy compaction, subband filters should have the magnitude
response of an ideal bandpass filter shown in Fig. 6.4, which has no energy leakage
at all and achieves the maximum coding gain (to be proved later). Unfortunately,
such a bandpass filter requires an infinite filter order [68], which is very difficult to
implement in a practical system, so the challenge is to design subband filters that
can optimize coding gain for a given limited order.

96

6 Subband Coding

Once the order of subband filters are extended beyond M , overlapping between
blocks occurs, providing an additional benefit of mitigating the blocking artifacts
discussed at the beginning of this chapter.
It is clear that transforms are a special type of filter banks whose subband filters
have order less than M . Its main characteristics is that there is no overlapping between transform blocks. Therefore, transforms are frequently referred as filter banks
and transform coefficients as subband samples. On the other hand, filter banks with
subband filters longer than M are also sometimes referred to as transforms or lapped
transform in the literature [49]. One such example is the modulated cosine transform
(MDCT), whose subband filters are twice as long as the block size.

6.2 Subband Coder


When the filter bank in Figs. 6.1 and 6.2 is directly used for subband coding, there
exists an immediate obstacle: M -fold increase in the number of samples to be coded
because the analysis bank generates M subband samples for each source sample.
This problem may be resolved, as shown in Fig. 6.5, by M -fold decimation in
the analysis bank to make the total number of subband samples equal to that of the
source block, followed by M -fold expansion in the synthesis bank to recover the
sample rate of the subband samples back to the original sample rate of the source
signal.
An M -fold decimator, also referred to as a downsampler or sample rate compressor, discards M  1 samples for each block of M input samples and retains
only one sample for output:
xD .n/ D x.M n/;

(6.11)

Fig. 6.5 Maximally decimated filter bank and subband coder. The #M denotes M -fold decimation
and "M M -fold expansion. The additive noise model is used to represent quantization in each
subband

6.3 Reconstruction Error

97

where x.n/ is the source sequence and xD .n/ is the decimated sequence. Due to the
loss of M  1 samples incurred in decimation, it may not be possible to recover
x.n/ from the decimated xD .n/ due to aliasing [65, 68].
When applied to the analysis bank in Fig. 6.5, the decimator reduces the sample
rate of each subband to its 1=M . Since there are M subbands, the total sample rate
for all subbands is still the same as the source signal.
The M -fold expander, also referred to as an upsampler or interpolator, passes
through each source sample to the output and, after each, inserts M  1 zeros to the
output:
(
x.n=M /; if n=M is an integerI
(6.12)
xE .n/ D
0;
otherwise:
Since all samples from the input are passed through to the output, there is obviously
no loss of information. For example, the source can be recovered from the expanded
output by an M -old decimator. As explained in Sect. 6.3, however, expansion causes
images in the spectrum, so needs to be handled accordingly.
When applied to the analysis bank in Fig. 6.5, the expander for each subband
recovers the sample rate of each subband back to the original sample rate of the
source signal. It is then possible to output an reconstructed sequence at the same
sample rate as the source signal.
Coding in the subband domain, or subband coding, is accomplished by attaching a quantizer to the output of each subband filter in the analysis filter bank and
a corresponding inverse quantizer to the input of each subband filter in the synthesis filter bank. The abstraction of this quantization and inverse quantization is the
additive noise model in Fig. 2.3, which is deployed in Fig. 6.5.
The filter bank above has a special characteristic: its sample rate for each subband
is 1=M of that of the source. This happens because the decimation factor is equal to
the number of subbands. Such a subband system is called a maximally decimated
or critically sampled filter bank.

6.3 Reconstruction Error


When a signal moves through a maximally decimated subband system, it is altered
by analysis filters, decimators, quantizers, inverse quantizers, expanders, and synthesis filters. The combined effect of these alternation may lead to reconstruction
error at the output of the synthesis bank:
e.n/ D x.n/
O
 kx.n  d /;

(6.13)

where k is a scale factor and d is a delay. For subband coding, this reconstruction error needs to be either exactly or approximately zero, in the absence of
quantization, so that the reconstructed signal is a delayed and scaled version of the
source signal. This section analyzes decimation and expansion effects to arrive at
conditions on analysis and synthesis filters that guarantee zero reconstruction error.

98

6 Subband Coding

6.3.1 Decimation Effects


Before considering the problem of decimation effects, let us first consider the following geometric series:
pM .n/ D

M 1
1 X j 2mn
e M :
M mD0

(6.14)

When n is a multiple of M , the above equation becomes


pM .n/ D 1

(6.15)

due to ej2m D 1. For other values of n, the equation becomes


 2n M
j
1
1 e M
D 0;
pM .n/ D
M ej 2n
M 1

(6.16)

due to the formula for geometric series and ej2n D 1. Therefore, the geometric
series (6.14) becomes
(

1; if n D multiples of M I

pM .n/ D

0; otherwise:

(6.17)

Now let us consider the Z-transform of the decimated sequence in (6.11):


XD .z/ D

1
X

xD .n/zn D

nD1

1
X

x.Mn/zn :

(6.18)

nD1

Due to the upper half of (6.17), we can multiply the right-hand side of the above
equation with pM .nM/ to get
1
X

XD .z/ D

pM .nM/x.Mn/zn :

(6.19)

nD1

Due to the lower half of (6.17), we can do a variable change of m D nM to get


1
X

XD .z/ D

pM .m/x.m/zm=M ;

(6.20)

mD1

where m takes on integer values at an increment of one. Dropping in (6.14), we have


M 1
1


1 X X
2k m
XD .z/ D
x.m/ z1=M ej M
;
M
mD1
kD0

(6.21)

6.3 Reconstruction Error

99

Due to (5.67), the equation above becomes


XD .z/ D

M 1
1
m

1 X X
k
x.m/ z1=M WM
;
M
mD1

(6.22)

kD0

Let X.z/ denote the Z-transform of x.m/, the equation above can be written as
XD .z/ D

M 1
1 X  1=M k 
X z
WM :
M

(6.23)

M 1
1 X  j !2k 
;
X e M
M

(6.24)

kD0

The Fourier transform of (6.23) is


XD .ej! / D

kD0

which can be interpreted as


1.
2.
3.
4.

Stretch X.ej! / by a factor of M .


Create M  1 aliasing copies and shift them by 2 k, respectively.
Add all shifted aliasing copies obtained in step 2 to the stretched copy in step 1.
Divide the sum above by M .

As an example, let us consider the prototype filter H.z/ in (6.7) as the


Z-transform for a regular signal. Its time-domain representation is obviously as
follows:
(
1; 0  n < M I
x.n/ D
(6.25)
0; otherwise:
Let M D 8, we now examine the effect of eightfold decimation on this signal. Its
Fourier transform is given in (6.10) and shown at the top of Fig. 6.3. The stretched
H.ej!=M / and all its shifted aliasing copies are shown at the top of Fig. 6.6. Due to
the stretching factor of M D 8, their period is no longer 2 , but stretched to 8  2 ,
which is the frequency range covered by Fig. 6.6. The Fourier transform for the
decimated signal is shown at the bottom of Fig. 6.6, whose period is 2 as required
by the Fourier transform for a sequence. Due to the overlapping of the stretched
spectrum with its shifted aliasing copies and the subsequent mutual cancellation,
the spectrum for the decimated signal is totally different than that of the original
source signal shown in Fig. 6.3, so we cannot recover the original signal from its
decimated version.
One approach to avoid aliasing is to band-limit the source signal to j!j < =M ,
according to the teaching from Nyquists sampling theorem [65]. Due to the stretching factor of M , the stretched spectrum is now bandlimited to j!j < . Since its
shifted copies are placed at 2 interval, there is no overlapping between the original
and the aliasing copies. The aliasing copies can be removed by an ideal low-pass
filter, leaving only the original copy.

100

6 Subband Coding
8Fold Streched Spectrum and its Shifted Copies
H0

Amplitude (dB)

20

H1

H2

H3

H4

H5

H0

H6

H7

10
0
10
20

4
5
/2
8Fold Decimated Signal

Amplitude (dB)

20
10
0
10
20

4
/2

Fig. 6.6 Stretched spectrum of the source signal and all its shifted aliasing copies (top). Due to
the stretching factor of M D 8, their period is also stretched from 2 to 8  2 . Spectrum for
the decimated signal (bottom) has a period of 2 , but is totally different from that of the original
signal

The approach above is not the only one for aliasing-free decimation, see [82]
for details. However, aliasing-free decimation is not the goal for filter bank design.
A certain amount of aliasing is usually allowed in some special ways. As long as
aliasing from all decimators in the filter bank cancel each other completely at the
output of the synthesis bank, the reconstructed signal is still aliasing free. Even if
aliasing cannot be canceled completely, proper filter bank design can still keep them
small enough so as to obtain a reconstruction with tolerable error.

6.3.2 Expansion Effects


To see the consequence of expansion, let us consider the Z-transform of the expanded signal xE .n/:
1
X
xE .n/zn :
(6.26)
XE .z/ D
nD1

Due to (6.12), xE .n/ is nonzero only when n is a multiple of M : n D kM, where k


is an integer. Replacing n with kM in the above equation, we have

6.3 Reconstruction Error

101
1
X

XE .z/ D

xE .kM/zkM :

(6.27)

kD1

Due to the upper half of (6.12), xE .kM/ D x.k/, so we have


XE .z/ D

 
x.k/zKM D X zM :

1
X

(6.28)

kD1

Its Fourier transform is



XE .ej! / D X ejM! ;

(6.29)

which is an M -fold compressed version of XE .ej! /. In other words, The effect of


sample rate expansion is frequency compression.
As an example, the signal in (6.25), whose spectrum is shown at the top of
Fig. 6.7, is eightfold expanded to give an output signal whose spectrum is shown
at the bottom of Fig. 6.7. Due to the compression of frequency by a factor of 8,
seven images are shifted into 0; 2  region from outside.

Input Signal
Amplitude (dB)

20
10
0
10
20
0

0.2

0.4

0.6

0.8

0.6

0.8

/2
8Fold Expanded Signal

Amplitude (dB)

20
10

0
10
20
0

0.2

0.4

/2

Fig. 6.7 The effect of sample rate expansion is frequency compression. The spectrum for the
source signal on the top is compressed by a factor of 8 to produce the spectrum for the expanded
signal at the bottom. Seven images are shifted into 0; 2  region from outside due to this frequency
compression

102

6 Subband Coding

6.3.3 Reconstruction Error


Let us now consider the reconstruction error of the subband system in Fig. 6.5 in the
absence of quantization. Due to (6.23), each subband signal after decimation is
Yk .z/ D

M 1

 

1 X
m
m
X z1=M WM
:
Hk z1=M WM
M mD0

(6.30)

Due to (6.28), the reconstructed signal is


XO .z/ D

M
1
X

 
Fk .z/Yk zM

kD0

M 1 M 1
 m  m
1 X X
X zWM
Fk .z/Hk zWM
M
mD0
kD0

D
D

1
M

M
1
X
mD0

1
X.z/
M
C

1
X
 m
 m M
X zWM
Fk .z/Hk zWM
kD0
M
1
X

Fk .z/Hk .z/

kD0

M 1
M 1
 m
1 X  m X
:
X zWM
Fk .z/Hk zWM
M mD1

(6.31)

kD0

Define the overall transfer function as


T .z/ D

M 1
1 X
Fk .z/Hk .z/
M

(6.32)

kD0

and the aliasing transfer function as


Am .z/ D

M 1
 m
1 X
;
Fk .z/Hk zWM
M

m D 1; 2; : : : ; M  1;

(6.33)

kD0

the reconstructed signal is


XO .z/ D T .z/X.z/ C

M
1
X

 m
Am .z/:
X zWM

(6.34)

mD1

Note that T .z/ is also the overall transfer function


 filter bank in the absence
 ofm the
is the shifted version of the
of both the decimators and expanders. Since X zWM
source signal, the reconstructed signal may be considered as a linear combination
of the source signal X.z/ and its shifted aliasing versions.

6.4 Polyphase Implementation

103

To set the reconstruction error (6.13) to zero, the overall transfer function should
be set to a delay and scale factor:
T .z/ D kzd

(6.35)

and the total aliasing effect to zero:


M
1
X

 m
Am .z/ D 0:
X zWM

(6.36)

mD1

If a subband system produces no reconstruction error, it is called a perfect


reconstruction (PR) system. If there is reconstruction error, but it is limited and
approximately zero, it is called a near-perfect reconstruction or nonperfect reconstruction (NPR) system. For subband coding, PR is desirable and NPR is the
minimal requirement.

6.4 Polyphase Implementation


While the total sample rate of all subbands within a maximally decimated filter bank
is made equal to that of the source signal through decimation and expansion, waste
of computation is still an issue.
To illustrate this issue, let us look at the output of the decimators in Fig. 6.5. The
decimator keeps only one subband sample and discards the other M  1 subband
samples, so the subband filtering for generating the discarded M  1 subband samples is a total waste of computational resources and thus should be eliminated. This
is achieved using polyphase representation of subband filters and noble identities.

6.4.1 Polyphase Representation


Polyphase representation is an important advancement in the theory of filter banks
that greatly simplifies the implementation structures of both analysis and synthesis
banks [3, 94].

6.4.1.1 Type-I Polyphase Representation


For any given integer M , an FIR or IIR filter given below
H.z/ D

1
X
nD1

h.n/zn

(6.37)

104

6 Subband Coding

can always be written as


1
X

H.z/ D

h.nM/znM

nD1

C z1

1
X

h.nM C 1/znM

nD1

::
:
1
X

C z.M 1/

h.nM C M  1/znM :

(6.38)

nD1

Denoting
pk .n/ D h.nM C k/;

0  k < M;

(6.39)

which is called a type-I polyphase component of h.n/, and its Z-transform


Pk .z/ D

1
X

pk .n/zn ; 0  k < M;

(6.40)

nD1

the (6.38) may be written as


H.z/ D

M
1
X

 
zk Pk zM :

(6.41)

kD0

The equation above is called the type-I polyphase representation of H.z/ with
respect to M and its implementation is shown in Fig. 6.8.
The type-I polyphase representation in (6.41) may be further written as
 
H.z/ D pT zM d.z/;

Fig. 6.8 Type-I polyphase


implementation of an
arbitrary filter

(6.42)

6.4 Polyphase Implementation

where

105

h
iT
d.z/ D 1; z1 ; : : : ; zM 1

(6.43)

p.z/ D P0 .z/; P1 .z/; : : : ; PM 1 .z/T

(6.44)

is the delay chain and

is the type-I polyphase (component) vector.


The type-I polyphase representation of an arbitrary filter may be used to implement the analysis filter bank in Fig. 6.5. Using (6.42), the kth subband filter Hk .z/
may be written as
 
Hk .z/ D hTk zM d.z/; 0  k < M;

(6.45)

hk .z/ D hk;0 ; hk;1 ; : : : ; hk;M 1 T

(6.46)

where
are the type-I polyphase components of Hk .z/. The analysis bank may then be represented by
 
3
hT0 zM  d.z/
6
7 6 hT zM d.z/ 7
 
6
7
7 6 1
h.z/ D 6
7 D H zM d.z/;
7D6
::
4
5
5 4
 :M 
T
HM 1 .z/
hM 1 z d.z/
2

H.z/
H1 .z/
::
:

where

2
6
6
H.z/ D 6
4

hT0 .z/
hT1 .z/
::
:

(6.47)

3
7
7
7
5

(6.48)

hTM 1 .z/
is called a polyphase (component) matrix. This leads to the type-I polyphase implementation in Fig. 6.9 for the analysis filter bank in Fig. 6.5.

Fig. 6.9 Type-I polyphase


implementation of a
maximally decimated
analysis filter bank

106

6 Subband Coding

6.4.1.2 Type-II Polyphase Representation


Type-II polyphase representation of a general filter H.z/ with respect to M may
be obtained from (6.41) through a variable change of k D M  1  n:
H.z/ D

M
1
X

 
z.M 1n/ PM 1n zM

(6.49)

 
z.M 1n/ Qn zM ;

(6.50)

nD0

M
1
X
nD0

where
Qn .z/ D PM 1n .z/

(6.51)

is a permutation of Pn .z/. Figure 6.10 shows type-II polyphase implementation of


an arbitrary filter.
The type-II polyphase representation in (6.50) may be re-written as
 
zn Qn zM ;

(6.52)

   
H.z/ D z.M 1/ dT z1 q zM ;

(6.53)

q.z/ D Q0 .z/; Q1 .z/; : : : ; qM 1 .z/T

(6.54)

H.z/ D z.M 1/

M
1
X
nD0

so it can be expressed in vector form

where
represents the type-II polyphase components.
Similar to type-I polyphase representation of the analysis filter bank, a synthesis
filter bank may be implemented using type-II polyphase representation. The type-II
polyphase components of the kth synthesis subband filter may be denoted as
fk .z/ D fk;0 .z/; fk;1 .z/; : : : ; fk;M 1 .z/T ;

Fig. 6.10 Type-II polyphase


implementation of an
arbitrary filter

(6.55)

6.4 Polyphase Implementation

107

then the kth synthesis subband filter may be written as


   
Fk .z/ D z.M 1/ dT z1 fk zM ;

(6.56)

so the synthesis filter bank may be written as


fT .z/ D F0 .z/; F1 .z/; : : : ; FM 1 .z/
 
 i
 h  
D z.M 1/ dT z1 f0 zM ; f1 zM ; : : : ; fM 1 zM
   
D z.M 1/ dT z1 F zM ;

(6.57)

F.z/ D f0 .z/; f1 .z/; : : : ; fM 1 .z/:

(6.58)

where
This leads to the type-II polyphase implementation in Fig. 6.11.

6.4.2 Noble Identities


Now that we have polyphase implementation of both analysis and synthesis filter
banks, we can move on to rid off the M  1 wasteful filtering operations in both
filter banks. We achieve this using noble identities.

6.4.2.1 Decimation
The noble identity for decimation is shown in Fig. 6.12 and is proven below using
(6.23)

Fig. 6.11 Type-II polyphase


implementation of a
maximally decimated
synthesis filter bank

Fig. 6.12 Noble identity for decimation

108

6 Subband Coding

Y1 .z/ D

M 1
1 X  1=M k 
U z
WM
M
kD0

 
M

1
k
k
H z1=M WM
X z1=M WM
M
kD0
#
" M 1
1 X  1=M k 
X z
WM H.z/
D
M

M
1
X

kD0

D Y2 .z/:

(6.59)

Applying the notable identity given above to the analysis bank in Fig. 6.9, we
can move the decimators on the right side of the analysis polyphase matrix to its
left side to arrive at the analysis filter bank in Fig. 6.13. With this new structure,
the delay chain presents M source samples in correct succession simultaneously to
the decimators. The decimators ensure that the subband filters operate only once for
each block of M input samples, generating only one block of M subband samples.
The sample rate is thus reduced by M times, but the data move in parallel now.
The combination of the delay chain and the decimators essentially accomplishes a
series-to-parallel conversion.

6.4.2.2 Interpolation
The noble identity for interpolation is shown in Fig. 6.14 and is easily proven
below using (6.28)
 
   
Y1 .z/ D U zM D X zM H zM D Y2 .z/:

Fig. 6.13 Efficient


implementation of a
maximally decimated
analysis filter bank

Fig. 6.14 Noble identity for interpolation

(6.60)

6.4 Polyphase Implementation

109

Fig. 6.15 Efficient


implementation of a
maximally decimated
synthesis filter bank

For the synthesis bank in Fig. 6.11, the expanders on the left side of the synthesis polyphase matrix may be moved to its right side to arrive at the filter bank
in Fig. 6.15 due to the noble identity just proven. With this structure, the synthesis
subband filters operate once for each block of subband samples, whose sample rate
was reduced to 1=M of the source sample rate by the analysis filter. The expanders
increase the sample rate by inserting M  1 zeros after each source sample, making the sample rate the same as the source signal. The delay chain then delays their
outputs in succession to align and interlace the M nonzero subband samples in the
time domain so as to form an output stream which has the same sample rate as the
source. The combination of the expander and the delay chain essentially accomplishes a parallel-to-series conversion.

6.4.3 Efficient Subband Coder


Replacing the analysis and synthesis filter banks in the subband coder in Fig. 6.5
with the efficient architectures in Figs. 6.13 and 6.15, respectively, we arrive at the
efficient architecture for subband coding shown in Fig. 6.16.

6.4.4 Transform Coder


Compared with the subband coder structure shown in Fig. 6.16, the transform coder
in Fig. 5.1 is obviously a special case with
H.z/ D T and F.z/ D TT :

(6.61)

In other words, the transform matrix is a polyphase matrix of order zero. The
delay chain and the decimators in the analysis bank simply serve as a series-toparallel converter that feeds the source samples to the transform matrix in blocks
of M samples. The expander and the delay chain in the synthesis bank serve as a

110

6 Subband Coding

Fig. 6.16 An efficient polyphase structure for subband coding

parallel-to-series converter that interlaces subband samples outputted from the transform matrix to form an output sequence. The orthogonal condition (5.9) ensures that
the filter bank satisfies the PR condition. For this reason, the transform coefficients
can be referred to as subband samples.

6.5 Optimal Bit Allocation and Coding Gain


Section 6.4.4 has shown that a transform coder is a special subband coder whose
polyphase matrix is of order zero. A subband coder, on the other hand, can have
a polyphase matrix with a higher order. Other than this, there is no difference between the two, so it can be expected that the optimal bit allocation strategy and the
method for calculating optimal coding gain for subband coding should be similar to
transform coding. This is shown in this section with the ideal subband coder.

6.5.1 Ideal Subband Coder


An ideal subband coder uses the ideal bandpass filters in Fig. 6.4 as both the analysis and synthesis subband filters. Since the bandwidth of each filter is limited to
2 =M , there is no overlapping between these subband filters. This offers optimal
band separation between subband bands in terms that no energy in one subband is
leaked into another, achieving optimal energy compaction.

6.5 Optimal Bit Allocation and Coding Gain

111

m
Since shifting the frequency of any of these ideal bandpass filters by WM
creates
a copy that does not overlap with the original one:

 m
 m
D Fk .z/Hk zWM
D 0; for m D 1; 2; : : : ; M  1;
Hk .z/Hk zWM

(6.62)

the aliasing transfer function (6.33) is zero:


Am .z/ D

M 1
 m
1 X
D 0; m D 1; 2; : : : ; M  1:
Fk .z/Hk zWM
M

(6.63)

kD0

Therefore, condition (6.35) for zero total aliasing effect is satisfied.


On the other side, each of the bandpass filters has a uniform transfer function
Hk .z/ D Fk .z/ D

(p
M ; passband
0;

stopband

(6.64)

so the overall transfer function defined in (6.32) is


T .z/ D 1;

(6.65)

which obviously satisfies (6.36).


Therefore, the ideal subband system satisfies both conditions for perfect reconstruction and is thus a PR system.

6.5.2 Optimal Bit Allocation and Coding Gain


Let us consider the synthesis bank in Fig. 6.5 and assume that the quantization noise
qk .n/ from the kth quantizer is zero-mean wide sense stationary with a variance
of q2k (fine quantization). After it passes through the expander, it is no longer wide
sense stationary because each qk .n/ is periodically interlaced by M 1 zeros. However, after it passes through the ideal passband filter Fk .z/, it becomes wide sense
O
stationary again with a variance of q2k =M [82]. Therefore, the total MSQE of x.n/
at the output of the synthesis filter bank is
2
q.
x/
O D

M 1
1 X 2
qk
M

(6.66)

kD0

Let us now consider the analysis bank in Fig. 6.5. Since the decimator retains
one sample out of M source samples, its output has the same variance as its input,
which, for the kth subband, is given by:
y2k D

1
2




jHk .ej! /j2 Sxx .ej! /d!:

(6.67)

112

6 Subband Coding

Using (6.64) and denoting its passband as k , we can write the equation above as
y2k D

M
2

Z
Sxx .ej! /d!

(6.68)

k

Adding both sides of the equation above for all subbands, we obtain
M
1
X

y2k D

kD0

Z
M 1 Z
M X
M 
Sxx .ej! /d! D
Sxx .ej! /d! D M x2 ;
2
2 
k

(6.69)

kD0

which leads to
x2

M 1
1 X 2
D
yk :
M

(6.70)

kD0

Since (6.66) and (6.70) are the same as (5.22) and (5.37), respectively, all the
derivations in Sect. 5.2 related to coding gain and optimal bit allocation applies to
the ideal subband coder as well. In particular, we have the following coding gain:
1
M

x2

GSBC D 
QM 1
kD0

y2k

PM 1
kD0

1=M D Q
M 1
kD0

y2k

y2k
1=M ;

(6.71)

and optimal bit allocation strategy


y2k
; for k D 0; 1; : : : ; M  1:
rk D R C 0:5 log2 
QM 1 2 1=M
kD0 yk

(6.72)

6.5.3 Asymptotic Coding Gain


When the number of subbands M is sufficiently large, each subband becomes sufficiently narrow, so the variance in (6.68) may be approximated by
y2k 

M
jk jSxx .ej! /;
2

(6.73)

where jk j denotes the width of k . Since


jk j D

2
;
M

(6.74)

the (6.73) becomes


y2k  Sxx .ej! /:

(6.75)

6.5 Optimal Bit Allocation and Coding Gain

113

The geometric mean used by the coding gain formula (6.71) may be rewritten as
M
1
Y

!1=M

M
1
Y

D exp 4ln

y2k

kD0

!1=M 3
5

y2k

kD0
M 1
1 X
D exp
ln y2k
M

!
:

(6.76)

kD0

Dropping (6.75) into the equation above, we have


M
1
Y

!1=M
y2k

kD0

"

#
M 1
1 X
j!
 exp
ln Sxx .e /
M
kD0
"
#
M 1
1 X
j! 2
ln Sxx .e /
D exp
2
M
kD0
#
"
M 1
1 X
j!
ln Sxx .e /jk j ;
D exp
2

(6.77)

kD0

where (6.74) is used to obtain the last equation. As M ! 1, the equation above
becomes
!1=M


Z 
M
1
Y
1
2
yk
D exp
ln Sxx .ej! /d!
(6.78)
2 
kD0

Dropping it back into the coding gain (6.71), we obtain


lim GSBC D

M !1

D
D

exp

1
2
1

exp
1
:
x2

x2

j!
 ln Sxx .e /d!

R
R

21 R

2

Sxx .ej! /d!



ln Sxx .ej! /d!

(6.79)

(6.80)
(6.81)

where x2 is the spectral flatness measure defined in (5.65). Therefore, the ideal
subband coder approaches the same asymptotic optimal coding gain as KLT (see
(5.64)).

Chapter 7

Cosine-Modulated Filter Banks

Between the KLT transform coder and the ideal subband coder, there are many
subband coders which offer great energy compaction capability with a reasonable implementation cost. Prominent among them are cosine modulated filter banks
(CMFB) whose subband filters are derived from a prototype filter through cosine
modulation.
The first advantage of CMFB is that the implementation cost of both analysis
and synthesis banks are that of the prototype filter plus the overhead associated with
cosine modulation. For a CMFB with M bands and N taps per subband filter, the
number of operations for the prototype filter is on the order of N and that for the
cosine modulation, when implemented using a fast algorithm, is on the order of
M log2 M , so the total operations is merely on the order of N C M log2 M . For
comparison, the number of operations for a regular filter bank is on the order of
M  N.
The second advantage is associated with the design of subband filters. Instead of
designing all subband filters in a filter bank independently, which entails optimizing
a total of M  N coefficients, we only need to optimize the prototype filter with
CMFB, which has no more than N coefficients.
Early CMFBs are near perfect reconstruction systems [6, 8, 50, 64, 81] in which
only adjacent-subband aliasing is canceled, so the reconstructed signal at the
output of the synthesis filter bank is only approximately equal to a delayed and
scaled version of the signal inputted to the analysis filter bank. The same filter bank
structure was later found to be capable of delivering perfect reconstruction if two
additional constraints are imposed on the prototype filter [39, 40, 4648, 7779].

7.1 Cosine Modulation


The idea of modulated filter bank was exemplified by the DFT bank discussed in
Sect. 6.1.2 whose analysis filters are all derived from the prototype filter in (6.7)
using DFT modulation (6.8). This leads to the implementation structure shown in
Fig. 6.2 whose implementation cost is the delay line plus the DFT which can be
implemented using an FFT.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 7,
c Springer Science+Business Media, LLC 2010


115

116

7 Cosine-Modulated Filter Banks

While the subband filters of the DFT bank are limited to just M taps (see (6.7)),
it illustrates the basic idea of modulated filter banks and can be easily extended to
accommodate for subband filters with more than M taps through polyphase representation.
Even if the prototype filter of an extended DFT filter bank is real-valued, the
modulated subband filters are generally complex-valued because the DFT modulation is complex-valued. Consequently, the subband samples of an extended DFT
filter bank are complex-valued and are thus not amenable to subband coding. To
obtain a real-valued modulated filter bank, the idea that extends DFT to DCT is followed: a 2M DFT is used to modulate a real-valued prototype filter to produce 2M
complex subband filters and then subband filters symmetric with respect to the zero
frequency are combined to obtain real-valued ones. This leads to CMFB.
There is a little practical issue as shown in Fig. 6.3, where the magnitude response of the prototype filter in the DFT bank is split at the zero frequency, leaving
half of its bandwidth in the real frequency domain and the other half in the imaginary frequency domain. Due to periodicity of the DFT spectrum, it appears to have
two subbands in the real frequency domain, one starting at the zero frequency and
the other ending at 2 . This is very different from other modulated subbands whose
subbands are not split. This issue may be easily addressed by shifting the subband
filters by =2M . Afterwards, the subband filters whose center frequencies are symmetric with respect to the zero frequency are combined together to construct a real
subband filter.

7.1.1 Extended DFT Bank


The prototype filter H.z/ given in (6.7) for the DFT bank in Sect. 6.1.2 is of length
M . This can be extended by obtaining type-I polyphase representation of the modulated subband filter (6.8). In particular, using the type-I polyphase representation
(6.41), the subband filters of the DFT modulated filter bank (6.8) may be written as


k
Hk .z/ D H zWM
D

M
1 
X

k
zWM



m
Pm

k
zWM

M

mD0

M
1
X

 
km m
WM
z Pm zM ; for k D 1; 2; : : : ; M  1;

(7.1)

mD0

where Pm .zM / is the mth type-I polyphase component of H.z/ with respect to M .
Note that a subscript M is attached to W to emphasize that it is for an M -fold DFT:
WM D ej2=M :
Equation (7.1) leads to the implementation structure shown in Fig. 7.1.

(7.2)

7.1 Cosine Modulation

117

Fig. 7.1 Extension of DFT analysis filter bank to accommodate for subband filters with more
M 1
are the type-I polyphase components of the prototype filter H.z/ with
than M taps. fPm .zM /gmD0
respect to M . Note that the M subscript for the DFT matrix WM is included to emphasize that
each of its elements is WM and it is an M  M matrix

If the prototype filter used in (7.1) is reduced to (6.7):


 
Pm zM D 1; for m D 1; 2; : : : ; M  1;

(7.3)

then the filter bank in Fig. 7.1 degenerates to the DFT bank in Fig. 6.2.
There is obviously no explicit restriction on the length of the prototype filter in
(7.1), so a generic N -tap FIR filter can be assumed:
H.z/ D

N
1
X

h.n/zn :

(7.4)

nD0

This extension enables longer prototype filters which can offer much better energy
compaction capability than the 13 dB achieved by the DFT bank in Fig. 6.3.

7.1.2 2M -DFT Bank


Even if the coefficients of the prototype filter H.z/ in Fig. 7.1 are real-valued, the
modulated filters Hk .z/ generally do not have real-valued coefficients because the
DFT modulation is complex. Consequently, the subband samples outputted from
these filters are complex. Since a complex sample actually consists of a real and
imaginary parts, there are now 2M subband samples to be quantized and coded for
each block of M real-valued input samples, amounting to a onefold increase. To
avoid this problem, a modulation scheme that leads to real-valued subband samples
is called for.
This can be achieved using an approach similar to the derivation of DCT from
DFT: a real-valued prototype filter is modulated by a 2M DFT to produce 2M
complex subband filters and then each pair of such subband filters symmetric with
respect to the zero frequency are combined to form a real-valued subband filter.

118

7 Cosine-Modulated Filter Banks

2M 1
Fig. 7.2 2M -DFT analysis filter bank. fPm .z2M /gmD0
are the type-I polyphase components of
the prototype filter H.z/ with respect to 2M and W2M is the 2M  2M DFT matrix

To arrive at 2M DFT modulation, the polyphase representation (7.1) becomes


1
 2M



X
k
km m
D
Hk .z/ D H zW2M
W2M
z Pm z2M ; for k D 1; 2; : : : ; 2M  1;
mD0

(7.5)

where
W2M D ej2=2M D ej=M ;

(7.6)

and Pm .z2M / is the m-th type-I polyphase component of H.z/ with respect to 2M .
The implementation structure for such a filter bank is shown in Fig. 7.2.
The magnitude responses of the above 2M DFT bank is shown in Fig. 7.3 for
M D 8. The prototype filter is again given in (6.7) with 2M D 16 taps and its
magnitude response is shown at the top of the figure. There are 2M D 16 subband
filters whose magnitude responses are shown in the bottom of the figure.
Since a filter with only real-valued coefficients has a frequency response that
is conjugate-symmetric with respect to the zero frequency [67], each pair of the
subband filters in the above 2M DFT bank satisfying this condition can be combined
to form a subband filter with real-valued coefficients. Since the frequency responses
of both H0 .z/, which is the prototype filter, and H8 .z/ are themselves conjugatesymmetric with respect to the zero frequency, their coefficients are already realvalued and they cannot be combined with any other subband filters. The remaining
subband filters from H1 .z/ to H7 .z/ and from H9 .z/ to H1 5.z/ can be combined to
form a total of .M 2/=2 D 7 real-valued subband filters. These combined subband
filters, plus H0 .z/) and H8 .z/, give us a total of 7 C2 D 9 combined subband filters.
Since the frequency response of the prototype filter H.z/ is split at zero frequency
and that of H8 .z/ split at  and , their bandwidth is only half of that of the
other combined subband filters, as shown at the bottom of Fig. 7.3. This results in a
situation where two subbands have half bandwidth and the remaining subbands have
full-bandwidth. While this type of filter banks with unequal subband bandwidth can
be made to work (see [77], for example), it is rather awkward for practical subband
coding.

7.1 Cosine Modulation

119

Amplitude (dB)

30

H0

20
10
0
10
0.5

Amplitude (dB)

30

H8 H9

0
/2
H10

H11

H12

H13

H14

H15

H0

0.5

H1

H2

H3

H4

H5

H6

H7 H8

20
10
0
10
0.5

0
/2

0.5

Fig. 7.3 Magnitude responses of the prototype filter of a 2M DFT filter bank (top) and of all its
subband filters (bottom)

7.1.3 Frequency-Shifted DFT Bank


The problem of unequal subband bandwidth can be addressed by shifting the subband filters to the right by the additional amount of =2M so that (7.5) becomes


kC0:5
Hk .z/ D H zW2M
D

2M
1 
X

kC0:5
zW2M

m
Pm


2M
kC0:5
zW2M

mD0

2M
1
X



 0:5 m
km
zW2M
W2M
Pm z2M ; k D 1; 2; : : : ; 2M  1; (7.7)

mD0
M
where W2M
D 1 was used. Equation (7.7) can be implemented using the structure
shown in Fig. 7.4.
Figure 7.5 shows the magnitude responses of the filter bank above using the
prototype filter given in (6.7) with 2M D 16 taps. They are the same magnitude
responses given in Fig. 7.2 except for a frequency shift of =2M , respectively. Now
all subbands have the same bandwidth.

120

7 Cosine-Modulated Filter Banks

Fig. 7.4 2M -DFT analysis filter bank with an additional frequency shift of =2M to the right.
2M 1
are the type-I polyphase components of H.z/ with respect to 2M and W2M is the
fPm .z2M /gmD0
2M  2M DFT matrix

Amplitude (dB)

30

H0

20
10
0
10
0.5

Amplitude (dB)

30

0
/2
H8

H9

H10 H11 H12 H13 H14 H15 H0

0.5

H1

H2

H3

H4

H5

H6

H7

20
10
0
10
0.5

0
/2

0.5

Fig. 7.5 Magnitude response of a prototype filter shifted by =2M to the right (top). Magnitude
responses of all subband filters modulated from such a filter using a 2M DFT (bottom)

7.1.4 CMFB
From Fig. 7.5, it is obvious that the frequency response of the k-th subband filter
is conjugate-symmetric to that of the .2M  1  k/-th subband filter with respect
to zero frequency (they are images of each other), so they are candidate pairs for
combination into a real filter. Let us drop (7.4) into first equation of (7.7) to obtain

7.1 Cosine Modulation

121

Hk .z/ D

N
1
X

n

kC0:5
h.n/ zW2M

nD0

N
1
X

.kC0:5/n n

h.n/W2M

(7.8)

nD0

and
H2M 1k .z/ D

N
1
X

n

2M 1kC0:5
h.n/ zW2M

nD0

N
1
X

n

k0:5
h.n/ zW2M

nD0

N
1
X

.kC0:5/n n

h.n/W2M

(7.9)

nD0

Comparing the two equations above we can see that the coefficients of Hk .z/ and
H2M 1k .z/ are obviously conjugates of each other, so the combined filter will
have real coefficients.
When the pair above are actually combined, they are weighted by a unitmagnitude constant c:
Ak .z/ D ck Hk .z/ C ck H2M 1k .z/

(7.10)

to aid alias cancellation and elimination of phase distortion. In addition, linear


phase condition is imposed on the prototype filter:
h.n/ D h.N  1  n/

(7.11)

and mirror image condition on the synthesis filters:


sk .n/ D ak .N  1  n/;

(7.12)

where ak .n/ and sk .n/ are the impulse responses of the kth analysis and synthesis
subband filters, respectively. After much derivation [93], we arrive at the following
cosine modulated analysis filter:






N 1
ak .n/ D 2h.n/ cos
.k C 0:5/ n 
C k ;
M
2
for k D 0; 1; : : : ; M  1;
where the phase
k D .1/k


:
4

(7.13)

(7.14)

122

7 Cosine-Modulated Filter Banks

The cosine modulated synthesis filter can be obtained from the analysis filter
(7.13) using the mirror image relation (7.12):



N 1

.k C 0:5/ n 
 k ;
M
2
for k D 0; 1; : : : ; M  1:


sk .n/ D 2h.n/ cos

(7.15)

The system of analysis and synthesis banks above completely eliminates phase
distortion, so the overall transfer function T .z/ defined in (6.32) is linear phase. But
amplitude distortion remains. Therefore, the CMFB is a nonperfect reconstruction
system and is sometimes called pseudo QMF (quadrature mirror filter).
Part of the amplitude distortion comes from incomplete aliasing cancellation:
aliasing from only adjacent subbands are canceled, not from all subbands.
Note that, even though linear phase condition (7.11) is imposed to the prototype
filter, the analysis and synthesis filters generally do not have linear phase.

7.2 Design of NPR Filter Banks


The CMFB given in (7.13) and (7.15) cannot deliver perfect reconstruction due
to incomplete aliasing cancellation and existence of amplitude distortion. But the
overall reconstruction error can be reduced to an acceptable level if the amplitude
distortion is properly controlled.
Amplitude distortion arises if the magnitude of the overall transfer function
jT .z/j is not exactly flat, so the problem may be posed as designing the prototype
filter in such a way that jT .z/j is flat or close to flat. It turns out that this can be
ensured if the following function is sufficiently flat: [93]:


jH.ej! /j2 C jH ej.!=M / j2  1;

for ! 2 0; =M :

(7.16)

This condition can be enforced through minimizing the following cost function:
Z

=M

.H.z// D




2
jH.ej! /j2 C jH ej.!=M / j2  1 d!:

(7.17)

In addition to the concern above over amplitude distortion, energy compaction


is also of prominent importance for signal coding and other applications. To ensure
this, all subband filters should have good stopband attenuation. Since all subband
filters are shifted copies of the prototype filter, they all have the same amplitude
shape of the prototype filter. Therefore, the optimization of stopband attenuation for
all subband filters can be reduced to that of the prototype filter.

, so
The nominal bandwidth of the prototype filter on the positive frequency is 2M
stopband attenuation can be optimized by minimizing the following cost function:

7.3 Perfect Reconstruction

123

Z
.H.z// D



2M

jH.ej! /j2 d!;

(7.18)

C

where  controls the transition bandwidth and should be adjusted for a particular
application.
Now both amplitude distortion and stopband attenuation can be optimized by
minh.n/

.H.z// D .H.z// C .1  /.H.z//;

Subject to (7.11);

(7.19)

where controls the trade-off between amplitude distortion and stopband attenuation. See [76] for standard optimization procedures that can be applied.

7.3 Perfect Reconstruction


The CMFB in (7.13) and (7.15) becomes a perfect reconstruction system when
aliasing is completely canceled and amplitude distortion eliminated. Toward this
end, we first impose the following length constraint on the prototype filter
N D 2mM;

(7.20)

where m is a positive integer. Then the CMFB is a perfect reconstruction system if


and only if the polyphase components of the prototype filter satisfy the following
pairwise power complementary conditions [40, 93]:
PQk .z/Pk .z/ C PQM Ck .z/PM Ck .z/ D ; k D 0; 1; : : : ; M  1;

(7.21)

where is a positive number.


The notation tilde applied to a rational Z-transform function H.z/ means taking complex conjugate of all its coefficients and replacing z with z1 . For example, if

then

H.z/ D a C bz1 C cz2 ;

(7.22)

HQ .z/ D a C b  z C cz2 :

(7.23)

It is intended to effect complex conjugation applicable to a frequency response


function:
HQ .ej! / D H  .ej! /:
(7.24)
When applied to a matrix of Z-transform functions H.z/ D Hi;j .z/, a transpose
operation is also implied:
Q
H.z/
D HQ i;j .z/T :
(7.25)

124

7 Cosine-Modulated Filter Banks

7.4 Design of PR Filter Banks


The method for designing a PR prototype filter is similar to that for the NPR
prototype filter discussed in Sect. 7.2, the difference is that the amplitude distortion is now eliminated by the power complementary conditions (7.21), so the design
problem is focused on energy compaction:
Z 
.H.z// D
jH.ej! /j2 d!;
minh.n/

C
(7.26)
2M
Subject to (7.11) and (7.21):
While the above minimization step may be straight-forward by itself, the difficulty
lies in the imposition of the power-complementary constraints (7.21) and the linear
phase condition (7.11).

7.4.1 Lattice Structure


One approach to impose the power-complementary constraints (7.21) during the
minimization process is to implement the power-complementary pairs of polyphase
components using a cascade of lattice structures.
7.4.1.1 Paraunitary Systems
Toward this end, let us write each power-complementary pair as the following 2  1
transfer matrix or system:

Pk .z/ D


Pk .z/
; k D 0; 1; : : : ; M  1;
PM Ck .z/

(7.27)

then the power complementary condition (7.21) may be rewritten as


PQ k .z/Pk .z/ D ; k D 0; 1; : : : ; M  1;

(7.28)

which means that the 2  1 system Pk .z/ is paraunitary. Therefore, the power complementary condition (7.21) is equivalent to the condition that Pk .z/ is paraunitary.
In general terms, an m  n rational transfer matrix or system H.z/ is called
paraunitary if
Q
(7.29)
H.z/H.z/
D In ;
where In is the n  n unit matrix. It is obviously necessary that m
n. Otherwise,
the rank of H.z/ is less than n. If m D n or the transfer matrix is square, the transfer
system is further referred to as unitary.

7.4 Design of PR Filter Banks

125

7.4.1.2 Givens Rotation


As an example, consider Givens rotation described by the following transfer matrix
[20, 22]:


cos  sin 
G./ D
;
(7.30)
 sin  cos 
where  is a real angel. A flowgraph for this transfer matrix is shown in Fig. 7.6. It
can be easily verified that

GT ./G./ D

cos   sin 
sin  cos 

cos  sin 
 sin  cos 


D I2 ;

(7.31)

so it is unitary.
A geometric interpretation of Givens rotation is that it rotates an input vector
clockwise by . In particular, if an input A D r cos ; r sin T with an angle of
is rotated clockwise by an angle of , the output vector has an angle of   and is
given by





r cos.  /
r cos cos  C r sin sin 
r cos./
D
D G./
: (7.32)
r sin.  /
r sin cos   r cos sin 
r sin./

7.4.1.3 Delay Matrix


Let us consider another 2  2 transfer matrix:


1 0
;
DD
0 z1

(7.33)

which is a simple 2  2 delay system and is shown in Fig. 7.7. It is unitary because

Fig. 7.6 The Givens rotation

Fig. 7.7 A simple 2  2


delay system

Z -1

126

7 Cosine-Modulated Filter Banks

Q D 1 0
DD
0 z

1
0

0
z1


D I2 :

(7.34)

7.4.1.4 Rotation Vector


The following simple 2  1 transfer matrix:

cos 
R./ D
sin 


(7.35)

is a simple rotation vector. It is paraunitary because


R ./R./ D cos 
T

cos 
sin 
sin 


D I1 :

(7.36)

Its flowgraph is shown in Fig. 7.8.

7.4.1.5 Cascade of Paraunitary Matrices


An important property of paraunitary matrices is that a cascade of paraunitary
matrices are also paraunitary. In particular, if H1 .z/ and H2 .z/ are paraunitary, then
H.z/ D H1 .z/H2 .z/ is also paraunitary. This is because
Q
H.z/H.z/
D HQ2 .z/HQ1 .z/H1 .z/H2 .z/ D 2 I:

(7.37)

The result above obviously can be extended to include a cascading of any number
of paraunitary systems.
Using this property we can build more complex 2  1 paraunitary systems of
arbitrary order by cascading the elementary paraunitary transfer matrices discussed
above. One such example is the lattice structure shown in Fig. 7.9. It has N  1
delay subsystem and N  1 Givens rotation. Its transfer function may be written as

Fig. 7.8 A simple 2  1


rotation system

7.4 Design of PR Filter Banks

127

Fig. 7.9 A cascaded 2  1 paraunitary systems

P.z/ D

1
Y

G.n /D.z/ R.0 /:

(7.38)

nDN 1
N 1
It has a parameter set of fn gnD0
and N  1 delay units, so represents a 2  1
real-coefficient FIR system of order N  1. It was shown that any 2  1 realcoefficient FIR paraunitary systems of order N  1 may be factorized by such a
lattice structure [93].

7.4.1.6 Power-Complementary Condition


Let us apply the result above to the 2  1 transfer matrices Pk .z/ defined in (7.27) to
enforce the power-complementary condition (7.21). Due to (7.20), both polyphase
components Pk .z/ and PM Ck .z/ with respect to 2M have an order of m  1, so
Pk .z/ has an order of m  1. Due to (7.38), it can be factorized as follows:
Pk .z/ D

1
Y

!
 
 
k
G n D.z/ R 0k ; k D 0; 1; : : : ; M  1:

(7.39)

nDm1

Since this lattice structure is guaranteed to be paraunitary, the parameter set


n

nk

om1
nD0

(7.40)

can be arbitrarily adjusted to minimize (7.26) without violating the power complementary condition. Since there are M such systems, the total number of free
parameters nk that can be varied for optimization is reduced to mM from 2mM.

7.4.2 Linear Phase


When the linear phase condition (7.11) is imposed, the number of free parameters above that are allowed to be varied for optimization will be further reduced.

128

7 Cosine-Modulated Filter Banks

To see this, let us first represent the linear phase condition (7.11) in the Z-transform
domain:
H.z/ D

N
1
X

h.n/zn

nD0

D z.N 1/

N
1
X

h.N  1  m/zm

mD0

D z.N 1/

N
1
X

h.m/zm

mD0
.N 1/

Dz

HQ .z/:

(7.41)

where a variable change of n D N  1  m is used to arrive at the second equation,


the linear phase condition (7.11) is used for the third equation, and the assumption
that h.n/ are real-valued is used for the fourth equation.
The type-I polyphase representation of the prototype filter with respect to 2M is
H.z/ D

2M
1
X



zk Pk z2M :

(7.42)

kD0

Dropping it into (7.41) we obtain


HQ .z/

2M
1
X



zN 1k Pk z2M

kD0
nD2M 1k

2M
1
X



zN Ck2M P2M 1k z2M

kD0
N D2mM

2M
1
X



zk z2M.m1/ P2M 1k z2M :

(7.43)

kD0

From (7.42) we also have


HQ .z/ D

2M
1
X



zk PQk z2M :

(7.44)

kD0

Comparing the last two equations, we have






PQk z2M D z2M.m1/ P2M 1k z2M

(7.45)

7.4 Design of PR Filter Banks

129

or
PQk .z/ D zm1 P2M 1k .z/:

(7.46)

Therefore, half of the 2M polyphase components are completely determined by the


other half due to the linear phase condition.

7.4.3 Free Optimization Parameters


Now there are two sets of constraints on the prototype filter coefficients: the power
complementary condition ties polyphase components Pk .z/ and PM Ck .z/ using the
lattice structure, while the linear phase condition ties Pk .z/ and P2M 1k .z/ using (7.46). Intuitively, this should leave us with roughly one quarter of polyphase
components that can be freely optimized.
7.4.3.1 Even M
Let us first consider the case that M is even or M=2 is an integer. Each pair of the
following two sets of polyphase components:
P0 .z/; P1 .z/; : : : ; PM=21 .z/

(7.47)

PM .z/; PM C1 .z/; : : : ; P3M=21 .z/

(7.48)

and
can be used to form the 2  1 system in (7.27), which in turn can be represented by the lattice structure in (7.39) with a parameter set given in (7.40). Since
each parameter set has m free parameters, the total number of free parameters is
mM=2.
The remaining polyphase components can be derived from the two sets above using the linear phase condition (7.46). In particular, the set of polyphase components
in (7.47) determines the following set:
P2M 1 .z/; P2M 2 .z/; : : : ; P3M=2 .z/

(7.49)

PM 1 .z/; PM 2 .z/; : : : ; PM=2 .z/;

(7.50)

and (7.48) determines

respectively.

130

7 Cosine-Modulated Filter Banks

7.4.3.2 Odd M
When M is odd or .M  1/=2 is an integer, the scheme above is still valid, except
for P.M 1/=2 and PM C.M 1/=2 . In particular, each pair of the following two sets of
polyphase components:
P0 .z/; P1 .z/; : : : ; P.M 1/=21 .z/

(7.51)

PM .z/; PM C1 .z/; : : : ; P.3M 1/=21 .z/:

(7.52)

and
can be used to form the 2  1 system in (7.27), which in turn can be represented by
the lattice structure in (7.39) with a parameter set given in (7.40). Since each parameter set has m free parameters, the total number of free parameters is m.M  1/=2.
The linear phase condition (7.46) in turn causes (7.51) to determine
P2M 1 .z/; P2M 2 .z/; : : : ; P.3M C1/=2 .z/

(7.53)

and (7.52) to determine


PM 1 .z/; PM 2 .z/; : : : ; P.M C1/=2 .z/;

(7.54)

respectively.
Apparently, both P.M 1/=2 and PM C.M 1/=2 are missing from the lists above.
The underlying reason is that both the power-complementary and linear-phase conditions apply to them simultaneously. In particular, the linear phase condition (7.46)
requires
(7.55)
PQ.M 1/=2 .z/ D zm1 P.3M 1/=2 .z/:
This causes the power complementary condition (7.21) to become
2PQ.M 1/=2 .z/P.M 1/=2 .z/ D ;
which leads to
P.M 1/=2 .z/ D

p
0:5z ;

(7.56)

(7.57)

where is a delay. Since H.z/ is low-pass with a cutoff frequency of =2M , it can
be shown that the only acceptable choice of is [93]
(
D

m1
2 ;
m
;
2

if m is oddI
if m is even:

(7.58)

Since is a scale factor for the whole the system, P.M 1/=2 is now completely
determined. In addition, P.3M 1/=2 .z/ can be derived from it using (7.55):
P.3M 1/=2 .z/ D z.m1/ PQ.M 1/=2 .z/ D

p
0:5z.m1/ :

(7.59)

7.5 Efficient Implementation

131

7.5 Efficient Implementation


Direct implementation of either the analysis filter bank (7.13) or the synthesis filter bank (7.15) requires operations on the order M  N . This can be significantly
reduced to the order of N C M log2 M by utilizing polyphase representation and
cosine modulation. There is a little variation in the actual implementation structures,
depending on whether m is even or odd, and both cases are presented in Sects. 7.5.1
and 7.5.2 [93].

7.5.1 Even m
When m is even, the analysis filter bank, represented by the following vector:
2

A0 .z/
A1 .z/
::
:

6
6
a.z/ D 6
4

3
7
7
7;
5

(7.60)

AM 1 .z/
may be written as [93]
a.z/ D

M Dc CI  J

P .z2M / 0
 I  J 0
0
P1 .z2M /

d.z/
zM d.z/


(7.61)

where
Dc kk D cos .k C 0:5/m; k D 0; 1; : : : ; M  1;

(7.62)

is an M  M diagonal matrix, C is the DCT-IV matrix given in (5.79),


2

3
1
07
7
:: 7
:7
7
1  0 05
1 0  0 0

0
60
6
6
J D 6 :::
6
40

0  0
0  1
::
::
:
:

(7.63)

is the reversal or anti-diagonal matrix,


h


i


P0 z2M
D Pk z2M ; k D 0; 1; : : : ; M  1;
kk

(7.64)

is the diagonal matrix consisting of the initial M polyphase components of the prototype filter P .z/, and

132

7 Cosine-Modulated Filter Banks

P1 z2M

i
kk



D PM Ck z2M ; k D 0; 1; : : : ; M  1;

(7.65)

is an diagonal matrix consisting of the last M polyphase components of the prototype filter P .z/.
For an even m, (7.62) becomes
Dc kk D cos0:5m  D .1/0:5m ; k D 0; 1; : : : ; M  1;

(7.66)

Dc D .1/0:5m I:

(7.67)

so
Therefore, the analysis bank may be written as
#"
#
"
2M
p
d.z/
.z
/
0
P
0
:
a.z/ D .1/0:5m M CIJ IJ
zM d.z/
0
P1 .z2M /
(7.68)
The filter bank above may be implemented using the structure shown in Fig. 7.10.
It is obvious that the major burdens of calculations are the polyphase filtering which
entails operations on the order of N and the M M DCT-IV which needs operations

Fig. 7.10 Cosine modulated analysis filter bank implemented using DCT-IV. The prototype filter
has a length of N D 2mM with an even m. The Pk .z2M / is the
p kth polyphase component of the
prototype filter with respect to 2M . A scale factor of .1/0:5m M is omitted

7.5 Efficient Implementation

133

on the order of M log2 M when implemented using a fast algorithm, so the total
number of operations is on the order of N C M log2 M .
The synthesis filter bank is obtained from the analysis bank using (7.12), which
becomes
Sk .z/ D z.N 1/ AQk .z/
(7.69)
due to (7.41). Denoting
sT .z/ D S0 .z/; S0 .z/; : : : ; SM 1 .z/;

(7.70)

the equation above becomes


sT .z/ D z.N 1/ aQ .z/:

(7.71)

Dropping in (7.68) and using (7.20), the equation above becomes


" 
#"
#

p
 P
Q 0 z2M 0

IJ
MQ
Q


s .z/ D z
d.z/ z d.z/
C M .1/0:5m
Q 1 z2M
0
P
I  J
" 
#



Q 0 z2M
P
 0 2M 
D z2M C1 dQ .z/ zM dQ .z/ z2M.m1/
Q 1 z
0
P
.2mM 1/

"

#
p
IJ
C M .1/0:5m :
I  J
(7.72)

Due to (7.46), we have





PQ 0 z2M 0 

z
0
PQ 1 z2M


2
P2M 1 z2M   
0
0
6
::
::
::
::
6
:
:
:
:
6
6
 2M 
6
0
0
   PM z
6
D6


6
0

0
PM 1 z2M
6
6
::
::
::
::
6
:
:
:
:
4
2M.m1/



0
"
D



JP1 z2M J 0
0



JP0 z2M J

0
#


::
:

0
::
:




::
:

0
::
:



   P0 z2M

3
7
7
7
7
7
7
7
7
7
7
7
5

(7.73)

134

7 Cosine-Modulated Filter Banks

Fig. 7.11 Cosine modulated synthesis filter bank implemented using DCT-IV. The prototype filter
has a length of N D 2mM with an even m. The Pk .z2M / is the
p kth polyphase component of the
prototype filter with respect to 2M . A scale factor of .1/0:5m M is omitted

where the last step uses the following property of the reversal matrix:
Jdiagfx1 ; x2 ; : : : ; xM 1 gJ D diagfxM 1 ; : : : ; x2 ; x1 g:

(7.74)

Therefore, the synthesis bank becomes



i JP z2M  J 0
h
1
Q
Q
 2M 
zM C1 d.z/
sT .z/ D zM zM C1 d.z/
J
0
JP0 z


p
IJ
C M .1/0:5m ;
(7.75)
I  J
which may be implemented by the structure in Fig. 7.11.

7.5.2 Odd m
When m is odd, both the analysis and synthesis banks are essentially the same as
when m is even, with only minor differences. In particular, the analysis bank is given
by [93]

7.5 Efficient Implementation

135

a.z/ D

p
P .z2M / 0
M Ds CI C J I  J 0
0
P1 .z2M /

d.z/
zM d.z/


(7.76)

where
Ds kk D sin .k C 0:5/m D .1/

m1
2

.1/k ; k D 0; 1; : : : ; M  1;

(7.77)

is a diagonal matrix with alternating 1 and 1 on the diagonal. Since this matrix
only changes the signs of alternating subband samples and these sign changes are
reversed upon input to the synthesis bank, the implementation of this matrix can be
omitted in both analysis and synthesis bank.
Similar to the case with even m, the corresponding synthesis filter bank may be
obtained as:

i JP .z2M /J 0
h
1
T
M M C1 Q
M C1 Q
d.z/ z
d.z/
s .z/ D z z
0
JP0 .z2M /J


p
ICJ
(7.78)
CDs M :
IJ
The analysis and synthesis banks can be implemented by structures shown in
Figs. 7.12 and 7.13, respectively.

Fig. 7.12 Cosine modulated analysis filter bank implemented using DCT-IV. The prototype filter
has a length of N D 2mM with an odd m. The Pk .z2M / is the kth polyphase component of the
subbands,
prototype filter with respect to 2M . The diagonal matrix Ds only negates alternating
p
hence can be omitted together with that in the synthesis bank. A scale factor of M is omitted

136

7 Cosine-Modulated Filter Banks

Fig. 7.13 Cosine modulated synthesis filter bank implemented using DCT-IV. The prototype filter
has a length of N D 2mM with an odd m. The Pk .z2M / is the kth polyphase component of the
subbands,
prototype filter with respect to 2M . The diagonal matrix Ds only negates alternating
p
hence can be omitted together with that in the analysis bank. A scale factor of M is omitted

7.6 Modified Discrete Cosine Transform


Modified discrete cosine transform (MDCT) is a special case of CMFB when
m D 1. It deserves special discussion here because of its wide application in audio coding. The first PR CMFB is the time-domain aliasing cancellation (TDAC)
which was obtained from the 2M -DFT discussed in Sect. 7.1.2 without the =2M
frequency shifting (even channel stacking) [77]. Referred to as evenly-stacked
TDAC, it has M  1 full-bandwidth channels and 2 half-bandwidth channels, for a
total of M C 1 channels. This issue was latter addressed using the =2M frequency
shifting, which is called odd channel stacking, and the resultant filter bank is called
oddly stacked TDAC [78].

7.6.1 Window Function


With m D 1, the polyphase components of the prototype filter with respect to 2M
becomes the coefficients of the filter itself:
Pk .z/ D h.k/;

k D 0; 1; : : : ; 2M  1;

(7.79)

7.6 Modified Discrete Cosine Transform

137

which makes it intuitive to understand the filter bank. For example, the powercomplementary condition (7.21) becomes
h2 .n/ C h2 .M C n/ D ; n D 0; 1; : : : ; M  1:

(7.80)

Also, the polyphase filtering stage in both Figs. 7.12 and 7.13 becomes simply applying the prototype filter coefficients, so the prototype filter is often referred to as
the window function or simply the window. Since the block size is M and the window size is 2M , there is half window or one block overlapping between blocks, as
shown in Fig. 7.15.
The window function can be designed using the procedure discussed in Sect. 7.4.
Due to m D 1, the lattice structure (7.39) degenerates into the rotation vector (7.35),
so the design problem is much simpler.
A window widely used in audio coding is the following half-sine window or
simply sine window:
h
i
:
h.n/ D sin .n C 0:5/
2M

(7.81)

It satisfies the power-complementary condition (7.80) because


h
h
i
i
C sin2
C .n C 0:5/
h2 .n/ C h2 .M C n/ D sin2 .n C 0:5/
2M i
2M
h
h2

i
2
2
C cos .n C 0:5/
D sin .n C 0:5/
2M
2M
D 1:
(7.82)
It is a unique window that allows perfect DC reconstruction using only the low-pass
subband, i.e., subband zero [45,49]. This was shown to be a necessary condition for
maximum asymptotic coding gain for an AR(1) signal with the correlation coefficient approaching the value of one.

7.6.2 MDCT
The widely used MDCT is actually given by the following synthesis filter:


M C1

.k C 0:5/ n C
M
2
for k D 0; 1; : : : ; M  1:


sk .n/ D 2h.n/ cos

(7.83)

It is obtained from (7.15) using the following phase:



k D .k C 0:5/.2m C 1/ :
2

(7.84)

138

7 Cosine-Modulated Filter Banks

Table 7.1 Phase difference


between CMFB and MDCT.
Its impact to the subband
filters is a sign change when
the phase difference is

k
4n C 0
4n C 1
4n C 2
4n C 3

k
=4
 =4
=4
 =4

k
=4 C
 =4
=4
 =4 C

k  k

0
0

Fig. 7.14 Implementation of forward MDCT as application of window function and then calculation of DCT-IV. The block C represents DCT-IV matrix

It differs from the k in (7.14) for some k by , as shown in Table 7.1. A phase
difference of causes the cosine function to switch its sign to negative, so some
of the analysis and synthesis filters will have negative values when compared
with those given by (7.14). This is equivalent to that a different Ds is used in
the analysis and synthesis banks in Sect. 7.5.2. As stated there, this kind of sign
change is insignificant as long as it is complemented in both analysis and synthesis
banks.

7.6.3 Efficient Implementation


An efficient implementation structure for the analysis bank which utilizes the linear
phase condition (7.11) to represent the second half of the window function is given
in [45]. To prepare for window switching which is critical for coping with transients
in audio signals (to be discussed in Chap. 11), we forgo this use of the linear phase
condition to present the structure shown in Fig. 7.14 which uses the second half

7.6 Modified Discrete Cosine Transform

139

Fig. 7.15 Implementation of MDCT as application of an overlapping window function and then
calculation of DCT-IV

of the window function directly. In particular, the input to the DCT block may be
expressed as
un D x.M=2 C n/h.3M=2 C n/  x.M=2  1  n/h.3M=2  1  n/ (7.85)
and
unCM=2 D x.n  M /h.n/  x.1  n/h.M  1  n/

(7.86)

for n D 0; 1; : : : ; M=2  1, respectively. For both equations above, the current block
is considered as consisting of samples x.0/; x.1/; : : : ; x.M  1/ and the past block
as x.M /; x.M  1/; : : : ; x.1/ which essentially amounts to a delay line.
The first half of the inputs to the DCT-IV block, namely un in (7.85), are obviously obtained by applying the second half of the window function to the current
block of input data. The second half, namely unCM=2 in (7.86), are calculated by
applying the first half of the window function to the previous block of data. This
constitutes an overlap with the previous block of data. Therefore, the implementation of MDCT may be considered as application of an overlapping window function
and then calculation of DCT-IV, as shown in Fig. 7.15.
The following is the Matlab code for implementing MDCT:
function [y] = mdct(x, n0, M, h)
%
% [y] = mdct(x, n0, M, h)
%
% x: Input array. M samples before n0 are considered as the
%
delay line
% n0: Start of a block of new data
% M: Block size
% h: Window function
% y: MDCT coefficients
%
% Here is an example for generating the sine win
% n = 0:(2*M-1);
% h = sin((n+0.5)*0.5*pi/M);
%

140

7 Cosine-Modulated Filter Banks

% Convert to DCT4
for n=0:(M/2-1)
u(n+1) = - x(n0+M/2+n+1)*h(3*M/2+n+1)
- x(n0+M/2-1-n+1)*h(3*M/2-1-n+1);
u(n+M/2+1) = x(n0+n-M+1)*h(n+1)
- x(n0+-1-n+1)*h(M-1-n+1);
end
%
% DCT4, you can use any DCT4 subroutines
y=dct4(u);

The inverse of Fig. 7.14 is shown in Fig. 7.16 which does not utilize the symmetric property of the window function imposed by the linear phase condition. The
output of the synthesis bank may be expressed as
x.n/ D xd.n/ C uM=2Cn h.n/;
xd.n/ D uM=21n h.M C n/;
for n D 0; 1; : : : ; M=2  1I

(7.87)

and
x.n/ D xd.n/  u3M=21n h.n/;
xd.n/ D uM=2Cn h.M C n/;
for n D M=2; M=2 C 1; : : : ; M  1I

(7.88)

where xd.n/ is the delay line with a length of M samples.

Fig. 7.16 Implementation of backward MDCT as calculation of DCT-IV and then application of
window function. The block C represents DCT-IV matrix

7.6 Modified Discrete Cosine Transform

141

The following is the Matlab code for implementing inverse MDCT:


function [x, xd] = imdct(y, xd, M, h)
%
% [x, xd] = imdct(y, xd, M, h)
%
% y: MDCT coefficients
% xd: Delay line
% M: Block size
% h: Window function
% x: Reconstruced samples
%
% Here is an example for generating the sine window
% n = 0:(2*M-1);
% h = sin((n+0.5)*0.5*pi/M);
%
% DCT4
u=dct4(y);
%
%
for n=0:(M/2-1)
x(n+1) = xd(n+1) + u(M/2+n+1)*h(n+1);
xd(n+1) = -u(M/2-1-n+1)*h(M+n+1);
end
%
for n=(M/2):(M-1)
x(n+1) = xd(n+1) - u(3*M/2-1-n+1)*h(n+1);
xd(n+1) = -u(-M/2+n+1)*h(M+n+1);
end

If Figs. 7.14 and 7.16 were followed strictly, we could end up with half length for
the delay lines at the cost of an increased complexity for swapping variables.

Part IV

Entropy Coding

After the removal of perceptual irrelevancy using quantization aided by data


modeling, we are left with a set of quantization indexes. They can be directly
packed into a bit stream for transmission to the decoder. However, they can be further compressed through the removal of statistical redundancy via entropy coding.
The basically idea of entropy coding is to represent more probable quantization
indexes with shorter codewords and less probable ones with longer codewords so
as to achieve a shorter average codeword length. In this way, the set of quantization
indexes can be represented with less number of bits.
The theoretical minimum of the average codeword length for a particular quantizer is its entropy which is a function of the probability distribution of the quantization indexes. The practical minimum of average codeword length achievable by all
practical codebooks is usually higher than the entropy. The codebook that delivers
such practical minimum is called the optimal codebook.
However, this practical minimum can be made to approach the entropy if quantization indexes are grouped into blocks, each block is coded as one block symbol,
and the block size is allowed to be arbitrarily large.
Huffmans algorithm is an iterative procedure that always produces an optimal
entropy codebook.

Chapter 8

Entropy and Coding

Let us consider a 2-bit quantizer that represents quantized values using the following
set of quantization indexes:
f0; 1; 2; 3g:
(8.1)
Each quantization index given above is called a source symbol, or simply a symbol,
and the set is called a symbol set. When applied to quantize a sequence of input
samples, the quantizer produces a sequence of quantization indexes, such as the
following:
1; 2; 1; 0; 1; 2; 1; 2; 1; 0; 1; 2; 2; 1; 2; 1; 2; 3; 2; 1; 2; 1; 1; 2; 1; 0; 1; 2; 1; 2:

(8.2)

Called a source sequence, it needs to be represented by or converted to a sequence


of codewords or codes that are suitable for transmission over a variety of channels.
The primary concern is that the average codeword length is minimized so that the
transmission of the source sequence demands a lower bit rate.
An instinct approach to this coding problem is to use a binary numeral system to
represent the symbol set. This may lead to the codebook in Table 8.1. Each codeword
in this codebook is of fixed length, namely 2 bits, so the codebook is referred to as a
fixed-length codebook or fixed-length code.
Coding each symbol in the source sequence (8.2) using the fixed-length code in
Table 8.1 takes two bits, amounting to a total of 2  30 D 60 bits to code the entire
30 symbols in source sequence (8.2). The average codeword length is obviously
LD

60
D 2 bits/symbol;
30

(8.3)

which is the codeword length of the fixed-length codebook and is independent of


the frequency that each symbol occurs in the source sequence (8.2).
Since the symbols appear in source sequence (8.2) with obviously different frequencies or probabilities, the average codeword length would be reduced if a short
codeword is assigned to a symbol with high probability and a long one to a symbol
with low probability. This strategy leads to variable-length codebooks or simply
variable-length codes.

Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 8,


c Springer Science+Business Media, LLC 2010


145

146

8 Entropy and Coding

Table 8.1 Fixed-length


codebook for the source
sequence in (8.2)

Symbol
0
1
2
3

Codeword
00
01
10
11

Codeword length (bits)


2
2
2
2

Table 8.2 Variable-length unary code for the example


sequence in (8.5). Shorter codewords are assigned to more
frequently occurring symbols and longer codewords to less
frequently occurring ones
Symbol Codeword Frequency Codeword length (bits)
0
001
3
3
1
1
14
1
2
01
12
2
3
0001
1
4

As an example, the first two columns of Table 8.2 are such a variable-length
codebook that is built using the unary numeral system. Assuming that the sequence
is iid and that the number that a symbol appears in the sequence accurately reflects
its frequency of occurrence, the probability distribution may be estimated as
p.0/ D

14
12
1
3
; p.1/ D
; p.2/ D
; p.3/ D
:
30
30
30
30

(8.4)

With this unary codebook, the most frequently occurring symbol 1 is coded with
one bit and the least frequently occurring symbol 3 is coded with four bits. The
average codeword length is
LD

14
12
1
3
3C
1C
2C
 4 D 1:7 bits;
30
30
30
30

which is 0.3 bits better than the the fixed-length code in Table 8.1.

8.1 Entropy Coding


To cast the problem above of coding a source sequence to reduce average codeword
length in mathematical terms, let us consider an information source that emits a
sequence of messages or source symbols
X.1/; X.2/; : : :

(8.5)

by drawing from a symbol set or alphabet of


S D fs0 ; s1 ; : : : ; sM 1 g

(8.6)

8.1 Entropy Coding

147

consisting of M source symbols. The symbols in the source sequence (8.5) is often
assumed to be an independently and identically distributed (iid) random variable
with a probability distribution of
p.sm / D pm ; m D 0; 1; : : : ; M  1:

(8.7)

The task is to convert this sequence into a compact sequence of codewords


Y .1/; Y .2/; : : :

(8.8)

C D fc0 ; c1 ; : : : ; cM 1 g;

(8.9)

drawn from a set of codewords

called a codebook or simply code. The goal is to find a codebook that minimizes
the average codeword length without loss of information.
The codewords are often represented using binary numerals, in which case the
resultant sequence of codewords, called a codeword sequence or simply a code
sequence, is referred to as a bit stream. While other radix of representation, such
as hexadecimal, can also be used, binary radix is assumed in this book without loss
of generality.
This coding process has to be lossless in the sense that the complete source
sequence can be recovered or decoded from the received codeword sequence without any error or loss of information, so it is called lossless compression coding or
simply lossless coding.
In what amounts to symbol code approach to lossless compression coding, a
one-to-one mapping:
sm

! cm ;

for m D 0; 1; : : : ; M  1;

(8.10)

is established between each source symbol sm in the symbol set and a codeword
cm in the codebook and then deployed to encode the source sequence or decode the
codeword sequence symbol by symbol. The codebooks in Tables 8.1 and 8.2 are
symbol codes.
Let l.cm / denotes the codeword length of codeword cm in codebook (8.9), then
the average codeword length per source symbol of the code sequence (8.8) is
LD

M
1
X

p.sm /l.cm /:

(8.11)

mD0

Due to the symbol code mapping (8.10), the equation above becomes:
LD

M
1
X

p.cm /l.cm /;

mD0

which is the average codeword length of the codebook (8.9).

(8.12)

148

8 Entropy and Coding

Apparently any symbol set and consequently source sequence can be represented
or coded using a binary numeral system with
L D ceil log2 .M / bits;

(8.13)

where the function ceil.x/ returns the smallest integer no less than x. This results
in a fixed-length codebook or fixed-length code in which each codeword is coded
with an L-bits binary numeral or is said to have a codeword length of L bits. This
fixed length binary codebook is considered as the baseline code for an information
source and is used by PCM in (2.30). The performance of a variable-length code
may be assessed by compression ratio:
RD

L0
;
L

(8.14)

where L0 and L are the average codeword lengths of the fixed-length code and
the variable-length code, respectively. For the example unary code in Table 8.2, the
compression ratio is
2
 1:176:
RD
1:7

8.2 Entropy
In pursuit of a codebook that delivers an average codeword length as low as possible,
it is critical to know if there exists a minimal average codeword length and, if it
exists, what it is. Due to (8.12), the average codeword length is weighted by the
probability distribution of the given information source, so it can be expected that
the answer is dependent on this probability model. In fact, it was discovered by
Claude E. Shannon, an electrical engineer at Bell Labs, in 1948 that this minimum is
the entropy of the information source which is solely determined by the probability
distribution [85, 86].

8.2.1 Entropy
When a message X from an information source is received by the receiver which
turns out to be symbol sm , the associated self-information is
I.X D sm / D  log p.sm /:

(8.15)

The average information per symbol for all messages emitted by the information
source is obviously dependent on the probability that each symbol occurs and is
thus given by:
M
1
X
H.X / D 
p.sm / log p.sm /:
(8.16)
mD0

8.2 Entropy

149

This is called entropy and is the minimal average codeword length for the given
information source (to be proved later).
The unit of entropy is determined by the logarithmic base. The bit, based on the
binary logarithm (log2 ), is the most commonly used unit. Other units include the
nat, based on the natural logarithm (loge ), and the hartley, based on the common
logarithm (log10 ). Due to
log2 x
;
loga x D
log2 a
conversion between these units are simple and straigthforward, so binary logarithm
(log2 ) is always assumed in this book unless stated otherwise.
The use of logarithm as a measure of information makes sense intuitively. Let us
first note that the function in (8.15) is a decreasing function of the probability p.sm /
and equals zero when p.sm / D 1. This means that
 A less likely event carries more information because the amount of surprise

is larger. When the receiver knows that an event is sure to happen, i.e.,
p.X D sm / D 1, before receiving the message X , the event of receiving X
to discover that X D sm carries no information at all. So the self-information
(8.15) is zero.
 More bits need to be allocated to encode less likely events because they carry
more information. This is consistent with our strategy for codeword assignment: assigning longer codewords to less frequently occurring symbols. Ideally,
the length of the codeword assigned to encode a symbol should be its selfinformation. If this were done, the average codeword length would be the same
as the entropy. Partly because of this, variable-length coding is often referred to
as entropy coding.
To view another intuitive perspective about entropy, let us suppose that we received two messages (symbols) from the source: Xi and Xj . Since the source is iid,
we have p.Xi ; Xj / D p.Xi /p.Xj /. Consequently, the self-information carried by
the two messages is


I.Xi ; Xj / D  log p.Xi ; Xj /



D  log .p.Xi //  log p.Xj /
D I.Xi / C I.Xj /:

(8.17)

This is exact what we expect:


 The information for two messages should be the sum of the information that each

message carries.
 The number of bits to code two messages should be the sum of coding each

message individually.
In addition to the intuitive perspectives outlined above, there are other considerations which ensure that the choice of using logarithm for entropy is not arbitrary.
See [85, 86] or [83] for details.

150

8 Entropy and Coding

As an example, let us calculate the entropy for the source sequence (8.2):
H.X / D 

3
log2
30

12
log2
30




3
30
12
30

14
log2
30

1
log2
30




14
30
1
30

 1:5376 bits:
Compared with this value of entropy, the average codeword length of 1.7 bits
achieved by the unary code in Table 8.2 is quite impressive.

8.2.2 Model Dependency


The definition of entropy in (8.16) assumes that messages emitted from an information source be iid, so the source can be completely characterized by the
one-dimensional probability distribution. Since entropy is completely determined
by this distribution, its value as the minimal average codeword length is apparently
as good as the probability model, especially the iid assumption.
Although simple, the iid assumption usually do not reflect the real probability
structure of the source sequence. In fact, most information source, and audio signals
in particular, are strongly correlated. The violation of this iid assumption significantly skew the calculated entropy toward a value larger than the real entropy.
This is shown by two examples below.
In the first example, we notice that each symbol in the example sequence (8.2)
can be predicted from its predecessor using (4.1) to give the following residual
sequence:
1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 0; 1; 1;
1; 1; 1; 1; 1; 1; 1; 0; 1; 1; 1; 1; 1; 1; 1:
Now the alphabet is reduced to
f1, 0, 1g
with the following probabilities:
p.1/ D

2
15
13
; p.0/ D
; p.1/ D
:
30
30
30

Its entropy is
H.X / D 

13
log2
30

13
30

2
log2
30

2
30

15
log2
30

15
30


 1:2833 bits,

8.2 Entropy

151

which is 1:5376  1:2833 D 0:2543 bits less than the entropy achieved with the
false iid assumption.
Another approach to exploiting the correlation in example sequence (8.2) is to
borrow the idea from vector quantization and consider the symbol sequence as a
sequence of vectors or block symbols:
.1; 2/; .1; 0/; .1; 2/; .1; 2/; .1; 0/; .1; 2/; .2; 1/;
.2; 1/; .2; 3/; .2; 1/; .2; 1/; .1; 2/; .1; 0/; .1; 2/; .1; 2/
with an alphabet of
f(2, 3), (1, 0), (2, 1), (1, 2)g:
The probabilities of occurrence for all block symbols or vectors are
p..2; 3// D

3
4
7
1
; p..1; 0// D
; p..2; 1// D
; p..1; 2// D
:
15
15
15
15

The entropy for the vector symbols is


1
H.X / D  log2
15
4
 log2
15




1
15
4
15

3

log2
15
7

log2
15




3
15
7
15

 1:7465 bits per block symbol.


Since there are two symbols per block symbol, the entropy is actually 1:7465=2 
0:8733 bits per symbol. This is 1:5376  0:8733 D 0:6643 bits less than the entropy
achieved with the false iid assumption.
Table 8.3 summarizes the entropies achieved using three data models to code the
same source sequence (8.2). It is obvious that the entropy for a source is a moving
entity, depending on how good the data model is. Since it is generally not possible
to completely know a physical source and hence build the perfect model for it, it
is generally impossible to know the real entropy of the source. The entropy is as
good as the data model used. This is similar to quantization where data models play
a critical role.
Table 8.3 Entropies
obtained using three data
models for the same source
sequence (8.2)

Data model
iid
Prediction
Block coding

Entropy (bits per symbol)


1.5376
1.2833
0.87325

152

8 Entropy and Coding

8.3 Uniquely and Instantaneously Decodable Codes


A necessary requirement for entropy coding is that the source sequence can be
reconstructed without any loss of information from the the codeword sequence received by the decoder. While the one-to-one mapping (8.10) is the first step toward
this end, it is not sufficient because the codeword sequence generated by concatenating the codewords from such a codebook can become undecodable. There is,
therefore, an implicit requirement for a codebook that all of its codewords must be
uniquely decodable when concatenated together in any order.
Among all the uniquely decodable codebooks for a given information source, the
one or ones with the least average codeword length is called the optimal codebook.
Uniquely decodable codes vary significantly in terms of computational complexity, especially when decoding is involved. There is a subset, called prefix-free codes,
which are instantaneous decodable in the sense that each of its codewords can be
decoded as soon as the last bit of it is received. It will be shown that, if there is an
optimal codebook, at least one of them is a prefix-free code.

8.3.1 Uniquely Decodable Code


Looking at the unary code in Table 8.2, one notices that the 1 does not appear anywhere other than the end of the codewords. One would wonder why this is needed?
To answer this question, let us consider the old unary numeral system shown in
Table 8.4 which uses the number of 0s to represent the corresponding number and
thus establishes a one-to-one mapping. It is obvious that it takes the same number
of bits to code the sequence in (8.2) as the unary code given in Table 8.2. Therefore,
the codebook in Table 8.4 seems to be equally adequate.
The problem lies in that the codeword sequence generated from the codebook
in Table 8.4 cannot be uniquely decoded. For example, the first three symbols in
(8.2) are f1, 2, 1g, so will be coded into f0 00 0g. Once the receiver received this
sequence of f0000g, the decoder cannot uniquely decode the sequence: it cannot
determine whether the received codeword sequence is either f3g, f2,2g, or f1,1,1,1g.
To ensure unique decodability of the unary code, the 1 is used to signal the
end of a codeword in Table 8.2. Now, the three symbols f1, 2, 1g will be coded via
the codebook in Table 8.2 into f1 01 1g. Once the receiver received the sequence
f1011g, it can uniquely determines that the symbols are f1,2,1g.

Table 8.4 A not uniquely


decodable codebook

Symbol
0
1
2
3

Codeword
000
0
00
0000

Frequency
3
14
12
1

Codeword length
3
1
2
4

8.3 Uniquely and Instantaneously Decodable Codes

153

Fixed-length codes such as the one in Table 8.1 is also uniquely decodable
because all codewords are of the same length and unique in the codebook. To decode a sequence of symbols coded with a fixed-length code of n bits, one can cut
the sequence into blocks of n bits each, extract the bits in each block, and look up
the codebook to find the symbol it represents.
The unique decodability imposes limit on codeword lengths. In particular,
McMillans inequality states that a binary codebook (8.9) is uniquely decodable if
and only if
M
1
X
2l.cm /  1;
(8.18)
mD0

where l.cm / is again the length of codeword cm . See [43] for proof.
Given the source sequence (8.5) and the probability distribution (8.7), there are
many uniquely decodable codes satisfying the requirement for decodability (8.18).
But these codes are not equal because their average codeword lengths may be different. Among all of them, the one that produces the minimum average codeword
length
M
1
X
p.cm /l.cm /
(8.19)
Lopt D min
fl.cm /g

mD0

is referred as the optimal codebook Copt . It is the target of codebook design.

8.3.2 Instantaneous and Prefix-Free Code


The uniquely decodable fixed-length code in Table 8.1 and unary code in Table 8.2
are also instantaneously decodable in the sense that each codeword can be decoded
as soon as its last bit is received. For the fixed-length code, decoding is possible as
soon as the fixed number of bits are received, or the last bit is received. For the unary
code, decoding is possible as soon as the 1 is received, which signals the end of a
codeword.
There are codes that are uniquely decodable, but cannot be instantaneously decoded. Two such examples are shown in Table 8.5. For Codebook A, the codeword
f0g is a prefix of codewords f01g and f011g. When f0g is received, the receiver
cannot decide whether it is codeword f0g or the first bit of codewords f01g or f011g,
so it has to wait. If f1g is subsequently received, the receiver cannot decide whether
the received f01g is codeword f01g or the first two bits of codeword f011g because
f01g is a prefix of codeword 011. So the receiver has to wait again. Except for the

Table 8.5 Two examples of


not instantaneously decodable
codebooks

Symbol
0
1
2

Codebook A
0
01
011

Codebook B
0
01
11

154

8 Entropy and Coding

reception of f011g which the receiver can decide immediately that it is codeword
f011g, the decoder has to wait until the reception of f0g, the start of the next source
symbol. Therefore, the decoding delay may be more than one codeword.
For Codebook B, the decoding delay may be as long as the whole sequence.
For example, the source sequence f01111g may decode as f0,2,2g. However, when
another f1g is subsequently received to give a codeword sequence of f011111g, the
interpretation of the initial 5s bits (f01111g) becomes totally different because the
codeword sequence f011111g decodes as f1,2,2g now. The decoder cannot make the
decision until it sees the end of the sequence. The primary reason for the delayed
decision is that the codeword f0g is a prefix of codeword f01g. When the receiver
sees f0g, it cannot decide whether it is codeword f0g or just the first bit of codeword
f01g. To resolve this, it has to wait until the end of the sequence and work backward.
Apparently, whether or not a codeword is a prefix of another codeword is critical
to whether it is instantaneously decodable. A codebook in which no codeword is
a prefix of any other codewords is referred to as a prefix code, or more precisely
prefix-free code. The unary code in Table 8.2 and fixed-length code in Table 8.1 are
both prefix codes and are instantaneously decodable. The two uniquely decodable
codes in Table 8.5 are not prefix codes and are not instantaneous decodable.
In fact, this association is not a coincidence: a codebook is instantaneously decodable if and only if it is a prefix-free code. To see this, let us assume that there is
a codeword in an instantaneously decodable code that is a prefix of at least another
codeword. Because of this, this codeword is obviously not instantaneously decodable, as shown by Codebook A and Codebook B in the above example. Therefore,
an instantaneously decodable code has to be prefix-free code.
On the other hand, all codewords of a prefix-free code can be decoded instantaneously upon reception because there is no ambiguity with any other codewords in
the codebook.

8.3.3 Prefix-Free Code and Binary Tree


A codebook can be viewed as a binary tree or code tree. This is illustrated in
Fig. 8.1 for the unary codebook in Table 8.2. The tree starts from a root node of
NULL and can have no more than two possible branches at each node. Each branch
represents either 0 or 1. Each node contains the codeword that represents all the
branches connecting from the root node all the way through to the current node.
If a node does not grow any more branches, it is called an external node or leaf;
otherwise, it is called an internal node.
Since an internal node grows at least one branch, the codeword it represents is a
prefix of whatever codeword or node that grows from it. On the other hand, a leaf
or external node does not grow any more branches, the codeword it represents is not
a prefix of any other nodes or codewords. Therefore, the codewords of a prefix-free
code are taken only from the leaves. For example, the code in Fig. 8.1 is a prefix-free
code since only its leaves are taken as codewords.

8.3 Uniquely and Instantaneously Decodable Codes

155

Fig. 8.1 Binary tree for an


unary codebook. It is a
prefix-free code because only
its leaves are taken as
codewords

Fig. 8.2 Binary trees for two noninstantaneous codes. Since left tree (for Codebook A) takes codewords from internal nodes f0g and f01g, and the right tree (for Codebook B) from f0g, respectively,
both are not prefix-free codes. The branches are not labeled due to the convention that left branches
represent 0 and right branches represents 1

Figure 8.2 shows the trees for the two noninstantaneous codes given in Table 8.5.
Since Codebook A takes codewords from internal nodes f0g and f01g, and Codebook B from f0g, respectively, both are not prefix-free codes. Note that the branches
are not labeled because both trees follow the convention that left branches represent
0 and right branches represent 1. This convention is followed throughout this
book unless stated otherwise.

8.3.4 Optimal Prefix-Free Code


Instantaneous/prefix-free codes are obviously desirable for easy and efficient decoding. However, prefix-free codes are only a subset of uniquely decodable codes, so
there is a legitimate concern that the optimal code may not be a prefix-free code for

156

8 Entropy and Coding

a given information source. Fortunately, this concern turns out to be unwarranted


due to Krafts inequality which states that there is a prefix-free codebook (8.9) if
and only if
M
1
X

2l.cm /  1:

(8.20)

mD0

See [43] for proof.


Since Krafts inequality (8.20) is the same as McMillans inequality (8.18),
we conclude that there is a uniquely decodable code if and only if there is an
instantaneous/prefix-free code with the same set of codeword lengths.
In terms of the optimal codebooks, there is an optimal codebook if and only if
there is a prefix-free code with the same set of codeword lengths. In other words,
there is always a prefix-free codebook that is optimal.

8.4 Shannons Noiseless Coding Theorem


Although prefix-free codes are instantaneously decodable and there is always a
prefix-free codebook that is optimal, there is still a question as to how close the
average codeword length of such an optimal prefix-free code can approach the entropy of the information source. Shannons noiseless coding theorem states that the
entropy is the absolute minimal average codeword length of any uniquely decodable
codes and that the entropy can be asymptotically approached by a prefix-free code
if source symbols are coded as blocks and the block size goes to infinity.

8.4.1 Entropy as the Lower Bound


To reduce the average codeword length, it is desired that only short codewords be
used. But McMillans inequality (8.18) states that, to ensure unique decodability, the
use of some short codewords requires the other codewords to be long. Consequently,
the overall or average codeword length cannot be arbitrarily low, there is an absolute
lower bound. It turns out that the entropy is this lower bound.
To prove this, let
KD

M
1
X

2l.cm / :

mD0

Due to McMillans inequality (8.18), we have


log.K/  0:

(8.21)

8.4 Shannons Noiseless Coding Theorem

157

Consequently, we have
LD

M
1
X

p.cm /l.cm /

mD0

M
1
X

p.cm / l.cm / C log2 .K/

mD0

M
1
X

h
i
p.cm / log2 2l.cm / K

mD0

M
1
X

"
p.cm / log2

mD0

M
1
X


p.cm / log2

mD0

DH

M
1
X

p.cm /2l.cm / K
p.cm /

M
1
h
i
X
1
C
p.cm / log2 p.cm /2l.cm / K
p.cm /
mD0

p.cm / log2

mD0


1
:
p.cm /2l.cm / K

Due to
log2 .x/ 

1
.x  1/; 8x > 0;
ln 2

the second term on the right-hand side is always negative because:


M
1
X


p.cm / log2

mD0

1
p.cm /2l.cm / K



M 1
1 X
1
p.cm /

1
ln 2 mD0
p.cm /2l.cm / K


M 1
1
1 X
/

p.c
m
ln 2 mD0 2l.cm / K

1
D
ln 2

"

M 1
M 1
1 X l.cm / X
2

p.cm /
K mD0
mD0

1
1  1
ln 2
D 0:
D

Therefore, we have
L
H:

(8.22)

158

8 Entropy and Coding

8.4.2 Upper Bound


Since the entropy is the absolute lower bound on average codeword length, an
intuitive approach to the construction of an optimal codebook is to set the length
of the codeword assigned to a source symbol to its self-information (8.15), then the
average codeword length would be equal to the entropy. This is, unfortunately, unworkable because the self-information is most likely not an integer. But we can get
close to this by setting the codeword length to the next smallest integer:
l.cm / D ceil log2 p.cm /:

(8.23)

Such a codebook is called a ShannonFano code.


A ShannonFano code is uniquely decodable because it satisfies the McMillan
inequality (8.18):
M
1
X

2l.cm / D

mD0

M
1
X

2ceil log2 p.cm /

mD0

M
1
X

2log2 p.cm /

mD0

M
1
X

p.cm /

mD0

D 1:

(8.24)

The average codeword length of the ShannonFano code is


LD

M
1
X

p.cm /ceil log2 p.cm /

mD0

M
1
X

p.cm / 1  log2 p.cm /

mD0

D 1

M
1
X

p.cm / log2 p.cm /

mD0

D 1 C H.X /;

(8.25)

where the second inequality is obtained due to


ceil.x/  1 C x:
A ShannonFano code may or may not be optimal, although it sometimes is. But
the above inequality constitutes an upper bound on the optimal codeword length.

8.4 Shannons Noiseless Coding Theorem

159

Combining this and the lower entropy bound (8.22), we obtain the following bound
for the optimal codeword length:
H.X /  Lopt < 1 C H.X /:

(8.26)

8.4.3 Shannons Noiseless Coding Theorem


Let us group n source symbols in source sequence (8.5) as a block, called a block
symbol,
(8.27)
X.k/ D Xkn ; XknC1 ; : : : ; XknCn1 ;
thus converting source sequence (8.5) into a sequence of block symbols:
X.0/; X.1/; : : : :

(8.28)

Since source sequence (8.5) is assumed to be iid, the probability distribution for a
block symbol is
p.sm0 ; sm1    smn1 / D p.sm0 /p.sm1 /    p.smn1 /:

(8.29)

Using this equation, the entropy for the block symbols is


H n .X/ D 

1
M
1 M
X
X
m0 D0 m1 D0



M
1
X
mn1 D0

p.sm0 ; sm1    smn1 / log p.sm0 ; sm1    smn1 /:


D n

M
1
X

p.sm0 / log p.sm0 /

m0 D0

D nH.X /:

(8.30)

Applying the bounds in (8.26) to the block symbols, we have


nH.X /  Lnopt < 1 C nH.X /

(8.31)

where Lnopt is the optimal codeword length for the block codebook that is used
to code the sequence of block symbols. The average codeword length per source
symbol is obviously
Lnopt
:
(8.32)
Lopt D
n

160

8 Entropy and Coding

Therefore, the optimal codeword length per source symbol is bounded by


H.X /  Lopt <

1
C H.X /:
n

(8.33)

As n ! 1, the optimal codeword length Lopt approaches the entropy. In other


words, by choosing a large enough block size n, Lopt can be made as close to the
entropy as desired. And this can always be delivered by a prefix-free code because
the inequality in (8.33) was derived within the context that both the McMillan inequality (8.18) and Kraft inequality (8.20) are satisfied.

Chapter 9

Huffman Coding

Now that we are assured that, given a probability distribution, if there is an optimal uniquely decodable code, there is a prefix-free code with the same average
codeword length, the next step is the construction of such optimal prefix-free codes.
Huffmans algorithm [29], developed by David A. Huffman in 1952 when he was a
Ph.D. student at MIT, is just such a simple algorithm.

9.1 Huffmans Algorithm


Huffmans algorithm is a recursive procedure that merges the two least probable
symbols into a new meta-symbol until only one meta-symbol is left. This procedure can be best illustrated through an example.
Let us consider the probability distribution in (8.4). Its Huffman code is constructed using the following steps (see Fig. 9.1):
1. The two least probable symbols 1=30 and 3=30 are merged into a new metasymbol with a probability of 4=30, which becomes the new least probable
symbol.
2. The next two least probable meta-symbol 4=30 and symbol 12=30 are merged
into a new meta-symbol with a probability of 16=30.
3. The next two least probable meta-symbol 16=30 and symbol 14=30 are merged
into a new meta-symbol with a probability of 30=30, which signals the end of
recursion.
The Huffman codeword for each symbol is listed on the left side of the symbol in
Fig. 9.1. These codewords are generated by assigning 1 to top branches and 0 to
lower branches, starting from the right-most meta-symbol which has a probability
of 30=30. The average length for this Huffman codebook is
LD3

14
12
1
50
3
C1
C2
C3
D
 1:6667 bits.
30
30
30
30
30

Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 9,


c Springer Science+Business Media, LLC 2010


161

162

9 Huffman Coding

Fig. 9.1 Steps involved in constructing a Huffman codebook for the probability distribution in
(8.4)

This is better than 1:7 bits achieved using the unary code given in Table 8.2, but still
larger than the entropy of 1.5376 bits.
To present Huffmans algorithm for an arbitrary information source, let us consider the symbol set (8.6) with a probability distribution (8.7), Huffmans algorithm
generates the codebook (8.9) with the following recursive procedure:
1. Let m0 , m1 , : : :, mM 1 be a permutation of 0, 1, : : :, M  1 for which
pm0
pm1
  
pmM 1 ;

(9.1)

then the symbol set (8.6) is permulated as


fsm0 ; sm1 ; : : : ; smM 3 ; smM 2 ; smM 1 g:

(9.2)

2. Merge the two least probable symbols smM 2 and smM 1 into a new metasymbol s 0 with the associated probability of
p 0 D pmM 2 C pmM 1 :

(9.3)

This gives rise to the following new symbol set:


fsm0 ; sm1 ; : : : ; smM 3 ; s 0 g:

(9.4)

with the associated probability distribution of


pm0 ; pm1 ; : : : ; pmM 3 ; p 0 :

(9.5)

3. Repeat step 1 until the number of symbols is two. Then return with the following
codebook
f0; 1g:
(9.6)

9.2 Optimality

163

4. If the codebook for symbol set (9.4) is


fc0 ; c1 ; : : : ; cM 3 ; c 0 gI

(9.7)

return with the following codebook:


fc0 ; c1 ; : : : ; cM 3 ; c 0 0; c 0 1g

(9.8)

for symbol set (9.2).


Apparently, for the special case of M D 2, the iteration above ends at step 3, so
Huffman codebook f0; 1g is returned.

9.2 Optimality
Huffmans algorithm produces optimal codes. This can be proved by induction on
the number of source symbols M . In particular, for M D 2, the codebook produced
by Huffmans algorithm is obviously optimal: we cannot do any better than using
one bit to code each symbol. For M > 2, let us assume that Huffmans algorithm
produces an optimal code for a symbol set of size M  1, we prove that Huffmans
algorithm produces an optimal code for a symbol set of size M .

9.2.1 Codeword Siblings


From (9.8) we notice that the Huffman codeword for the two least probable symbols
have the form of fc 0 0g and fc 0 1g, i.e., they have the same length and differ only in
the last bit (see Fig. 9.1). Such a pair of codewords are called siblings. In fact, any
instantaneous codebook can always be re-arranged in such a way that the codewords
for the two least probably symbols are siblings while keeping the average codeword
length the same or less.
To show this, let us first note that, for two symbols with probabilities p1 < p2 , if
a longer codeword was assigned to the more probable symbol, i.e., l1 < l2 , the codewords can always be swapped without any topological change to the tree, but with
reduced average codeword length. One such example is shown in Fig. 9.2 where
codewords for symbols with probabilities 0:3 and 0:6 are swapped. Repetitive application of this topologically constant procedure to a codebook can always end up
with a new one which has the same topology as the original one, but whose two
least probable symbols are assigned the two longest codewords and whose average
codeword length is at least the same as, if not shorter than, the original one.
Second, if an internal node in a codebook tree does not grow two branches, it
can always be removed to generate shorter codewords. This is shown in Fig. 9.3
where node 1 in the tree shown on the top is removed to give the tree at the bottom

164

9 Huffman Coding

Fig. 9.2 Codewords for


symbols with probabilities
0:3 and 0:6 in the codebook
tree on the top are swapped to
give the codebook tree in the
bottom. There is no change to
the tree topology, average
codeword length becomes
shorter because the shorter
codeword is weighted by the
higher probability

Fig. 9.3 Removal of the internal node 1 in the tree on the top produces the tree at the bottom
which has shorter average codeword length

with shorter average codeword length. Application of this procedure to the two least
probable symbols in a codebook ensures that they always have the same codeword
length. Otherwise, the longest codeword must be from at least one internal node
which grows only one branch.
Third, if the codewords for the two least probable symbols do not grow from
the same last internal node, the last internal node for the least probable symbol
must grow another codeword whose length is the same as the second least probable

9.2 Optimality

165

symbol. Due to the same codeword length, these two codewords can be swapped
with no impact to the average codeword length. Consequently, the codewords for
the two least probable symbols can always grow from the same last internal node.
In other words, the two codewords can always be of the form of c 0 0 and c 0 1.

9.2.2 Proof of Optimality


Let LM be the average codeword length for the codebook produced by Huffmans
algorithm for the symbol set of (8.6) with the probability distribution of (8.7). Without loss of generality, the symbol set in (8.6) can always be permutated to give the
following symbol set:
fs0 ; s1 ; : : : ; sM 3 ; sM 2 ; sM 1 g

(9.9)

with a probability distribution of


pm
pM 2
pM 1

for all 0  m  M  3:

(9.10)

Then the last two symbols can be merged to give the following symbol set:
fs0 ; s1 ; : : : ; sM 3 ; s 0 g

(9.11)

which has M  1 symbols and a probability distribution of

where

p0 ; p1 ; : : : ; pM 3 ; p 0

(9.12)

p 0 D pM 2 C pM 1 :

(9.13)

Applying Huffmans recursive procedure in Sect. 9.1 to symbol set (9.11) produces a Huffman codebook with an average codeword length of LM 1 . By the
induction hypothesis, this Huffman codebook is optimal.
The last step of Huffman procedure grows the last codeword in symbol set (9.11)
into two codewords by attaching one bit (0 and 1) to its end to produce a codebook of size M for symbol set (9.9). This additional bit is added with a probability
of pM 2 CpM 1 , so the average codeword length for the new Huffman codebook is
LM D LM 1 C pM 2 C pM 1 :

(9.14)

Suppose that there were another instantaneous codebook for the symbol set in
O M that is less than LM :
(8.6) with an average codeword length of L
O M < LM :
L

(9.15)

166

9 Huffman Coding

As shown in Sect. 9.2.1, this codebook can be modified so that the codewords for
two least probable symbols sM 2 and sM 1 have the forms of c 0 0 and c 0 1, while
keeping the average codeword length the same or less. This means the symbol set is
permutated to have a form given in (9.9) with the corresponding probability distribution given by (9.10). This codebook can be used to produce another codebook of
size M  1 for symbol set (9.11) by keeping the codewords for fs0 ; s1 ; : : : ; sM 3 g
the same and encoding the last symbol s 0 using c. Let us denote its average codeword length as LO M 1 . Following the same argument as that which leads to (9.14),
we can establish
(9.16)
LO M D LO M 1 C pM 2 C pM 1 :
Subtracting (9.16) from (9.14), we have
LM  LO M D LM 1  LO M 1

(9.17)

By the induction hypothesis, LM 1 is the average codeword length for an optimal


codebook for the symbol set in (9.11), so we have
LM 1  LO M 1 :

(9.18)

Plugging this into (9.17), we have

or

O M  LM  0
L

(9.19)

LO M
LM

(9.20)

which contradicts the supposition in (9.15). Therefore, Huffmans algorithm produces an optimal codebook for M as well.
To summarize, it was proven above that, if Huffmans algorithm produces an
optimal codebook for a symbol set of size M  1, it produces an optimal codebook
for a symbol set of size M . Since it produces the optimal codebook for M D 2, it
produces optimal codebooks for any M .

9.3 Block Huffman Code


Although Huffman code is optimal for symbol sets of any size, the optimal average
codeword length that it achieves is often much larger than the entropy when the
symbol set is small. To see this, let us consider the extreme case of M D 2. The
only possible codebook, which is also Huffman code, is
f0; 1g:

9.3 Block Huffman Code

167

It obviously has an average codeword length of one bit, regardless the underlying
probability distribution and entropy. The more skewed the probability distribution
is, the smaller the entropy is, hence the less efficient the Huffman code is. As an
example, let us consider the probability distribution of p1 D 0:1 and p2 D 0:9. It
results in an entropy of
H D 0:1 log.0:1/  0:9 log.0:9/  0:4690 bits;
which is obviously much smaller than the one bit delivered by the Huffman code.

9.3.1 Efficiency Improvement


By Shannons noiseless coding theorem (8.33), however, the entropy can be approached by grouping more symbols together into block symbols. To illustrate this,
let us first block-code two symbols together as one block symbol. This gives the
probability distribution in Table 9.1 and the corresponding Huffman code in Fig. 9.4.
Its average codeword length is
LD

1
.0:01  3 C 0:09  3 C 0:09  2 C 0:81  1/ D 0:645 bits;
2

which is significantly smaller than the one bit achieved by the M D 2 Huffman
code.
Table 9.1 Probability
distribution when two
symbols are coded together as
one block symbol

Symbol

Probability

00
01
10
11

0.01
0.09
0.09
0.81

Fig. 9.4 Huffman code when two symbols are coded together as one block symbol

168

9 Huffman Coding

Table 9.2 Probability


distribution when three
symbols are coded together as
one block symbol

Symbol
000
001
010
011
100
101
110
111

Probability
0.001
0.009
0.009
0.081
0.009
0.081
0.081
0.729

Fig. 9.5 Huffman code when three symbols are coded together as one block symbol

This can be further improved by coding three symbols as one block symbol. This
gives us the probability distribution in Table 9.2 and the Huffman code in Fig. 9.5.
The average codeword length is
LD

1
.0:001  7 C 0:009  7 C 0:009  6 C 0:081  4
3
C 0:009  5 C 0:081  3 C 0:081  2 C 0:729  1/

D 0:5423 bits;
which is more than 0.1 bit better than coding two symbols as a block symbol and
closer to the entropy.

9.4 Recursive Coding

169

9.3.2 Block Encoding and Decoding


In the examples above, the block symbols are constructed from the original symbols
by stacking the bits of the primary symbols together. For example, a block symbol
in Table 9.2 is constructed as
B D fs2 s1 s0 g
where s0 , s1 , and s2 are the bits representing the original symbols. This is possible
because the number of symbols in the original symbol set is M D 2. In general, for any finite M , a block symbol B consisting of n original symbols may be
expressed as
B D sn1 M n1 C sn2 M n2 C    C s1 M C s0 ;

(9.21)

where si ; i D 0; 1; : : : ; n  1; represents the bits for each original symbol. It is,


therefore, obvious that block encoding just consists of a series of multiplication and
accumulation operations.
Equation (9.21) also indicates that the original symbols may be decoded from the
block symbol through the following iterative procedure:

s0 D B  B1 M;
s1 D B1  B2 M;
::
:
sn2 D Bn2  Bn1 M;
sn1 D Bn1  Bn M:

B1 D B=M I
B2 D B1=M I
B3 D B2=M I
::
:

(9.22)

Bn D Bn1 =M

where the = operation represents integer division. When M is 2s power, it may be


implemented as right shifting. The first step in each iteration (the step to obtain si )
is actually the operation to get the remainder.

9.4 Recursive Coding


Huffman encoding is straightforward and simple because it only involves looking
up the codebook. Huffman decoding, however, is rather complex because it entails
searching through the tree until a matching leaf is found. If the codebook is too large,
consisting of more than 300 codewords, for example, the decoding complexity can
be excessive.
Recursive indexing is a simple method for representing an excessively large symbol set by a moderate one so that a moderate Huffman codebook can be used to
encode the excessively large symbol set.

170

9 Huffman Coding

Without loss of generality, let us represent a symbol set by its indexes starting
from zero, thus each symbol in the symbol set corresponds to a nonnegative
integer x. This x can be represented as
x D q  M C r;

(9.23)

where M is the maximum value of the reduced symbol set f0; 1; : : : ; M g, q is the
quotient, and r is the remainder. Once M is agreed upon by the encoder and decoder,
only q and r need to be conveyed to the decoder.
Usually r is encoded using a Huffman codebook and q by other means. One
simple approach to encoding q is to represent it by repeating q times the symbol M .
In this way a single Huffman codebook can be used to encode a large symbol set,
no matter how large it is.

9.5 A Fast Decoding Algorithm


Due to the need to search the tree to find the leaf that matches bits from the
input stream, Huffman decoding is computationally expensive. While fast decoding
algorithms are usually tied to specific hardware (computer) architectures, a generic
algorithm is provided here to illustrate the steps involved in Huffman decoding:
1.
2.
3.
4.
5.
6.

n D 1I
Unpack one bit from the bit stream;
Concatenate the bit to the previously unpacked bits to form a word with n bits;
Search the codewords of n bits in the Huffman codebook;
Stop if a codeword is found to be equal to the unpacked word;
n D n C 1 and go back to step 2.

Part V

Audio Coding

While presented from the perspective of audio coding, the chapters in the previous
parts cover theoretical aspects of the coding technology that can be applied to the
coding of signals in general. The chapters in this part are devoted to the coding of
audio signals. In particular, Chap. 10 covers perceptual models which determines
which part of the source signal is inaudible (perceptually irrelevant) and thus can be
removed. Chapter 11 addresses the resolution challenge posed by the frequent interruption by transients to the otherwise quasi-stationary audio signals. Chapter 12
deals with widely used methods for joint channel coding as well as the coding of
low-frequency effect (LFE) channels. Chapter 13 covers a few practical issues frequently encountered in the development of audio coding algorithms. Chapter 14
is devoted to performance assessment of audio coding algorithms and Chap. 15
presents dynamic resolution adaptation (DRA) audio coding standard as an example to illustrate how to integrate the technologies described in this book to create a
practical audio coding algorithm.

Chapter 10

Perceptual Model

Although data model and quantization have been discussed in detail in the earlier
chapters as the tool for effectively removing perceptual irrelevance, a question still
remains as to which part of the source signal is perceptually irrelevant. Feasible
answers to this question obviously depend on the underlying application. For audio
coding, perceptual irrelevance is ultimately determined by the human ear, so perceptual models need to be built that mimic the human auditory system so as to indicate
to an audio coder which parts of the source audio signal are perceptually irrelevant,
hence can be removed without audible artifacts.
When a quantizer removes perceptual irrelevance, it essentially substitutes quantization noise for perceptually irrelevant parts of the source signal, so the quantization process should be properly controlled to ensure that quantization noise is not
audible.
Quantization noise is not audible if its power is below the sensitivity threshold
of the human ear. This threshold is very low in an absolutely quiet environment
(threshold in quiet), but becomes significantly elevated in the presence of other
sounds due to masking. Masking is a phenomenon where a strong sound makes
a weak sound less audible or even completely inaudible when the power of the
weak sound is below a certain threshold jointly determined by the characteristics of
both sounds.
Quantization noise may be masked by signal components that occur simultaneously with the signal component being quantized. This is called simultaneous
masking and is exploited most extensively in audio coding. Quantization noise may
also be masked by signal components that are ahead of and/or behind it. This is
called temporal masking.
The task of the perceptual model is to explore the threshold in quiet and the
simultaneous/temporal masking to come up with an estimate of global masking
threshold which is a function of frequency and time. The audio coder can then adjust its quantization process in such a way that all quantization noises are below this
threshold to ensure that they are not audible.

Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 10,


c Springer Science+Business Media, LLC 2010


173

174

10 Perceptual Model

10.1 Sound Pressure Level


Sound waves traveling in the air or other transmission media can be described by
time-varying atmosphere pressure change p.t/, called sound pressure. Sound pressure is measured by the sound force per unit area and its unit is Newton per square
meter (N=m2 ), which is also known as a Pascal (Pa). The human ear can perceive
sound pressure as low as 105 Pa and a sound pressure of 100 Pa is considered as the
threshold of pain. These two values establish a dramatic dynamic range of roughly
107 Pa.
When compared with the atmospheric pressure which is 101,325 Pa, the absolute
values of sound pressure perceivable by the human ear are obviously very small. To
cope with this situation, sound pressure level or SPL is introduced
l D 20 log10

p
dB;
p0

(10.1)

where p0 is a reference level of 2  105 Pa. This reference level corresponds to the
best hearing sensitivity of an average listener at around 1,000 Hz.
Another description of sound waves is the sound intensity I which is defined
as the sound power per unit area. For a spherical or plane progressive wave, sound
intensity I is proportional to the square of sound pressure, so sound intensity level
is related to the sound pressure level by
l D 20 log10

p
I
D 10 log10
dB;
p0
I0

(10.2)

where I0 D 1012 W=m2 is the reference level. Due to this relationship, SPL level
and sound intensity level are identical on logarithmic scale.
When a sound signal is considered as a wide sense stationary random process, its
spectrum or intensity density level is defined as
L.f / D

P .f /
;
I0

(10.3)

where P .f / is the power spectrum density of the sound wave.

10.2 Absolute Threshold of Hearing


The absolute threshold of hearing (ATH) or threshold in quiet (THQ) is the
minimum sound pressure level of a pure tone that an average listener with normal
hearing capability can hear in an absolutely quiet environment. This SPL threshold
varies with the frequency of the test tone, an empirical equation that describes this
relationship is [91]

10.2 Absolute Threshold of Hearing

175

100
90
Threshold in Quiet (dB SPL)

80
70
60
50
40
30
20
10
0
10

102

103
Frequency (Hz)

104

Fig. 10.1 Absolute threshold of hearing


Tq .f / D 3:64

f
1;000

0:8

 6:5e0:6.f =1;0003:3/ C 0:001

f
1;000

4
dB
(10.4)

and is plotted in Fig. 10.1.


As can be seen in Fig. 10.1, the human ear is very sensitive in frequencies
from 1,000 to 5,000 Hz and is most sensitive around 3,300 Hz. Beyond this region, the sensitivity of hearing degrades rapidly, especially below 100 Hz and above
10,000 Hz. Below 20 Hz and above 18,000 Hz, the human ear can hardly perceive
sounds. The formula in (10.4) and hence Fig. 10.1 does not fully reflect the rapid
degradation of hearing sensitivity below 20 Hz. When people age, the hearing sensitivity degrades mostly at high frequencies and there is little change in the low
frequencies.
It is rather difficult to apply the threshold in quiet to audio coding mostly because
there is no way to know the playback SPL that an audio signal is presented to a
listener. A safe bet is to equate the minimum in Fig. 10.1 around 3,300 Hz to the
lowest bit in the audio coder. This ensures that quantization noise is not audible
even if the audio signal is played back at the maximum volume, but is usually too
pessimistic because listeners rarely playback sound at the maximum volume.
Another difficulty with applying the threshold in quiet to audio coding is that
quantization noise is complex and is unlikely sinusoidal. The actual threshold in
quiet for complex quantization noise is definitely different than pure tones. But there
is not much research reported on this regard.

176

10 Perceptual Model

10.3 Auditory Subband Filtering


As shown in Fig. 10.2, when sound is perceived by the human ear, it is first
preprocessed by the human body, including the head and shoulder, and then the
outer ear canal before it reaches the ear drum. The vibration of the ear drum is
transferred by the ossicular bones in the middle ear to the oval window which is
the entrance to or the start of the cochlea in the inner ear. The cochlea is a spiral
structure filled with almost incompressible fluids, whose start at the oval window is
known as the base and whose end as the apex. The vibrations at the oval window
induce traveling waves in the fluids which in turn transfer the waves to the basilar
membrane that lies along the length of cochlea. These traveling waves are converted into electrical signals by neural receptors that are connected along the length
of the basilar membrane [53].

10.3.1 Subband Filtering


Different frequency components of an input sound wave are sorted out while traveling along the basilar membrane from the start (base) towards the end (the apex).
This is schematically illustrated in Fig. 10.3 for an example signal consisting of three
tones (400, 1,600, and 6,400 Hz) presented to the base of basilar membrane [102].
For each sinusoidal component in the input sound wave, the amplitude of basilar
membrane displacement increases at first, reaches a maximum, and then decreases
rather abruptly. The position where the amplitude peak occurs depends on the frequency of the sinusoidal component. In other words, a sinusoidal signal resonants
strongly at a position on the basilar membrane that corresponds to its frequency.
Equivalently, different frequency components of an input sound wave resonant at
different locations on the basilar membrane. This allows different groups of neural
receptors connected along the length of the basilar membrane to process different
frequency components of the input signal. From a signal processing perspective, this
frequency-selective processing of sound signals may be viewed as subband filtering
and the basilar membrane may be considered as a bank of bandpass auditory filters.

Fig. 10.2 Major steps involved in the conversion of sound waves into neural signals in the
human ear

10.3 Auditory Subband Filtering

177

Fig. 10.3 Instantaneous displacement of basilar membrane for an input sound wave consisting of
three tones (400, 1,600, and 6,400 Hz) that is presented to the oval window of basilar membrane.
Note that the three excitations do not appear simultaneously because the wave needs time to travel
along the basilar membrane

An observation from Fig. 10.3 is that the auditory filters are continuously placed
along the length of the basilar membrane and is activated in response to the frequency components of the input sound wave. If the frequency components of the
sound wave are close to each other, these auditory filters overlap significantly.
There is, of course, no decimation in the continuous-time world of neurons. As
will be discussed later in this chapter, the frequency responses of these auditory
filters are asymmetric, nonlinear, level-dependent, and with nonuniform bandwidth
that increases with frequency. Therefore, auditory filters are very different from the
discretely placed, almost nonoverlapping and often maximally decimated subband
filters that we are familiar with.

10.3.2 Auditory Filters


A simple model for auditory filters is the gammatone filter whose impulse response
is given by [1, 71]
h.t/ D At n1 e2Bt cos.2 fc t C /;

(10.5)

where fc is the center frequency, the phase, A the amplitude, n the filters order, t
the time, and B is the filters bandwidth. Figure 10.4 shows the magnitude response
of the gammatone filter with fc D 1;000 Hz, n D 4, and B determined by (10.7).
A widely used model that catches the asymmetric nature of auditory filters is
the rounded exponential filter, denoted as roex.pl ; pu /, whose power spectrum is
given by [73]
8

fc f

< 1 C pl fcff epl fc ; f  fc


c
(10.6)
W .f / D 

f fc

: 1 C pu f fc epu fc ; f > fc


fc

178

10 Perceptual Model
20

Magnitude Response (dB)

10
0
10
20
30
40
50
60
0

1000

2000
3000
Frequency (Hz)

4000

Fig. 10.4 Magnitude response of a gammatone filter that models the auditory filters

0
20

Power Spectrum (dB)

40
60
80
100
120
140
160

500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Frequency (Hz)

Fig. 10.5 Power spectrum of roex.pl ; pu / filter (fc D 1;000 Hz, pl D 40, and pu D 10) that
models auditory filters

where fc represents the center frequency, pl determines the slope of the filter below
fc , and pu determines the slope of the filter above fc . Figure 10.5 shows the power
spectrum of this filter for fc D 1;000 Hz, pl D 40 and pu D 10.

10.3 Auditory Subband Filtering

179

The roex.pl ; pu / filter with pl D pu was used to estimate the critical bandwidth
(ERB) of the auditory filters [53, 70]. When the order of the gammatone filter is in
the range 35, the shape of its magnitude characteristic is very similar to that of
the roex.pl ; pu / filter with pl D pu [72]. In particular, when the order n D 4, the
suggested bandwidth for gammatone filter is
B D 1:019ERB;

(10.7)

where ERB is given in (10.15).

10.3.3 Bark Scale


From a subband filtering perspective, the location on the basilar membrane where
the amplitude maximum occurs for a sinusoidal component may be considered as
the point that represents the frequency of the sinusoidal component and as the center frequency of the auditory filter that processes this component. Consequently,
the distance from the base of cochlea or the oval window on the basilar membrane
represents a new frequency scale that is different from the linear frequency scale
(in Hz) that we are familiar with. A frequency scale that seeks to linearly approximate the frequency scale represented by the basilar membrane length is the critical
band rate or Bark scale. Its relationship with the linear frequency scale f (Hz) is
empirically determined and may be analytically expressed as follows [102]:


z D 13 arctan.0:00076f / C 3:5 arctan .f =7; 500/2 Bark

(10.8)

and is shown in Fig. 10.6. In fact, one Bark approximately corresponds to a distance
of about 1.3 mm along the basilar membrane [102]. The Bark scale is apparently
neither linear nor logarithmic with respect to the linear frequency scale.
The Bark scale was proposed by Eberhard Zwicker in 1961 [101] and named in
memory of Heinrich Barkhausen who introduced the phon, a scale for loudness
as perceived by the human ear.

10.3.4 Critical Bands


While complex, the auditory filters exhibit strong frequency selectivity: the loudness
of a signal remains constant as long as its energy is within the passband, referred
to as the critical band (CB), of an auditory filter and decreases dramatically as the
energy moves out of the critical band. The critical bandwidth is a parameter that
quantifies the bandwidth of the auditory filter passband.
There are a variety of methods for estimating the critical bandwidth. A simple
approach is to present a set of uniformly spaced tones with equal power to the

180

10 Perceptual Model
25

Bark

20
15
10
5
0

2000

4000

6000 8000 10000 12000 14000 16000


Frequency (Hz)

25

Bark

20
15
10
5
0
100

101

102
Frequency (Hz)

103

104

Fig. 10.6 Relationships of the Bark scale with respect to the linear frequency scale (top) and the
logarithmic frequency scale (bottom)

listeners and measure the threshold in quiet [102]. For example, to estimate the critical bandwidth near 1,000 Hz, the threshold in quiet is measured by placing more
and more equally powered tones starting from 920 Hz with a 20 Hz frequency increment. Figure 10.7 shows that the measured threshold in quiet, which is the total
power of all tones, remains at about 3 dB when the number of tones increases from
one to eight and begins to increase afterwards. This indicates that the critical bandwidth near 1,000 Hz is about eight tones, or 160 Hz, that starts at 920 Hz and ends
at 920 C 160 D 1,080 Hz.
To see this, let us denote the power of each tone as  2 and the total number of
tones as n. Before n reaches eight, the power of each tone is
2 D

100:3
n

(10.9)

to maintain a total power of 3 dB in the passband. When n reaches eight, the power
of each tones is
100:3
2 D
:
(10.10)
8
When n is more than eight, only the first eight tones fall into the passband and the
others are filtered out by the auditory subband filter. Therefore, the power for each
tone given in (10.10) needs to be maintained to maintain a total power of 3 dB in the
passband. Consequently, the total power of all tones is

10.3 Auditory Subband Filtering

181

5
4.5

Total Power (dB)

4
3.5
3
2.5
2
1.5
1
0.5
0

4
5
6
7
Total Number of Tones

10

Fig. 10.7 The measurement of critical bandwidth near 1,000 Hz by placing more and more uniformly spaced and equally powered tones starting from 920 Hz with a 20 Hz increment. The total
power of all tones indicated by the cross remains constant when the number of tones increases
from one to eight and begins to increase afterwards. This indicates that critical bandwidth near
1,000 Hz is about eight tones, or 160 Hz, that starts at 920 Hz and ends at 920 C 160 D 1,080 Hz.
This is so because tones after the eighth are filtered out by the auditory filter, so do not contribute
to the power in the passband. The addition of more tones causes an increase in total power, but the
power perceived in the passband remains constant

P D

100:3
n:
8

(10.11)

To summarize, the total power of all tones as a function of the number of tones is
(
P D

100:3 ; n  8I
100:3
n;
8

(10.12)

otherwise.

This is the curve shown in Fig. 10.7.


The method above is valid only in the frequency range between 500 and 2,000 Hz
where the threshold in quiet is approximately independent of frequency. For other
frequency ranges, more sophisticated methods are called for.
One such method, called masking in frequency gap, is shown in Fig. 10.8 [102].
It places a test signal, called a maskee, at the center frequency where the critical
bandwidth is to be estimated and then places two masking signals, called maskers,
of equal power at equal distance in the linear frequency scale from the test signal.
If the power of the test signal is weak relative to the total power of the maskers, the
test signal is not audible. When this happens, the test signal is said to be masked

182

10 Perceptual Model

Fig. 10.8 Measurement of critical bandwidth with masking in frequency gap. A test signal is
placed at the center frequency f0 where the critical bandwidth is to be measured and two masking
signals of equal power are placed at equal distance from the test signal. In (a), the test signal is
a narrow-band noise and the two maskers are tones. In (b), two narrow-band noises are used as
maskers to mask the tone in the center. The masked threshold versus the frequency separation
between the maskers is shown in (c). The frequency separation where the masked threshold begins
to drop off may be considered as the critical bandwidth

by the maskers. In order for the test signal to become audible, its power has to be
raised to above a certain level, denoted as masked threshold or sometimes masking
threshold.
When the frequency separation between the maskers is within the critical bandwidth, all of their powers fall into the critical band where the test signal resides and
the total masking power is the summation of that of the two maskers. In this case,
no matter how wide the frequency separation is, the total power in the critical band
is constant, so a constant masked threshold is expected. As the separation becomes
wider than the critical bandwidth, part of the powers of the maskers begin to fall
out of the critical band and are hence filtered out by the auditory filter. This causes
the total masking power in the critical band to become less, so their ability to mask
the test signal becomes weaker, causing the masked threshold to decrease accordingly. This is summarized in the curve of the masked threshold versus the frequency
separation shown in Fig. 10.8c. This curve is flat or constant for small frequency
separations and begins to fall off when the separation becomes larger than a certain
value, which may be considered as the critical bandwidth.
Data from many subjective tests were collected to produce the critical bandwidth
listed in Table 10.1 and shown in Fig. 10.9 [101, 102]. Here the lowest critical band
is considered to have a frequency range between 0 and 100 Hz, which includes the
inaudible frequency range between 0 and 20 Hz. Some authors may choose to assume the lowest critical band to have a frequency range between 20 and 100 Hz.

10.3 Auditory Subband Filtering


Table 10.1 Critical
bandwidth as proposed
by Zwicker

183

Band number
1
2
3
4
5
6
7
8
9
10
11
12
13
13
14
15
16
17
18
19
20
21
22
23

Upper frequency
boundary (Hz)
100
200
300
400
510
630
770
920
1;080
1;270
1;480
1;720
2;000
2;320
2;700
3;150
3;700
4;400
5;300
6;400
7;700
9;500
12;000
15;500

3500

Critical Bandwidth (Hz)

3000
2500
2000
1500
1000
500
0

10
15
Critical Band Number

Fig. 10.9 Critical bandwidth as proposed by Zwicker

20

Critical bandwidth
(Hz)
100
100
100
100
110
120
140
150
160
190
210
240
280
320
380
450
550
700
900
1;100
1;300
1;800
2;500
3;500

184

10 Perceptual Model

Many audio applications, such as CD and DVD, deploy sample rates that allow
for a frequency range higher than the maximum 15,500 Hz given in Table 10.1.
For these applications, one more critical band may be added, which starts from
15,500 Hz and ends with half of the sample rate.
The critical bandwidth listed in Table 10.1 might give the wrong impression that
critical bands are discrete and nonoverlapping. To emphasize that critical bands are
continuously placed on the frequency scale, an analytic expression is usually more
useful. One such approximation is given below
"

f
fG D 25 C 75 1 C 1:4
1;000

2 #0:69
Hz;

(10.13)

where f is the center frequency in hertz [102].


In terms of the number of critical bands, Table 10.1 may be approximated by
(10.8) where the nth Bark represents the nth critical band. Therefore, the Bark scale
is also called the critical band rate scale in the sense that one Bark is equal to one
critical bandwidth.

10.3.5 Critical Band Level


Similar to sound intensity level, critical band level of a sound in a critical band z is
defined as the total sound intensity level within the critical band:
Z
L.z/ D

L.f /df:

(10.14)

f 2z

For tones, the critical band level is obviously the same as the tones intensity level.
For noise, the critical band level is the total of the noise intensity density levels
within the critical band.

10.3.6 Equivalent Rectangular Bandwidth


An alternative measure of critical bandwidth is the equivalent rectangular bandwidth (ERB). For a given auditory filter, its ERB is the bandwidth of an ideal rectangular filter which has a passband magnitude equal to the maximum passband gain
of the auditory filter and passes the same amount of energy as the auditory filter [53].
A notched-noise method is used to estimate the shape of the roex filter in (10.6),
from which the ERB of the auditory filter is obtained [53, 70]. A formula that fits
many experimental data well is given below [21, 53]

10.4 Simultaneous Masking

185

Bandwidth (Hz)

103
Traditional CB

ERB

102

102

103
Center Frequency (Hz)

104

Fig. 10.10 Comparison of ERB with traditional critical bandwidth (CB)

ERB D 24:7.0:00437f C 1/ Hz;

(10.15)

where f is the center frequency in hertz.


The ERB formula indicates that the ERB is linear with respect to the center
frequency. This is significantly different from the critical bandwidth in (10.13), as
shown in Fig. 10.10, especially for frequency below 500 Hz. It was argued that ERB
given in (10.15) is a better approximation than the traditional critical bands discussed in Sect. 10.3.4 because it is based on new data that were obtained using direct
measurement of critical bands using the notched-noise method by a few different
laboratories [53].
One ERB obviously represent one frequency unit in the auditory system, so the
number of ERBs corresponds to a frequency scale and is conceptually similar to
Bark scale. A formula for calculating this ERB scale or the number of ERBs for a
center frequency f in hertz is given below [21, 53]
Number of ERBs D 21:4 log10 .0:00437f C 1/:

(10.16)

10.4 Simultaneous Masking


It is very easy to hear quiet conversation when the background is quiet. When there
is sound in the background which may be either another conversation, music or
noise, the speaker has to raise his/her volume to be heard. This is a simple example
of simultaneous masking.

186

10 Perceptual Model

Fig. 10.11 Masking of a


weak sound (dashed lines)
by a strong one (solid lines)
when they are presented
simultaneously and their
frequencies are close to each
other

While the mechanism behind simultaneous masking is very complex, involving


at least the nonlinear basilar membrane and the complex auditory neural system, a
simple explanation is offered in Fig. 10.11 where two sound waves are presented to a
listener simultaneously. Since their frequencies are close to each other, the excitation
pattern of the weaker sound may be completely shadowed by the stronger one. If the
basilar membrane is assumed to be linear, the weaker sound cannot be perceived by
auditory neurons, hence is completely masked by the stronger one.
Apparently, the masking effect is, to a large extent, dependent on the power of
the masker relative to the maskee. For audio coding, the masker is considered as the
signal and the maskee is the quantization noise that is to be masked, so this relative
value is expressed by signal-to-mask ratio (SMR)
SMR D 10 log10

2
Masker
:
2
Maskee

(10.17)

In order for a masker to completely mask a maskee, the SMR has to pass a certain
threshold, called SMR threshold
TSMR D minfSMR j the maskee is inaudibleg:

(10.18)

The negative of this threshold in decibel is called masking index [102]:


I D TSMR dB:

(10.19)

10.4.1 Types of Masking


The frequency components of an audio signal may be considered as either noiselike or tone-like. Consequently, both the masker and the maskee may be either tones
or noise, so there are four types of masking as shown in Table 10.2. Note that the
noise here is usually considered as narrow-band with a bandwidth no more than
the critical bandwidth. Since the removal of perceptual irrelevance is all about the
masking of quantization noise which is complex and is rarely tone like, the cases of
TMN and NMN are most relevant for audio coding.

10.4 Simultaneous Masking


Table 10.2 Four types
of masking

187

Tone
Noise

Tone
Tone masks tone (TMT)
Noise masks tone (NMT)

Noise
Tone masks noise (TMN)
Noise masks noise (NMN)

10.4.1.1 Tone Masking Tone


For the case of a pure tone masking another pure tone (TMT), both the masker and
the maskee are simple, but it turned out to be very difficult to conduct masking experiments and to build good models, mostly due to the beating phenomenon that
occurs when two pure tones with frequencies close to each other are presented to
a listener. For example, two pure tones of 990 and 1,000 Hz, respectively, produce
a beating of 10 Hz which causes the listeners to hear something different from the
steady-state tone (masker). Then, they believe that the maskee had been heard. But
this is, in fact, different from having actually heard another tone (maskee). Fortunately, quantization noise is rarely tone-like, so the lack of good models on this
regard is less a problem for audio coding. A large number of experiments have,
nevertheless, indicated an SMR threshold of about 15 dB [102].

10.4.1.2 Tone Masking Noise


In this case, a pure tone masks the narrow-band noise whose spectrum falls within
the critical band in which the tone stands. Since quantization noise is more noiselike, this case is very relevant to audio coding. Unfortunately, there exist relatively
few studies to provide a good model for this useful case. A couple of studies, however, do indicate an SMR threshold between 21 and 28 dB [24, 84].
10.4.1.3 Noise Masking Noise
This case of a narrow band noise masking another one is very relevant to audio
coding, but is very difficult to study because of the phase correlations between the
masker and the maskee. So it is not surprising that there are little experimental
results addressing this important issue. The limited data, however, do suggest an
SMR threshold of about 26 dB [23, 52].
10.4.1.4 Noise Masking Tone
A narrow-band noise masking tone (NMT) is most widely studied in psychoacoustics. This type of experiments was deployed to estimate the critical bandwidth and
the excitation patterns of the auditory filters. The masking spreading function to be
discussed later in this section is largely based on this kind of studies. There are a lot
of experimental data and models for NMT. The SMR threshold is generally considered to vary from about 2 dB at low frequencies to 6 dB at high frequencies [102].

188

10 Perceptual Model

Table 10.3 Empirical SMR


thresholds for the four
masking types

Masking type
SMR threshold (dB)

TMT
15

TMN
2128

NMN
26

NMT
26

10.4.1.5 Practical Masking Index


Table 10.3 summarizes SMR thresholds for the four types of masking discussed
above. From this table, we observe that tones have much weaker masking capability
than noise and noise is much more difficult to be masked than tones.
For audio coding, only TMN and NMT are usually considered with the following
masking index formula:
(10.20)
ITMN D 14:5  z dB
and
INMT D K dB;

(10.21)

where K is a parameter between 3 and 5 dB [32].

10.4.2 Spread of Masking


The discussion above addresses the situation that a masker masks maskee(s) that is
within the maskers critical band. Masking effect is no doubt the strongest in this
case. However, masking effect also exists when the maskee is not in the same critical band as the masker. An essential contributing factor for this effect is that the
auditory filters are not ideal bandpass filters, so do not attenuate frequency components outside the passband completely. In fact, the roll-off beyond the passband is
rather gradual, as shown in Figs. 10.4 and 10.5. This means that a significant chunk
of maskers power is picked up by the auditory filter of the critical band where the
maskee resides, making the maskee less audible. This effect is called the spread of
masking.
This explanation is apparently very simplistic in the light of the nonlinear basilar
membrane and the complex auditory neural system. As discussed in Sect. 10.4.1, it
is very difficult to study the masking behavior when both masker and maskees are
within the same critical band, especially for the important case of TMN and NMN,
it is no surprise that it is even more difficult to deal with the spread of masking
effects.
For simplification, masking spread function SF.zr ; ze / is introduced to express
the masking effect due to a masker at critical band zr to maskees at critical band ze .
If the masker at critical band zr has a critical band level of L.zr / , the power leaked
to critical band ze or the critical band level that the maskees auditory filter picks up
from the masker is
(10.22)
L.zr / SF.zr ; ze /:

10.4 Simultaneous Masking

189

Fig. 10.12 Spreading of


masking into neighboring
critical bands. The left
maskee is audible while the
right one is completely
masked because it is
completely below the masked
threshold

If the masking index at critical band ze due to the masker is I.ze /, the masked
threshold at critical band ze is
LT .zr ; ze / D I.ze /L.zr / SF .zr ; ze /:

(10.23)

This relationship is shown in Fig. 10.12.


The basic masking spread function is also shown in Fig. 10.12 which is mostly
extracted from data obtained from NMT experiments [102]. It is a triangular function with a slope of about 25 dB per Bark below the masker and 10 dB per Bark
above the masker. The slope of 25 dB for the lower half almost remains constant for
all frequencies. The slope of 10 dB for the upper half is also almost constant for all
frequencies higher than 200 Hz. Therefore, the spread function may be considered
as shift-invariant across the frequency scale.
As shown in Fig. 10.13, the simple spread function in Fig. 10.12 is captured by
Schroeder in the following analytic form [84]:
p
SF. z/ D 15:81 C 7:5. z C 0:474/  17:5 1 C . z C 0:474/2 dB;

(10.24)

where
z D zr  ze Bark;

(10.25)

which signifies that the spread function is frequency shift-invariant.


A modified version of the spreading function above is given below
SF. z/ D 15:8111389 C 7:5.K z C 0:474/
p
17:5 1 C .K z C 0:474/2


C8 min 0; .K z  0:5/2  2.K z  0:5/ dB

(10.26)

where K is a tunable parameter. This spreading function is essentially the same


as the Schroeder spreading function when K D 1, except for the last term, which

190

10 Perceptual Model

Critical Band Level (dB)

50

100

Schroeder
Schroeder With Dip
MPEG Psycho Model 2
150

10

Bark Scale

Fig. 10.13 Comparison of Schroeders, modified Schroeders, and MPEG Psychoacoustic Model
2 spreading functions. The dips in modified Schroeders and MPEG Model near the top are intended to model additional nonlinear effects in auditory system

introduces a dip near the top (see Fig. 10.13) that is intended to model additional
nonlinear effects in auditory system as reported in [102]. This function is used in
MPEG Psychoacoustic Model 2 [60] with following parameter:
(
KD

3; z < 0;

(10.27)

1:5; otherwise:

The three models above are independent of the SPL or critical band level of
the masker. While simple, this is not a realistic reflection of the auditory system.
A model that accounts for level dependency is the spreading function used in MPEG
Psychoacoustic Model 1 given below

SF. z; Lr / D

17. z C 1/  .0:4Lr C 6/;

< .0:4Lr C 6/ z;

3  z < 1
1  z < 0

17 z;
0  z < 1

/

17;
1  z < 8
. z

1/.17

0:15L
r

: 0;
otherwise;

dB

(10.28)

where Lr is the critical band level of the masker [55]. This spreading function is
shown in Fig. 10.14 for masker critical band levels at 20, 40, 60, 80, and 100 dB,
respectively. It apparently delivers increased masking for higher masker SPL on
both sides of the masking curve to match the nonlinear masking properties of the
auditory system [102].

10.4 Simultaneous Masking

191

0
10

Critical Band Level (dB)

20
Lr = 100 dB

30

Lr =80 dB

40
50

Lr = 60 dB

60
Lr = 40 dB

70
80

Lr = 20 dB

90
100

2
Bark Scale

Fig. 10.14 The spreading function of MPEG Psychoacoustic Model 1 for masker critical band
levels at 20, 40, 60, 80, and 100 dB, respectively. Increased masking is provided for higher masker
critical band levels on both sides of the masking curve

The masking characteristics of the auditory system is also frequency dependent:


the masking slope decreases as the masker frequency increases. This dependency is
captured by Terhardt in the following model [91]:
(
SF. z; Lr ; f / D

.0:2Lr C 230=f  24/ z; z


0
24 z;

otherwise

dB;

(10.29)

where f is the masker frequency in hertz. Figure 10.15 shows this model at Lr D
60 dB and for f D 100; 200; and 1;000, respectively.

10.4.3 Global Masking Threshold


The masking spread function helps to estimate the masked threshold of a masker in
one critical band over maskees in the same a different critical band. From a maskees
perspective, it is masked by all maskers in all critical bands, including the critical
band in which it resides. A question arises as to how those masked thresholds add
up for maskees in a particular critical band?
To answer this question, let us consider two maskers, one at critical band zr1
and the other at critical band zr2 , and denote their respective masking spread at
critical band ze as LT .zr1 ; ze / and LT .zr2 ; ze /, respectively. If one of the masking
effect is much stronger than the other one, the total masking effect would obviously

192

10 Perceptual Model
0

Critical Band Level (dB)

20

f = 100 Hz

40
f = 200 Hz
60

80

f = 10000 Hz

100

120
5

10

Bark Scale

Fig. 10.15 Terhardts spreading function at masker critical band level of Lr D 60 dB and masker
frequencies of f D 100; 200; and 1;000 Hz, respectively. It shows reduced masking slope as the
masker frequency increases

be dominated by the stronger one. When they are equal, how do the two masking
effects add up. If it were intensity addition, a 3 dB gain would be expected. If it
were sound pressure addition, a gain of 6 dB would be expected.
Experiment using a tone masker placed at a low critical band and a critical-band
wide noise masker placed at a high critical band to mask a tone maskee at a critical
band in between (they are not in the same critical band) indicates a masking effect
gain of 12 dB when the two maskers are of equal power. Even when one is much
weaker than the other one, a gain between 6 and 8 dB is still observed. Therefore,
the addition of masking effect is stronger than sound pressure addition and much
stronger than intensity addition [102].
When the experiment is performed within the same critical band, however, the
gain of masking effect is only 3 dB. This correlates well to intensity addition.
In practical audio coding systems, intensity addition is often performed for simplicity, so the total masked threshold is calculated using the following formula:
LT .ze / D I.ze /

L.zr /SF.zr ; ze /; for all ze :

(10.30)

all zr
Since threshold in quiet establishes the absolute minimal masked threshold, the
global masked threshold curve is
LG .ze / D maxfLT .ze /; LQ .ze /g;

(10.31)

10.5 Temporal Masking

193

where LQ .ze / is the critical band level that represents the threshold in quiet.
A conservative approach to establish this critical band level is to use the minimal
threshold in the whole critical band:
LQ .ze / D f .ze / min Tq .f /;
f 2ze

(10.32)

where f .ze / is the critical bandwidth in hertz for critical band ze .

10.5 Temporal Masking


The simultaneous masking discussed in Sect. 10.4 is under the condition of steady
state, i.e., both the masker and maskee are long lasting and in steady state. This
steady-state assumption is true most of the time because audio signals may be characterized as consisting of quasistationary episodes which are frequently interrupted
by strong transients. Transients bring on masking effects that vary with time. This
type of masking is called temporal masking.
Temporal masking may be exemplified by postmasking (postmasker masking)
which occurs after a loud transient, such as a gun shot. Immediately after such an
event, there are a few moments when most people cannot hear much. In addition to
postmasking, there are premasking (premasker masking) and simultaneous masking, as illustrated in Fig. 10.16.
The time period that premasking can be measured is about 20 ms. During this
period, the masked threshold gradually increases with time and reaches the level of
simultaneous masking when the masker switches on. The period of strong premasking may be considered as long as 5 ms.
Although premasking occurs before the masker is switched on, it does not mean
that the auditory system can listen to the future. Instead, it is believed to be caused
by the build-up time of the auditory system which is shorter for strong signals and
longer for weak signals. The shorter build-up time of the strong masker enables
parts of the masker to build up quickly, which then mask parts of the weak maskee
which are built-up slowly.

Fig. 10.16 Schematic drawing illustrating temporal masking. The masked threshold is indicated
by the solid line

194

10 Perceptual Model

Post-masking is much stronger than premasking. It kicks in immediately after


the masker is switched off and shows almost no decay for the first 5 ms. Afterwards,
it decreases gradually with time for about 200 ms. And this decay cannot be considered as exponential.
The auditory system integrates sound intensity over a period of 200 ms [102], so
the simultaneous masking in Fig. 10.16 may be described by the steady-state models
described in the last section.
However, if the maskee is switched on shortly after the masker is switched on,
there is an overshoot effect which boosts the masked threshold about 10 dB upward
above the threshold for steady-state simultaneous masking. This effect may last as
long as 10 ms.

10.6 Perceptual Bit Allocation


The optimal bit allocation strategy discussed in Chaps. 5 and 6 stipulates that the
minimal overall MSQE is achieved when the MSQE for all subbands are equalized.
This is achieved based on the assumption that all quantization noises in all frequency
bands are equal in terms of their contribution to the overall MSQE as seen in
(5.22) and (6.66).
From the perspective of perceptual irrelevancy, quantization noise in each critical
band is not equal in terms of perceived distortion and thus its contribution to the
total perceived distortion is not equal. Only quantization noise in those critical bands
whose power is above the masked threshold is of perceptual importance. Therefore,
the MSQE for each critical band should be normalized by the masked threshold of
that critical band in order to assess its contribution to perceptual distortion.
Toward this end, let us define the critical band level of quantization noise in the
subband context by rewriting the critical band level defined in (10.14) as
X
e2k
(10.33)
q2 .z/ D
k2z

where e2k is again the MSQE for subband k. This critical band level of quantization
noise may be normalized by the masked threshold of the same critical band using
the following NMR (noise to mask ratio):
NMR.z/ D

q2 .z/
LG .z/

(10.34)

Quantization noise for each critical band normalized in this way may be considered
as equal in terms of its contribution to the perceptually meaningful total distortion.
For this reason, NMR can be viewed as the variance of perceptual quantization
error. Consequently, the total average perceptual quantization error becomes
p2 D

1 X
NMR.z/;
Z
all z

(10.35)

10.8 Perceptual Entropy

195

where Z is the number of critical bands. Note that only critical bands with NMR
1
need to be considered because the other ones are completely inaudible and thus have
no contribution to p2 .
Comparing the formula above with (5.22) and (6.66), we know that, if subbands
are replaced by critical bands and quantization noise by perceptual quantization
noise NMR.z/, the derivation of optimal bit allocation and coding gain in Sect. 5.2
applies. So the optimal bit allocation strategy becomes:
Allocating bits to individual critical bands so that the NMR for all critical
bands are equalized.
Since NMR.z/  1 for a critical band z means that quantization noise in the
critical band is completely masked, the bit allocation strategy should ensure that
NMR.z/  1 for all critical bands should the bit resource is abundant enough.

10.7 Masked Threshold in Subband Domain


The psychoacoustic experiments and theory regarding masking in the frequency
domain are mostly based on Fourier transform, so DFT is most likely the frequency
transform in perceptual models built for audio coding. On the other hand, subband
filters are the preferred method for data modeling, so there is a need to translate the
masked threshold in the DFT domain into the subband domain.
While frequency scale correspondence between DFT and subbands are straightforward, the magnitude scales are not obvious. One general approach to address this
issue is to use the relative value between signal power and masked threshold, which
is essentially the SMR threshold defined in (10.17). Since the masked threshold is
given by (10.31), the SMR threshold may be written as
TSMR .z/ D

2
.z/
Masker
;
LG .z/

(10.36)

2
where Masker
.z/ is the power of the masker in critical band z. This ratio should be
the same either in the DFT or subband domain, so can be used to obtain the masked
threshold in the subband domain

L0G .z/ D

02
.z/
Masker
;
TSMR .z/

(10.37)

02
where Masker
.z/ is the signal power in the subband domain within critical band z.

10.8 Perceptual Entropy


Suppose that the NMR.z/ for all critical bands are equalized to NMR0 , then the total
MSQE for all subbands within critical band z is

196

10 Perceptual Model

q2 .z/ D LG .z/NMR0 ;

(10.38)

according to (10.34). Since the normalization of quantization error for all subbands
within a critical band by the masked threshold also means that all subbands are quantized together with the same quantization step size, so the MSQE for each subband
is the same:
q2 .z/
LG .z/NMR0
D
; for all k 2 z;
(10.39)
e2k D
k.z/
k.z/
where k.z/ represents the number of subbands within critical band z Therefore,
the SNR for a subband in critical band z is
SNR.k/ D

y2k
e2k

y2k k.z/
LG .z/NMR0

for all k 2 z:

(10.40)

Due to (2.43), the number of bits assigned to subband k is


10
rk D
log2
b log2 10

y2k k.z/
LG .z/NMR0

!


a
;
b

for all k 2 z;

(10.41)

so the total number of bits assigned to all subbands within critical band z is
r.z/ D

"

k2z

10
log2
b log2 10

y2k k.z/
LG .z/NMR0

#
a

:
b

(10.42)

The average bit rate is then


"
1 XX
10
RD
log2
M
b log2 10
k2z
allz

!#

y2k k.z/
LG .z/NMR0

a
:
b

(10.43)

For perceptual transparency, the quantization noise in each critical band must be
below the masked threshold, i.e.,
NMR0  1:

(10.44)

A smaller NMR0 means a lower quantization noise level relative to the masked
threshold, so requires more bits to be allocated to critical bands. Therefore, setting
NMR0 D 1 requires the least number of bits and at the same time ensures that
quantization noise is just at masked threshold. This leads to the following minimum
average bit rate:
"
10
1 XX
log2
RD
M
b log2 10
all z k2z

y2k k.z/
LG .z/

!#


a
:
b

(10.45)

10.9 A Simple Perceptual Model

197

The derivation above does not assume any quantization scheme because a general set of parameters a and b has been used. If uniform quantization is used and
subband samples are assumed to have the matching uniform distribution, a and b
are determined by (2.45) and (2.44), respectively, so the above minimum average
bit rate becomes
!
y2k k.z/
1 XX
:
(10.46)
0:5 log2
RD
M
LG .z/
allz k2z
To avoid negative values that the logarithmic function may produce, we add one to
its argument to arrive at
y2 k.z/
1 XX
RD
0:5 log2 1 C k
M
LG .z/
allz k2z

!
;

(10.47)

which is the perceptual entropy proposed by Johnston [34].

10.9 A Simple Perceptual Model


To calculate perceptual entropy, Johnston proposed a simple perceptual model
which has influenced many perceptual models deployed in practical audio coding
systems [34]. This model is described here.
A Hann window [67] is first applied to a chunk of 2;048 input audio samples and
the windowed samples are transformed into frequency coefficients using a 2,048point DFT. Since a real input signal produces a DFT spectrum that is symmetric
with respect to the zero frequency, only the first half of the DFT coefficients need to
be considered. Therefore, the number of subbands M is 1,024.
The magnitude square of the DFT coefficients P .k/; k D 0; 1; : : : ; M  1 may
be considered as the power spectrum Sxx .e j! / of the input signal, so are used to
calculate the spectral flatness measure defined in (5.65) for each critical band, which
is denoted as x2 .z/ for critical band z.
If the input signal is noise-like, its spectrum is flat, then the spectral flatness
measure should be close to one. If the input signal is tone-like, its spectrum is full
of peaks, then the spectral flatness measure should be close to zero. Therefore, the
spectral flatness measure is a good inverse measure for tonal quality of the input
signal, thus can be used to derive a tonality index such as the following:

x2 .z/ (dB)
;1 :
T .z/ D min
60

(10.48)

Note that the x2 .z/ above is in decibel. Since the spectral flatness measure is always positive and less than one, its decibel value is always negative. Therefore, the

198

10 Perceptual Model

tonality index is always positive. It is also limited to less than one in the equation
above.
The tonality index indicates that the spectral components in critical band z is T .z/
degree tone-like and 1  T .z/ degree noise-like, so it is used to weigh the masking
indexes given in (10.20) and (10.21), respectively, to give a total masking index of
I.z/ D T .z/ITMN .z/ C .1  T .z//INMT .z/
D .14:5 C z/T .z/  K.1  T .z// dB;

(10.49)

where K is set to a value of 5.5 dB.


The magnitude square of the DFT coefficients P .k/; k D 0; 1; : : : ; M  1 may
be considered as the signal variance y2k ; k D 0; 1; : : : ; M  1, so the critical band
level defined in (10.14) in the current context may be written as
y2 .z/ D

P .k/:

(10.50)

k2z

Using (10.30) with a masking spread function, the cumulative masking effects for
each critical band can be obtained as
X
y2 .zr /SF .zr ; z/; for all z:
(10.51)
LT .z/ D I.z/
all zr
Denoting
E.z/ D

y2 .zr /SF.zr ; z/;

(10.52)

all zr
we obtain the total masked threshold
LT .z/ D 10 log10 E.z/ C I.z/
D 10 log10 E.z/  .14:5 C z/T .z/  K.1  T .z// dB

(10.53)

for all critical bands z.


As usual, this threshold is combined with the threshold in quiet using (10.31) to
produce a global threshold, which is then been substituted into (10.47) to give the
perceptual entropy.

Chapter 11

Transients

An audio signal often consists of quasistationary episodes, each including a number


of tonal frequency components, which are frequently interrupted by dramatic transients. To achieve optimal energy compaction and thus coding gain, a filter bank
with fine frequency resolution is necessary to resolve the tonal components or fine
frequency structures in quasistationary episodes. But this filter bank is an ill fit for
transients which often last for no more than a few samples, hence require fine time
resolution for optimal energy compaction. Therefore, filter banks with both good
time and frequency resolution are needed to effectively code audio signals.
According the Fourier uncertainty principle, however, a filter bank cannot have
fine frequency resolution and high time resolution simultaneously. One approach to
mitigate this problem is to adapt the resolution of a filter bank in time to match
the conflicting resolution requirements posted by transients and quasistationary
episodes.
This chapter presents a variety of contemporary schemes for switching the time
frequency resolution of MDCT, the preferred filter bank for audio coding. Also
presented are practical methods for mitigating pre-echo artifacts that sometimes occur when the time resolution is not good enough to effectively deal with transients.
Finally, switching the timefrequency resolution of a filter bank requires the
knowledge of the occurrence and location of transients. Practical methods for detecting and locating transients are presented at the end of this chapter.

11.1 Resolution Challenge


Audio signals mostly consist of quasistationary episodes, such as the one shown at
the top of Fig. 11.1, which often include a number of tonal frequency components.
For effective energy compaction to maximize coding gain, these tonal frequency
components may be resolved or separated using filter banks, as shown by the MDCT
magnitude spectra of 1,024 (middle) and 128 (bottom) subbands in the middle and
bottom of Fig. 11.1, respectively.
Apparently, the 1,024-subband MDCT is able to resolve the frequency components much better than the 128-subband MDCT, thus having a clear advantage in
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 11,
c Springer Science+Business Media, LLC 2010


199

200

11 Transients

Amplitude

0.1
0
0.1

10

20

30

40

Time

Magnitude (dB)

50

100
5

10
Frequency (kHz)

15

20

10
Frequency (kHz)

15

20

Magnitude (dB)

50

100

Fig. 11.1 An episode of quasistationary audio signal (top) and its MDCT spectra of 1,024 (middle)
and 128 (bottom) subbands

energy compaction. It can be expected that the closer together the frequency components are, the more number of subbands is needed to resolve them.
A filter bank with a large number of subbands, conveniently referred to as a long
filter bank in this book, is able to deliver this advantage because using a large number of subbands to represent the full frequency range, as determined by the sample
rate, means that each subband is allocated a small frequency range, hence the frequency resolution is finer. Also, such a filter bank has a long prototype filter that
covers a large number of time samples, hence is able to resolve minute signal variations with frequency.

11.1 Resolution Challenge

201

Amplitude

0.5

0.5

10

20

30

40

Time

Magnitude (dB)

20
40
60
80
100
120
5

10
Frequency (kHz)

15

20

10
Frequency (kHz)

15

20

Magnitude (dB)

20
40
60
80
100
120

Fig. 11.2 A transient that interrupts quasistationary episodes of an audio signal (top) and its
MDCT spectra of 1,024 (middle) and 128 (bottom) subbands, respectively

Unfortunately, quasistationary episodes of an audio signal are intermittently


interrupted by dramatic transients, as shown at the top of Fig. 11.2. Applying the
same 1024-subband and 128-subband MDCT produces the spectra shown in the
middle and bottom of Fig. 11.2, respectively. Now the short MDCT resolves
the spectral valleys better than the long MDCT, so is more effective in energy
compaction.
There is another reason for improved overall energy compaction performance of
a short filter bank. Transients are well known for causing flat spectra, hence a large

202

11 Transients

spectral flatness measure and bad coding gain, according to (5.65) and (6.81). As
long as the prototype filter covers a transient attack, this spectral flattening effect
is reflected in the whole block of subband samples. For overlapping filter banks,
this affects multiple blocks of subband samples whose prototype filter covers the
transient. For a long filter bank with a long prototype filter, this flattening effect
affects a large number of subband samples, resulting low coding gain for a larger
number of subband samples. Using a short filter bank with a short prototype filter,
however, helps to isolate this spectral flatting effect to a smaller number of subband
samples. Before and after those affected subband samples, the coding gain may
go back to normal. Therefore, applying a short filter bank to cover the transients
improves the overall coding gain.

11.1.1 Pre-Echo Artifacts


Another reason that favors short filter banks when dealing with transients is pre-echo
artifacts. Quantization is a step in an audio coder that compresses the signal most
effectively, but it also introduces quantization noise. Under a filter bank scheme,
the quantization noise introduced in the subband or frequency domain becomes almost uniformly distributed in the time domain after the audio signal is reconstructed
from the quantized subband samples. This quantization noise is shown at the top
of Fig. 11.3, which is the difference signal between the reconstructed signal and

Amplitude

0.5
0
0.5
5

10

15

20

25

30

35

40

45

10

15

20

25
Time

30

35

40

45

Amplitude

0.5
0
0.5

Fig. 11.3 Pre-echo artifacts produced by a long 1024-subband MDCT. The top figure shows the
error between the reconstructed signal and the original, which is, therefore, the quantization noise.
The bottom shows the reconstructed signal alone, the quantization noise before the transient attack
is clearly visible and audible

11.1 Resolution Challenge

203

the original signal. When looking at the reconstructed signal alone (see bottom of
Fig. 11.3), however, the quantization noise is not visible after the transient attack
because it is visually masked by the signal. For the ear, it is also not audible due
to simultaneous and postmasking. Before the transient attack, however, it is clearly
visible. For the ear, it is also very audible and frequently very annoying because it
is supposed to be quiet before the transient attack (see original signal at the top of
Fig. 11.2). This frequently annoying quantization noise that occurs before the transient attack is called pre-echo artifacts.
One approach to mitigate pre-echo artifacts is to use a short filter bank whose fine
time localization or resolution would help at least limit the extent that the artifacts
appear. For example, a short 128-subband MDCT is used to process the same transient signal to give the reconstructed signal and the quantization noise in Fig. 11.4.
The quantization noise that occurs before the transient attack is still visible, but is
much shorter, less than 5 ms in fact. As discussed in Sect. 10.5, the period of strong
premasking may be considered as long as 5 ms, so the short pretransient quantization noise is unlikely to be audible.
For a 128-subband MDCT, the window size is 256 samples, which covers a
period of 256=44:1  5:8 ms for an audio signal of 44.1 kHz sample rate. In

Amplitude

0.5
0
0.5
5

10

15

20

25

30

35

40

45

10

15

20

25
Time

30

35

40

45

Amplitude

0.5
0
0.5

Fig. 11.4 Pre-echo artifacts produced by a short 128-subband MDCT. The top figure shows the
quantization noise and the bottom the reconstructed signal. The quantization noise before the transient attack is still visible, but may be inaudible because it could be shorter than premasking which
may last as long as 5 ms

204

11 Transients

order for significant quantization noise to build up, the MDCT window must cover
a significant amount of signal energy, so the number of input samples after the
transient attack and still covered by the MDCT window must be significant. This
means that the number of input samples before the transient attack is much shorter
than 256, so the period for pre-echo artifacts is much shorter than 5.8 ms and thus is
very likely to be masked by premasking. Therefore, a 128-subband MDCT is likely
to suppress most pre-echo artifacts.

11.1.2 Fourier Uncertainty Principle


Now it is clear that a filter bank needs to have both fine frequency resolution and fine
time resolution to effectively encode both transients and quasistationary episodes of
audio signals. The timefrequency resolution of a filter bank is largely determined
by its number of subband samples. For modulated filter banks, this is often reflected
in the length of the prototype filter. A long prototype filter has better frequency
resolution but poor time resolution. A short prototype filter (said to be compactly
supported) has good time resolution but poor frequency resolution. There is no filter
that can provide both good time and frequency resolution at the same time due to
the Fourier uncertainty principle, which is related to the Heisenberg uncertainty
principle [90].
Without loss of generality, let h.t/ denotes a signal that is normalized as follows:
Z

jh.t/j2 dt D 1

(11.1)

1

and H.f / its Fourier transform. The dispersion about zero in both time and frequency domain may defined by
Z
Dt D
Z

and
Df D

t 2 jh.t/j2 dt

(11.2)

f 2 jH.f /j2 df;

(11.3)

1
1
1

respectively. It is obvious that they represent the energy concentration of h.t/ and
H.f / toward zero in time and frequency domains, respectively, hence their respective time and frequency resolutions. The Fourier uncertainty principle states
that [75]
1
Dt Df

(11.4)
16 2
The equality is attained only in the case that h.t/ is an Gaussian function.

11.1 Resolution Challenge

205

Although the Gaussian function provides the optimal simultaneous time


frequency resolution under the uncertainly principle, it is still within the limit
stipulated by the uncertainly principle, thus not the level of timefrequency
resolution desired for audio coding. In addition, it has an infinity support and
clearly does not satisfies the power complementary condition for use in CMFB, so
is not suitable for practical audio coding systems.
Providing simultaneous timefrequency resolution is one of the motivations behind the creation of the wavelet transform which may be viewed as a nonuniform
filter bank. Its high-frequency basis functions are short to give good time resolution
for high-frequency components and low-frequency basis functions are long to provide good frequency resolution for low-frequency components, so it does not violate
the Fourier uncertainty principle. This approach can address the timefrequency resolution problems in many areas, such as image and videos, but it is not very suitable
for audio. Audio signals often contain tones at high frequencies, which require fine
frequency resolution at high frequencies, thus cannot be effectively handled by a
wavelet transform.

11.1.3 Adaptation of Resolution with Time


A general approach for mitigating this limitation on timefrequency resolution is
to adapt the timefrequency resolution of a filter bank with time: deploy high
frequency resolution to code quasistationary episodes and high time resolution to
localize transients.
This may be implemented using a hybrid filter bank in which each subband
of the first stage filter bank is cascaded with a transform as the second stage, as
shown in Fig. 11.5. Around a transient attack, only the first stage is deployed whose

Fig. 11.5 A hybrid filter


bank for adaptation of
timefrequency resolution
with time. For transients, only
the first stage is deployed to
provide limited frequency
resolution but better time
localization. For
quasistationary episodes, the
second stage is cascaded to
each subband of the first stage
to boost frequency resolution

206

11 Transients

good time resolution helps to isolate the transient attack and limit pre-echo artifacts.
For quasistationary episodes, the second stage is deployed to further decompose the
subband samples from the first stage so that a much better frequency resolution is
delivered. It is desirable that the first stage filter banks have good time resolution
while the second stage transforms have good frequency resolution.
A variation to this scheme is to replace the transform in the second stage with
linear prediction, as shown in Fig. 11.6. The linear prediction in each subband is
switched on whenever the prediction gain is large enough and off otherwise. This
approach is deployed by DTS Coherent Acoustic where the first stage is a 32band CMFB with a 512-tap prototype filter [88]. The DTS scheme suffers from
the poor time resolution of the first filter bank stage because 512 taps translates
into 512=44:1  11:6 ms for a sample rate of 44.1 kHz, far longer than the effective
premasking period of no more than 5 ms.
A more involved but computationally inexpensive scheme is to switch the number of subbands of an MDCT in such a way that a smaller number of subbands is
deployed to code transients and a large number of subbands to code quasistationary
episodes. It seems to have become the dominant scheme in audio coding for the
adaptation of timefrequency resolution with time, as can be seen in Table 11.1.
This switched MDCT is often cascaded to the output of a CMFB in some audio
coding algorithms. For example, it is deployed by MPEG-1&2 Layer III [55, 56],
whose first stage is a 32-band CMFB with a prototype filter of 512 taps and the
second stage is an MDCT which switches between 6 and 18 subbands. In this

Fig. 11.6 Cascading linear


predictors with a filter bank to
adapt timefrequency
resolution with time. Linear
prediction is optionally
applied to the subband
samples in each subband if
the resultant prediction gain
is sufficiently large

Table 11.1 Switched-window MDCT used by various audio coding algorithms


Audio coder
Number of subbands
Dolby AC-2A [10]
Dolby AC-3 [11]
Sony ATRAC [92]
Lucent PAC [36]
MPEG 1&2 Layer 3 [55, 56]
MPEG 2&4 AAC [59, 60]
Xiph.Org Vorbis [96]
Microsoft WMA [95]
Digirise DRA [98]

128/512
128/256
32/128 and 32/256
128/1,024
6/18
128/1,024
64, 128, 256, 512, 1,024, 2,048, 4,096 or 8,192
64, 128, 256, 512, 1,024 or 2,048
128/1,024

11.2 Switched-Window MDCT

207

configuration, the resolution adaptation is actually achieved through the switched


MDCT. This scheme suffers from the poor time resolution because the combined
prototype filter length is 512 C 2  6  32 D 896 even when the MDCT is in short
mode. This amounts to 896=44:1  20:3 ms for a sample rate of 44.1 kHz, which is
far longer than the effective premasking period of no more than 5 ms.
A similar scheme is used by MPEG-2&4 AAC in its gain control tool box
[59, 60]. It deploys as the first stage a 4-subband CMFB with a 96-tap prototype
filter and as the second stage an MDCT that switches between 32 and 256 subbands. This scheme seems to be able to barely avoid pre-echo artifacts due to its
short combined prototype filter of 96 C 2  32  4 D 352 taps, which amount to
352=44:1  8:0 ms for a sample rate of 44.1 kHz.
A more sophisticated scheme is used by Sony ATRACT deployed in its MiniDisc
and Sony Dynamic Digital Sound (SDDS) cinematic sound system, which involves
cascading of three stages of filter banks [92]. The first stage is a quadrature mirror
filter bank (QMF) with two subbands. Its low-frequency subband is connected to
another two-subband QMF, the outputs of which are connected to MDCTs which
switches between 32 and 128 subbands. The high-frequency subband from the first
stage QMF is connected to an MDCT that switches between 32 and 256 subbands.
The combined short prototype filter lengths are 2.9 ms for the low-frequency subbands and 1.45 ms for the high-frequency subband, so they are within the safe zone
of premasking.
In place of the short MDCT, Lucent EPAC deploys a wavelet transform to
handle transients. It still uses a 1024-subband MDCT to process quasistationary
episodes [87].

11.2 Switched-Window MDCT


A switched-window MDCT or Switched MDCT operates in long filter bank mode
(with a long window function) to handle quasistationary episodes of audio signal,
switches to short filter bank mode (with a short window function) around a transient
attack, and reverts back to the long mode afterwards. A widely used scheme is 1,024
subbands for the long mode and 128 subbands for the short mode.

11.2.1 Relaxed PR Conditions and Window Switching


To ensure perfect reconstruction, the linear phase condition (7.11) and powercomplementary conditions (7.80) must be satisfied. Since these conditions impose
symmetric and power-complementary constraints on each half of the window
function, it seems impossible to change either the window shape or the number
of subbands.

208

11 Transients
1
0.8

Left Window

Right Window

0.6
Any Valid Window

0.4
0.2
0

Current Block
500

1000

1500

2000

2500

3000

1
0.8

Left Window

Right Window

0.6
Any Valid Window

0.4
0.2
0

Current Block
500

1000

1500

2000

2500

3000

Fig. 11.7 The PR conditions only apply to two window halves that operate on the current block
of input samples, so the second half of the right window can have other shapes that satisfy the PR
conditions on the next block

An inspection of the MDCT operation in Figs. 7.15, however, indicates that each
block of the input samples are operated on by two window functions only. This
is illustrated in Fig. 11.7, where the middle block marked by the dotted lines is
considered as the current block. The first half of the left window, denoted as hL .n/,
operates on the previous block and is thus not related to the current block. Similarly,
the second half of the right window, denoted as hR .n/, operates on next block and is
thus not related to the current block, either. Only the second half of the left window
hL .n/ and the first half of the right hR .n/ operates on the current block. To ensure
perfection reconstruction for the current block of samples, only the window halves
covering it need to be constrained by the PR conditions (7.11) and (7.80). Therefore,
they can be rewritten as
hR .n/ D hL .2M  1  n/

(11.5)

h2R .n/ C h2L .M C n/ D ;

(11.6)

and

for n D 0; 1; : : : ; M  1, respectively.

11.2 Switched-Window MDCT

209

Let us do a variable change of n D M C m to the index of the left window so


that m D 0 corresponds to the middle of the left window, or the start of the second
half of the left window:
hN L .m/ D hL .M C m/;

M  m  M  1:

(11.7)

This enables us to rewrite the PR conditions as

and

hR .n/ D hN L .M  1  n/

(11.8)

h2R .n/ C hN 2L .n/ D ;

(11.9)

for n D 0; 1; : : : ; M  1, respectively. Therefore, the linear phase condition (11.8)


states that the second half of the left window and the first half of the right window
are symmetric to each other with respect to the center of the current block and
the power complementary condition states that these two window halves are power
complementary at each point within the current block.
Since these PR conditions are imposed to window halves that apply only to the
current block of samples, the window halves in the next block, or any other blocks
are totally independent from the current block or any other blocks. Therefore, different blocks can have totally different set of window halves and they can change in
either shape or length from block to block. In other words, windows can be switched
from block to block.

11.2.2 Window Sequencing


While the selection of window halves are independent from block to block, both
halves of a window are applied together to two blocks of samples to generate one
block of MDCT coefficients. This imposes a little constraint on how window halves
transit from one block to another block.
To see this, let us consider the current block of input samples denoted by the
dashed line in Fig. 11.7. It is used together with the previous block by the left window to produce the current block of MDCT coefficients. Once this block of MDCT
coefficients are generated and transferred to the decoder, the second half of the left
window is determined and cannot be changed.
When it is time to produce the next block of MDCT coefficients, the second
half of the left window was already determined when the current block of MDCT
coefficients are generated. The PR conditions then dictate that its symmetric and
power-complementary window half must be used as the first half of the right window. Therefore, the first half of the right window is also completely determined,
there is no flexibility for change.

210

11 Transients

This, however, does not restrict the selection of the second half of the right window which can be totally different from the first half, thus enabling switching to a
different window.

11.3 Double-Resolution Switched MDCT


The simplest form of window switching is obviously the case of using a long
window to process quasistationary episodes and a short window to deal with
transients. This set of window lengths represents two modes of timefrequency
resolution.

11.3.1 Primary and Transitional Windows


The first step toward this simple case of window switching is to build a set of long
and short primary windows using a window design procedure. Conveniently denoted as hM .n/ and hm .n/, respectively, where M designates the block length or
the number of subbands enabled by the long window and m that by the short window, they must satisfy the original PR conditions (7.11) and (7.80), i.e., they are
symmetric and power-complementary with respect to their respective middle points.
Without loss of generality, the sine window can always be used for this purpose
and are shown as the first and second windows in Fig. 11.8 where m D 128 and
M D 1;024.
Since the long and short windows are totally different in length, corresponding
to different block lengths, transitional windows are needed to bridge this transition
of block sizes. To maintain fairly good frequency response, a smooth transition in
window shape is highly desirable and abrupt change should be avoided. An example
set of such transitional windows are given in Fig. 11.8 as the last three windows. In
particular, window WL L2S is a long window for transition from a long window to
a short window, WL S2L for transition from a short window to a long window, and
WL S2L for transition from a short window to a short window, respectively.
A moniker in the form of WX Y2Z has been used to identify the different
windows above, where X designates the total window length, Y the left half,
and Z the right half of the window. For example, WL L2S designates a long window for transition from a long window to a short window and WS S2S designates a
short primary window which may be considered as a short window for transition
from a short window to a short window.
Mathematically, let us denote the left and right half of the long window as hM
L .n/
m
m
and hM
R .n/ and the short window as hL .n/ and hR .n/, respectively. Then the two
primary windows can be rewritten as

11.3 Double-Resolution Switched MDCT

211

1
WS_S2S

0.5
0

200

400

600

800

1000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1
WL_L2L

0.5
0

200

400

600

800

1000

1
WL_L2S

0.5
0

200

400

600

800

1000

1
WL_S2L

0.5
0

200

400

600

800

1000

1
WL_S2S

0.5
0

200

400

600

800

1000

Fig. 11.8 Window functions produced from a set of 2048-tap and 256-tap sine windows. Note that
the length of W S S2S is only 256 taps

(
WS S2S:

hS

S2S .n/

D
(

and
WL L2L:

hL L2L .n/ D

hm
L .n/; 0  n < mI
hm
R .n/; m  n < 2m

hM
L .n/; 0  n < M I
hM
R .n/; M  n < 2M:

(11.10)

(11.11)

The transitional windows can then be expressed in terms of the long and short window halves as follows:
8
M

hL .n/; 0  n < M I

< 1;
M  n < 3M=2  m=2I
(11.12)
WL L2S: hL L2S .n/ D
m

h
.n/;
3M=2

m=2

n
<
3M=2
C
m=2I

:
0;
3M=2 C m=2  n < 2M:

212

WL S2L:

WL S2S:

11 Transients

hL S2L .n/ D

8
0;
0  n < M=2  m=2I

m
< h .n/; M=2  m=2  n < M=2 C m=2I
L

(11.13)

1;
M=2 C m=2  n < M I

: M
hR .n/; M  n < 2M:
8
0;
0  n < M=2  m=2I

h .n/; M=2  m=2  n < M=2 C m=2I

< L
hL S2S .n/ D 1;
M=2 C m=2  n < 3M=2  m=2I (11.14)

hR .n/; 3M=2  m=2  n < 3M=2 C m=2I

:
0;
3M=2 C m=2  n < 2M:

Figure 11.9 shows some window switching examples using the primary and
transitional windows in Fig. 11.8. Transitional windows are placed back to back
in the second and third rows and eight short windows are placed in the last row.
As shown in Fig. 11.8, the long-to-short transitional windows WL L2S and
WL S2S provide for a short window half to be placed in the middle of their second
half. In the coordinate of the long transitional window, the short window must be

1
0.5
0

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

1
0.5
0
1
0.5
0
1
0.5
0

Fig. 11.9 Some possible window sequence examples

11.3 Double-Resolution Switched MDCT

213

placed within 3M=2  m=2; 3M=2 C m=2 (see (11.12) and (11.14), respectively).
After 3M=2 C m=2, the long transitional window does not impose any constraint
because its window values have become zero, so other methods can be used to represent the samples beyond 3M=2 C m=2.
The same argument applies to the first half of the short-to-long transitional windows WL S2L and WL S2S as well (see Fig. 11.8), leading to no constraint placed
on samples before M=2  m=2, so other signal representation methods can be
accommodated for the samples before M=2  m=2.
MDCT with the short window function WS S2S is a simple method for representing those samples. Due to the overlapping of m samples between the short
windows and transitional windows, M=m short windows need to be used to
represent the samples between the long-to-short (WL L2S and WL S2S) and
short-to-long (WL S2L and WL S2S) transitional windows. See the last row of
Fig. 11.9, for example.
These M=m short windows amount to a total of M samples, which are the
same as the block length of the long windows, so this window switching scheme
is amenable to maintaining a constant frame size, which is highly desirable for
convenient real-time processing of signals. For the seek of convenience, a long block
may sometimes be referred to as the frame in the remainder of this book.
Under such a scheme, the short window represents the fine time resolution mode
and the long windows, including the transitional ones, represent the fine frequency
resolution mode, thus amounting to a double-resolution switched MDCT.
If constant frame size is forgone, other possible number of short windows can
be used. This typically involves more sophisticated window sequencing and buffer
management.
If the window size of the short window is set zero, i.e., m D 0, there is absolutely
no need for any form of short window. The unconstrained regions before M=2 and
after 3M=2 are open for any kind of representation methods. For example, a DCT-IV
or wavelet transform may be deployed to represent samples in those regions.

11.3.2 Look-Ahead and Window Sequencing


It is clear now that the interval for possible short window placement is M=2 C
m=2; 3M=2 C m=2) (see the last row of Fig. 11.9). Referred to as transient detection interval, it obviously resides between two frames: about half of the short
windows are placed in the current frame and the other placed in the second frame.
If there is a transient in the interval, the short windows should be placed to cover it;
otherwise, one of the long windows should be used.
In preparation for placing the short windows in this interval, a long-to-short transitional window WL X2S needs to be used in the current frame, where the X can
be either L or S and are determined in the previous frame. Therefore, to determine the window for the current frame we need to look-ahead to the transient
detection interval for the presence or absence of transients. Since this interval ends

214
Table 11.2 Possible window
switching scenarios

11 Transients
Current window half
WL X2S
WL X2L
WS S2S
WS S2S

Next window half


WS S2S
WL L2X
WS S2S
WL S2X

in the middle of the next frame, we need a look-ahead interval of up to a half frame,
which causes additional coding delays. This look-ahead interval of half frame is necessary for other window switching situations, such as transition from short windows
to long windows (see the last row of Fig. 11.9 again).
If there is a transient in the transient detection interval, the short windows have
to be placed to cover it. Since the short windows only cover the second half of the
current frame, the first half of the current frame needs to be covered by a WL X2S
long window. The short windows also cover the first half of the next frame, whose
next half is covered by a WL S2X long window. This complete the transition from a
long window to short windows and back to a long window. Table 11.2 summarizes
the possible window transition scenarios between two frames.
The decoder needs to know exactly what window the encoder used to generate
the current block of MDCT coefficients in order to use the same window to perform
the inverse MDCT, so information about window sequencing needs to be included
in the bit stream. This can be done using three bits to convey an window index
that identifies the windows in (11.10) to (11.14) and shown in Fig. 11.8. If window
WL S2S is forbidden, only two bits are needed to convey the window index.

11.3.3 Implementation
Double-resolution switched IMDCT is widely used in a variety of audio coding
standards, its implementation is straight forward and fully illustrated in the third
row of Fig. 11.9. In particular, starting from the second window in that row (the
WL L2S window), the IMDCT may be implemented in the following steps:
1. Copy the .M  m/=2 samples, where the WL L2S is one, from the delay line
directly to the output;
2. Do M=2m short IMDCT and put all results to the output because all of them
belong to the current frame;
3. Do another IMDCT, put first m=2 samples of its result to the output and store
the remaining m=2 samples to the delay line, because the first half belong to the
current frame and the rest to the next frame;
5. Do M=2m  1 short IMDCT and put all results to the delay line because all of
them belong to the next frame;
6. Clear the remaining .M C m/=2 samples of the delay line to zero because the
last WS S2S has ended.

11.4 Temporal Noise Shaping

215

11.3.4 Window Size Compromise


To localize transient attacks, the shorter the window is, the better the localization
is achieved, thus the better the coding gain becomes. This is true for the group
of samples around the transient attack, but causes poor coding gain for the other
samples in the frame, which do not contain transient attacks and are quasistationary.
Therefore, there is a trade-off or compromise when choosing the size for the short
window. Too short a window size means good coding gain for the transient attack
but poor coding gain for the quasistationary remainder, and vice versa for a window
size too long. The compromise eventually reached is, of course, optimal neither for
the transients nor for the quasistationary remainder.
This problem is further compounded by the need for longer long windows to
better encode tonal components in quasistationary episodes. If the short window
size is fixed, longer long windows mean more short windows in a frame, thus more
short blocks of audio samples coded with poor coding gain. Therefore, the long
windows cannot be too long.
Apparently, there is a consensus of using 256 taps for the short window, as shown
in Table 11.1. This seems to be a good compromise because a window size of 256
taps is equivalent to 256=44:1  5:8 ms, which is barely longer than the 5 ms for
premasking. In other words, it is the longest acceptable size that is barely short
enough, but not unnecessarily short. With this longest acceptable size for the short
window, 2,048 taps for the long window is also a widely accepted option.
However, pre-echo artifacts are still frequently audible with such a window size
arrangement, especially for audio pieces with significant transient attacks. This calls
for techniques to improve the time resolution of the short window for enhanced
control of pre-echo artifacts.

11.4 Temporal Noise Shaping


Temporal noise shaping (TNS) [26], used by AAC [60], is one of the preecho control methods [69]. It deploys an open-loop DPCM on the block of short
MDCT coefficients that cover the transient attack, and leaves other blocks of short
MDCT coefficients in the frame untouched.
In particular, it deploys the open-loop DPCM encoder shown in Fig. 4.3 on the
block of short MDCT coefficients that cover the transient in the following steps:
1. Estimate the autocorrelation matrix of the MDCT coefficients.
2. Solve the normal equations (4.46) using a method such as the LevinsonDurbin
algorithm to produce the prediction coefficients.
3. Produce the prediction residue of the MDCT coefficients using the prediction
coefficients obtained in the last step.
4. Quantize the prediction residue. Note that the quantizer is placed outside the
prediction loop as shown in Fig. 4.3.

216

11 Transients

On the decoder side, the regular decoder shown in Fig. 4.4 is used to reconstruct the
MDCT coefficients for the block with a transient.
The first theoretical justification for TNS is the spectral flattening effect of transients. For the short window that covers a transient attack, the resultant MDCT
coefficients are either close to or essentially flat, thus are amenable for linear prediction. As discussed in Chap. 4, the resultant prediction gain may be considered as
the same as the coding gain.
The second theoretic justification is that an open-loop DPCM shapes the spectrum of quantization noise toward the spectral envelop of the input signal (see
Sect. 4.5.2). For TNS, the input signal is the MDCT coefficients and their spectrum
is the time-domain samples covered by the short window, so the quantization noise
of MDCT coefficients in the time domain is shaped toward the envelop of the
time-domain samples. This means that more quantization noise is placed after the
transient attack and less noise before the transient attack. Therefore, there is less
likelihood for pre-echo artifacts.
As an example, let us apply the TNS method to the MDCT block that covers
the transient attack in the audio signal in Fig. 11.2 with a predictor order of 2. The
prediction coefficients are obtained from the autocorrelation matrix using (4.63).
The quantization noise and the reconstructed signal are shown in Fig. 11.10. While
the quantization noise before the transient attack is still visible, it is significantly
shorter than that of the regular short window shown in Fig. 11.4.

Amplitude

0.5
0
0.5
5

10

15

20

25

30

35

40

45

10

15

20

25
Time

30

35

40

45

Amplitude

0.5
0
0.5

Fig. 11.10 Pre-echo artifacts for TNS. The top figure shows the quantization noise and the bottom
the reconstructed signal. The quantization noise before the transient attack is still visible, but is
significantly shorter than that of the regular short window. However, the concentration of quantization noise in a short period of time (top) elevates the noise intensity significantly and hence may
become audible

11.5 Transient-Localized MDCT

217

However, the concentration of quantization noise in a short period of time as


shown at the top of the figure elevates the noise intensity significantly and hence
may become audible. More sophisticated noise shaping methods may be deployed
to shape the noise in such a way that it is more uniformly distributed behind the
transient attack.
Since linear prediction needs to be performed for each MDCT coefficients, TNS
is computationally intensive, even on the decoder side. The overhead for transferring
the description of the predictor, including the prediction filter coefficients, is also
remarkable.

11.5 Transient-Localized MDCT


Another approach to improving the window size compromise is to leave the short
window size unchanged but use a narrower window shape to cope with transients
better. This allows better transient localization with minimal impact to the coding
gain for the quasistationary remainder of the frame and to the complexity of both
encoder and decoder.

11.5.1 Brief Window and Pre-Echo Artifacts


Let us look at window function WL S2S, which is the last one in Fig. 11.8. It is
a long window, but its window shape is much narrower than the regular long window WL L2L. This is achieved by shifting the short window outward and properly
padding zeros to make it as long as a long window. This same idea may be applied
to the short window using a model window whose length, denoted as 2B, is even
shorter than the short window, i.e., B < m. Denoting its left and right half as hB
L .n/
and hB
R .n/, respectively, this model window is not directly used in the switched
MDCT, other than building other windows, so it may be referred to as the virtual
window. As an example, a 64-tap sine window, shown at the top of Fig. 11.11 as
WB B2B, may be used for such a purpose. It is plotted using a dashed line to emphasize that it is a virtual window.
Based on this virtual widow, the following narrow short window, called a brief
window, may be built

WS B2B:

8
0;
0  n < m=2  B=2I

hL .n/; m=2  B=2  n < m=2 C B=2I


<
hS B2B .n/ D 1;
m=2 C B=2  n < 3m=2  B=2I (11.15)

hB

R .n/; 3m=2  B=2  n < 3m=2 C B=2I

:
0;
3m=2 C B=2  n < 2m:

218
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0

11 Transients

WB_B2B
200

400

600

800

1000

1200

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

1400

1600

1800

2000

WS_S2S
200

400

600

800

1000

1200

WS_S2B
200

400

600

800

1000

1200

WS_B2S
200

400

600

800

1000

1200

WS_B2B
200

400

600

800

1000

1200

WL_L2L
200

400

600

800

1000

1200

WL_L2S
200

400

600

800

1000

1200

WL_S2L
200

400

600

800

1000

1200

WL_S2S
200

400

600

800

1000

1200

WL_L2B
200

400

600

800

1000

1200

WL_B2L
200

400

600

800

1000

1200

WL_B2B
200

400

600

800

1000

1200

WL_S2B
200

400

600

800

1000

1200

WL_B2S
200

400

600

800

1000

1200

Fig. 11.11 Window functions for a transient-localized 1024/128-subband MDCT. The first window is plotted using a dashed line to emphasize that it is a virtual window (not directly used). Its
length is 64 taps. All windows are built using the sine window as the model window

Its nominal length is the same as the regular short window WS S2S, but its effective
length is only
.3m=2 C B=2/  .m=2  B=2/ D m C B

(11.16)

11.5 Transient-Localized MDCT

219

because its leading and trailing .mB/=2 taps are zero, respectively. Its overlapping
with its neighboring windows is only B.
As an example, the WB B2B virtual window at the top of Fig. 11.11 may be
used in combination with the short window BS S2S in the same figure to build
the brief window shown as the fifth window in Fig. 11.11. Its effective length of
128 C 32 D 160 is significantly shorter than the 256 taps of the short window, so
should provide better transient localization.
For a sample rate of 44.1 kHz, this corresponds to 160=44:1  3:6 ms. Compared
with the 5:8-ms length of the regular short window, this amounts to an improvement
of 1:3 ms, or 22.4%. This improvement is critical for pre-echo control because the
3.6-ms length of the brief window is well within the 5-ms range of premasking,
while the 5.8-ms length of the regular short window is not.
Figure 11.12 shows the quantization noise achieved by this brief window for a
piece of real audio signal. Its pre-echo artifacts are obviously shorter and weaker
than those with the regular short window (see Fig. 11.4) but longer and weaker than
those delivered by TNS (see Fig. 11.10). One may argue that TNS might deliver
better pre-echo control due to its significantly shorter but more powerful pre-echo
artifacts. Due to the premasking effect that often lasts up to 5 ms, however, pre-echo
artifacts significantly shorter than the premasking period is most likely inaudible
and thus irrelevant. Therefore, the simple brief window approach to pre-echo control
serves the purpose well.

Amplitude

0.5

0.5
5

10

15

20

10

15

20

25

30

35

40

45

25

30

35

40

45

Amplitude

0.5

0.5

Time

Fig. 11.12 Pre-echo artifacts for transient localized MDCT. The top figure shows the quantization
noise and the bottom the reconstructed signal. The quantization noise before the transient attack is
still visible, but is remarkably shorter than that of the regular short window

220

11 Transients

11.5.2 Window Sequencing


To switch between this brief window (WS B2B), the long window (WL L2L) and
the short window (WS S2S), the PR conditions (11.8) and (11.9) call for the addition
of various transitional windows which are illustrated in Fig. 11.11 along with the primary windows. Since the brief window provides much better transient localization,
the new switched-window MDCT scheme may be referred to as transient-localized
MDCT (TLM).
Due to the increased number of windows as compared with the conventional approach, the determination of appropriate window sequence is more involved, but
still fairly simple. The addition to the usual placement of long and short windows
discussed in Sect. 11.3 is the placement of the brief window within a frame with
transients. Within such a frame, this brief window is placed only to the block of
samples containing a transient, while the short and/or the appropriate short transitional windows are applied to the quasistationary samples in the remainder of the
frame. Some window sequence examples are shown in Fig. 11.13.

1
0.5
0

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

1
0.5
0
1
0.5
0
1
0.5
0

Fig. 11.13 Window sequence examples. The top sequence is for the conventional method which
does not use the brief window. The brief window is used to cover blocks with transients for the
sequences in the rest of the figure. The second and the third sequences are for a transient occurring
in the first and the third blocks, respectively. Two transients occur in the first and sixth blocks in
the last sequence

11.5 Transient-Localized MDCT

221

11.5.2.1 Long Windows


If there is no transient within the current frame, a long window should be selected,
the specific shape of which depending on the shape of the immediately previous and
subsequent window halves, respectively. This is summarized in Table 11.3.

11.5.2.2 Short Windows


If there is a transient in the current frame, a sequence of eight short windows should
be used, the specific shape of each depends on transient locations. This is summarized as follows:
 WS B2B is placed to each short block within which there is a transient, to im-

prove the time resolution of the short MDCT.


 The window for the block that is immediately before this transient block has a

designation of the form WS X2B.


 The window for the block that is immediately after this transient block has a

designation of the form WS B2X.


The moniker X in the above designation can be either S or B. The allowed
placement of short windows may then be summarized in Table 11.4.
For the remainder of the frame (away from the transients), short window WS S2S
should be deployed, except for the first and last blocks of the frame, whose window
assignments are dependent on the immediate window halves in the previous and subsequent frames, respectively. They are listed in Tables 11.5 and 11.6, respectively.
Table 11.3 Determination of long window shape for a frame without
detected transient
Previous window half Current window Subsequent window half
WL X2L

WS X2S

WS X2B

Table 11.4 Allowed


placement of short windows
around a block with
a detected transient

WL
WL
WL
WL
WL
WL
WL
WL
WL

L2L
L2S
L2B
S2L
S2S
S2B
B2L
B2S
B2B

WL L2X
WS S2X
WS B2X
WL L2X
WS S2X
WS B2X
WL L2X
WS S2X
WS B2X

Pretransient
WL L2B
WL S2B
WL B2B
WS S2B
WS B2B

Transient

WS B2B

Posttransient
WL B2L
WL B2S
WL B2B
WS B2S
WS B2B

222

11 Transients
Table 11.5 Determination of the first half of the first window in
a frame with detected transients
Last window in previous frame First window in current frame
WL X2S
WS X2S
WS B2B

WS S2X
WS S2X
WS B2X

Table 11.6 Determination of the second half of the last window in


a frame with detected transients
Last window in current frame First window in subsequent frame
WS X2S
WS X2S
WS B2B

WL S2X
WS S2X
WS B2X

11.5.3 Indication of Window Sequence to Decoder


The encoder needs to indicate to the decoder the window(s) that it used to encode
the current frame so that the decoder can use the same window(s) to decode the
frame. This can be accomplished again using a window index.
For a frame without a detected transient, one label from the middle column of
Table 11.3 is all that is needed.
For a frame with transients, the window sequencing procedure outlined in
Sect. 11.5.2 can be used to determine the sequence of short window shapes based
on the knowledge of transient locations in the current frame. This procedure also
need to know whether there is a transient in the first block of the subsequent frame
due to the need for look-ahead.
A value of 0 or 1 may be used to indicate the absence or presence of a transient
in a block. For example, 00100010 indicates that there is a transient in the third
and seventh block, respectively. This sequence may be reduced by the block count
starting from the last transient block. For example, the previous sequence may be
coded by 23.
Note that, the particular method above cannot indicate if there is a transient in
the first block of the current frame. This particular information, together with the
absence or presence of transient in the first block of the subsequent frame, may be
conveyed by the nomenclature WS Curr2Subs, where:
1. Curr (S=no, B=yes) identifies if there is transient in the first block of current
frame, and
2. Subs (S=no, B=yes) identifies if there is transient in the first block of the subsequent frame.
This is summarized in Table 11.7.
The first column of Table 11.7 is obviously the same set of labels used for the
short windows. Combining the labels in Tables 11.7 and 11.3, we arrive at the

11.5 Transient-Localized MDCT


Table 11.7 Encoding of the
existence or absence of
transient in the first block of
the current and subsequent
frames

223

Label
WS B2B
WS B2S
WS S2B
WS S2S

Table 11.8 Window indexes


and their corresponding labels

Transient in the first block of


Current frame Subsequent frame
Yes
Yes
Yes
No
No
Yes
No
No

Window index

Window label

0
1
2
3
4
5
6
7
8
9
10
11
12

WS S2S
WL L2L
WL L2S
WL S2L
WL S2S
WS B2B
WS S2B
WS B2S
WL L2B
WL B2L
WL B2B
WL S2B
WL B2S

complete set of window labels shown in Table 11.8. The total number of windows
labels is now 13, requiring 4 bits to transmit the index to the decoder.
A pseudo CCC function is given in Sect. 15.4.5 that illustrates how to
obtain short window sequencing from this window index and transient location
information.

11.5.4 Inverse TLM Implementation


Both TLM and the regular double-resolution switched MDCT discussed in Sect. 11.3
involve switching between long and short MDCT. The difference is that the regular
one uses a small set of short and long windows, while TLM deals with a more
complex set of short and long windows that involve the brief window WS B2B. Note
that the brief window and all of its related transitional windows are either short or
long windows, just like the windows used by the regular switched MDCT. They are
different simply because the window functions have different values. These simple
differences in values do not change the procedure that calculate the switched MDCT,
so the same procedure given in Sect. 11.3.3 that calculate the switched MDCT can
be applied to calculate TLM.

224

11 Transients

11.6 Triple-Resolution Switched MDCT


Another approach to improving the window size compromise is to introduce a third
window size, called the medium window size, between the short and long window
sizes. The primary purpose is to provide better frequency resolution to the stationary
segments within a frame with detected transients, thus allowing a much shorter window size to be used to deal with transient attacks. There are, therefore, two window
sizes within such a frame: short and medium.
In addition, the medium window size can also be used to handle a frame with
smooth transients. In this case, there are only medium windows and no short windows within such a frame.
The three kinds of frames are summarized in Table 11.9. There are obviously three resolution modes, represented by the three window sizes, under such
a switched MDCT architecture. To maintain a constant frame size, the long window
size must be a multiple of the medium window size, which in turn a multiple of the
short window size.
As an example, let us reconsider the 1024/128 switched MDCT discussed
Sect. 11.3. To mitigate the pre-echo artifacts encountered by its short window size
of 256 taps, 128 may be selected as the new short window size, which corresponds to
64 MDCT subbands. To achieve better coding gain for the remainder of a transient
frame, a medium window size of 512 may be selected, corresponding to 256 MDCT
subbands. Keeping the long window size of 2048, we end up with a 1024/256/64
switched MDCT, or triple-resolution switched MDCT, whose window sizes are
multiples of 4 and 4, respectively.
Since there are three sets of window sizes that can be switched between each
other, three sets of transitional windows are needed. Each set of these windows can
be built using the formulas from (11.10) to (11.14). Figure 11.14 shows all these
windows built based on the sine window.
The advantage of this new architecture is illustrated in Fig. 11.15, where a few
example window sequences are shown. Now the much shorter short window can be
used to better localize the transients and the medium window to achieve more coding
gains for the remainder of the frame. The all medium window frame is suitable for
handling slow transient frames.
Comparing Fig. 11.14 with Fig. 11.11, we notice they are essentially the same,
except the first window. Window WB B2B in Fig. 11.11 is virtual and not actually
used. Instead, the dilated version of it, WS B2B is used to deal with transients.
The window equivalent to WB B2B is labeled as WS S2S in Fig. 11.14 and used to

Table 11.9 Three types of


frames in a triple-resolution
switched MDCT

Frame type
Quasistationary
Smooth transient
Transient

Windows
A long window
A multiple of medium windows
A multiple of short and medium windows

11.6 Triple-Resolution Switched MDCT


1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0

225

WS_S2S
200

400

600

800

1000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

1200

1400

1600

1800

2000

WM_M2M
200

400

600

800

1000
WM_M2S

200

400

600

800

1000
WM_S2M

200

400

600

800

1000
WM_S2S

200

400

600

800

200

400

600

800

1000
WL_L2L
1000
WL_L2M

200

400

600

800

1000
WL_M2L

200

400

600

800

1000
WL_M2M

200

400

600

800

1000
WL_L2S

200

400

600

800

1000
WL_S2L

200

400

600

800

1000
WL_S2S

200

400

600

800

1000
WL_M2S

200

400

600

800

1000
WL_S2M

200

400

600

800

1000

Fig. 11.14 Window functions produced for a 1024/256/64 triple-resolution switched MDCT using
the sine window as the model window

cope with transient. Since it can be much shorter than WS B2B, it can deliver much
better time localization. Also, it is smoother than WS B2B, so it should have better
frequency resolution property.

226

11 Transients

1
0.5
0

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

1
0.5
0
1
0.5
0
1
0.5
0

Fig. 11.15 Window sequence examples for a 1024/256/64 triple-resolution switched MDCT. The
all medium window frame on the top is suitable for handling slow transient frames. For the rest of
the figure, the short window can be used to better localize transients and the medium window to
achieve better coding gains for the remainder of a frame with detected transients

Comparing Figs. 11.15 and 11.13, we notice that they are also very similar, the
only difference is essentially that the WS B2B window in Fig. 11.13 is replaced by
four WS S2S windows in Fig. 11.15. Therefore, window sequencing procedures are
similar and thus are not discussed here.
However, the addition of another resolution mode means that the resultant audio
coding algorithm will become much more complex because each resolution mode
usually requires its own sets of critical bands, quantizers, and entropy codes. See
Chap. 13 for more explanation.

11.7 Transient Detection


The adaptation of the timefrequency resolution of a filter bank hinges on whether
there are transients in a frame as well as their locations, so the proper detection and
locating of transients are critical to the success of audio coding.

11.7 Transient Detection

227

11.7.1 General Procedure


Since transients are mostly consist of high-frequency components, an input audio
signal x.n/ is usually preprocessed by a high-pass filter h.n/ to extract its highfrequency components:
y.n/ D

h.k/x.n  k/:

(11.17)

The Laplacian whose impulse response function is given below


h.n/ D f1; 2; 1g

(11.18)

is an example of such a high-pass filter.


The high-pass filtered samples y.n/ within a transient detection interval (see
Sect. 11.3.2) are then divided into blocks of equal size, referred to as transientdetection blocks. Note that this transient-detection block is different from the block
used by filter banks or MDCT. Let L denote the number of samples in such a
transient-detection block, then there are K D N=L transient-detection blocks in
each transient detection interval, assuming that it has a size of N samples. The short
block size of filter bank should be a multiple of this transient-detection block size.
Next, some kind of metric or energy for each transient-detection block is calculated. The most straight-forward is the following L2 metric:
E.k/ D

L1
X

jy.kL C i /j2 ; for k D 0; 1; : : : ; K  1:

(11.19)

i D0

For reduced computational load, the following L1 metric is also a good choice:
E.k/ D

L1
X

jy.kL C i /j; for k D 0; 1; : : : ; K  1:

(11.20)

i D0

Other sophisticated norms, such as perceptual entropy [34] can also be used.
At this point, transient detection decision can be made based on the variations of
the metric or energy among the blocks. As a simple example, let us first calculate
Emax D max E.k/;

(11.21)

Emin D min E.k/:

(11.22)

0k<K

and
0k<K

228

11 Transients

Then the existence of transient may be declared if


Emax  Emin
>T
Emax C Emin

(11.23)

where the T is a threshold, which may be set to 0.5.


After a frame is declared as containing transients, the next step is to identify the
locations of the transients in the frame. Let us use the following transient function
to designate the existence or absence of transient in transient-detection block k:
(
T .k/ D

1; if there is a transient;
0; otherwise.

(11.24)

A simple method for transient localization is


(
T .k/ D

1; if E.k/  E.k  1/ > T;


0; otherwise;

(11.25)

where T is a threshold. It may be set as


T Dk

Emax C Emin
2

(11.26)

where k is an adjustable constant.


Since a short MDCT or subband block may contain a multiple of transientdetection blocks, the transient locations obtained above need to be converted into
MDCT blocks. This is easily done by declaring that an MDCT block contains a
transient if any of its transient-detection block contains a transient.

11.7.2 A Practical Example


A practical example is provided here to illustrate how transient detection is done in
practical audio coding systems. It entails two stages of decision.
In the preliminary stage, no transient in the current frame is declared if any of the
following conditions are true:
1.
2.
3.
4.

Emax < k1 Emin , where k1 is a tunable parameter.


k2 Dmax < Emax  Emin , where k2 is a tunable parameter.
Emax < T1 , where T1 is a tunable threshold.
Emin > T2 , where T2 is a tunable threshold.

The Dmax used above is the maximum of absolute metric difference defined below
Dmax D max jE.k/  E.k  1/j:
0<k<K

(11.27)

11.7 Transient Detection

229

If none of the conditions above are satisfied, a decision cannot be made for now.
Instead, we need to move on to the secondary stage of decision, which focuses on
eliminating large preattack and postattack peaks.
In particular, let us denote
kmax D fk j E.k/ D Emax ; 0  k < Kg;

(11.28)

which is the index for the block that has the maximum metric or energy. It may be
considered as the block where transient attack occurs.
To find the preattack peak, search backward from kmax for the block where the
metric begins to increase:
for (k=kmax-1; k>0; k--) {
if ( E[k-1] > E[k] ) {
break;
}
}
PreK = k-1;
The preattack peak is
P reE max D

max

0k<PreK

E.k/:

(11.29)

A similar procedure is deployed to find the postattack peak. In particular, search


forward from kmax for the last block where the metric begins to increase and whose
metric is also no more than EX D 0:5Emax :
k = kmax ;
do {
k++;
for (; k<K-1; k++) {
if ( E[k+1] > E[k] )
break;
}
if ( k+1>=K )
break;
} while ( E[k] > EX );
PostK = k+1;
The postattack peak is
PostEmax D

max
E.k/:
PostKk<K

(11.30)

For final decision, declare that the current frame contains transients if
Emax > k3 maxfPreEmax ; PostEmax g;
where k3 is a tunable parameter.

(11.31)

Chapter 12

Joint Channel Coding

Multichannel audio signals or programs, including the most widely used stereo and
5.1 surround sounds, are considered as consisting of discrete channels. Since a multichannel signal is intended for reproduction of coherent sound field, there is strong
correlation between its discrete channels. This inter-channel correlation can obviously be exploited to reduce bit rate.
On the receiving end, the human auditory system relies on a lot of cues in the
audio signal to achieve sound localization and the processing involved is very complex. However, a lot of psychoacoustic experiments have consistently indicated that
some components of the audio signal are either insignificant or even irrelevant for
sound localization, thus can be removed for bit rate reduction.
Surround sounds usually include one or more special channels, called low
frequency effect or LFE channels, which are specifically intended for deep and lowpitched sounds with a frequency range from 3 to 120 Hz. The significantly reduced
bandwidth presents a great opportunity for reducing bit rate.
It is obvious that a great deal of bit rate reduction can be achieved by jointly coding all channels of an audio signal through exploitation of inter-channel redundancy
and irrelevancy. Unfortunately, joint channel coding has not reached the same level
of sophistication and effectiveness as that of intra-channel coding. Therefore, only a
few widely used and simple methods are covered in this chapter. See [25] to explore
further.

12.1 M/S Stereo Coding


M/S stereo coding, or sum/difference coding, is an old technology which was
deployed in stereo FM radio [15] and stereo TV broadcasting [27] to extend
from monaural or monophonic sound reproduction (often shortened to mono) to
stereophonic sound reproduction (shortened to stereo) while maintaining backward
compatibility with old mono receivers. Toward this end, the left (L) and right (R)
channels of a stereo program are encoded into sum
S D 0:5.L C R/
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 12,
c Springer Science+Business Media, LLC 2010


(12.1)
231

232

12 Joint Channel Coding

and difference
D D 0:5.L  R/

(12.2)

channels. A mono receiver can process the sum signal only, so the listener can hear
both left and right channels in a single loudspeaker. A stereo receiver, however, can
decode the left channel by
LDS CD
(12.3)
and the right channel by
R D S  D;

(12.4)

respectively, so the listener can enjoy stereo sound reproduction. The sum signal is
also referred to as the main signal and the difference signal as the side signal, so this
technology is often called main/side stereo coding.
From the perspective of audio coding, this old technology obviously provides a
means for exploiting the strong correlation between stereo channels. In particular, if
the correlation between the left and right channels is strongly positive, the difference
channel becomes very weak, thus needs less bits to encode. If the correlation is
strongly negative, the sum channel becomes weak, thus needs less bits to encode.
However, if the correlation between the left and right channels is weak, both sum and
difference signals are strong, then there is not much coding gain. Also, if the left or
right channel is much stronger than the other one, sum/difference coding is unlikely
to provide any coding gain either. Therefore, the encoder needs to dynamically make
the decision as to whether or not sum/difference coding is deployed and indicates
the decision to the decoder.
Instead of the time domain approach used by stereo FM radio and TV broadcasting, sum/difference coding in audio coding is mostly performed in the subband
domain. In addition, the decision as to whether sum/difference coding is deployed
for a frame is tailored to each critical band. In other words, there is a sum/difference
coding decision for each critical band [35].

12.2 Joint Intensity Coding


The perception of sound localization by the human auditory system is frequencydependent [16, 18, 102]. At low frequencies, the human ear localizes sound dominantly through inter-aural time differences (ITD). At high frequencies (higher than
45 kHz, for example), however, sound localization is dominated by inter-aural
amplitude differences (IAD). This latter property renders a great opportunity for
significant bit rate reduction.
The basic idea of joint intensity coding (JIC) is to merge subbands at high frequencies into just one channel (thus significantly reducing the number of samples
to be coded) and to transmit instead a smaller number of bits that describe the amplitude differences between channels.

12.2 Joint Intensity Coding

233

Joint intensity coding is, of course, performed on critical band basis, so only
critical band z is considered in the following discussion. Since sound localization is
dominated by inter-aural amplitude differences only at high frequencies, higher than
45 kHz, the critical band z should be selected in such a way that its lower frequency
bound is higher than 45 kHz. All critical bands higher than this can be considered
for joint intensity coding.
To illustrate the basic idea of and the steps typically involved in joint intensity
coding, let us suppose that there are K channels that can be jointly coded and denote
the nth subband sample from the kth channel as X.k; n/. The first step of joint
intensity coding is to calculate the power or intensity of all subband samples in
critical band z for each channel:
X
X 2 .k; n/; 0  k < K:
(12.5)
k2 D
n2z

At the second step, all subband samples in critical band z are jointed together to
form a joint channel:
J.n/ D

X.k; n/;

n 2 z:

(12.6)

0k<K

Without loss of generality, let us suppose that these joint subband samples are embedded in the zth critical band of the zeroth channel. These joint subband samples
need to be adjusted so that the power of the 0th channel in critical band z is unchanged:
s
X.0; n/ D

02
J.n/;
2
n2z J .n/

n 2 z:

(12.7)

At this moment, these joint subband samples can be coded as a normal critical band
of the zeroth channel. All subband samples in critical band z in other channels are
discarded, thus achieving significantly bit saving.
At the third step, the steering vector for jointed channels are calculated:
s
k D

k2
02

0 < k < K:

(12.8)

They are subsequently quantized and embedded into the bit stream as side information. This completes the encoder side of joint intensity coding.
At the decoder side, the joint subband samples for critical band z embedded as part of the zeroth channel are first decoded and reconstructed as XO .0; n/.
Then the steering vectors are unpacked from the bit stream and reconstructed as
O k ; k D 1; 2; : : : ; K  1. And finally, subband samples of all jointed channels can
be reconstructed from the joint channel using the steering vector as follows:
X.k; n/ D Ok XO .0; n/; n 2 z and 0 < k < K:

(12.9)

234

12 Joint Channel Coding

Since a large chunk of subband samples are discarded, joint intensity coding
introduces significant distortion. When properly designed and not aggressively deployed, the distortion may appear as mild collapsing of stereo imaging. A listener
may complain that the sound field is narrower. When aggressively deployed, significant collapsing of stereo imaging is possible.
After subband samples are jointed together and then reconstructed using the
steering vector at the decoder side, they can no longer cancel out aliasing from other
subbands. In other words, the aliasing canceling property necessary for perfect reconstruction is destroyed forever. If stopband attenuation of the prototype filter is
high, this may not be a significant problem. For subband coders whose stopband
attenuation is not high, however, there exists a problem that aliasing may become
easily audible. This is the case for MDCT whose stopband attenuation is usually
less than 20 dB.

12.3 Low-Frequency Effect Channel


Low-frequency effects (LFE) channels are for sound tracks that are specifically
intended for deep and low-pitched sounds with frequency bandwidth limited to
3120 Hz. Such a channel is normally sent to a speaker that is specially designed
for low-pitched sounds, called a subwoofer or low frequency emitter.
Due to the extremely low bandwidth of an LFE channel, it can be very easily
coded. A simple approach is to down-sample and then quantize it directly. Another simple approach is to code it using the same method as the other channels
are coded, but with specific restriction to disallow transient mode for the filter bank.
For MDCT-based coders, for example, this entails that only the long window is
allowed.

Chapter 13

Implementation Issues

The various audio coding technologies presented in earlier chapters need to be


stitched together to form a complete audio coding algorithm, which in turn needs
to be implemented on a physical platform such as a DSP or other microprocessors.
Toward this end, a lot of practical issues need to be effectively addressed, including
but not limited to the following:
Data Structure. The audio samples need to be properly organized and managed in
such a way that is suitable for both encoding and decoding processing.
Entropy Codebooks. To maximize the efficiency of entropy coding, a variety of
entropy codebooks need to be adaptively deployed to match the statistic properties
of various sources being coded.
Bit Allocation. Bits are the precious resource that need to be optimally allocated to
deliver the best overall coding efficiency.
Bit Stream Format. All bits that represent the coded audio data need to be properly
formated into a structure that facilitates easy synchronization, unpacking, decoding,
and error handling.
Implementation on Microprocessors. Audio encoders and decoders are mostly implemented on DSPs or other low-cost microprocessors, so an algorithm needs
to be designed in such ways that it can be conveniently implemented on these
microprocessors.

13.1 Data Structure


To facilitate real-time audio coding, audio data need to be organized as frames so
that they can be encoded, delivered, and decoded piece by piece in real time. In
addition, frequent interruption of mostly quasistationary audio signals by transients
necessitates the segmentation of audio data into quasistationary segments based on
transient locations so that adaptation in time of a filter banks resolution can be
properly deployed. In the frequency domain, the human ear processes audio signals
in critical bands, so subband samples need to be organized in a similar fashion to
maximally exploit perceptual irrelevance. The three requirements above constitutes
the basic constraints in designing data structures for audio coding.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 13,
c Springer Science+Business Media, LLC 2010


235

236

13 Implementation Issues

13.1.1 Frame-Based Processing


Real-time delivery of media contents, such as live broadcasting, is often a necessary
requirement for many applications that involves audio. In these applications, one
cannot wait until all samples are received and then begin to process and deliver
the samples. In addition, this waiting for completion would make the encoding and
decoding equipments extremely expensive due to the need to store the entire signal
that may last hours, even days. A basic solution to this problem is frame-based
processing in which the input signal is proceeded and delivered frame by frame.
To avoid data corruption, this frame-based processing is usually implemented
using double buffering or ping-pong buffering. Under such a mechanism, two sets
of input and output buffer pairs are deployed. While one set of buffers are handling
input and output operations (the input buffer is receiving and the output buffer is
transmitting data), the data in the other set are being processed (the input buffer is
being read from and the output buffer is written to filled). Once the input/output
handling set is full/empty and the data in the processing set is processed, the buffer
sets need to be switched to let the processed buffer set handle new input/output
data and the data in the input/output set be processed. The processing has to be
fast enough to ensure that the processing of the whole frame of data is completed
before the input/output buffer set is full/empty. Otherwise, the processing has to
be abandoned to ensure real-time operation. For easy frame buffer management,
especially the synchronized switching of buffer sets, a constant frame size is highly
desirable.
The block-based nature of filter banks, and MDCT in particular, is obviously very
amenable to frame-based processing and to maintaining a constant frame size. One
can simply set the frame size to a multiple of the block size of the filter bank. With
a switched filter bank, whose block size varies over time, a constant frame size can
still be easily achieved by setting the frame size to a multiple of the longest block
size and ensuring that the longer block sizes are multiples of the shorter block sizes.
For example, the block sizes for the double-switched MDCT in Sect. 11.3 are 1,024
and 128, respectively, and that for the triple-switched MDCT in Sect. 11.6 are 1,024,
256 and 64, respectively, so a constant frame size may be maintained by setting it to
a multiple of 1,024, such as 1; 024; 2; 048; : : : .
While a frame containing a multiple of the longest blocks may save a little side
information in the final bit stream, this structure is not essential as far as coding is
concerned. Without loss of generality, therefore, the frame size is assumed to be
the same as the longest block size of a switched filter bank in the remainder of the
this book.

13.1.2 TimeFrequency Tiling


A frame with transients contains a multiple of short blocks, in which transients may
occur in one or more of those blocks. The statistical properties of the blocks after a

13.1 Data Structure

237

transient are usually similar to the transient block, so a transient may be considered
as the start of a quasistationary segment, called transient segment, each consisting
of one or more short blocks. For this reason transient localization may also be called
transient segmentation. Under such a scheme, the first transient segment of a frame
starts from the first block of the frame and ends before the first transient block. The
second transient segment starts from the first transient block and ends either before
the next transient block or with the end of the frame. This procedure continues until
the end of the frame. This segmentation of subband samples by blocks based on
transient locations and frame boundaries is illustrated in Fig. 13.1.
Within each block, either short or long, the subband samples are segmented along
the frequency axis in such a way that critical bands are approximated to fully utilize

Fig. 13.1 Segmentation of blocks of subband samples within a frame based on transient locations
and frame boundaries

238

13 Implementation Issues

Critical Band M-1 Quantization Unit

..
.

Quantization Unit

..
.

..
.

Quantization Unit

..
.

Critical Band 1 Quantization Unit

Quantization Unit

Quantization Unit

Critical Band 0 Quantization Unit

Quantization Unit

Quantization Unit

Segment 1

Segment 2

Segment 0

Fig. 13.2 Timefrequency tiling of subband samples in a frame by transient segments in the time
domain and critical bands in the frequency (subband) domain. Subband samples in each tile share
similar statistical properties along the time axis and are within one critical band along the frequency
axis, so may be considered as one unit for quantization and bit allocation purpose

results from perceptual models. Subband segments thus obtained may be conveniently referred to as critical band segments or simply critical bands.
The subband samples for a frame with detected transient are, therefore, segmented along the time axis into transient segments and along the frequency axis
into critical band segments. Combined together, they segment all subband samples
in a frame into a timefrequency tiling as shown in Fig. 13.2. All subband samples within each tile share similar statistical properties along the time axis and are
within one critical band along the frequency axis, so they should be considered as
one group or unit for quantization and bit allocation purpose. Such a tile is called a
quantization unit.
For a frame with a long block, the timefrequency tiling for a frame is obviously
reduced to tiling of critical bands along the frequency axis only, so a quantization
unit consists of subband samples in one critical band.

13.2 Entropy Codebook Assignment


After quantization, entropy coding may be applied to the quantization indexes to
remove statistic redundancy. As discussed in Chaps. 8 and 9, an entropy codebook
is designed based on a specific estimate of the probability model of the source sequence, which are the quantization indexes in this case. When quantization indexes
are coded by an entropy codebook, it is imperative that they follow the probability
model for which the entropy codebook is designed. Otherwise, less than optimal
and even very poor coding performance is likely to occur.

13.2 Entropy Codebook Assignment

239

Due to the dynamic nature of audio signals, quantization indexes usually do


not follow a single probability model. Accordingly, a library of entropy codebooks
should be designed and assigned to different segments of quantization indexes to
match their respective probability characteristics.

13.2.1 Fixed Assignment


For the sake of simplicity, the quantization indexes for the subband samples in one
quantization unit may be considered as statistically similar, so can be considered as
sharing a common probability model and thus the same entropy codebook. In such a
case, the quantization unit is also the unit for codebook assignment: all quantization
indexes within a quantization unit are coded using the same entropy codebook. This
is shown in Fig. 13.3. In other words, the application scopes of entropy codebooks
overlap with that of quantization units. This is the scheme adopted by almost all
audio coding algorithms.
Under such a scheme, the codebook assigned to a quantization unit is usually the
smallest one that can accommodate the largest quantization index (absolute value)
within the quantization unit. Since this largest value is completely determined once
the quantization step size is selected by bit allocation, the entropy codebook is also
completely determined, there is no room for optimization.

Fig. 13.3 All quantization indexes within a quantization unit are coded using the same entropy
codebook. The application scopes of entropy codebooks overlap with that of quantization units

240

13 Implementation Issues

13.2.2 Statistics-Adaptive Assignment


Although the boundaries of a quantization unit along the time axis is based on the
statistic properties of the subband samples, its boundaries along the the frequency
axis is based on critical bands and is thus unrelated to the statistic properties of the
quantization indexes. Therefore, the fixed codebook assignment approach discussed
above does not provide a good match between the statistic properties of the entropy
codebooks and those of the quantization indexes.
To arrive at a better match, a statistic-adaptive approach may be adopted which
ignores the boundaries of the quantization units and, instead, adaptively matches
codebooks to the local statistic properties of quantization indexes. This is illustrated
in Fig. 13.4 and outlined below:
1. Assign to a quantization index the smallest codebook that can accommodate it,
thereby converting quantization indexes into codebook indexes.
2. Segments these codebook indexes into large segments based on their local statistic properties.
3. Select the largest codebook index for each segment as the codebook index for the
segment.
The advantage of this statistics-adaptive approach to codebook assignment over
the fixed assignment approach can be seen by comparing Fig. 13.3 with Fig. 13.4.
Since the largest quantization index falls into quantization unit d in Fig. 13.3, a large
codebook needs to be assigned to this unit to handle this large quantization index
if the fixed assignment scheme is deployed. This is obviously not a good match

Fig. 13.4 Statistic-adaptive approach to entropy codebook assignment, which ignores the boundaries of the quantization units and, instead, adaptively matches codebooks to the local statistic
properties of quantization indexes

13.3 Bit Allocation

241

because most of the indexes in the unit are much smaller and thus are not suitable
for being coded using a large codebook. Using the statistics-adaptive approach, however, the largest quantization index is segmented into codebook segment C as shown
in Fig. 13.4, so it shares a codebook with other large quantization indexes in the segment. Also, all quantization indexes in codebook segment D are small, so a small
codebook can be selected for them, which results in fewer bits for coding these
quantization indexes.
With the fixed assignment approach, only the codebook indexes need to be transferred to the decoder as side information, because the codebook application scopes
are the same as the quantization units, which are usually determined by transient
segments and critical bands. The statistics-adaptive approach, however, needs to
transfer the codebook application scopes to the decoder as side information, in
addition to the codebook indexes, since they are dynamic and independent of the
quantization units. This overhead must be contained in order for the statisticsadaptive approach to offer any advantage. This can be accomplished by imposing a
limit on the number of segments, thus restricting the amount of side information.

13.3 Bit Allocation


The goal of audio coding is to reduce the bit rate required to carry audio data to the
decoder with inaudible distortion. Under normal configuration, a particular application would specify a target bit rate, which may be the maximal rate that the audio
stream can never exceed, an average rate that the audio stream should maintain over
a certain period of time, or a variable rate that varies with channel condition (such
as the Internet). Depending on the type of the bit rate, an inter-frame bit allocation
algorithm is needed to optimally allocate a certain number of bits to encode each
audio frame. Within each frame, an intra-frame bit allocation algorithm is needed
to optimally allocate these bits to encode each quantization indexes and other side
information to ensure minimal distortion.

13.3.1 Inter-Frame Allocation


There are many algorithms to allocate bits to frames, the easiest approach is to
allocate the same number of bit to each frame since equal frame size is usually
deployed for easy buffer handling:
Bits Per Frame D

Bit Rate
 Samples Per Frame:
Sample Rate

(13.1)

As an example, let us consider a bit rate of 128 kbps and a frame size of 1,024
samples. For a sample rate of 48 kHz, the number of bits assigned to code each
frame is

242

13 Implementation Issues

128
 1;024  2;730:
48
While simple, this approach is not a good match to the dynamic nature of audio signals. For example, frames whose audio samples are either silent or of small
magnitude demand a small number of bits. Allocating the same number of bits to
such frames means a significant number of them are not used and thus wasted. On
the other hand, frames with transients tend to demand a large number of bits, allocating the same number of bits to such frames results in inadequate number of bits
and thus distortion is likely to become audible. Therefore, a better approach is to
adaptively allocate bits to frames in proportion to their individual demand while full
conformance to the target bit rate is maintained.
Adaptive inter-frame bit allocation is a very complex problem that are affected
by a variety of factors, such as the type of bit rate constraints (maximal and average
bit rates), the size of decoder buffers, and bit demands of individual audio frames.
Audio decoders usually have very limited buffer size, which leads to the possibility of buffer underflow and overflow. Buffer underflow, also called buffer
underrun, is a condition that occurs when the input buffer is fed with data at a
lower speed than the data is being read from it for a prolonged period of time so that
there is not enough data for the decoder to decode. When this happens, real-time
audio playback has to be paused until enough data is fed to the input buffer. Buffer
overflow is the opposite condition where the input buffer is fed with data at a higher
speed than the data is being read from it for a prolonged period of time so that the
input buffer is full and can no longer take more data. When this occurs, some of the
input data have to be discarded, causing decoding error.
The bit demands of individual audio frames vary with time. Since it is impossible
to predict the bit demand for frames in the future, optimal bit allocation is very
problematic for real-time encoding. For applications where real-time encoding is not
required, one usually can make a pilot run of encoding to estimate the bit demand
for each frame, thus making inter-frame bit allocation easier.

13.3.2 Intra-Frame Allocation


Once the number of bits allocated to a frame is known, there is still a need to optimally allocate these bits, referred to as a bit pool, to each individual quantization
unit in such a way that quantization noise is either completely inaudible or least
audible. According to the optimal bit allocation strategy discussed in Sects. 5.2, 6.5,
and 10.6, the basic principle is to equalize MSQE or NMR (if a perceptual model
is used) in all quantization units. While there are many methods to achieve this, the
following water-filling algorithm illustrate the basic idea:
1. Find the quantization unit whose quantization noise is most audible, i.e. the NMR
is the highest.
2. Decrease the quantization step size for this unit.

13.4 Bit Stream Format

243

3. Recalculate the number of bits consumed.


4. Repeat steps 1 through 3 until either
 The bit pool is exhausted or
 The quantization noise for all quantization units are below their respective

masking thresholds, or NMR for all quantization units are less than one.

13.4 Bit Stream Format


All elements of compressed audio data that represent a frame of audio samples need
to be packed into a structured frame for transmission to the decoder and subsequent
unpacking by the decoder so that the original audio signal can be reconstructed and
played back. A sequence of such frames form an audio bit stream.
The payloads for an audio frame obviously consist of bits that represent audio
samples, but supplementary data that may provide a description of the audio signal,
such as sample rate and speaker configuration, are needed to assist the synchronization, decoding and playback of audio signals. Error protection codes may be
inserted to help deal with difficult channels. Axillary data, such as time codes and
information about the band, may also be attached to enhance the audio playback experience. Therefore, a compressed audio frame may have a typical frame structure
as shown in Table 13.1 in which the payloads are supplemented by frame header,
error protection codes, auxiliary data, and end of frame signature.

13.4.1 Frame Header


A basic frame header structure is shown in Table 13.2. Its first role is to assist the decoder to synchronize with the audio frames so that it knows where a frame starts and
ends. This is usually achieved through the synchronization codeword, the compressed data size codeword, and the end of frame signature which is the last
codeword of a frame.
Table 13.1 A basic frame
structure for compressed
audio data. The payloads are
the bits representing audio
data for all channels.

Frame Header
Audio Data for Normal Channel 0
Audio Data for Normal Channel 1
::
:
Payloads
Audio Data for LFE Channel 0
Audio Data for LFE Channel 1
::
:
Error Protection Codes
Auxiliary data
End of Frame

244
Table 13.2 A basic frame
header that assists the
synchronization of audio
frames and provides a
description of the audio
signal that is carried by the
audio bit stream

13 Implementation Issues
Synchronization word
Compressed data size
Sample rate
Number of normal channels
Number of LFE channels
Speaker configuration

The synchronization word signals the start of the frame. Some audio coding
algorithms impose a constraint that the synchronization word cannot appear
anywhere inside a frame for easy synchronization with the frame: once the synchronization word is found, it is the start of the frame and there is no ambiguity. In such
a case, it is unnecessary to use the compressed data size codeword and end of frame
signature.
Forbidding the synchronization word to appear inside a frame imposes a certain degree of inflexibility to the encoding process. To lessen this inflexibility, it is
imperative to use a long synchronization word. A good example is 0xffffffff (hexadecimal) which requires that 32 consecutive binary 1s do not appear anywhere
inside the frame.
Without the restriction above, a synchronization word might appear in the middle
of a frame. If seeking synchronization word begins in the middle of such a frame,
an identification of the synchronization word inside the frame triggers false synchronization. However, compressed data size codeword, which goes right after the
synchronization word, would point to a position in the bit stream where the end of
frame signature is absent, so synchronization cannot be confirmed. Consequently, a
new round of synchronization loop is lunched.
The second role of the frame header is to provide a description of the audio
signal carried by the bit stream. This facilitates the decoding of the compressed
audio frames as well as the playback of the decoded audio. A minimal description
must include the sample rate, the number of normal and LFE channels, and speaker
configuration.

13.4.2 Audio Channels


The payloads of an audio frame are the bits representing all audio samples which
are usually packed one channel after another as shown in Table 13.1. Within each
channel are a large chunk of ancillary data, called side information, and the bits
representing the quantization indexes, as shown in Table 13.3.
The side information aids the unpacking of quantization indexes and the reconstructing of audio samples from these indexes. It may include bits that represent
MDCT window index, transient locations, control bits for joint channel coding, indexes for quantization step sizes and bit allocation. Note that the side information
do not necessarily have to be packed as a whole chunk. Instead, it can be interleaved
with bits representing the quantization indexes.

13.4 Bit Stream Format


Table 13.3 Basic data
format for packing one audio
channel consisting of M
entropy codebook application
scopes

245
Window index
Transient locations
Joint channel coding
Quantization step sizes
Bit allocation
Entropy codebook application scope 0
Quantization
Entropy codebook application scope 1
::
:
Indexes
Entropy codebook application scope M  1
Side
Information

The bits representing the quantization indexes for all coded subband samples
may be packed sequentially from low frequency to high frequency. For frames with
short windows, they are usually laid out one transient segment after another. When
entropy coding is used for quantization indexes, these bits are obviously placed
based on the application scopes of respective entropy codebooks.

13.4.3 Error Protection Codes


Error protection codes, such as ReedSolomon code [44], may be inserted between data elements in a frame to detect bit stream error and even correct minor
errors. To reduce the associated overhead, they may be selectively placed to protect
critical segments of data in a frame, such as the entropy codebook indexes, within
which any error is likely to cause the whole decoding process to crash. This amounts
to unequal error protection.
In addition, bits in a frame may be arranged in such a way such that most burst bit
errors would cause a graceful degradation of audio quality, instead of a total crash.
The end result is a certain degree of error resilience, which is valuable when audio
streams are transmitted or broadcast in wireless channels.
However, error protection codes are extensively deployed in modern digital communication protocols and systems to ensure that the bit error rate is below a certain
threshold. In addition, the trio of synchronization word, compressed data size word,
and end of frame signature provide a basic mechanism for error detection: they
must be consistent with each other. Therefore, it is unnecessary to insert additional
error protection capability in audio frames for normal applications. Error protection
should be considered as part of channel coding, unless the channel condition is so
bad that joint source-channel coding becomes necessary.

13.4.4 Auxiliary Data


Many data elements, such as the name for the band that played the piece of music
being encoded or a number that identifies the encoder, which are not part of the audio

246

13 Implementation Issues

data and are user-defined, may be attached to the audio frame as auxiliary data.
The decoder can choose to ignore all of them without interrupting the decoding and
playback of the audio signal.
The auxiliary data may even provide a mechanism for future extension to the
audio coding algorithm. Data for coder extension may be attached to the original
frame as auxiliary data to let the corresponding new decoders to decode. This extension in the auxiliary data is simply ignored by the older decoders which does not
recognize the extension.

13.5 Implementation on Microprocessors


An important aspect of consideration when designing an audio coding algorithm
is implementation cost, especially on the decoder side, which is directly linked to
encoder/decoder cost and their energy consumption. Decoder cost is usually a major
factor affecting a consumers purchase decision. Energy consumption has become
an important issue in the global warming trend and is a critical factor affecting the
viability of mobile devices.
Algorithm complexity has a significant impact to implementation cost. In the pursuit of coding performance, there is a tendency to add more and more compression
components to the algorithm, usually at mediocre performance gain. An alternative
is to maximize the performance of basic components to rein in a bloated algorithm.
The synthesis filter bank and entropy decoding are the two most computationally expensive components in an audio decoder. Cosine modulated filter bank, and
MDCT in particular, is widely used in audio coding largely because it has an
structure amenable for fast algorithms. Huffman coding is considered as a good
balance between coding gain and decoder complexity. If companding is deployed,
the expanding on the decoder side may become an computationally expensive
operation.

13.5.1 Fitting to Low-Cost Microprocessors


Another implementation aspect that might be overlooked is the suitability of the
decoding algorithm for low-cost microprocessors that have no floating-point units
and small on-chip memory. For improved accuracy and performance on such systems, the audio coding algorithm must be designed in such a way that the decoder
can be implemented using simple fixed-point arithmetic without resorting to sophisticated techniques to manage the magnitudes of any variables involved in the
decoding process.
The first step toward this end is to limit the dynamic range of all signal
components that may be used by the decoding algorithm to within the range
that can be represented by integers of a typical microprocessor, while maintaining
acceptable precision required by audio signals. While 16 bits per sample have been

13.5 Implementation on Microprocessors

247

widely used for delivering audio signals in various digital audio systems, most
notably the compact disc (CD), the representation of high-fidelity sounds may need
more than 20 bits per sample, so 24 bits per sample has been considered as more
appropriate.
The 24 bits are also amenable for microprocessors that do not implement fixedpoint arithmetic in hardware and usually are at least of 32-bit data width. For such
processors, the 24 bits for audio signals may be placed to occupy the lower 24 bits
of a 32-bit word, leaving the upper 8 bits as a buffer for coping with overflowing,
which may occur during a series of accumulation operations.
Any noninteger parameters that the decoder may use, such as the quantization
step sizes, must be converted to 24-bit integer representation.
Division, which is extremely expensive on such systems, should be strictly disallowed.
Modern microprocessor usually come with a small amount of fast on-chip internal memory and a large amount of slow off-chip or external memory. If the foot-print
of the decoding algorithm could fit within the fast internal memory, decoding would
be significantly accelerated. Software development time and reliability may also be
improved if managing various accesses to the external memory is avoided. Memory
foot print may be reduced by using small entropy and quantization codebooks, as
well as a small audio frame size.

13.5.2 Fixed-Point Arithmetic


A real number on a computer may be considered as consisting of an integer and a
scale factor. For example, the real number 0.123 can be considered as 123/1,000,
where 123 is the integer and 1,000 is the scaling factor. If both integer value and
the scale factor are stored, the number is said as being represented by a floatingpoint number, because the scale factor actually determines the radix point or decimal
point. The stored scale factors allow floating-point numbers to have a wide range of
values.
If the scale factor is not stored but assumed during the whole course of computation, the radix point is then fixed, so the real number becomes a fixed-point
number.
The computer uses 2s complement to represent integers, so let us consider a
binary fixed-point type with f fractional bits and a total of b bits. The corresponding
number of integer bits is m D b  f  1 (the left most bit is the sign bit). If the
integer value represented by the b bits is x, its fixed-point value is
x
:
2f

(13.2)

Since the left most bit is the sign bit, the dynamic range of this fixed-point number is
#
"
2b1 2b1  1
:
(13.3)
 f ;
2
2f

248

13 Implementation Issues

There are various notations for representing word length and radix point in a
binary fixed-point type. The one most widely used in the signal processing community is the Q notation Qm.f and its abbreviation Qf which assumes a default
value of m D 0. For example, Q0.31 describes a type with 0 integer bit and 31
fractional bits stored as a 32-bit 2s complement integer, which may be further abbreviated to Q31. The 24 bits per sample suggested above for representing variables
and parameters in an audio coder may be designated as Q8.23 if implemented using
a 32-bit integer (bD32). Using a 24-bit integer, it is Q0.23, or simply Q23.
To add or subtract two fixed-point numbers, it is sufficient to add or subtract the
underlying integers
y
xy
x
f D
;
(13.4)
f
2
2
2f
which is exactly the same as integer addition or subtraction. Both operations may
overflow, but does not underflow.
For multiplication, however, the result needs to be rescaled
 x
y  f
xy
2 D f

f
f
2
2
2

(13.5)

to maintain the same fixed-point representation. The scaling can be conveniently


done through binary shifting to the left. While the multiplication of two Qm.f
numbers with m > 0 may overflow, this is not true when m D 0 because the multiplication of two fractional numbers remains a fractional number. Multiplication of
such numbers may underflow.
Similar to multiplication, the result from the division of two fixed-point numbers
needs to be rescaled
 x
y  1
xy

D
(13.6)
2f
2f 2f
2f
to maintain the same fixed-point representation. The scaling can be conveniently
done through binary shifting to the right. Underflowing may occur in division.
When implementing fixed-point multiplication and division using integer arithmetic, the intermediate multiplication and division results must be stored in double
precision. To maintain accuracy, the intermediate results must be properly rounded
before they are converted back to the desired fixed-point format. The C code for
implementing fixed-point multiplication is given below:
#define R

( 1 << (f-1) )

// Rounding factor

int FixedPointMultiply(int x, int y)


{
//
long int temp; // Double precision for intermediate results
//
// Integer multiplication
temp = (long int)x * (long int)y;
//
// Rounding

13.5 Implementation on Microprocessors

249

temp = (temp>0) ? temp+R : temp-R;


//
// Scaling
return temp >> f;
}

The C code for implementing fixed-point division is given below:


int FixedPointDivide(int x, int y)
// return x/y
{
long int temp; // Double precision for intermediate results
//
// Scaling
temp = (long int)x << f;
//
// Rounding
temp = (temp>0) ? temp+(y>>1) : temp-(y>>1);
//
// Integer division
return int(temp/y);
}

Chapter 14

Quality Evaluation

Since removal of perceptual irrelevance through quantization contributes the largest


part of compression for a lossy audio coder, significant distortion or impairment
is inherent in the reconstructed audio signal that the decoder outputs. The whole
promise of perceptual audio coding is based on the assumption that quantization
noise can be hidden behind or masked by the prominent signal components in such
a way that it is inaudible.
This promise may be easily delivered when the compression ratio is low. As the
compression ratio increases to a certain level, distortion or impairment begins to
become audible. The nature of distortion and the conditions under which it becomes
audible depend on a variety of factors, including the audio coding algorithm, the
input audio signal, the rendering equipments, and the listening environment. The
threshold of inaudible impairment for a million dollar listening room is dramatically
different than that for a portable music player.
One of the goals of audio coding algorithm development is to maximize compression ratio subject to the constraint that distortion is inaudible or tolerable for
the most difficult pieces of audio signals, using the best rendering equipments, and
under the optimal listening environment. This is usually approached by setting the
algorithm to work at various bit rates and then minimizing the corresponding coding
impairment. This necessitates frequent evaluation of coding impairment.
The development process of an audio coder inevitably involves many iterations
of trial-and-error to test the performance of the algorithm structure and its subsystem as well as to optimize various parameters of the algorithm. Each of these steps
entails the evaluation of distortion level to identify the right direction for algorithm
improvement. Even after the algorithm development is completed at the engineering department, a formal evaluation or characterization of the algorithm is in order
before it is lunched to the market.
During the course of market promotion, suspicious customers may request a
formal evaluation. Competitors may make various self-serving claims about audio
coding algorithms on the market, which may be arguably based on their internal
evaluation.
Audio coding algorithms are often adopted as an industry, national or
international standard to facilitate its mass adoption. During the standardization process, many competing algorithms are likely to be proposed, so the standardization
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 14,
c Springer Science+Business Media, LLC 2010


251

252

14 Quality Evaluation

committee needs to order a set of formal evaluation to help determine which


algorithm is to be adopted or how to combine good features from various algorithms into a joint one.
All of the various situations above call for an effective and simple measure of
coding distortion or impairment. Since the ultimate judge is the human ear, listening test or objective evaluation is the necessary and final tool for such impairment
assessment. Listening test is, however, usually time-consuming and expensive, so
simple objective metrics may serve as intermediate tools for fast and cheap impairment assessment so that algorithm development can evolve on fast track and
performance of various algorithms can be conveniently compared.

14.1 Objective Metrics


The first requirement for an objective metric is that it correlates well with how the
human ear perceives. In addition, it must be simple enough to enable cheap and
rapid evaluation.
The first objective metric that would come to mind is the venerable signal-tonoise ratio (SNR). When the distortion is at the same level as the masked threshold,
SNR becomes SMR which was discussed in Sect. 10.4. Depending on the types of
masker and maskee, the SMR threshold may differ from 2 to 28 dB, as shown in
Table 10.3. This means that, for some maskermaskee pairs, an SNR as low as 2 dB
may be inaudible, but for some others, an SNR as high as 28 dB is necessary to
ensure inaudibility. This large variation of up to 26 dB means that SNR does not
correlate well with how the human ear perceives, so is unsuitable for measuring
coding distortion.
Apparently, psychoacoustic model must be deployed to account for how quantization noise is masked by signal components adjacent in both time and frequency.
This leads to various algorithms that try to objectively measure perceived audio
quality or distortion, including PEAQ (Perceptual Evaluation of Audio Quality)
released as ITU-R Recommendation BS.1387 [31] in 1998. Unfortunately, such objective measures still provide impairment values that rarely correlate well with how
the human ear perceives.

14.2 Subjective Tests


Given that both SNR and the current perceptual objective measures are not considered as satisfactory in terms of effectiveness, the tedious and often expensive
subjective listening tests has remained to be the only effective and final tool. Informal listening tests are conducted by developers in various stages of algorithm
development and formal listening tests are usually ordered as the authoritative
assessment of an audio coders performance.

14.2 Subjective Tests

253

14.2.1 Double-Blind Principle


Subjective listening test usually involves comparing a coded piece of audio with the
original one. A major challenge with such a test is the bias that either the listener
or the test administrator may have on different types of distortion, music, and coder.
This problem is often exacerbated by the subjective nature of human perception: a
listener may even perceive the same piece of music differently at different time even
under the same environment.
A common approach to mitigate these problems is the Double blind principle
given below:
Neither the listener nor the test administrator knowns whether the piece of
audio that is being listened is the coded signal or the original one during
the whole test process.

14.2.2 ABX Test


A simple subjective listening test that is favored by many algorithm developers is
the ABX test [7]. Under such a test, the original audio signal is usually designated
as A and the coded audio signal as B. Then either A and B are randomly presented
to the listener as X, thus enforcing the double-blind principle. The listener is asked
to identify X as either A or B. The listener has the liberty of listening to A, B, and
X at any time before he or she makes the decision. Once the decision is made, the
next round of testing begins.
After this procedure is repeated many times, the rate of correct identification is
calculated. A rate around 50% clearly indicates that the listener cannot reliably perceive the distortion, while a significantly biased rate indicates perceived distortion.

14.2.3 ITU-R BS.1116


For formal evaluation of small impairment in coded audio signals, ITU-R Recommendation BS.1116 is usually the choice [30]. It uses a continuous five grade
impairment scale, the quantized version of which is shown in Table 14.1.

Table 14.1 Five grade


impairment scale used
by ITU-R BS.1116

Impairment description
Imperceptible
Perceptible, but not annoying
Slightly annoying
Annoying
Very annoying

Grade
5.0
4.0
3.0
2.0
1.0

254

14 Quality Evaluation

The test method may be characterized as double-blind, triple-stimulus with


hidden reference. The listener is presented with three stimuli: the reference signal
at first, then the reference, and the coded signal in random order, thus enforcing the
double-blind principle. The listener is asked to assess the small impairment in the
later two and give grade to both of them. Since one of the later two is the reference
signal, one of the grade has to be five.
A higher grade assigned to the coded signal indicates a false identification. A rate
of false identification around 50% clearly indicates that the listener cannot perceive
the impairment, hence his grading results should be excluded in the final average
grade. This procedure may exclude the majority of the listeners, leaving with a group
of expert listeners. The number of expert listeners in a formal test may vary between
20 and 60. If even the expert listeners cannot perceive any impairment, the audio
coding algorithm is called transparent.
A training session should precede the formal testing to let the listeners familiar
with the test procedure, the grading method, the environment, and the coder impairment. To give the listeners an opportunity to learn to appreciate the impairment
and the grading scale, coded audio signals with easily perceived impairment may be
presented to the listeners.
The listening conditions and equipments used are critical to the final grade value
of the test. ITU-R BS.1116 stipulate strict requirements to the test environment,
including the speaker configuration, background noise, reverberation time, etc. See
[30] for details.
Since audio coders perform differently with audio signals of different characteristics, the selection of test signals or materials is also critical to the final grade
value. When comparison test is performed with competing audio coders, this issue
may become highly contentious.
The EBU SQAM CD [13] contains a set of audio signals recommended by EBU
for subjective listening tests. It is a selection of signals specifically chosen to reveal
impairments of various audio systems. Castanets and Glockenspiel in SQAM are
particularly famous for breaking many audio coders.

Chapter 15

DRA Audio Coding Standard

This chapter presents DRA (Dynamic Resolution Adaptation) audio coding


standard [97] as an example to illustrate how to integrate the technologies described
in this book to create a practical audio coding algorithm. DRA algorithm has been
approved by the Blu-ray Disc Association as part of its BD-ROM 2.3 specification
[99] and by Chinese government as its national standard [100].

15.1 Design Considerations


The core competence of an audio coding algorithm is the coding efficiency, or the
compression ratio that it can deliver without audible impairment. While the compression ratio given in (1.2) is a hard metric, the audible impairment is a subjective
metric which is affected by a variety of factor, including the sound material, the
playback device, the listening environment, and the listeners. In addition, for many
consumer applications, a certain degree of audible impairment can be tolerated and
thus allowed. Therefore, coding efficiency is eventually a subjective term that varies
with the desired or tolerable level of audio impairment requirement, which is usually
determined by applications.
To deliver a high level of fidelity (low level of impairment), an audio coding
algorithm needs to maintain a signal path of high resolution in both its encoder
and decoder. This usually requires fine quantization step sizes and consequently a
large number of quantized values, thus demanding more bits to convey both to the
decoder. Therefore, it is more difficult to deliver high coding efficiency when the
fidelity requirement is high.
To achieve high coding efficiency, an audio coding algorithms needs to consider
more complex situations so that it can adapt better to the changing signal characteristics. It also needs to deploy more, complex and sometimes redundant technologies
to squeeze out a little bit performance gain for some special situations. Therefore,
high coding efficiency usually means complex algorithms.
However, high algorithmic complexity, especially on the decoder side, results in
expensive devices and high power consumption and is, therefore, an impediment

Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 15,


c Springer Science+Business Media, LLC 2010


255

256

15 DRA Audio Coding Standard

Fig. 15.1 DRA encoder. The solid lines indicates flow of audio data and the dashed lines flow of
control parameters. The dashed boxes indicate optional modules that do not have to be invoked

to the adoption of the algorithm in the market place. Decoder cost is a key factor
affecting consume purchase decision and power consumption is becoming a prominent issue in an era of mobile devices and global warming.
When DRA audio coding algorithm was conceived, it was set as the design goal
to deliver high coding efficiency and high audio quality (if bit rate allows) with minimal decoder complexity. To accomplish this, a principle was set that only necessary
modules would be used, each module would be implemented with minimal impact
to the decoder complexity, and a signal path of 24 bits would be maintained when
the bit rate allows. This results in the encoder and decoder shown in Figs. 15.1 and
15.2, respectively.

15.2 Architecture
Transient localized MDCT (TLM) discussed in Sect. 11.5 is deployed as a simple approach to optimizing the coding gain for both quasistationary and transient
episodes of audio signals. The number of taps for long, short and virtual windows
are 2,048, 256 and 64, respectively. According to (11.16), the effective length for

15.2 Architecture

257

Fig. 15.2 DRA decoder. The solid lines indicates flow of audio data and the dashed lines flow of
control parameters. The dashed boxes indicate optional modules that do not have to be invoked

the resultant brief window is 160 taps, corresponding to a period of mere 3.6 ms for
a sample rate of 44.1 kHz. This is well within the 5 ms range that strong premasking
typically lasts.
The implementation costs of the inverse TLM is essentially the same as the regular switched IMDCT because both can be accomplished using the same software
or hardware implementation. A pseudo CCC implementation example is given in
Sect. 11.5.4. The only additional cost of inverse TLM over the regular switched
IMDCT is more complex window sequencing when in short MDCT mode. Fortunately, this window sequencing involves a very limited number of assignment and
logic operations as can be seen in the pseudo CCC code in Sect. 15.4.5.
Only midtread uniform quantization discussed in Sect. 2.3.2 is deployed to quantize all MDCT coefficients, so the implementation code for inverse quantization is
minimal, only involving an integer multiplication of the quantization index with the
quantization step size (see (2.22)).
Logarithmic companding discussed in Sect. 2.4.2.2 is used to quantize the quantization step sizes with a step size of 0.2, resulting in a total of 116 quantized values
that cover a dynamic range of 24 bits for the quantization indexes of MDCT coefficients. The quantized step size values are further converted into Q.23 fractional
format for convenient and exact implementation on fixed-point microprocessors.
The resultant quantization table is given in Table A.1.
All MDCT coefficients in one quantization unit (see Sect. 13.1.2) share one quantization step size. The critical bands that define the quantization units along the
frequency axis are given in various tables in Sect. A.2 for long and short windows
and for a variety of sample rates.
Statistic codebook assignment discussed in Sect. 13.2.2 is used to improve the
efficiency of Huffman coding with small impact to decoding complexity, but with

258

15 DRA Audio Coding Standard

a potentially significant overhead due to the bits needed to transfer the codebook
application scopes. But this can be properly managed by the encoder so that the
efficiency gain significantly exceeds the overhead.
The module in the decoder to reconstruct the number of quantization units involves a simple search to find the lowest critical band that accommodates the highest
(in frequency) nonzero quantization index (see Sect. 15.3.3.4 for details) , so its implementation cost is very low.
The optional sum/difference coding and joint intensity coding module implement
the simplest methods discussed in Chap. 12 so as to incur the least decoding cost
while reaping the the gains of joint channel coding. The decision for sum/difference
coding and steering vectors for joint intensity coding are individually based on quantization units. The quantization of the steering vectors for joint intensity coding
reuses the quantization step size table given in Table A.1 with a bias to provide for
the need that the elements of these steering vectors may be larger as well as smaller
than one, corresponding to the possibility that the intensities of the joint channels
may be stronger or weaker than that of the source channel.
As usual, transient detection, perceptual model and global bit allocation modules
are not specified by DRA standard and can be implemented by any methods that can
provide data in formats suitable for DRA encoding.

15.3 Bit Stream Format


DRA frame structure is shown in Table 15.1. It is designed to carry up to 64
normal channels and up to 3 LFE channels, or up to 64.3 surround sounds. This
capacity far exceeds the 7.1 surround sound enabled by Blu-ray Disc and even the
Hamasaki 20.2 surround sound demonstrated by NHK Science and Technical Research Laboratories [62].
The bit stream structure for normal channels are shown in Table 15.2, which
basically consists of five sections: window sequencing, codebook assignment, quantization indexes, quantization step sizes, and joint channel coding.
Since no transient should be declared to an LFE channel due to its limited bandwidth (less than 120 Hz), only the long MDCT window WL L2L or the short MDCT
window WS S2S is allowed to be used, thus there is no need to convey window
sequencing information. The limited bandwidth of LFE channels also disqualify

Table 15.1 Bit stream


structure for a DRA frame

Synchronization codeword (0x7FFF)


Frame header
Audio Normal channel(s): from 1 up to 64
Data
LFE channel(s): from 0 up to 3
End of frame signature
Auxiliary data

15.3 Bit Stream Format

259

Table 15.2 Bit stream structure for a normal channel


MDCT window index
Window sequencing
Number of transient segments
Lengths of each transient segment
Number of Huffman codebooks
Codebook assignment Application scopes for all Huffman codebooks
Selection indexes identifying corresponding Huffman codebooks
Quantization indexes Huffman codes for quantization indexes of all MDCT coefficients
arranged based on the application scopes of Huffman codebooks.
Quantization indexes in one application scope are coded using the
corresponding Huffman codebook for that scope.
Quantization step sizes Huffman codes for all quantization step sizes, one for each quantization unit
One sum/difference coding decision for each quantization unit for
Joint channel coding
each channel pairs
One joint intensity coding scale factor for each joint high-frequency
quantization unit in the joint channel

Table 15.3 Bit stream structure for an LFE channel


Number of Huffman codebooks
Codebook assignment Application scopes for all Huffman codebooks
Selection indexes identifying corresponding Huffman codebooks
Quantization indexes Huffman codes for quantization indexes of all MDCT coefficients
arranged based on the application scopes of Huffman codebooks.
Quantization indexes in one application scope are coded using the
corresponding Huffman codebook for that scope.
Quantization step sizes Huffman codes for all quantization step sizes, one for each quantization unit

them from carrying sound imaging information, so joint channel coding does not
need to be deployed. Therefore, an LFE channel has a simpler bit stream structure
as shown in Table 15.3.
The end of frame signature is placed right after the audio data and before the
auxiliary data so that the decoder can check immediately if the end of frame signature can be confirmed after all audio data are decoded. This enables error checking
without the burden of handling auxiliary data.

15.3.1 Frame Synchronization


A DRA frame starts with the following 16 bit synchronization codeword:
nSyncWord = 0x7FFF;

260

15 DRA Audio Coding Standard

so the decoding of DRA bit stream should start with a search for this synchronization
codeword, as illustrated by the following pseudo CCC code:
SeekSyncWord()
{
while ( Unpack(16) != 0x7FFF ) {
continue;
}
}

where
Unpack(nNumBits)
is a function that unpacks nNumBits bits from the bit stream. For the code above,
nNumBitsD16. Note that the seeking loop above may become an infinite loop in
case that the input stream does not contain nSyncWord and thus may cause the
decoder to hang up. This must be properly taken care of by aborting the loop and
relaunching the SeekSyncWord() routine.
Since nSyncWordD0x7FFF is not prohibited from appearing in the middle of
a DRA frame, the SeekSyncWord() function listed above may end up with a false
nSyncWord. A function that properly handles this situation should skip nNumWord
(see Table 15.4) number of 32-bit codewords after this nSyncWord and then seek
and confirm the next nSyncWord before declaring synchronization. The following
pseudo CCC code illustrates this procedure:
Sync()
{
// Seek the first nSyncWord
while ( Unpack(16) != 0x7FFF ) {
continue;
}
// Unpack frame header type (0=short, 1=long)
nFrameHeaderType = Unpack(1);
// Number of 32-bit codewords in the frame
nBits4NumWord = (nFrameHeaderType==0) ? 10 : 13;
nNumWord = Unpack(nBits4NumWord);
// Skip all audio data in the frame
Unpack(nNumWord*32-nBits4NumWord-1);
// Seek and confirm the second nSyncWord
while ( Unpack(16) != 0x7FFF ) {
continue;
}
}

In case that the input stream is not a valid DRA stream, the nSyncWord may not
appear after all, causing the two search loops to hang up forever. This situation must
be taken care of by aborting the loops and relaunching the Sync() routine.

15.3 Bit Stream Format

261

Table 15.4 DRA frame header. If there is a slash in the column under Number of bits, such as
L/R, L is the number of bits for short frame header and R is for long frame header. A value of zero
means that the codeword is absent in the frame
DRA codeword
Number of bits
Explanation
nSyncWord
nFrameHeaderType
nNumWord

16
1
10/13

nNumBlocksPerFrm

nSampleRateIndex

nNumNormalCh

3/6

nNumLfeCh

1/2

bAuxChCfg

bUseSumDiff

1/0

bUseJIC

1/0

nJicCb

5/0

Synchronization codeword D 0x7FFF


Indicates the type of DRA frame header
Number of 32-bit codewords starting from the start
of the frame (nSyncWord) until the end of the
frame (End of frame signature)
Number of short MDCT blocks of PCM audio
samples encoded in the frame for each audio
channel. The value in the bit stream represents
the power over two (2nNumBlocksPerFrm ), so
the valid values of 0, 1, 2 and 3 corresponds to 1,
2, 4 and 8 short MDCT blocks. Since the block
length of the short MDCT is 128 samples, these
values are equivalent to 128, 256, 512 and 1024
PCM samples for each channel, respectively
Sample rate index which is used to look up
Table 15.5 to get the sample rate for the audio
signal carried in the frame.
nNumNormalChC1 is the number of normal
channels, so the supported number of normal
channels is between 1 and 8 for short frames and
between 1 and 64 for long frames. After its
unpacking, it is immediately updated to reflect
the value of nNumNormalChC1 so that
nNumNormalCh represents the actual number of
normal channels. This rule applies to all other
DRA codewords as well
Number of LFE channels. The supported number of
LFE channels is either 0 or 1 for short frames and
between 0 and 3 for long frames
Indicates if additional speaker configuration
information is attached to the end of the frame as
the first part of the auxiliary data
Indicates whether
sum/difference coding is deployed:
0; not deployed;
bUseSumDiff D
1; deployed.
Transferred only for short frame header and when
nNumNormalCh > 1.
Indicates whether
joint intensity coding is deployed:

0; not deployed;
bUseJIC D
1; deployed.
Transferred only for short frame header and when
nNumNormalCh > 1.
Joint intensity coding is deployed for all critical
bands starting from nJicCbC1 and ending with
the last active critical band. nJicCb is transferred
only for short frame header and if
bUseJIC DD 1

262

15 DRA Audio Coding Standard

Table 15.5 Supported


sample rates

Index
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Sample rate (Hz)


8,000
11,025
12,000
16,000
22,050
24,000
32,000
44,100
48,000
88,200
96,000
1,74,600
1,92,000
Reserved
Reserved
Reserved

15.3.2 Frame Header


There are two types of DRA frame header, the short one is for short frames and
the long one for long frames. The type of frame header is indicated by the 1-bit
nFrameHeaderType which comes right after nSyncWord:

nFrameHeaderType D

0; Short frame header;


1; Long frame header.

The short frames are intended for lower bit rate DRA streams whose number of
channels is limited to no more than 8.1 surround sound and for which joint channel
coding may be optionally enabled. The long frames are intended for higher bit rate
DRA streams which may support up to 64.3 surround sound and for which joint
channel coding is disallowed.
The DRA frame header is described in Table 15.4 and its unpacking from the bit
stream is delineated by the following pseudo CCC code:
UnpackFrameHeader()
{
// Verify sync codeword (16 bits)
nSyncWord = Unpack(16);
if ( nSyncWord != 0x7FFF ) {
throw("Sync codeword not verified.");
}
// Type of frame header (0=short, 1=long)
nFrameHeaderType = Unpack(1);
// Number of 32-bit codewords in the frame
if ( nFrameHeaderType == 0 ) {
// 10 bits for short frame header
nNumWord = Unpack(10);
}

15.3 Bit Stream Format

263

else {
// 13 bits for long frame header
nNumWord = Unpack(13);
}
// Number of MDCT blocks per frame
nNumBlocksPerFrm = 1<<Unpack(2);
// Sample rate index
nSampleRateIndex = Unpack(4);
// Number of normal and LFE channels
if ( nFrmHeaderType == 0 ) {
// Short frame header
nNumNormalCh = Unpack(3)+1;
nNumLfeCh = Unpack(1);
}
else {
// Long frame header
nNumNormalCh = Unpack(6)+1;
nNumLfeCh = Unpack(2);
}
// Flag indicating if speaker configuration is attached to the
// end of the frame as the first part of auxiliary data
bAuxChCfg = Unpack(1);
// Joint channel coding
if ( nFrmHeaderType == 0 ) {
// Present only for short frame headers
if ( nNumNormalCh>1 ) {
// Present only for multichannel audio
// Decision for sum/difference coding
bUseSumDiff = Unpack(1);
// Decision for joint intensity coding
bUseJIC = Unpack(1);
}
else {
// Not present if thre is only one channel
bUseSumDiff = 0;
bUseJIC = 0;
}
// The start critical band that JIC is deployed
if ( bUseJIC == 1 ) {
// Present only when JIC is enabled
nJicCb = Unpack(5)+1;
}
else {
nJicCb = 0;
}
}
else {
// Joint channel coding is not deployed for long frame
// headers
bUseSumDiff = 0;
bUseJIC = 0;
nJicCb = 0;
}
}

264

15 DRA Audio Coding Standard

Table 15.6 Default speaker


configuration (FDfront,
LDleft, RDright,
SDsurround, CDcenter,
BDback)

nNumNormalCh
1
2
3
4
5
6
7
8

Table 15.7 Representation


of a few widely used speaker
configurations using the
default DRA speaker
configuration scheme

Speaker configuration
Mono
Stereo
2.1
3.1
5.1
6.1
7.1

Table 15.8 Default


placement of normal channels
in a DRA frame

Speaker configuration
F
FL, FR
FL, FR, FC
FL, FR, SL, SR
FL, FR, SL, SR, FC
FL, FR, SL, SR, BC, FC
FL, FR, SL, SR, BL, BR, FC
FL, FR, SL, SR, BL, BR, FC, overhead
nNumNormalCh
1
2
2
3
5
6
7

nNumLfeCh
0
0
1
1
1
1
1

Ordinal number

Normal channel

0
1
2
3
4
5
6
7

Front left
Front right
Surround left
Surround right
Back left
Back right
Front center
Overhead

There is no specific information for speaker configuration in DRA frame header.


Instead, it uses a default set of speaker configurations as laid out in Table 15.6.
Table 15.7 shows how a number of widely used speaker configurations are represented under this scheme.
If the default scheme is not enough or suitable for representing a specific speaker
configuration, the bAuxChCfg flag in the frame header should be set to one. This
indicates to the decoder that information for user-defined speaker configuration is
placed as the first part of the auxiliary data.

15.3.3 Audio Channels


Right after the frame header come the payloads of an audio frame which are the bits
representing the normal and then the LFE audio channels. The default placement
of normal channels are shown in Table 15.8 for the currently widely used speaker

15.3 Bit Stream Format

265

configurations. The placement of center channel toward the tail is to facilitate joint
channel coding by letting left and right channel pairs start from the first channel. If
any of these channels are not present, the channels below are moved upward to take
the place. Any speaker configuration beyond Table 15.8 needs to be specified in the
auxiliary data by setting bAuxChCfg to one.
Inside each audio channel, data components are placed as shown in Table 15.2
for normal channels and in Table 15.3 for LFE channels. All these data components
are delineated below.

15.3.3.1 Window Sequencing


Window sequencing is indicated by a trio of window index, number of transient
segments and the number of short MDCT blocks in each transient segment, as shown
in Table 15.9.
Each normal channel should has its own independent set of window sequencing
codewords. When joint channel coding is deployed, however, all normal channels
share the same set of window sequencing codewords, which are transferred in the
first channel. Consequently, there are no window sequencing codewords for the
other normal channels and they are copied from the first channel.
The codeword nWinIndex is the window index for the current frame, which is
used to look up Table 11.8 for the corresponding window label. The interpretation of
these labels is discussed in Sect. 11.5.3. Since there are only 13 entries in Table 11.8,
nWinIndex>12 indicates an error.
If nWinIndex indicates a long window (0  nWinIndex 8), there is no transient
in the current frame, so there is only one transient segment
nNumSegments D 1

Table 15.9 DRA codewords for window sequencing. The Number of Bits is represented as L/S
where L is the number of bits for long MDCT windows and S is for short windows. A value of
zero means that the codeword is absent in the frame
DRA codeword
Number of bits Explanation
nWinIndex
nNumSegments

4/4
0/2

nNumBlocksInSegment

0/Varied

Window index
Number of transient segments. Not transferred if a
long window is used
Number short MDCT blocks for each transient
segment. There are (nNumSegments-1) of such
codewords in the bit stream. The length of the
last transient segment is not transferred but
inferred from nNumBlocksPerFrm, the number
of short MDCT blocks in a frame. Not
transferred if a long window is used or
nNumSegmentsD1

266

15 DRA Audio Coding Standard

Table 15.10 Huffman


codebook for encoding the
number of short MDCT
blocks in a transient segment

Index
0
1
2
3
4
5
6

Length
2
2
3
3
4
3
4

Codeword
0
3
3
2
9
5
8

and it is as long as the whole frame


nNumBlocksInSegment D nNumBlocksPerFrm:
Therefore, both codewords are not transferred.
If nWinIndex indicates a short window (9  nWinIndex  12), the codeword
nNumSegment is always transferred. If nNumSegmentD1, there is only one transient segment in the whole frame, whose length is again nNumBlocksPerFrm, so
there is no need to waste bit resource to transfer it.
If nNumSegment>1, only the first (nNumSegment-1) nNumBlocksInSegment
codeword(s) is(are) transferred, because the length of the last transient segment can
be inferred by the fact that the total number of short MDCT blocks of all transient
segments in a frame is equal to nNumBlocksPerFrm, the number of short MDCT
blocks in the whole frame. The nNumBlocksInSegment codewords are encoded using Huffman codebook listed in Table 15.10.
The decoding of the codewords above is delineated by the following pseudo
CCC code:
UnpackWinSequencing(pSegmentBook, Ch0)
// pSegmentBook: pointer to the Huffman codebook for encoding
//
the number of blocks per transient segment.
// Ch0:
A reference to the audio channel zero which
//
is the source channel if joint channel coding
//
is deployed.
{
// The left edge of the first transient segment is
anEdgeSegment[0] = 0;
//
if ( nCh==0
// Always appear for the first channel
|| (bUseJIC==false && bUseSumDiff==false)
// For other normal channels, appear only if no joint
// channel coding is deployed
)
{
// Unpack window index
nWinIndex = Unpack(4);
// Transient segments
if ( nWinIndex != ANY_LONG_WIN) {
// Transferred only for short windows

15.3 Bit Stream Format

267

// First, unpack the number of transient segments


nNumSegment = Unpack(2)+1;
// Number of short MDCT blocks for each transient segment
if ( nNumSegment >= 2 ) {
// Transferred only if there are more than two transient
// segments
// Loop through the transient segments, except for the
// last one
for (nSegment=0; nSegment<nNumSegment-1; nSegment++) {
// Decode the number of short MDCT blocks in a
// transient segment
nNumBlocksInSegment = HuffDec(pSegmentBook) + 1;
// Add it to the right edge of the last transient
// segment to get the right edge of the current
// transient segment
anEdgeSegment[nSegment+1] = anEdgeSegment[nSegment]
+ nNumBlocksInSegment;
}
// The right edge of the last transient segment is
// obviously the end of the frame
anEdgeSegment[nSegment+1] = nNumBlocksPerFrm;
}
else {
// Just one transient segment in the frame, its length
// is the same as the frame size nNumBlocksPerFrm, so
// nNumBlocksInSegment is not transferred.
anEdgeSegment[1] = nNumBlocksPerFrm;
}
}
else {
// For long windows, there is only one transient segment,
nNumSegment = 1;
// whose length is one long MDCT block
anEdgeSegment[1] = 1;
}
}
else {
// This is not the first channel and joint channel coding is
// deployed, window sequencing information for the first
// channel (Ch0) is copied to the other normal channels
nWinIndex = Ch0.nWinIndex;
nNumSegment = Ch0.nNumSegment;
for (n=0; n<nNumSegment+1; n++) {
anEdgeSegment[n] = Ch0.anEdgeSegment[n];
}
}
}

When the function above is called, it should be passed with a pointer to the
Huffman codebook given in Table 15.10 which is further passed to the Huffman
decoding function HuffDec().

268

15 DRA Audio Coding Standard

15.3.3.2 Codebook Assignment


Since DRA deploys statistic-adaptive codebook assignment, the bit stream needs to
transfer both the application scope and selection index for each Huffman codebook
that is used to code quantization indexes (see Sect. 13.2.2). In other words, for each
Huffman codebook that is deployed to encode a group of quantization indexes, the
encoder needs to convey to the decoder the following pair of parameters:
.application scope, selection index/
where the application scope defines the boundary for the group of quantization
indexes that are coded by the Huffman codebook and the selection index identifies
the Huffman codebook itself.
The quantization indexes are coded based on transient segments. For a transient
segment identified by nSegment, the number of Huffman codebooks to code all
of its quantization indexes is anHSNumBands[nSegment]. Each of these Huffman
codebooks is identified by the following pair of application scope and codebook
selection index:
.mnHSBandEdge[nSegment][nBand], mnHS[nSegment][nBand]/;
where nBand goes from 0 to anHSNumBands[nSegment]  1 (see Table 15.11).
And this is duplicated for all transient segments.
Since the statistic properties of quantization indexes are obviously different for
frames with declared transients than the ones without, both the application scopes
and selection indexes are coded with different sets of Huffman codebooks as specified in Table 15.12.
Since the application scope could theoretically be as large as the whole frame
which may be as large as 1,024 quantization indexes, recursive indexing is deployed
to avoid using large Huffman codebooks (see Sect. 9.4). Therefore, the small

Table 15.11 DRA codewords used to represent Huffman codebook assignment for a transient
segment identified by nSegment
DRA codeword
Number of bits Explanation
anHSNumBands[nSegment]
4
Number of Huffman codebooks used to
encode the quantization indexes in the
transient segment identified by
nSegment
mnHSBandEdge[nSegment][nBand] Variable
Application scope for nBand-th Huffman
codebook. nBand goes from 0 to
anHSNumBands[nSegment]-1
mnHS[nSegment][nBand]
Variable
Selection index for the corresponding
nBand-th Huffman codebook. nBand
goes from 0 to anHSNumBands
[nSegment]-1

15.3 Bit Stream Format

269

Table 15.12 Huffman codebooks for encoding codebook assignments


Frame type
Codebook pointer
Quasistationary Transient
Explanation
pCodebook4Scope
Table A.28
Table A.29 Huffman codebook for encoding
application scopes. Recursive indexing
is deployed to handle large scope
values with a small codebook
pCodebook4Selection Table A.30
Table A.31 Huffman codebook for encoding the
corresponding selection indexes.
Difference coding is used to improve
coding efficiency

Huffman codebooks in Tables A.28 and A.29 are enough to cover any possible
application scope values used by DAR standard.
Each unit of application scope encoded in the bit stream corresponds to four
quantization indexes, so the decoded value should be multiplied by four to get the
application scopes that correspond to quantization indexes.
The difference between the codebook selection index for the nBand-th application scope with its predecessor:
nDiff D mnHS[nSegment][nBand]  mnHS[nSegment][nBand-1]
is encoded using one of the Huffman codebooks given in the last row of Table 15.12,
depending on the presence or absence of transients in the frame, except for the first
one in a transient segment, which is packed directly into the bit stream using four
bits. Therefore, on the decoder side, the selection index is obtained as
mnHS[nSegment][nBand] D nDiff C mnHS[nSegment][nBand-1]:
The unpacking of codebook assignment is illustrated by the pseudo CCC code
given below. Proper Huffman codebooks specified in Table 15.12 should be passed
to this function depending on the presence or absence of transients in the frame. The
HuffDecRecursive() function used in the code implements the recursive procedure
given in the last paragraph of Sect. 9.4.
UnpackCodeookAssignment(
pCodebook4Scope,
// pointer to a Huffman codebook for
// decoding application scope
pCodebook4Selection // pointer to a Huffman codebook for
// decoding selection index
)
{
// The left boundary of the first application scope is
// always zero
nLeftBoundary = 0;
// Loop through transient segments to unpack application
// scopes

270

15 DRA Audio Coding Standard


for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Unpack the number of application scopes for the current
// transient segment
anHSNumBands[nSegment] = Unpack(4);
// Loop through application scopes
for ( nBand=0; nBand<anHSNumBands[nSegment]; nBand++ ) {
// Unpack the length of application scope. Recursive
// indexing is used to handle values larger than the
// range that the Huffman codebook can handle.
k = HuffDecRecursive(pCodebook4Scope) + 1;
// Each unit of application scope thus decoded corresponds
// to four quantization indexes
k *= 4;
// Add it to the left boundary to get the right boundary
// of the current application scope
k += nLeftBoundary;
// Save the right boundary
mnHSBandEdge[nSegment][nBand] = k;
// It becomes the left boundary for the next application
// scope
nLeftBoundary = k;
}
}
// Loop through transient segments to unpack selection indexes
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
if ( anHSNumBands[nSegment]>0 ) {
// The first selection index is transferred directly using
// four bits
nLast = Unpack(4);
mnHS[nSegment][0] = nLast;
// The rest are difference coded
for (nBand=1; nBand<anHSNumBands[nSegment]; nBand++) {
// Decode from the bit stream
k = HuffDec(pCodebook4Selection);
// Add it to the predecessor to get the current
// selection index
k += nLast;
mnHS[nSegment][nBand] = k;
// It becomes the predecessor for the next selection
// index
nLast = k;
}
}
}

15.3.3.3 Quantization Indexes


Two libraries of Huffman codebooks, shown in Table 15.13, are deployed to code
the quantization indexes, one for frames with declared transients and the other for

15.3 Bit Stream Format


Table 15.13 Huffman codebooks for encoding quantization indexes
Codebook
Selection index
Dimension
Covered range
Signed
Quasistationary
0
0
0
N/A
N/A
1
4
[1,1]
Yes
Table A.34
2
2
[2,2]
Yes
Table A.35
3
2
[4,4]
Yes
Table A.36
4
2
[8,8]
Yes
Table A.37
5
1
[15,15]
Yes
Table A.38
6
1
[31,31]
Yes
Table A.39
7
1
[63,63]
Yes
Table A.40
8
1
[127,127]
Yes
Table A.41
9
1
[0,255)
No
Table A.42

271

Transient
N/A
Table A.43
Table A.44
Table A.45
Table A.46
Table A.47
Table A.48
Table A.49
Table A.50
Table A.51

quasistationary frames. Depending on the absence or presence of transient in the


current frame, either the fifth or sixth column of this table is copied to the anQIndexBooks[9] array in the pseudo CCC code to be given at the end of this section
so that the pQIndexBook pointer can be pointed to the proper Huffman codebook.
When the range of quantization indexes covered by a Huffman codebook is small,
block coding is deployed to enhance coding efficiency (see Sects. 8.4.3 and 9.3).
The number of quantization indexes that are coded together as a block is indicated
as Dimension in the table. The encoding and decoding procedure for block codes
are delineated in Sect. 9.3. The quantization index range covered by each Huffman
codebook is indicated by Covered Range in Table 15.13.
Most of the Huffman codebooks handles signed quantization indexes, except the
last two codebooks, namely Tables A.42 and A.51, which are unsigned and cover a
range of [0,255). This range is extended to (255,255) with the transmission of a
sign bit for each nonzero quantization index. The exclusion of 255 from the covered
range is explained next.
To accommodate the target signal path of 24 bits, recursive indexing is extensively used to handle quantization indexes whose absolute value are beyond [0,255).
In particular, the absolute value of a quantization index are decomposed using (9.23)
with M D 255 where the reminder r is coded with either Table A.42 or Table A.51
as indicated in Table 15.13 and the quotient q, if any of them in the whole codebook
application scope are nonzero, is packed directly into the bit stream.
For quantization index whose quotient is not zero, the residue encoded by either Table A.42 or Table A.51 above must have the value of 255, which actually
represents an Escape, indicating that its quotient is not zero.
If no Escape appears in the whole application scope, no quotients are packed
into the bit stream. Otherwise, each Escape is followed by a nonzero quotient and
a real residue. The real residue is again coded by either Table A.42 or Table A.51.
Denote the number of bits used to pack the nonzero quotients as nBits4Quotient.
nBits4Quotient-1 are difference-coded using the Huffman codebook indicated in
Table 15.14. The valid range for nBits4Quotient-1 is between 0 and 15. Reconstructing the quotients from the values packed in the bit stream may generate values

272

15 DRA Audio Coding Standard


Table 15.14 Determination of the Huffman codebook for encoding
the number of bits used to pack the quotients generated in recursive
indexing of large quantization indexes
Codebook pointer
Quasistationary frame
Transient frame
pQuotientWidthBook

Table A.32

Table A.33

out of this range. Any value out of this range must be moved back by taking the
remainder against 16.
Sign bits for all nonzero quantization indexes coded above follow immediately.
One bit is packed for each nonzero quantization index and the negative sign is indicated by the value of zero.
The unpacking of all quantization indexes is delineated by the following pseudo
CCC code:
UnpackQIndex(
int anQIndexBooks[9],

//
//
//
int *pQuotientWidthBook //
//
//

An array that holds pointers to


Huffman codebooks for codign
quantization indexes
Pointer to the Huffman codebook
for encoding number of bits for
packing quotients

)
{
// Reset the history value for the number of bits for packing
// quotients
nLastQuotientWidth = 0;
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Initialize the start of the application scope to the
// first quantization index in the current transient
// segment. Since each transient segment starts at the
// anEdgeSegment[nSegment]-th short MDCT block and each
// short MDCT block consist of 128 samples, we have
nStart = anEdgeSegment[nSegment]*128;
// Loop through all Huffman codebooks in the transient
// segment
for (nBand=0; nBand<anHSNumBands[nSegment]; nBand++) {
// Establish the end of the application scope by adding
// the width of the application scope to its start
nEnd = nStart + mnHSBandEdge[nSegment][nBand];
// Load the selection index for the Huffman codebook
// corresponding to the application scope
nHSelect = mnHS[nSegment][nBand];
// Consider different types of Huffman codebooks
if ( nHSelect==0 ) {
// No Huffman codebook is assigned. All quantization
// indexes are zero, so set them as such
for (nBin=nStart; nBin<nEnd; nBin++) {
anQIndex[nBin] = 0;
}
}
else {
// The pointers to the Huffman codebooks are stored in
// array anQIndexBooks[9], so this is the right way to

15.3 Bit Stream Format

273

// load the codebook indexed by nHSelect


pQIndexBook = anQIndexBooks[nHSelect];
// This function returns with the number of the
// codewords per dimension in the codebook. The total
// number of codewords in the codebook is
// nNumCodesnDim
nNumCodes = GetNumCodewords(pQIndexBook);
// Unpack the quantization indexes
if ( nHSelect == 9 ) {
// The Last Huffman codebook
// Recursive indexing is used with the last Huffman
// codebook, so the largest quantization index in this
// codebook is 255 and represents the Escape
nMaxIndex = nNumCodes-1; // nMaxIndex=255
// Decode the residue and count the times that
// nMaxIndex (Escape) appears.
nCtr = 0; // The counter for nMaxIndex appearance
for (nBin=nStart; nBin<nEnd; nBin++) {
// Decode the residue
nQIndex = HuffDec(pQIndexBook);
// Count the times that nMaxIndex appears
if ( nQIndex == nMaxIndex ) {
nCtr++;
}
// Save the decoded residue
anQIndex[nBin] = nQIndex;
}
// Unpack the quotient
if ( nCtr>0 ) {
// This step is necessary only if nMaxIndex has
// appeared
// Decode the number of bits used to pack the
// quotient
nBits4Quotient = HuffDec(pQuotientWidthBook);
// Add it to the history
nLastQuotientWidth += nBits4Quotient;
// If out of [0,15], take the remainder to move it
// back
nLastQuotientWidth = nLastQuotientWidth % 16;
// The number of bits used to pack the quotients is
nBits4Quotient = nLastQuotientWidth + 1;
// Loop through all quantization indexes to unpack
// the non-zero quotients and then recover the
// absolute values of the quantization indexes
for (nBin=nStart; nBin<nEnd; nBin++) {
// Reload the unpacked residue
nQIndex = anQIndex[nBin];
// If the residue is equal to nMaxIndex, the
// quotient is packed
if ( nQIndex == nMaxIndex ) {
// Unpack the quotient
q = Unpack(nbits4Quotient) + 1;
// Unpack the real residue
r = HuffDec(pQIndexBook);
// Put them together to get the real absolute
// value of the quantization index
anQIndex[nBin] = q*nMaxIndex + r;
}
}
}

274

15 DRA Audio Coding Standard


// Add signs
for (nBin=nStart; nBin<nEnd; nBin++) {
// Reload the absolute quantization indexes
nQIndex = anQIndex[nBin];
//
if ( nQIndex != 0 ) {
// The sign bit presents only for non-zero values
// Unpack the sign bit
nSign = Unpack(1);
// A zero signals a negative value
if ( nSign==0 ) {
anQIndex[nBin] = - nQIndex;
}
}
}
else {
// For all other Huffman codebooks
// Since they are signed, the largest quantization
// index value in these codebooks is
nMaxIndex = nNumCodes/2-1;
// This function returns the dimension of the codebook
nDim = GetDim(pQIndexBook);
//
if ( nDim>1 ) {
// Block coding is deployed
for (nBin=nStart; nBin<nEnd; nBin+=nDim) {
// Unpack the block code
nB = HuffDec(pQIndexBook);
// Decode the block code
for (k=0; k<nDim; k++) {
// The quantization index is the remainder
nQIndex = nB % nNumCodes;
// Substract the bias to add the sign
nQIndex[nBin+k] = nQIndex - nMaxIndex;
// Update block code for the next quantization
// index
nB = nB / nNumCodes;
}
}
}
else {
// No block coding
for (nBin=nStart; nBin<nEnd; nBin++) {
// Decode the quantization indexes
nQIndex = HuffDec(pQIndexBook);
// Substract the bias to add the sign
anQIndex[nBin] = nQIndex - nMaxIndex;
}
}
}
}
// The end of the current application scope is the start
// of the next one
nStart = nEnd;
}
}

15.3 Bit Stream Format

275

15.3.3.4 Quantization Step Sizes


There is one quantization step size for each quantization unit which is defined along
the time axis by the boundaries of transient segments and along the frequency axis
by the boundaries of critical bands. The critical bands for all supported sample rates
are listed in Sect. A.2.
Since audio signals tend to have lower energy at high frequencies, aggressive
quantization may cause the quantization indexes for critical bands at high frequencies to become all zero. When this happens, there is no need to transfer quantization
step size indexes for those critical bands. For transient segment nSegment, the highest critical band with nonzero quantization indexes is referred to as the maximal
active critical band and is denoted as anMaxActCb[nSegment]. This number may
be inferred from the cumulative application scopes of Huffman codebooks that are
used to encode the quantization indexes. One such method is illustrated by the following pseudo CCC code:
ReconstructMaxActCb()
{
// Loop through transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Load the number of application scopes in the current
// transient segment
nMaxBand = anHSNumBands[nSegment];
// Get the right edge of the last application scope, which
// represents the location of the non-zero quantization
// index with the highest frequency
nMaxBin = mnHSBandEdge[nSegment][nMaxBand-1];
// Since the number of MDCT blocks in the transient segment
// is (anEdgeSegment[nSegment+1]-anEdgeSegment[nSegment]),
// the non-zero quantization index with the highest
// frequency in each MDCT block is
nMaxBin = ceil(nMaxBin/(anEdgeSegment[nSegment+1]
-anEdgeSegment[nSegment]));
// Loop through the critical bands to find the
// lowest-frequency critical band whose right boundary
// covers this quantization index with the highest frequency
nCb = 0;
while ( pnCBEdge[nCb] < nMaxBin ) {
nCb++;
}
// This is the maximal active critical band
anMaxActCb[nSegment] = nCb;
}
}

It is obvious that there are anMaxActCb[nSegment] quantization units in transient segment indexed by nSegment, each of which has a quantization step size that
needs to be conveyed to the decoder, so the number of the corresponding quantization indexes is also anMaxActCb[nSegment]. The differences of all these indexes
between successive quantization units are encoded using one of the Huffman codebooks as indicated in Table 15.15.

276

15 DRA Audio Coding Standard


Table 15.15 Determination of Huffman codebooks for encoding the indexes of quantization step sizes
Codebook pointer Quasistationary frame Transient frame
pQStepBook

Table A.52

Table A.53

Reconstructing the indexes from these difference values encoded in the bit stream
may generate values out of the valid range of 0; 115. Any value out of this range
must be moved within this range by taking the remainder. The unpacking operation
is delineated by the following pseudo CCC code:
UnpackQStepIndex(pQStepBook)
// pQStepBook: Huffman codebook used to encode the difference
//
between quantization step size indexes
{
// Reset the quantization step size index history
nLastQStepIndex = 0;
// Loop through the transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Loop through the active critical bands
for (nBand=0; nBand<anMaxActCb[nSegment]; nBand++) {
// Decode the difference from the bit stream
nQStepIndex = HuffDec(pQStepBook);
// Add it to the history
nLastQStepIndex += nQStepIndex;
// Take the remainder if out of [0,115]
nLastQStepIndex = nLastQStepIndex % 116
// Store the index
mnQStepIndex[nSegment][nBand] = nLastQStepIndex;
}
}
}

15.3.3.5 Sum/Difference Coding Decisions


Whether sum/difference coding is deployed for the current frame is indicated in
the frame header by setting bUseSumDiffD0 (see Table 15.4). If it is deployed,
the only information that sum/difference decoding needs is the decision of whether
sum/difference coding is deployed for each paired quantization unit. This decision
applies to the corresponding quantization units for both the sum and difference channels, so only one decision is packed into the bit stream for both channels. All these
bits are packed as part of the bits for the difference channel, so the unpacking needs
to be run only for the difference channel.
After a pair of left and right channels are sum/difference-encoded, the left
channel is replaced with the sum channel and the right channel is replaced with
the difference channels (see Sect. 12.1), so the bits representing the decisions of
sum/difference coding are actually carried in the right channel. Due to the default
placement scheme of normal channels shown in Table 15.8, the right channels are
placed as channels 1, 3, 5, : : : in the bit stream.

15.3 Bit Stream Format

277

If joint intensity coding is not deployed (nJicCbD0, see Table 15.4), sum/
difference coding may be deployed to the larger of the maximal active critical
bands of both sum and difference channels. If joint intensity coding is deployed
(nJicCb0), this number is limited by nJicCb.
Sum/difference coding may be turned off for the whole transient segment and
this decision is conveyed to the decoder by one bit per transient segment. If it is
turned on, there is one bit decision for each critical band or quantization unit in the
transient segment. The unpacking of these decisions is delineated by the following
pseudo CCC code:
UnpackSumDff(anMaxActCb4Sum[])
// anMaxActCb4Sum: Maximal active critical bands for the sum
//
(left) channel
// Note:
This function is run only for the difference
//
(right) channel
{
// Loop through the transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Get the larger of the maximal active critical bands of
// both sum and difference channels
nMaxCb =
__max(anMaxActCb4Sum[nSegment], anMaxActCb[nSegment]);
// nMaxCb identifies the highest critical band that
// sum/difference coding is deployed only if joint intensity
// coding is not used (nJicCb=0).
if ( nJicCb>0 ) {
// Otherwise, sum/difference coding is deployed to
// critical bands before joint intensity coding, so we
// need to get the smaller of nMaxCb and nJicCb
nMaxCb = __min(nJicCb, nMaxCb);
}
// If it is deployed, unpack the relevant decisions
if ( nMaxCb>0 ) {
// Sum/Difference coding may be turned off for the whole
// transient segment, so need to unpack this decision
anSumDffAllOff[nSegment] = Unpack(1);
//
if ( anSumDffAllOff[nSegment] == 0 ) {
// Not turned off completely, unpack the decision for
// each quantization unit
for (nBand=0; nBand<nMaxCb; nBand++) {
mnSumDffOn[nSegment][nBand] = Unpack(1);
}
}
else {
// Turned off completely, so turn off the decision for
// each quantization unit
for (nBand=0; nBand<nMaxCb; nBand++) {
mnSumDffOn[nSegment][nBand] = 0;
}
}
}
}
}

278

15 DRA Audio Coding Standard

15.3.3.6 Steering Vector for Joint Intensity Coding


If the frame header indicates that joint intensity coding is deployed by setting
nUseJICD1 (see Table 15.4), there is one scale factor for each quantization unit
in each joint channel starting from nJicCb until the maximal active critical band of
the source channel for each transient segment (anMaxActCb[nSegment]). This applies to all normal channels except for the first one which is the source channel. The
indexes of these scale factors are packed into the bit stream in exactly the same way
as that for the quantization step sizes (see Sect. 15.3.3.4) and is delineated by the
following pseudo CCC code:
UnpackJicScaleIndex(
pQStepBook, // The same Huffman codebook used to code the
// quantization step sizes
anMaxCbCh0[] // An array that stores the maximal active
// critical bands of the source channel
// (channel 0).
)
{
// Reset the scale factor index history
nLastScaleIndex = 0;
// Loop through transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Loop through JIC encoded quantization units
for (nBand=nJicCb; nBand<anMaxCbCh0[nSegment]; nBand++) {
// Decode the index
nScaleIndex = HuffDec(pQStepBook);
// Add it to the history
nLastScaleIndex += nScaleIndex;
// Take the remainder if out of [0,115]
nLastScaleIndex = nLastQStepIndex % 116
// Store the index
mnQStepIndex[nSegment][nBand] = nLastScaleIndex;
}
}
}

15.3.4 Window Sequencing for LFE Channels


Since LFE channels are bandlimited to 120 Hz, it is impossible for transients to
occur in such channels. Therefore, an LFE channel always consist of one transient
segment whose length is nNumBlocksPerFrm, the total number of short MDCT
blocks in the whole frame.
If the frame is as long as the long MDCT block, the long MDCT window
(WL L2L) is always deployed. Otherwise, nNumBlocksPerFrm short windows
(WS S2S) are always deployed to cover the samples in the frame. The setup of

15.3 Bit Stream Format

279

window sequencing for LFE channels is, therefore, very easy and can be delineated
by the following pseudo CCC code:
SetupLfeWinSequencing()
{
// The left edge of the first transient segment is
anEdgeSegment[0] = 0;
//
if ( nNumBlocksPerFrm == 8 ) {
// The frame size is equal to the long MDCT block
// the long window is always used
nWinIndex = WL_L2L;
// It is always a quasi-stationary frame so there
// one transient segment
nNumSegment = 1;
// whose length is equal to one long MDCT block
anEdgeSegment[1] = 1;
}
else {
// The frame is shorter than the long MDCT block,
// short window is always used
nWinIndex = WS_S2S;
// It is always a quasi-stationary frame so there
// one transient segment
nNumSegment = 1;
// whose length is equal to the frame size
anEdgeSegment[1] = nNumBlocksPerFrm;
}
}

size, so

is only

so the

is only

15.3.5 End of Frame Signature


To assist checking for possible errors that may occur during the decoding process,
the encoder packs 1 to all unused bits in the last 32-bit codeword. Upon completing decoding, the decoder should check if the unused bits in the last 32-bit
codeword are all 1. If any of them is not, there is at least an error in the decoding process and error handling procedure should be deployed. This is illustrated
below:
CheckError()
{
if ( Unused Bits in the last codeword != all 1s ) {
ErrorHandling();
}
}

280

15 DRA Audio Coding Standard

15.3.6 Auxiliary Data


Auxiliary data are optional and user-defined. They are attached to the end of the
32-bit codeword that contains the end of frame signature. Therefore, a decoder can
ignore these bits without any impact to the normal decoding.
However, if bAuxChCfgD1 (see Table 15.4), the frame header mandates that
additional information for speaker configuration is attached as the very first entry of
auxiliary data. The specific format for speaker configuration is user-defined.

15.3.7 Unpacking the Whole Frame


Now that all components of a DRA frame listed in Tables 15.1, 15.2, and 15.3 have
been delineated, it is fairly easy to put together the unpacking of a whole DRA frame
as follows:
UnpackDraFrame()
{
// Frame synchronization
Sync();
// Unpack the frame header
UnpackFrameHeader();
// Unpack the normal channels
for (nCh=0; nCh<nNumNormalCh; nCh++) {
// Unpack window sequencing information
UnpackWinSequencing();
// Unpack codebook assignment
UnpackCodebookAssignment();
// Unpack quantization indexes
UnpackQIndex();
// Reconstruct the maximal active critical bands
ReconstructMaxActCb();
// Unpack indexes for quantization step sizes
UnpackQStepIndex();
// Unpack sum/difference coding decisions
if ( bUseSumDiff==true && (nCh%2)==1 ) {
// Bits for the decisions are in the right channels only,
// which are nCh = 1, 3, 5, ...
UnpackSumDff();
}
// Unpack steering vectors for joint intensity coding
if ( bUseJIC==true && nCh>0 ) {
// Only appear in the joint channels, nCh=0 is the
// source channel
UnpackJicScaleIndex();
}
}
// Unpack the LFE channels
for (nCh=nNumNormalCh; nCh<nNumNormalCh+nNumLfeCh; nCh++) {
// Set up window sequencing for LFE channels

15.4 Decoding

281

SetupLfeWinSequencing();
// Unpack codebook assignment
UnpackCodebookAssignment();
// Unpack quantization indexes
UnpackQIndex();
// Reconstruct the maximal active critical bands
ReconstructMaxActCb();
// Unpack indexes for quantization step sizes
UnpackQStepIndex();
}
// Check for error (end of frame signature)
CheckError()
// Unpack auxiliary data
UnpackAuxiliaryData()
}

15.4 Decoding
After a DRA frame is unpacked, all components are made ready for reconstructing
the PCM samples for all channels in the frame. The reconstruction involves the
following steps and each of them is described in this section:






Inverse quantization
Sum/difference decoding
Joint intensity decoding
De-interleaving
Inverse MDCT

15.4.1 Inverse Quantization


Since DRA deploys only midtread uniform quantization (see Sect. 2.3.2), inverse
quantization simply involves the multiplication of the quantization indexes with
the corresponding quantization step size (see (2.22)). Since MDCT coefficients are
quantized based on quantization unit and all coefficients in one unit share one quantization step size, inverse quantization is centered around quantization units and may
be delineated as follows:
InverseQuantization(
aunStepSize[116], // Quantization step size table
// Pointer to a critical band table
*pnCBEdge
)
{
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Point to the first quantization index in the current

282

15 DRA Audio Coding Standard


// transient segment.
nBin0 = anEdgeSegment[nSegment]*128;
// The number of short MDCT blocks in a transient
// segment is
nNumBlocks =
anEdgeSegment[nSegment+1] - anEdgeSegment[nSegment];
// Loops through all active critical bands
for (nBand=0; nBand<anMaxActCb[nSegment]; nBand++) {
// The quantization unit starts from
nStart = nBin0 + nNumBlocks * pnCBEdge[nBand-1];
// and end with
nEnd = nBin0 + nNumBlocks * pnCBEdge[nBand];
// Load the index to the quantization step size
nQStepSelect = mnQStepIndex[nSegment][nBand];
// Look up the quantization step size table to
// get the quantization step size
nStepSize = aunStepSize[nQStepSelect];
// Loop through all quantization indexes in the
// quantization unit
for (nBin=nStart; nBin<nEnd; nBin++) {
// Inverse quantization to reconstruct MDCT coefficient
afBinReconst[nBin] = anQIndex[nBin] * nStepSize;
}
}
}

When the function above is called, aunStepSize should be passed a pointer to


Table A.1 and pnCBEdge a pointer to a critical band table in Sect. A.2 based on
the sample rate and window size.

15.4.2 Joint Intensity Decoding


When joint intensity coding is deployed, the source channel is always the first channel and all other normal channels are the joint ones. There is one scale factor for
each quantization unit for each of such joint channels, so the decoding simply involves multiplying the scale factor with the reconstructed MDCT coefficients in the
source channel (see (12.9)).
DRA standard reuses the quantization step size table (Table A.1) for JIC scale
factors with a bias of
rJicScaleBias D

1
1
D
aunStepSize[57]
2;702

to make up for the need that JIC scale factors can be either larger or smaller than
one.

15.4 Decoding

283

Joint intensity decoding may be delineated by the following pseudo CCC code:
DecJIC(
srcCh,

//
//
aunStepSize[116], //
//
*pnCBEdge

A reference to the source channel object


(channel 0)
Quantization step size table
Pointer to a critical band table

)
{
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Point to the first quantization index in the current
// transient segment.
nBin0 = anEdgeSegment[nSegment]*128;
// The number of short MDCT blocks in a transient
// segment is
nNumBlocks =
anEdgeSegment[nSegment+1] - anEdgeSegment[nSegment];
// Loops through active critical bands that are JIC coded
for (nBand=nJicCb; nBand<srcCh.anMaxCb[nSegment]; nBand++){
// The quantization unit starts from
nStart = nBin0 + nNumBlocks * pnCBEdge[nBand-1];
// and end with
nEnd = nBin0 + nNumBlocks * pnCBEdge[nBand];
// Load the index to the scale factor
nQStepSelect = mnQStepIndex[nSegment][nBand];
// Look up the step size table for the JIC scale factor
rStepSize = aunStepSize[nQStepSelect];
// Loop through all quantization indexes in the
// quantization unit
for (nBin=nStart; nBin<nEnd; nBin++) {
// Reconstructed MDCT coefficient in the source channel
// is multiplied by the scale factor and then the bias
// to generate the MDCT coefficient in the joint
// channel. The order of multiplication is critical to
// prevent overflow for integer implementation
afBinReconst[nBin]=
(rStepSize*srcCh.afBinReconst[nBin]) * rJicScaleBias;
}
}
}
}

When the function above is called, aunStepSize should be passed a pointer to


Table A.1 and pnCBEdge a pointer to a critical band table in Sect. A.2 based on
the sample rate and window size.

15.4.3 Sum/Difference Decoding


Sum/difference decoding simply involves the proper application of (12.3) and (12.4)
to the corresponding pair of reconstructed MDCT coefficients. The data structure for

284

15 DRA Audio Coding Standard

sum/difference coding is described in Sect. 15.3.3.5. The following pseudo CCC


code describes the complete decoding steps:
DecSumDff(
LeftCh,

//
//
//
*pnCBEdge //

A reference for the sum (left) channel object.


Note: This function is run only for the
difference (right) channels.
Pointer to a critical band table.

{
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Point to the first quantization index in the current
// transient segment.
nBin0 = anEdgeSegment[nSegment]*128;
// The number of short MDCT blocks in a transient
// segment is
nNumBlocks =
anEdgeSegment[nSegment+1] - anEdgeSegment[nSegment];
// Get the larger of the maximal active critical bands of
// both sum and difference channels
nMaxCb =
__max(LeftCh.anMaxActCb[nSegment], anMaxActCb[nSegment]);
// nMaxCb identifies the highest critical band that
// sum/difference coding is deployed only if joint intensity
// coding is not used (nJicCb=0).
if ( nJicCb>0 ) {
// Otherwise, sum/difference coding is deployed to
// critical bands before joint intensity coding, so we
// need to get the smaller of nMaxCb and nJicCb
nMaxCb = __min(nJicCb, nMaxCb);
}
// Loops through active critical bands that are
// sum/difference coded
for (nBand=0; nBand<nMaxCb; nBand++) {
if ( mnSumDffOn[nSegment][nBand] == 1 ) {
// Sum/difference coding is turned on.
// The quantization unit starts from
nStart = nBin0 + nNumBlocks * pnCBEdge[nBand-1];
// and end with
nEnd = nBin0 + nNumBlocks * pnCBEdge[nBand];
// Loop through all quantization indexes
for (nBin=nStart; nBin<nEnd; nBin++) {
// Left channel
rL = LeftCh.afBinReconst[nBin] + afBinReconst[nBin];
// Right channel
rR = LeftCh.afBinReconst[nBin] - afBinReconst[nBin];
// Store the decoded left and right channels
LeftCh.afBinReconst[nBin] = rL;
afBinReconst[nBin] = rR;
}
}
}
}
}

15.4 Decoding

285

15.4.4 De-Interleaving
For a frame in which the long MDCT is deployed, the 1,024 MDCT coefficients are
naturally placed from low frequency to high frequency.
For a frame in which the short MDCT is deployed, the MDCT coefficients are
said to be naturally placed if the 128 coefficients from the first short MDCT block
are placed from low frequency to high frequency, then the 128 coefficients from the
second block, from the third block, and until the last block of the frame. Under this
placement scheme, MDCT coefficients corresponding to the same frequency from
succeeding MDCT blocks are 128 coefficients apart.
To facilitate quantization and entropy coding, however, DRA encoder interleaves
these coefficients within each transient segment based on frequency: the MDCT
coefficients of all MDCT blocks within the transient segment corresponding to the
same frequency are placed together right next to each other based on their respective
time order. Then these group are laid out from low frequency to high frequency. This
interleaving operation is applied to each transient segment in the frame. Under this
placement scheme, MDCT coefficients from the same MDCT blocks with succeeding frequencies are nNumBlocks coefficients apart within each transient segment,
where nNumBlocks is the number of short MDCT blocks in the transient segment.
When it is time to do inverse short MDCT, this interleaving must be reverted.
The following pseudo CCC code illustrates how this should be accomplished:
DeInterleave()
{
// Index to the naturally ordered coefficient
p = 0;
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Point to the first coefficient in the current transient
// segment.
nBin0 = anEdgeSegment[nSegment]*128;
// The number of short MDCT blocks in a transient segment is
nNumBlocks =
anEdgeSegment[nSegment+1] - anEdgeSegment[nSegment];
// Loop through all blocks in the transient segment
for (nBlock=0; nBlock<nNumBlocks; nBlock++) {
// Index to the interleaved coefficient is set to point to
// the first coefficient in each short MDCT block
q = nBin0;
// Loop through all coefficients in one short MDCT block
for (n=0; n<128; n++) {
// Copy the coefficient
afBinNatural[p] = afBinInterleaved[q];
// The next interleaved coefficient is nNumBlocks apart
q += nNumBlocks;
// Increment the index for the natural order
p++;
}

286

15 DRA Audio Coding Standard


// Move on to the first interleaved coefficient in the
// next short MDCT block
nBin0++;
}
}

15.4.5 Window Sequencing


As discussed in Sect. 11.5.3, the window index unpacked from the bit stream points
to the actual window for the long MDCT, so can be directly used to perform the
inverse long MDCT. For frames processed with the short MDCT, the window index
unpacked from the bit stream is used to encode the existence or absence of transient
in the first block of the current and subsequent frames, as shown in Table 11.7.
The particular window that should be applied to each short MDCT block must be
determined using the method outlined in Sect. 11.5.2.2 which is further delineated
by the following pseudo CCC code:
SetupShortWinType(
int nWinLast,

//
//
int nWinIndex,
//
int anWinShort[] //
//
//

Window index for the last MDCT block of


the preceding frame
Window index unpacked from the bit stream
An array that holds the window index for
each MDCT blocks of the current frame. It
holds the result of this function

)
{
int nSegment, nBlock;
// First MDCT block
if ( nWinIndex==WS_S2S || nWinIndex==WS_S2B ) {
// The first half of nWinIndex unpacked from the bit stream
// indicates that there is no transient in the first MDCT
// block, so set its window index to WS_S2S
anWinShort[0] = WS_S2S;
// Make sure that the first half of this window matches the
// second half of the last window in the preceding frame
switch ( nWinLast ) {
case WS_B2B:
// The second half of the last window is "brief", so
// the first half must be changed to "brief" also
anWinShort[0] = WS_B2S;
break;
case WL_L2S:
case WL_S2S:
case WL_B2S:
case WS_S2S:
case WS_B2S:
// The second half of the last window is "short",
// which matches the first half of the current one, so

15.4 Decoding

287

// nothing needs to be done


break;
default:
// Any other window shape for the second half of the last
// window is disallowed, so flag error
throw("The second half of last window in the preceding
frame must be of either the short or brief window
shape");
}
}
else {
// The first half of nWinIndex indicates that there is a
// transient in the first MDCT block, so set its window
// index to WS_B2B
anWinShort[0] = WS_B2B;
// The first half of this window must match the second half
// of the last window in the preceding frame
switch ( nWinLast ) {
case WS_B2B:
case WS_S2B:
case WL_L2B:
case WL_B2B:
case WL_S2B:
// The second half of the last window is "brief",
// which matches the first half of the current one, so
// nothing needs to be done
break;
default:
// Any other window shape for the second half of the last
// window is disallowed, so flag error
throw("The second half of last window in the preceding
frame must be of the brief window shape");
}
}
// Initialize windows for the other MDCT blocks to WS_S2S
for (nBlock=1; nBlock<nNumBlocksPerFrm; nBlock++) {
anWinShort[nBlock] = WS_S2S;
}
// Place WS_B2B at the start of each transient segment, which
// is where the transient occurs, excluding the start of the
// first segment which is the start of the frame.
for (nSegment=1; nSegment<nNumSegment; nSegment++) {
// Load transient location
nBlock = anEdgeSegment[nSegment];
// Set the window to a WS_B2B
anWinShort[nBlock] = WS_B2B;
}
// The first half of the second window must match the second
// half of the first window
if ( anWinShort[0] == WS_B2B ) {
// If the second half of the first window is "brief"
if ( anWinShort[1]==WS_S2S ) {
// and the first half of the second window is "short", it
// must be set to "brief" to match it

288

15 DRA Audio Coding Standard


anWinShort[1] = WS_B2S;
}
}
// Set window halves next to a WS_B2B window to "brief"
for (nSegment=1; nSegment<nNumSegment; nSegment++) {
// Load transient location that is where WS_B2B is placed
nBlock = anEdgeSegment[nSegment];
// Left side
if ( anWinShort[nBlock-1] == WS_S2S ) {
// If it is WS_S2S, change its second half to "brief"
anWinShort[nBlock-1] = WS_S2B;
}
if ( anWinShort[nBlock-1] == WS_B2S ) {
// If it is WS_B2S, change its second half to "brief"
anWinShort[nBlock-1] = WS_B2B;
}
// Right side
if ( anWinShort[nBlock+1] == WS_S2S ) {
// If it is WS_S2S, change its first half to "brief"
anWinShort[nBlock+1] = WS_B2S;
}
}
// Last block in the frame.
nBlock = nNumBlocksPerFrm-1;
// Consider different window types
switch ( anWinShort[nBlock] ) {
case WS_B2B:
// If it is WS_B2B, the second half of the last window need
// to changed to "brief"
// WS_S2S to WS_S2B
if ( anWinShort[nBlock-1] == WS_S2S ) {
anWinShort[nBlock-1] = WS_S2B;
}
// WS_B2S to WS_B2B
if ( anWinShort[nBlock-1] == WS_B2S ) {
anWinShort[nBlock-1] = WS_B2B;
}
break;
case WS_S2S:
// If it is WS_S2S,
if ( nWinIndex==WS_S2B || nWinIndex==WS_B2B ) {
// and the second half of nWinIndex indicates that there
// is a transient in the first block of the next frame,
// its second half needs to be changed to "brief":
anWinShort[nBlock] = WS_S2B;
}
break;
case WS_B2S:
// If it is WS_B2S,
if ( nWinIndex==WS_S2B || nWinIndex==WS_B2B ) {
// and the second half of nWinIndex indicates that there
// is a transient in the first block of the next frame,
// its second half needs to be changed to "brief":
anWinShort[nBlock] = WS_B2B;

15.4 Decoding

289

}
break;
//
default:
// The only other case, which is WS_S2B, should not happen
// by itself, so flag error
throw("WS_S2B should not happen by itself.");
}
}

15.4.6 Inverse TLM


The implementation of inverse TLM was discussed in Sect. 11.5.4, which essentially
uses the same procedure given in Sect. 11.3.3.

15.4.7 Decoding the Whole Frame


Now that all components for decoding a DRA frame have been described, the following pseudo CCC code illustrates how to put them together to decode a complete
DRA frame:
DecodeDraFrame()
{
// Unpack the whole DRA frame
UnpackDraFrame();
// Inverse quantization for both normal and LfE channels
for (nCh=0; nCh<nNumNormalCh+nNumLfeCh; nCh++) {
m_pCh[nCh].InverseQuantization();
}
// JIC decoding
if ( bUseJIC==true ) {
// nCh=0 is the source channel and other normal channels are
// the joint channels
for (nCh=1; nCh<nNumNormalCh; nCh++) {
m_pCh[nCh].DecJIC();
}
}
// Sum/difference decoding
if ( bUseSumDiff==true ) {
// The decisions are placed in the difference/right channels
// that are indexed by nCh=1,2,...
for (nCh=1; nCh<nNumNormalCh; nCh+=2) {
m_pCh[nCh].DecSumDff();
}
}
// Inverse TLM for all channels
for (nCh=0; nCh<nNumNormalCh+nNumLfeCh; nCh++) {
// De-interleaving if short MDCT is deployed

290

15 DRA Audio Coding Standard


m_pCh[nCh].DeInterleave();
// Inverse TLM
m_pCh[nCh].SwitchMDCTI();
}

15.5 Formal Listening Tests


During its standardization process, DRA audio coding algorithm went through five
rounds of ITU-R BS.1116 [30] compliant subjective listening test (see Sect. 14.2.3).
The test grades are shown in Table 15.16.
To stress the algorithm, only the constant inter-frame bit allocation scheme in
(13.1) was allowed in all tests. No frame can use more than the constant number of
bits determined by the formula. Also, unused bits in all frames are discarded, no bit
reservoir is allowed.
The first test was conducted in August 2004 by National Testing and Inspection
Center for Radio and TV Products (NTICRT) under the Ministry of Information
Industry, China. Ten stereo sound tracks selected mostly from SQAM CD [14] and
five 5.1 surround sound tracks were used in the test. The test subjects were all expert listeners consisting of conductors, musicians, recording engineers, and audio
engineers.
The other four tests were performed by the State Lab for DTV System Testing
(SLDST) under the State Administration for Radio, Film and TV, China. Other than
a few Chinese sound tracks, most of the test materials were selected from the SQAM
CD and a group of surround sound tracks used by EBU and MPEG, including
Pitch pipe, Harpsichord, and Elloit1. Some of the most famous Chinese conductors, musicians, and recording engineers were among the test subjects, but senior
and graduate students from the School of Recording Arts, Communication University of China and the Department of Sound Recording, Beijing Film Academy, were
found to possess better acuity due to their age and training, so became the majority
in later tests.
The last test, while conducted by SLDST, was actually ordered and supervised by
China Central TV (CCTV) as part of its DTV standard evaluation program. CCTV

Table 15.16 Grades for ITU-R BS.1116 compliant subjective listening tests
Lab
Date
Stereo @ 128 kbps 5.1 @ 320 kbps 5.1 @ 384 kbps
NTICRT
SLDST
SLDST
SLDST
SLDST

08/2004
10/2004
01/2005
07/2005
08/2006

4.6
4.2
4.1
4.7






4.6

4.7
4.0
4.2
4.5
4.9

15.5 Formal Listening Tests

291

was only interested in surround sounds, so DRA was tested at 384 kbps as well
as 320 kbps. This test was conducted in comparison with two major international
codecs, DRA came out as the clear winner.
The test results from SLDST are more consistent because the same group of
engineers performed the tests in the same lab and the listener pool was largely
unchanged. The later the test, the more difficult the test became, because the test
administrators and many listeners had learned the characteristics of DRA-processed
audio and knew where to look for distortions. The gradually increasing test grades
reflect continuous improvement to the encoder, especially to its perceptual model
and transient detection unit.

Appendix A

Large Tables

A.1 Quantization Step Size

Table A.1 Quantization step sizes


Index Step Size Step Size Step Size
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
115

1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388607

1
2
5
9
18
37
74
147
294
588
1176
2353
4705
9410
18820
37641
75281
150562
301124
602249
1204498
2408995
4817990

1
3
5
11
21
42
84
169
338
676
1351
2702
5405
10809
21619
43238
86475
172951
345901
691802
1383604
2767209
5534417

Step Size

Step Size

2
3
6
12
24
49
97
194
388
776
1552
3104
6208
12417
24834
49667
99334
198668
397336
794672
1589344
3178688
6357376

2
3
7
14
28
56
111
223
446
891
1783
3566
7132
14263
28526
57052
114105
228210
456419
912838
1825677
3651354
7302707

293

294

A Large Tables

A.2 Critical Bands for Short and Long MDCT

Table A.2 Critical bands for long


MDCT at 8000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

26
53
80
108
138
170
204
241
282
328
381
443
517
606
715
848
1010
1024

101:563
207:031
312:5
421:875
539:063
664:063
796:875
941:406
1101:56
1281:25
1488:28
1730:47
2019:53
2367:19
2792:97
3312:5
3945:31
4000

Table A.3 Critical bands for short


MDCT at 8000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

4
8
12
16
20
25
30
36
42
49
57
67
79
94
112
128

125
250
375
500
625
781:25
937:5
1125
1312:5
1531:25
1781:25
2093:75
2468:75
2937:5
3500
4000

Table A.4 Critical bands for


long MDCT at 11025 Hz sample
rate
Critical Upper Edge Upper Edge
Bands (Subbands) (Hz)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

19
39
59
80
102
125
150
177
207
241
280
326
380
446
526
625
745
888
1024

102:283
209:949
317:615
430:664
549:097
672:913
807:495
952:844
1114:34
1297:38
1507:32
1754:96
2045:65
2400:95
2831:62
3364:56
4010:56
4780:37
5512:5

Table A.5 Critical bands for short


MDCT at 11025 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

4
8
12
16
20
24
28
33
39
46
54
64
76
91
109
128

172:266
344:531
516:797
689:063
861:328
1033:59
1205:86
1421:19
1679:59
1981:05
2325:59
2756:25
3273:05
3919:04
4694:24
5512:5

A.2 Critical Bands for Short and Long MDCT

295

Table A.6 Critical bands for long


MDCT at 12000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
18
105:469
1
36
210:938
2
54
316:406
3
73
427:734
4
93
544:922
5
114
667:969
6
137
802:734
7
162
949:219
8
190
1113:28
9
221
1294:92
10
257
1505:86
11
299
1751:95
12
349
2044:92
13
409
2396:48
14
483
2830:08
15
574
3363:28
16
684
4007:81
17
815
4775:39
18
968
5671:88
19
1024
6000

Table A.8 Critical bands for long


MDCT at 16000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
13
101:563
1
27
210:938
2
41
320:313
3
55
429:688
4
70
546:875
5
86
671:875
6
103
804:688
7
122
953:125
8
143
1117:19
9
167
1304:69
10
194
1515:63
11
226
1765:63
12
264
2062:5
13
310
2421:88
14
366
2859:38
15
435
3398:44
16
519
4054:69
17
618
4828:13
18
734
5734:38
19
870
6796:88
20
1024
8000

Table A.7 Critical bands for short


MDCT at 12000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
187:5
1
8
375
2
12
562:5
3
16
750
4
20
937:5
5
24
1125
6
28
1312:5
7
33
1546:88
8
39
1828:13
9
46
2156:25
10
55
2578:13
11
66
3093:75
12
79
3703:13
13
95
4453:13
14
113
5296:88
15
128
6000

Table A.9 Critical bands for short


MDCT at 16000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
250
1
8
500
2
12
750
3
16
1000
4
20
1250
5
24
1500
6
28
1750
7
33
2062:5
8
39
2437:5
9
47
2937:5
10
56
3500
11
67
4187:5
12
80
5000
13
95
5937:5
14
113
7062:5
15
128
8000

296

A Large Tables
Table A.10 Critical bands for long
MDCT at 22050 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
10
107:666
1
20
215:332
2
30
322:998
3
41
441:431
4
52
559:863
5
64
689:063
6
77
829:028
7
91
979:761
8
107
1152:03
9
125
1345:83
10
146
1571:92
11
170
1830:32
12
199
2142:55
13
234
2519:38
14
277
2982:35
15
330
3552:98
16
394
4242:04
17
469
5049:54
18
557
5997
19
661
7116:72
20
790
8505:62
21
968
10422:1
22
1024
11025

Table A.12 Critical bands for long


MDCT at 24000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
9
105:469
1
18
210:938
2
27
316:406
3
37
433:594
4
47
550:781
5
58
679:688
6
70
820:313
7
83
972:656
8
97
1136:72
9
113
1324:22
10
132
1546:88
11
154
1804:69
12
180
2109:38
13
212
2484:38
14
251
2941:41
15
299
3503:91
16
357
4183:59
17
425
4980:47
18
505
5917:97
19
599
7019:53
20
716
8390:63
21
875
10253:9
22
1024
12000

Table A.11 Critical bands for short


MDCT at 22050 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)

Table A.13 Critical bands for short


MDCT at 24000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

4
8
12
16
20
24
29
35
42
51
61
73
87
105
128

344:531
689:063
1033:59
1378:13
1722:66
2067:19
2497:85
3014:65
3617:58
4392:77
5254:1
6287:7
7493:55
9043:95
11025

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

4
8
12
16
20
24
29
35
42
51
61
73
87
106
128

375
750
1125
1500
1875
2250
2718:75
3281:25
3937:5
4781:25
5718:75
6843:75
8156:25
9937:5
12000

A.2 Critical Bands for Short and Long MDCT

297

Table A.14 Critical bands for long


MDCT at 32000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
7
109:375
1
14
218:75
2
21
328:125
3
28
437:5
4
36
562:5
5
44
687:5
6
53
828:125
7
63
984:375
8
74
1156:25
9
86
1343:75
10
100
1562:5
11
117
1828:13
12
137
2140:63
13
161
2515:63
14
191
2984:38
15
227
3546:88
16
271
4234:38
17
323
5046:88
18
384
6000
19
456
7125
20
545
8515:63
21
668
10437:5
22
868
13562:5
23
1024
16000

Table A.16 Critical bands for long


MDCT at 44100 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
5
107:666
1
10
215:332
2
15
322:998
3
21
452:197
4
27
581:396
5
33
710:596
6
40
861:328
7
47
1012:06
8
55
1184:33
9
64
1378:13
10
75
1614:99
11
88
1894:92
12
103
2217:92
13
122
2627:05
14
145
3122:31
15
173
3725:24
16
207
4457:37
17
247
5318:7
18
293
6309:23
19
348
7493:55
20
418
9000:88
21
519
11175:7
22
695
14965:6
23
1024
22050

Table A.15 Critical bands for short


MDCT at 32000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
500
1
8
1000
2
12
1500
3
16
2000
4
20
2500
5
24
3000
6
29
3625
7
35
4375
8
42
5250
9
50
6250
10
60
7500
11
73
9125
12
91
11375
13
123
15375
14
128
16000

Table A.17 Critical bands for short


MDCT at 44100 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
689:063
1
8
1378:13
2
12
2067:19
3
16
2756:25
4
20
3445:31
5
24
4134:38
6
29
4995:7
7
35
6029:3
8
42
7235:16
9
51
8785:55
10
63
10852:7
11
84
14470:3
12
128
22050

298

A Large Tables
Table A.18 Critical bands for long
MDCT at 48000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
5
117:188
1
10
234:375
2
15
351:563
3
20
468:75
4
25
585:938
5
31
726:563
6
37
867:188
7
44
1031:25
8
52
1218:75
9
61
1429:69
10
71
1664:06
11
83
1945:31
12
98
2296:88
13
116
2718:75
14
138
3234:38
15
165
3867:19
16
197
4617:19
17
235
5507:81
18
279
6539:06
19
332
7781:25
20
401
9398:44
21
503
11789:1
22
692
16218:8
23
1024
24000

Table A.20 Critical bands for long


MDCT at 88200 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
172:266
1
8
344:531
2
12
516:797
3
16
689:063
4
20
861:328
5
24
1033:59
6
28
1205:86
7
33
1421:19
8
39
1679:59
9
46
1981:05
10
54
2325:59
11
64
2756:25
12
76
3273:05
13
91
3919:04
14
109
4694:24
15
130
5598:63
16
155
6675:29
17
185
7967:29
18
224
9646:88
19
283
12187:8
20
397
17097:4
21
799
34410:1
22
1024
44100

Table A.19 Critical bands for short


MDCT at 48000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)

Table A.21 Critical bands for short


MDCT at 88200 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
1378:13
1
8
2756:25
2
12
4134:38
3
16
5512:5
4
20
6890:63
5
24
8268:75
6
30
10335:9
7
39
13436:7
8
59
20327:3
9
128
44100

0
1
2
3
4
5
6
7
8
9
10
11
12

4
8
12
16
20
24
29
35
42
51
65
92
128

750
1500
2250
3000
3750
4500
5437.5
6562.5
7875
9562.5
12187.5
17250
24000

A.2 Critical Bands for Short and Long MDCT

299

Table A.22 Critical bands for long


MDCT at 96000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
187:5
1
8
375
2
12
562:5
3
16
750
4
20
937:5
5
24
1125
6
28
1312:5
7
33
1546:88
8
39
1828:13
9
46
2156:25
10
55
2578:13
11
66
3093:75
12
79
3703:13
13
95
4453:13
14
113
5296:88
15
134
6281:25
16
160
7500
17
193
9046:88
18
240
11250
19
323
15140:6
20
546
25593:8
21
1024
48000

Table A.24 Critical bands for long


MDCT at 174600 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
341:016
1
8
682:031
2
12
1023:05
3
16
1364:06
4
20
1705:08
5
24
2046:09
6
29
2472:36
7
35
2983:89
8
42
3580:66
9
51
4347:95
10
61
5200:49
11
73
6223:54
12
87
7417:09
13
105
8951:66
14
130
11083
15
174
14834:2
16
288
24553:1
17
1024
87300

Table A.23 Critical bands for short


MDCT at 96000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
1500
1
8
3000
2
12
4500
3
16
6000
4
20
7500
5
25
9375
6
32
12000
7
45
16875
8
89
33375
9
128
48000

Table A.25 Critical bands for short


MDCT at 174600 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
1
2
3
4
5
6

4
8
12
16
22
37
128

2728:13
5456:25
8184:38
10912:5
15004:7
25235:2
87300

300

A Large Tables
Table A.26 Critical bands for long
MDCT at 192000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
375
1
8
750
2
12
1125
3
16
1500
4
20
1875
5
24
2250
6
29
2718:75
7
35
3281:25
8
42
3937:5
9
51
4781:25
10
61
5718:75
11
73
6843:75
12
87
8156:25
13
106
9937:5
14
136
12750
15
197
18468:8
16
465
43593:8
17
1024
96000

Table A.27 Critical bands for short


MDCT at 192000 Hz sample rate
Critical Upper Edge Upper Edge
Bands (Subbands) (Hz)
0
1
2
3
4
5
6

4
8
12
16
23
47
128

3000
6000
9000
12000
17250
35250
96000

A.3 Huffman Codebooks for Codebook Assignment

301

A.3 Huffman Codebooks for Codebook Assignment


Table A.28 Huffman codebook
for application scopes for quasistationary frames
Index
Length
Codeword
0
3
7
1
3
0
2
3
4
3
3
5
4
4
6
5
4
4
6
4
3
7
4
D
8
5
F
9
5
A
10
5
4
11
5
18
12
6
17
13
6
B
14
6
33
15
7
3A
16
7
38
17
7
15
18
8
77
19
8
73
20
8
5B
21
8
5A
22
8
29
23
8
C9
24
8
CA
25
9
ED
26
9
B1
27
9
B0
28
9
50
29
9
191

Table A.28 (continued)


30
9
196
31
10
1C 9
32
10
165
33
10
164
34
10
1CB
35
10
166
36
10
167
37
10
321
38
11
391
39
11
394
40
11
3B2
41
11
3B3
42
10
320
43
11
145
44
10
A3
45
11
144
46
11
65F
47
12
760
48
13
EC 6
49
12
761
50
13
EC 7
51
13
EC 4
52
12
72A
53
12
72B
54
12
720
55
12
CBC
56
13
197B
57
14
1C 85
58
13
EC 5
59
15
3909
60
15
3908
61
13
197A
62
13
E43
63
10
32E

302

A Large Tables
Table A.29 Huffman codebook
for application scopes for frames
with detected transients
Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

4
3
3
3
4
4
4
4
4
4
5
5
6
6
6
6
6
7
7
7
8
8
9
8
10
9
9
11
10
10
11
9

5
1
0
6
7
E
F
A
9
8
9
17
1B
19
18
10
2C
5A
34
5B
6A
6B
8D
45
11D
89
8C
223
110
11C
222
8F

Table A.30 Huffman codebook


for selection indexes for quasistationary frames
Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

11
10
8
6
5
5
5
3
2
2
3
4
5
6
6
7
9
11

B0
59
17
A
3
6
7
5
1
3
4
0
4
B
4
A
2D
B1

Table A.31 Huffman codebook


for selection indexes for frames
with detected transients
Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

12
12
12
9
7
5
4
3
1
3
4
5
5
6
8
11
11
12

CA
CB
C8
18
7
6
1
3
1
2
2
7
0
2
D
67
66
C9

A.4 Huffman Codebooks for Quotient Width of Quantization Indexes

303

A.4 Huffman Codebooks for Quotient Width of Quantization


Indexes
Table A.32 For quasi-stationary
frames
Index
Length
Codeword
0
2
2
1
3
6
2
3
0
3
3
1
4
3
2
5
4
7
6
5
1F
7
9
C9
8
10
190
9
10
191
10
8
65
11
7
33
12
6
18
13
5
1E
14
5
D
15
4
E

Table A.33 For frames with transients


Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
15

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0

304

A Large Tables

A.5 Huffman Codebooks for Quantization Indexes


in Quasi-Stationary Frames
Table A.34 Codebook 14
Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

10
9
10
9
7
8
10
9
10
9
7
9
7
5
7
8
7
8
10
8
9
8
7
8
10
8
10
8
7
8
7
4
7
8
7
8
7
4
7
4

17F
D5
3E0
D6
29
DC
3E1
D7
3E6
96
7E
BD
2B
F
23
F7
2D
DE
3E5
F5
1B6
F0
34
D0
12B
DF
129
D9
26
DA
20
9
2C
F6
30
D3
31
A
37
E

Table A.34 (continued)


40
2
41
4
42
7
43
4
44
7
45
8
46
7
47
8
48
7
49
5
50
7
51
9
52
7
53
8
54
10
55
8
56
10
57
9
58
7
59
8
60
9
61
8
62
10
63
8
64
7
65
8
66
7
67
4
68
7
69
8
70
7
71
9
72
10
73
8
74
10
75
8
76
7
77
9
78
10
79
8
80
10

0
C
36
B
27
D4
32
F2
2A
E
22
BE
28
D7
3E3
D1
17E
D4
33
F1
1B7
F4
3E7
D2
2E
F3
7F
8
24
D8
7D
BC
128
D6
3E2
DD
21
97
3E4
D5
12A

A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames


Table A.35 Codebook 22
Index Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

7
6
5
6
7
6
5
4
5
6
5
3
2
4
5
6
4
4
5
6
7
6
5
6
7

59
3
17
B
5A
9
8
7
9
A
2
4
3
5
3
D
A
6
7
8
5B
C
0
2
58

Table A.36 Codebook 42


Index Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

11
10
9
9
8
9
9
10
11
10
9
8
7
6
7
8
9

31E
17D
15E
EF
41
FA
81
1D9
3CB
17C
ED
74
21
28
34
7C
F0

Table A.36 (continued)


17
10
1E4
18
9
15F
19
8
6C
20
7
30
21
6
3C
22
5
13
23
6
A
24
7
38
25
8
75
26
9
BF
27
9
F3
28
7
17
29
6
3D
30
5
4
31
4
E
32
5
9
33
6
11
34
7
32
35
8
AD
36
8
62
37
6
2A
38
5
11
39
4
B
40
3
0
41
4
C
42
5
10
43
6
29
44
8
2D
45
8
AE
46
7
35
47
6
14
48
5
7
49
4
D
50
5
6
51
6
3E
52
7
2A
53
9
EE
54
9
C6
55
8
7B
56
7
39
57
6
16
58
5
12
59
6
3F
60
7
2E
61
8
6E
62
9
59
63
10
1D8

305

306

A Large Tables
Table A.36 (continued)
64
9
F1
65
8
7A
66
7
33
67
7
3F
68
7
2B
69
8
6D
70
9
DF
71
10
1BC
72
11
31F
73
10
1BD
74
9
80
75
8
AC
76
8
5E
77
9
FB
78
9
58
79
10
18E
80
11
3CA
Table A.37 Codebook 82
Index Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

13
13
13
12
12
11
11
10
10
11
11
11
12
12
12
13
14
13
12
12
12
11
10
10
10
9

62D
A4
BCA
53
50
6E0
753
307
3A8
328
2F 5
7B1
7B7
51
CEE
19DB
33B4
E0B
CEC
F 61
552
77D
210
368
1F
13C

Table A.37 (continued)


26
10
195
27
10
15
28
10
211
29
11
756
30
11
604
31
12
5E8
32
12
FAC
33
13
F 0B
34
12
9D9
35
11
46C
36
11
549
37
10
269
38
11
3D8
39
10
11
40
10
1C C
41
9
1E0
42
9
E7
43
9
5
44
9
19A
45
10
1B5
46
10
3BC
47
11
3D9
48
11
6B4
49
12
5E9
50
12
9D8
51
12
6D0
52
11
13
53
11
386
54
10
369
55
9
109
56
9
3A
57
8
AA
58
8
F7
59
7
44
60
8
6
61
8
9F
62
9
79
63
9
149
64
10
37E
65
11
383
66
11
757
67
12
317
68
12
782
69
11
3DA
70
10
37F
71
9
148

A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames


Table A.37 (continued)
72
9
180
73
9
BE
74
9
F7
75
8
1
76
7
4C
77
8
FE
78
8
D8
79
9
C8
80
9
138
81
9
135
82
10
35B
83
11
EF
84
11
6FA
85
11
3C
86
11
7D7
87
9
10B
88
9
61
89
9
63
90
8
E0
91
7
43
92
7
61
93
7
31
94
7
73
95
8
7A
96
8
F1
97
9
CE
98
9
DB
99
9
14B
100
11
3C 3
101
11
67A
102
11
20
103
10
3BF
104
9
153
105
8
A7
106
8
DE
107
7
47
108
7
D
109
6
32
110
6
E
111
6
2E
112
7
F
113
7
41
114
8
D7
115
8
D4
116
9
1AC
117
10
17

Table A.37 (continued)


118
11
182
119
10
2A5
120
10
C0
121
9
E
122
8
3
123
8
1C
124
7
74
125
6
2C
126
6
1D
127
5
2
128
6
1F
129
6
34
130
7
76
131
8
32
132
8
54
133
9
32
134
10
67
135
10
237
136
10
8
137
9
118
138
8
8B
139
8
71
140
8
72
141
7
2E
142
6
16
143
5
4
144
4
4
145
5
5
146
6
14
147
7
30
148
8
6C
149
8
6F
150
9
C9
151
9
19F
152
10
37C
153
10
303
154
9
13D
155
9
78
156
8
FB
157
8
33
158
7
7A
159
6
31
160
5
12
161
5
1
162
6
1A
163
6
2B

307

308

A Large Tables
Table A.37 (continued)
164
7
72
165
8
EB
166
8
FF
167
9
1F 4
168
10
66
169
11
3C 4
170
11
2A8
171
10
12
172
9
10A
173
8
D9
174
8
CC
175
7
51
176
7
1F
177
6
2F
178
6
D
179
6
2D
180
7
7E
181
8
6E
182
8
A6
183
8
9B
184
9
182
185
10
306
186
11
12
187
11
EE
188
11
329
189
9
139
190
9
7A
191
9
BF
192
8
EE
193
7
50
194
7
7C
195
7
2B
196
7
71
197
8
79
198
8
DB
199
9
CF
200
9
1E4
201
10
178
202
10
234
203
11
67B
204
11
4ED
205
11
183
206
10
13
207
9
19B
208
9
1B5
209
9
CB
210
8
AB

Table A.37 (continued)


211
8
0
212
7
40
213
8
F3
214
8
A8
215
9
7B
216
9
19C
217
9
14A
218
10
3D9
219
11
369
220
11
678
221
12
7B6
222
11
679
223
11
380
224
10
76
225
9
13A
226
9
AB
227
8
E1
228
8
18
229
8
66
230
8
DD
231
8
D5
232
9
1E5
233
10
1CD
234
10
3BD
235
11
387
236
12
783
237
12
5E6
238
12
F 60
239
12
5E7
240
11
6B5
241
11
381
242
10
33A
243
10
155
244
9
1B9
245
9
1ED
246
8
8A
247
9
1E1
248
10
1C 2
249
10
1E3
250
10
371
251
11
21
252
11
6E1
253
12
EA4
254
14
1C15
255
13
F 0A
256
12
CEF
257
12
553

A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames


Table A.37
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288

(continued)
11
46D
11
18A
10
235
10
16
10
17B
9
119
10
C4
10
3EA
11
77C
11
6FB
11
4D0
12
6D1
13
A5
14
1C14
13
62C
13
BCB
12
FAD
12
5E4
12
784
11
605
11
3C 5
11
3C 0
10
3AA
10
277
11
3D
11
548
12
704
11
4D1
13
1D4A
13
1D4B
14
33B5

Table A.38 Codebook 151


Index
Length Codeword
0
8
21
1
8
22
2
7
49
3
7
4F
4
7
2F
5
6
1
6
6
9
7
6
26
8
6
6
9
5
10

Table A.38
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

(continued)
5
2
5
5
4
B
4
4
3
6
3
3
3
7
4
3
4
A
5
A
5
1
5
11
6
7
6
25
6
16
6
0
7
2E
7
4E
7
48
8
23
8
20

Table A.39 Codebook 311


Index Length Codeword
0
10
BA
1
10
1EF
2
9
1B8
3
9
20
4
9
23
5
9
DA
6
9
79
7
9
F6
8
9
FE
9
8
DD
10
8
26
11
8
3E
12
7
72
13
7
6F
14
7
12
15
7
1B
16
8
6C
17
8
7A
18
7
73
19
7
16

309

310

A Large Tables
Table A.39 (continued)
20
7
18
21
7
3E
22
7
3C
23
6
36
24
6
A
25
6
E
26
5
1A
27
5
3
28
5
C
29
4
0
30
3
5
31
3
2
32
3
4
33
4
F
34
5
E
35
5
1D
36
5
18
37
6
1A
38
6
8
39
6
38
40
6
32
41
7
37
42
7
19
43
7
9
44
7
A
45
8
7E
46
8
3D
47
7
1A
48
7
B
49
7
66
50
7
67
51
8
3F
52
8
27
53
8
2F
54
9
FF
55
9
DB
56
9
5C
57
9
78
58
9
22
59
9
1B9
60
10
1EE
61
9
21
62
10
BB

Table A.40 Codebook 631


Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

10
11
10
10
10
10
10
10
10
10
9
10
9
10
10
9
9
9
9
9
9
9
9
9
9
8
8
8
8
8
8
7
9
9
9
8
9
8
9
9
8
8

1
F8
1E2
59
1CF
62
7D
24
2AA
1C 0
1B5
1E3
130
1C1
1D6
15A
15B
152
1B6
1C 6
1A
1B4
2D
1
32
9E
9A
E1
F7
17
3
40
1B7
1C 9
33
9F
EA
DE
38
F0
FC
A8

A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames


Table A.40
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84

(continued)
8
AB
8
C
8
74
8
A
8
1D
7
45
8
71
7
6E
7
6C
7
7F
7
7
6
21
7
3B
7
3F
6
3E
6
4
5
12
5
1A
5
C
4
C
4
5
3
1
4
4
4
B
5
D
5
1D
5
14
6
1
6
3C
7
3D
6
23
7
3E
7
D
7
73
7
7A
7
57
8
79
7
44
8
72
7
41
8
14
8
FD
8
E2

Table A.40 (continued)


85
8
1
86
8
9C
87
8
AC
88
9
4
89
9
39
90
9
E1
91
9
1B
92
9
153
93
9
2A
94
9
1C 0
95
8
E5
96
8
1E
97
8
B
98
8
8
99
8
DF
100
8
9B
101
9
3F
102
8
9D
103
9
2B
104
9
5
105
9
1C 7
106
9
E6
107
9
1C 8
108
9
30
109
9
1EC
110
9
131
111
10
1CE
112
9
132
113
9
133
114
10
0
115
9
1ED
116
10
25
117
10
27
118
10
1D7
119
10
58
120
10
2AB
121
10
63
122
9
154
123
10
382
124
10
383
125
11
F9
126
10
26

311

312

A Large Tables
Table A.41 Codebook 1271
Index Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

11
13
10
11
11
10
11
10
10
11
11
10
10
10
11
10
10
10
10
11
11
10
10
10
10
9
11
10
10
10
10
10
10
10
11
10
10
9
11
10
9
11
10
10
10
9

5DA
B8E
2E2
5DB
5D8
38
5D9
39
3E
74D
2B6
3A7
3F
158
76
159
2E3
2E0
2E1
3D4
2B7
3C
3D
32
3A4
177
2B4
33
15E
1EB
3A5
174
3BA
43
77
15F
15C
121
3D5
10A
1EA
3EA
15D
10B
3BB
142

Table A.41 (continued)


46
9
143
47
9
1D0
48
9
13
49
10
1E8
50
9
174
51
10
1E9
52
10
152
53
9
175
54
10
3B8
55
9
140
56
9
1D1
57
9
1D6
58
9
16A
59
9
94
60
9
26
61
9
27
62
9
1EB
63
9
BB
64
11
74
65
11
2B5
66
10
3B9
67
10
30
68
10
31
69
10
108
70
10
36
71
10
175
72
11
75
73
10
2E6
74
9
24
75
9
1D7
76
9
141
77
10
1EE
78
10
1A2
79
9
146
80
10
3BE
81
10
37
82
10
109
83
10
1A3
84
9
95
85
9
1E8
86
10
153
87
9
1D4
88
9
16B
89
9
168
90
9
1D5
91
10
1A0
92
9
B2

A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames


Table A.41
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

(continued)
9
E8
9
1E2
9
B9
9
10
9
25
9
11
9
F8
9
16
9
9A
8
92
8
E0
9
FC
8
E4
8
EC
8
F2
8
48
8
1E
8
44
8
5F
8
11
8
7F
8
75
7
5F
7
20
7
E
7
37
7
35
6
25
6
6
6
14
5
15
5
1F
4
8
4
2
3
6
4
3
5
C
5
0
5
13
6
1C
6
5
6
2C
7
3B
7
36
7
23
7
2D
7
52

Table A.41 (continued)


140
7
71
141
7
5E
142
8
79
143
8
4B
144
8
4F
145
8
E7
146
8
F3
147
8
F7
148
8
E5
149
8
E1
150
8
ED
151
9
9B
152
8
93
153
8
A7
154
9
FD
155
9
F9
156
9
1E9
157
9
17
158
9
14
159
9
98
160
9
99
161
9
3E
162
9
92
163
9
15
164
9
93
165
10
1A1
166
9
147
167
9
169
168
10
150
169
9
1E3
170
9
16E
171
10
1EF
172
10
2E7
173
9
16F
174
10
1EC
175
10
151
176
10
1ED
177
10
1E2
178
10
1A6
179
10
156
180
9
144
181
10
2E4
182
13
B8F
183
10
1E3
184
9
16C
185
11
4A
186
10
2E5

313

314

A Large Tables
Table A.41 (continued)
187
10
157
188
11
4B
189
10
3BF
190
10
10E
191
9
B3
192
9
9C
193
9
20
194
9
3F
195
9
1E0
196
9
B0
197
10
154
198
9
1C C
199
10
1A7
200
9
145
201
9
122
202
10
1A4
203
9
16D
204
9
B1
205
9
8A
206
9
1E1
207
9
1EC
208
10
1A5
209
10
17A
210
9
123
211
11
3EB
212
10
1E0
213
10
155
214
10
10F
215
10
10C
216
10
13A
217
10
34
218
11
3E8
219
10
1E1
220
10
13B

Table A.41 (continued)


221
10
35
222
11
3E9
223
10
29A
224
10
3DA
225
10
10D
226
10
170
227
10
29B
228
10
298
229
11
3EE
230
11
3EF
231
10
3BC
232
11
48
233
11
3EC
234
10
3DB
235
10
17B
236
9
120
237
10
299
238
10
178
239
10
179
240
10
3BD
241
11
3ED
242
10
39A
243
11
22E
244
11
22F
245
11
49
246
11
74C
247
11
2E2
248
10
116
249
10
1D2
250
11
3A6
251
12
5C 6
252
10
42
253
10
39B
254
11
3A7

A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames


Table A.42 Codebook 2551
Index Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

5
4
5
5
5
5
5
5
6
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
7
8
8
8
8
8
8
8
8
8
8
8
8

2
A
D
7
4
0
16
11
17
18
13
15
B
2
32
25
34
2E
20
3E
3B
22
15
14
1B
E
6E
5F
66
61
4E
7E
71
49
66
64
5A
43
49
4B
1E
19
35
D8
CF
DB

Table A.42 (continued)


46
8
F
47
8
D6
48
8
9F
49
8
D
50
8
BD
51
9
F7
52
8
86
53
8
91
54
8
DF
55
9
EB
56
9
F4
57
9
E8
58
9
CB
59
9
85
60
9
E9
61
9
60
62
9
B7
63
9
A5
64
9
A0
65
9
3E
66
9
E0
67
9
E4
68
9
E7
69
9
B0
70
9
68
71
9
69
72
9
8D
73
9
61
74
9
8C
75
9
62
76
9
65
77
9
36
78
9
91
79
9
94
80
9
19D
81
9
180
82
9
8E
83
9
34
84
9
121
85
9
18C
86
10
1E5
87
9
134
88
9
135
89
10
1E7
90
9
30
91
9
1AA
92
10
1EC

315

316

A Large Tables
Table A.42 (continued)
93
9
1BC
94
9
189
95
9
18D
96
10
166
97
10
194
98
9
182
99
9
1AB
100
9
183
101
10
167
102
10
164
103
9
18A
104
10
1D5
105
9
1AE
106
10
1ED
107
10
1FF
108
9
13D
109
10
162
110
10
120
111
10
19E
112
9
132
113
10
148
114
10
149
115
10
1E0
116
10
121
117
10
165
118
10
1EA
119
10
12A
120
10
105
121
10
2F2
122
10
19F
123
10
C8
124
10
19C
125
10
145
126
10
1E2
127
10
108
128
10
19D
129
9
133
130
10
1CA
131
9
108
132
10
14E
133
9
13C
134
10
1CB
135
9
10E
136
10
C9
137
9
109
138
10
1C2
139
10
1D4

Table A.42
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186

(continued)
10
14F
10
102
10
14C
10
1E1
9
10F
10
14D
10
142
10
103
9
10A
10
16C
10
1EB
10
CE
10
1E3
10
1E6
10
109
10
33
10
62
10
146
10
100
10
6F
10
317
10
143
10
101
10
7E
10
106
10
147
10
338
10
107
10
CF
10
3A
10
339
10
7F
10
366
10
CC
10
C6
10
367
11
3FA
10
37B
10
3B
10
1FE
11
3FB
10
240
10
63
10
CD
10
364
10
6A
10
352

A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames


Table A.42
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221

(continued)
10
261
10
26E
10
26F
10
104
10
30
10
31
11
32B
10
365
10
353
10
350
10
31E
11
39A
10
2F3
10
11E
10
310
10
241
11
3C8
11
3C9
10
6B
11
2C6
10
6E
10
36A
10
32
10
36B
10
26C
10
311
10
26D
11
3F8
10
216
10
368
11
39B
10
217
10
369
11
2C7
11
23F

Table A.42 (continued)


222
11
398
223
11
288
224
10
35E
225
10
2F0
226
10
31F
227
10
262
228
10
351
229
10
263
230
10
31C
231
11
2DA
232
11
6F5
233
10
2F1
234
11
3F9
235
11
399
236
11
289
237
11
256
238
10
38
239
10
31D
240
10
302
241
10
39
242
11
2DB
243
11
18F
244
10
260
245
11
32A
246
10
35F
247
10
303
248
11
23E
249
11
257
250
10
316
251
11
18E
252
11
6F4
253
11
386
254
11
387
255
3
7

317

318

A Large Tables

A.6 Huffman Codebooks for Quantization Indexes


in Frames with Transients

Table A.43 Codebook 14


Index
Length
Codeword
0
7
6D
1
8
1D
2
10
363
3
9
B
4
7
F
5
9
1A5
6
11
74
7
9
1E
8
11
75
9
9
184
10
7
67
11
9
185
12
7
6F
13
5
7
14
7
8
15
8
DD
16
7
6A
17
9
1B8
18
10
10
19
9
9
20
11
6C 4
21
9
19A
22
7
0
23
9
12
24
11
171
25
9
39
26
9
19B
27
8
2B
28
7
A
29
8
D3
30
7
11
31
4
A
32
7
12
33
9
5D
34
7
5
35
9
13
36
6
31
37
4
8
38
7
6
39
4
F

Table A.43 (continued)


40
2
1
41
4
E
42
7
B
43
4
B
44
6
32
45
9
1C
46
7
D
47
9
5E
48
7
3
49
4
9
50
7
C
51
9
54
52
7
1
53
7
60
54
9
1A4
55
8
C3
56
10
11
57
9
1B9
58
7
13
59
9
55
60
10
3B
61
9
1B0
62
11
170
63
9
1F
64
7
10
65
9
1B2
66
7
16
67
5
6
68
7
14
69
9
1B3
70
7
68
71
8
CC
72
11
172
73
9
A
74
11
6C 5
75
9
5F
76
7
9
77
9
38
78
11
173
79
8
8
80
7
6B

A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients


Table A.44 Codebook 22
Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

8
6
5
6
8
6
5
4
5
6
5
3
2
3
5
6
5
4
5
6
8
6
6
6
8

13
22
12
20
11
39
1
3
1F
21
1E
5
1
6
1D
27
0
2
3
38
10
23
5
26
12

Table A.45 Codebook 42


Index
Length
Codeword
0
10
71
1
10
E8
2
9
16D
3
9
166
4
8
B7
5
9
39
6
11
340
7
11
341
8
12
686
9
11
342
10
9
FF
11
9
FA
12
7
5A
13
7
35
14
7
49
15
8
38

Table A.45 (continued)


16
10
1A4
17
11
3FA
18
9
140
19
9
E4
20
7
51
21
7
32
22
6
18
23
7
36
24
8
71
25
8
A1
26
10
2D8
27
8
A4
28
7
53
29
6
20
30
5
6
31
4
2
32
5
15
33
7
33
34
7
A
35
9
E5
36
8
7C
37
6
6
38
6
1E
39
4
5
40
2
3
41
4
4
42
5
11
43
6
4
44
8
70
45
9
75
46
7
F
47
7
3A
48
5
17
49
4
0
50
5
13
51
6
21
52
7
B
53
8
3B
54
10
1CF
55
9
FB
56
8
39
57
7
3B
58
6
F
59
7
37
60
7
48
61
8
B2
62
10
E9

319

320

A Large Tables
Table A.45 (continued)
63
10
2D9
64
9
141
65
9
E6
66
7
58
67
6
25
68
8
7E
69
8
A5
70
9
D1
71
10
70
72
12
687
73
11
34A
74
10
1F C
75
9
D3
76
8
1D
77
9
167
78
11
3FB
79
10
1CE
80
11
34B
Table A.46 Codebook 82
Index
Length Codeword
0
12
413
1
13
824
2
13
825
3
13
83A
4
11
203
5
11
7BE
6
12
410
7
10
3DC
8
9
112
9
10
2A0
10
10
3DD
11
10
2A1
12
13
83B
13
12
411
14
13
838
15
12
416
16
13
839
17
12
417
18
11
7BF
19
12
414
20
12
415
21
11
7BC
22
12
46A
23
10
3D2
24
10
3D3
25
8
A0

Table A.46 (continued)


26
9
151
27
11
200
28
11
201
29
10
3D0
30
10
2A6
31
12
46B
32
13
83E
33
13
83F
34
11
7BD
35
13
83C
36
10
3D1
37
12
468
38
11
206
39
10
3D6
40
10
11F
41
9
113
42
8
16
43
9
1E1
44
10
55
45
10
7E
46
11
7B2
47
10
2A7
48
12
469
49
11
7B3
50
11
7B0
51
13
83D
52
11
7B1
53
10
2A4
54
9
110
55
10
7F
56
9
C
57
9
1E6
58
9
85
59
8
23
60
8
AD
61
9
172
62
10
11C
63
10
11D
64
11
7B6
65
13
832
66
12
46E
67
12
46F
68
12
46C
69
11
207
70
11
204
71
10
112
72
10
3D7

A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients


Table A.46 (continued)
73
10
2A5
74
10
2BA
75
8
A1
76
8
17
77
9
3C
78
9
1E7
79
9
D
80
10
3D4
81
10
113
82
10
3D5
83
10
3EA
84
10
3EB
85
11
7B7
86
10
2BB
87
10
110
88
10
7C
89
9
111
90
10
111
91
8
D0
92
8
24
93
7
6E
94
8
29
95
8
10
96
9
116
97
10
3E8
98
9
117
99
11
205
100
10
3E9
101
11
7B4
102
11
21A
103
10
116
104
10
7D
105
8
D8
106
9
173
107
8
B5
108
8
2A
109
7
26
110
6
0
111
7
7
112
7
5B
113
9
1A
114
10
72
115
9
3D
116
10
2B8
117
9
114
118
10
2B9
119
9
156
120
10
73

Table A.46 (continued)


121
9
7A
122
8
FD
123
8
F8
124
8
25
125
6
2F
126
6
B
127
5
B
128
5
13
129
6
2C
130
7
52
131
8
FE
132
8
FF
133
9
115
134
9
1B
135
10
13E
136
9
1AA
137
9
1E4
138
8
11
139
8
4E
140
7
42
141
7
1A
142
6
E
143
4
E
144
3
3
145
4
C
146
6
C
147
7
9
148
8
37
149
7
6B
150
8
4
151
9
95
152
10
3EE
153
10
117
154
11
21B
155
9
42
156
9
96
157
8
7
158
7
5D
159
7
13
160
5
12
161
5
A
162
6
6
163
6
20
164
7
6F
165
9
97
166
8
2B
167
9
9E
168
9
157

321

322

A Large Tables
Table A.46 (continued)
169
10
70
170
10
3EF
171
10
13F
172
9
18
173
8
86
174
8
8E
175
8
D9
176
7
51
177
7
1F
178
6
2
179
7
24
180
8
28
181
8
D1
182
9
7B
183
8
F9
184
10
114
185
10
3EC
186
11
218
187
12
46D
188
11
219
189
10
115
190
10
DA
191
9
154
192
10
DB
193
8
14
194
8
B8
195
7
69
196
8
D4
197
8
DA
198
9
78
199
9
43
200
9
19
201
9
155
202
10
71
203
11
21E
204
11
7B5
205
11
21F
206
11
78A
207
10
2BE
208
10
76
209
10
3ED
210
9
A
211
8
A6
212
8
A7
213
9
40
214
9
1E5
215
10
D8

Table A.46 (continued)


216
9
1B6
217
9
168
218
11
78B
219
11
788
220
11
789
221
13
833
222
13
830
223
12
462
224
10
77
225
11
21C
226
10
74
227
9
79
228
9
41
229
7
46
230
8
87
231
8
8F
232
10
75
233
10
D9
234
10
2BF
235
10
2BC
236
11
21D
237
12
463
238
12
460
239
11
78E
240
13
831
241
11
212
242
10
2BD
243
11
78F
244
11
78C
245
9
1B7
246
8
22
247
9
94
248
11
213
249
10
356
250
10
2B2
251
10
357
252
12
461
253
10
2B3
254
13
836
255
12
466
256
11
78D
257
12
467
258
13
837
259
12
464
260
11
782
261
10
2B0
262
10
2B1

A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients


Table A.46 (continued)
263
8
FC
264
11
783
265
10
2D2
266
11
780
267
10
56
268
13
834
269
12
465
270
13
835
271
13
80A
272
13
80B
273
12
47A
274
13
808
275
12
47B
276
11
210
277
11
211
278
12
478
279
12
479
280
9
B
281
10
57
282
11
781
283
11
5A6
284
13
809
285
12
152
286
11
5A7
287
11
A8
288
12
153

323

Table A.47 Codebook 151


Index
Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

9
9
8
7
7
7
6
7
6
6
6
5
5
4
3
2
3
4
4
5
6
6
6
7
6
6
7
7
7
8
8

7E
7F
3E
65
12
1E
27
19
24
8
D
1E
5
D
5
1
0
E
8
18
E
3F
26
18
3E
25
13
66
64
CE
CF

324

A Large Tables
Table A.48 Codebook 311
Index
Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

10
10
9
10
9
9
8
8
8
9
9
9
7
8
8
7
8
7
7
7
7
7
7
7
6
6
5
5
5
4
3
3
3
4
5
5
5
6
6
6
7
6
7
7
7

3BC
170
4F
3BD
42
64
20
CE
CF
77
65
B9
64
5D
3A
24
C
62
65
72
18
2F
2C
2D
2
13
1B
0
A
1
5
3
4
F
8
5
1A
F
D
38
25
30
1C
73
7

Table A.48 (continued)


45
7
75
46
8
26
47
7
11
48
7
66
49
7
12
50
7
76
51
7
63
52
8
33
53
8
E8
54
8
D
55
9
43
56
8
E9
57
10
EC
58
9
4E
59
9
1DF
60
8
EE
61
10
171
62
10
ED
Table A.49 Codebook 631
Index Length Codeword
0
10
1A8
1
10
1A9
2
10
341
3
10
346
4
11
680
5
9
1A1
6
10
347
7
10
1AE
8
10
1AF
9
11
681
10
10
344
11
9
1A6
12
12
48E
13
10
1AC
14
10
1AD
15
9
1A7
16
10
1A2
17
10
345
18
9
3E
19
9
3F
20
10
1A3
21
8
1C
22
9
D5
23
10
1A0
24
8
D4

A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients


Table A.49 (continued)
25
9
CA
26
8
BA
27
8
BB
28
8
49
29
8
D5
30
8
DF
31
9
CB
32
9
C8
33
9
C9
34
8
CE
35
10
1A1
36
8
1D
37
8
8
38
8
42
39
8
B8
40
9
3C
41
8
4C
42
8
B9
43
7
56
44
8
5A
45
7
D
46
8
5B
47
7
6C
48
7
66
49
7
6D
50
7
6
51
7
30
52
7
27
53
6
2C
54
6
21
55
6
32
56
6
11
57
6
17
58
6
1B
59
5
18
60
5
2
61
5
E
62
4
2
63
3
7
64
4
3
65
4
9
66
5
A
67
5
0
68
5
11
69
6
1F
70
5
14

Table A.49 (continued)


71
6
20
72
7
31
73
6
2D
74
6
2A
75
7
3D
76
7
57
77
7
6E
78
7
5
79
8
43
80
7
7
81
8
4D
82
8
4A
83
8
BE
84
8
40
85
9
CE
86
8
4B
87
8
CF
88
8
BF
89
9
CF
90
8
9
91
9
CC
92
9
3D
93
10
1A6
94
9
32
95
8
58
96
8
78
97
8
DE
98
8
BC
99
8
18
100
10
35A
101
9
CD
102
9
33
103
9
B2
104
10
1A7
105
8
41
106
9
B3
107
8
BD
108
10
35B
109
9
1A4
110
10
1A4
111
10
358
112
10
359
113
9
1A5
114
9
F2
115
9
90
116
10
35E

325

326

A Large Tables
Table A.49 (continued)
117
10
35F
118
10
35C
119
10
35D
120
11
3CE
121
12
48F
122
11
3CF
123
10
1A5
124
11
246
125
10
122
126
10
1E6
Table A.50 Codebook 1271
Index
Length Codeword
0
10
52
1
9
2E
2
10
53
3
10
50
4
10
51
5
10
56
6
10
57
7
10
54
8
10
55
9
10
3AA
10
10
3AB
11
10
3A8
12
9
2F
13
10
3A9
14
9
2C
15
9
2D
16
10
3AE
17
10
3AF
18
9
22
19
10
3AC
20
10
3AD
21
10
3A2
22
10
3A3
23
9
23
24
10
3A0
25
10
3A1
26
9
20
27
10
3A6
28
10
3A7
29
10
3A4
30
9
F4

Table A.50 (continued)


31
9
21
32
10
3A5
33
10
3BA
34
10
3BB
35
9
F5
36
8
8A
37
10
3B8
38
8
F9
39
10
3B9
40
9
26
41
9
27
42
10
3BE
43
9
24
44
9
25
45
9
3A
46
8
8B
47
10
3BF
48
9
3B
49
8
88
50
10
3BC
51
8
89
52
8
FE
53
9
38
54
8
8E
55
9
39
56
9
3E
57
8
8F
58
8
8C
59
9
3F
60
10
3BD
61
8
7B
62
8
4D
63
8
8D
64
9
3C
65
9
3D
66
7
5
67
9
32
68
10
3B2
69
10
3B3
70
9
33
71
8
82
72
8
FF
73
9
30
74
10
3B0

A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients


Table A.50 (continued)
75
8
FC
76
10
3B1
77
9
31
78
9
36
79
8
83
80
8
78
81
8
80
82
9
37
83
9
34
84
8
81
85
9
35
86
10
3B6
87
9
A
88
8
5E
89
8
86
90
8
5F
91
8
FD
92
10
3B7
93
8
87
94
8
F2
95
8
84
96
8
85
97
8
5C
98
8
79
99
8
9A
100
8
7E
101
9
B
102
8
7F
103
7
27
104
8
7C
105
8
F3
106
7
24
107
7
25
108
7
36
109
7
32
110
8
F0
111
8
9B
112
8
F1
113
8
7D
114
7
12
115
7
66
116
7
13
117
7
37
118
7
67
119
6
30
120
6
10
121
7
2A
122
7
2B

Table A.50 (continued)


123
5
15
124
6
1C
125
6
A
126
5
1A
127
4
3
128
4
B
129
6
B
130
6
11
131
7
28
132
6
31
133
7
34
134
6
16
135
7
52
136
7
10
137
7
64
138
7
35
139
8
F6
140
7
3A
141
7
29
142
8
5D
143
7
53
144
8
62
145
7
50
146
7
11
147
8
F7
148
8
63
149
7
65
150
7
51
151
8
F4
152
8
60
153
8
F5
154
8
76
155
8
DA
156
8
61
157
8
98
158
8
DB
159
9
8
160
9
9
161
8
99
162
8
9E
163
10
3B4
164
8
9F
165
10
3B5
166
9
E
167
10
38A
168
8
9C
169
9
F
170
8
D8

327

328

A Large Tables
Table A.50 (continued)
171
9
C
172
9
D
173
10
38B
174
8
77
175
10
388
176
9
2
177
8
D9
178
10
389
179
8
DE
180
8
9D
181
9
3
182
8
92
183
8
66
184
10
38E
185
8
93
186
10
38F
187
10
38C
188
10
38D
189
9
0
190
10
382
191
8
DF
192
8
90
193
10
383
194
10
380
195
10
381
196
8
DC
197
8
DD
198
8
91
199
9
1
200
9
6
201
8
96
202
9
7
203
8
4C
204
8
97
205
9
4
206
8
94
207
10
386
208
9
5
209
10
387
210
10
384
211
9
1A
212
9
1B

Table A.50 (continued)


213
10
385
214
10
39A
215
9
18
216
10
39B
217
10
398
218
10
399
219
8
95
220
9
19
221
9
1E
222
9
1F
223
9
1C
224
10
39E
225
9
CE
226
10
39F
227
10
39C
228
10
39D
229
10
392
230
10
393
231
10
390
232
10
391
233
9
1D
234
10
396
235
10
397
236
10
394
237
10
395
238
10
3EA
239
10
3EB
240
9
12
241
10
3E8
242
10
3E9
243
10
3EE
244
10
3EF
245
9
13
246
10
3EC
247
10
3ED
248
10
3E2
249
10
3E3
250
10
3E0
251
9
CF
252
9
10
253
10
3E1
254
9
11

A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients


Table A.51 Codebook 2551
Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

5
5
6
6
6
7
7
6
7
7
7
7
7
7
7
8
7
7
7
7
8
7
6
7
8
7
8
8
8
7
7
9
7
7
8
7
8
7
8
8
8
9
8
8
8
8

3
F
1A
8
B
23
14
9
33
15
36
20
21
26
1A
D4
37
4A
4B
27
D5
48
20
24
44
25
45
AA
5A
1B
49
1AE
4E
18
5B
32
58
4F
59
5E
AB
1AF
A8
A9
5F
AE

Table A.51
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92

(continued)
8
5C
8
5D
7
4C
8
AF
9
1AC
8
AC
9
1AD
9
1A2
9
1A3
8
AD
9
1A0
9
1A1
7
19
8
52
9
1A6
8
A2
8
53
9
1A7
8
A3
8
A0
8
50
8
A1
9
1A4
8
A6
8
A7
8
A4
8
51
9
1A5
9
1BA
9
1BB
9
1B8
8
56
9
1B9
9
1BE
8
57
8
A5
8
BA
8
BB
9
1BF
8
54
9
1BC
9
1BD
9
1B2
8
B8
9
1B3
8
55
7
4D

329

330

A Large Tables
Table A.51
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

(continued)
8
B9
8
3A
9
1B0
9
1B1
8
BE
8
BF
8
3B
8
BC
9
1B6
9
1B7
8
BD
9
1B4
9
1B5
8
B2
9
18A
9
18B
9
188
9
189
8
38
9
18E
8
B3
9
18F
9
18C
7
42
9
18D
8
B0
9
182
9
183
9
180
9
181
8
B1
9
186
8
B6
8
39
9
187
9
184
9
185
8
B7
8
3E
7
43
8
3F
9
19A
8
B4
8
3C
9
19B
9
198
9
199

Table A.51
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186

(continued)
9
19E
8
B5
9
19F
9
19C
8
A
8
B
8
3D
9
19D
9
192
9
193
9
190
8
8
9
191
9
196
9
197
8
62
9
194
9
195
8
9
9
1EA
9
1EB
8
63
9
1E8
9
1E9
8
E
9
1EE
9
1EF
9
1EC
9
1ED
9
1E2
9
1E3
8
F
9
1E0
9
1E1
7
A
9
1E6
8
C
8
D
9
1E7
9
1E4
8
2
8
3
8
0
9
1E5
9
1FA
8
1
9
1FB

A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients


Table A.51
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221

(continued)
9
1F 8
8
6
8
7
9
1F 9
9
1FE
9
1FF
9
1F C
9
1FD
9
1F 2
9
1F 3
9
1F 0
8
4
9
1F 1
9
1F 6
9
1F 7
9
1F 4
8
5
9
1F 5
8
8A
9
1CA
9
1CB
9
1C 8
9
1C 9
9
1CE
9
1CF
8
8B
9
1C C
8
88
9
1CD
9
1C 2
9
1C 3
8
89
8
8E
9
1C 0
9
1C1

Table A.51 (continued)


222
8
60
223
9
1C 6
224
9
1C 7
225
9
1C 4
226
9
1C 5
227
8
8F
228
9
1DA
229
9
1DB
230
9
1D8
231
9
1D9
232
9
1DE
233
8
8C
234
9
1DF
235
9
1DC
236
9
1DD
237
8
8D
238
9
1D2
239
9
1D3
240
9
1D0
241
9
1D1
242
9
1D6
243
9
1D7
244
9
1D4
245
9
1D5
246
8
12
247
8
13
248
8
61
249
8
10
250
8
11
251
8
16
252
10
5E
253
10
5F
254
9
2E
255
5
E

331

332

A Large Tables

A.7 Huffman Codebooks for Indexes of Quantization


Step Sizes
Table A.52 For quasi-stationary
frames
Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

3
3
4
4
5
5
6
6
6
7
7
7
8
8
9
9
9
10
10
10
11
11
11
11
11
12
12
11
9
9
9
9
9
9
9
10
10
10
11
11

1
5
6
F
9
1C
1F
2
32
3A
76
66
47
CF
E1
19C
138
1CF
1CB
3AD
394
39D
7D
4B2
4B3
733
F8
395
12E
18
EE
1C
E4
1D4
129
3F
3C
251
4F 3
759

Table A.52 (continued)


40
11
75D
41
12
F9
42
11
7A
43
11
398
44
11
39C
45
10
25E
46
10
34
47
10
1C1
48
10
1C 0
49
9
12B
50
9
13D
51
9
139
52
9
12D
53
10
3A
54
10
32
55
10
33B
56
10
3B
57
10
33
58
10
25F
59
10
33A
60
10
258
61
10
35
62
10
36
63
10
278
64
10
250
65
10
3AF
66
11
6F
67
11
75C
68
11
4F 2
69
11
4A8
70
12
DC
71
12
952
72
13
1D60
73
14
1C C 9
74
13
1D61
75
15
7588
76
14
3AC 6
77
16
EB16
78
16
EB17
79
16
EB14
80
16
EB15
81
14
3AC 7

A.7 Huffman Codebooks for Indexes of Quantization Step Sizes


Table A.52 (continued)
82
17
E646
83
17
E647
84
16
7322
85
15
7589
86
15
3990
87
15
3996
88
14
1C CA
89
15
3997
90
13
12A6
91
13
12A7
92
12
DD
93
11
7B
94
10
255
95
10
1CD
96
9
1D5
97
9
EF
98
8
9D
99
8
9F
100
8
46
101
8
71
102
8
76
103
7
74
104
7
77
105
7
22
106
6
24
107
6
26
108
6
10
109
6
1E
110
5
18
111
5
0
112
4
8
113
4
D
114
4
1
115
4
5

333

Table A.53 For frames with


transients
Index Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

3
4
4
5
5
5
5
5
6
6
6
6
6
7
7
7
7
7
8
8
8
8
8
10
9
11
11
10
10
12
12
12

0
D
A
E
9
8
6
19
F
17
B
31
38
14
10
12
5D
5F
63
26
27
B9
BD
39D
47
3E4
2DB
8D
16B
7C 3
60E
60F

334

A Large Tables
Table A.53 (continued)
32
12
60C
33
12
60D
34
12
602
35
11
738
36
13
C12
37
12
603
38
13
C13
39
12
600
40
11
739
41
13
C10
42
13
C11
43
12
601
44
11
73E
45
12
606
46
10
8A
47
11
73F
48
11
2D8
49
11
2D9
50
10
8B
51
10
1F 3
52
10
88
53
9
B1
54
9
B2
55
9
B7
56
9
B0
57
9
B3
58
9
FA
59
9
C2
60
9
1C C
61
9
1CA
62
10
1F 1
63
10
168
64
11
73C
65
11
73D
66
12
607
67
11
2D2
68
11
3E5
69
12
604
70
12
605
71
13
C16
72
12
5AA
73
12
5AB

Table A.53 (continued)


74
13
C17
75
13
C14
76
13
C15
77
13
B6A
78
13
B6B
79
13
B68
80
12
5A8
81
12
5A9
82
11
72E
83
11
72F
84
12
7C 2
85
13
B69
86
10
2E2
87
11
2D3
88
10
2E3
89
10
2E0
90
10
89
91
10
2E1
92
10
8C
93
10
396
94
11
3E0
95
9
1CD
96
9
C3
97
9
FB
98
8
BC
99
8
E4
100
8
7A
101
8
62
102
8
7B
103
7
15
104
7
3C
105
7
3F
106
6
30
107
6
E
108
6
19
109
5
16
110
5
1D
111
5
A
112
5
D
113
4
9
114
4
8
115
4
F

References

1. Aertsen, A.M.H.J., Johannesma, P.I.M.: Spectro-temporal receptive fields of auditory neurons


in the grassfrog. Biological Cybernetics 38, 223234 (1980)
2. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Transactions on Computers 28, 9093 (1974)
3. Bellanger, M., Bonnerot, G., Coudreuse, M.: Digital filtering by polyphase network: Application to sample-rate alteration and filter banks. IEEE Transactions on Acoustics, Speech and
Signal Processing 24, 109114 (1976)
4. Berger, T.: Rate Distortion Theory. Prentice-Hall, Englewood Cliffs, NJ (1971)
5. Cattermole, K.: Principles of Pulse Code Modulation. ILIFFE Books, London (1969)
6. Chu, P.: Quadrature mirror filter design for an arbitrary number of equal bandwidth channels.
IEEE Transactions on Acoustics, Speech and Signal Processing 33(1), 203218 (1985)
7. Clark, D.L.: Ten years of a/b/x testing. AES Convention 91(3167) (1991)
8. Cox, R.: The design of uniformly and nonuniformly spaced pseudo quadrature mirror filters.
IEEE Transactions on Acoustics, Speech and Signal Processing 34(5), 10901096 (1986)
9. Cutler, C.C.: Differential quantization of communication signals. U.S. Patent 2605361 (1952)
10. Davidson, G., Bosi, M.: AC-2: high quality audio coding for broadcasting and storage. 46th
Annual Broadcasting Engineering Conference pp. 98105 (1992)
11. Dolby Laboratories: Digital Audio Compression Standard A/52B. Advanced Television Systems Committee (ATSC) (2005)
12. Durbin, J.: The fitting of time series models. Review of the International Institute of Statistics
28, 233243 (1960)
13. EBU: Sound quality assessment material recordings for subjective tests, vol. 3253-E. European Broadcasting Union (1988)
14. EBU: Sound Quality Assessment Material Recordings for Subjective Tests, Tech. 3253. European Broadcasting Union (1988)
15. FCC: Amendment of Part 3 of the Commissions Rules and Regulations to Permit FM Broadcast Stations to Transmit Stereophonic Programs on a Multiplex Basis, vol. FCC 61-524
13506. Federal Communications Commission (1961)
16. Feddersen, W.E., Sandel, T.T., Teas, D.C., Jeffress, L.A.: Localization of high frequency
tones. Journal of the Acoustical Society of America 5, 82108 (1957)
17. Flanagan, J.L.: Speech Analysis, Synthesis and Perception, second edn. Springer, Berlin
(1983)
18. Gelfand, S.: Hearing: An Introduction to Psychological and Physiological Acoustics, fifth
edn. Informa Healthcare, London (2009)
19. Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression, first edn. Springer,
New York (1991)
20. Givens, W.: Computation of plain unitary rotations transforming a general matrix to triangular
form. Journal of the Society for Industrial and Applied Mathematics 6(1), 2650 (1958)
21. Glasberg, B.R., Moore, B.C.: Derivation of auditory filter shapes from notched-noise data.
Hearing Research 47, 103138 (1990)

335

336

References

22. Golub, G.H., Loan, C.F.V.: Matrix Computations, third edn. The Johns Hopkins University
Press, Baltimore (1996)
23. Hall, J.L.: Auditory psychophysics for coding applications. In: V.K. Madisetti, D. Williams
(eds.) The Digital Signal Processing Handbook, pp. 39.139.25. CRC, Boca Raton (1997)
24. Hellman, R.: Asymmetry of masking between noise and tone. Perception and Psychophysics
11, 241246 (1972)
25. Herre, J.: From joint stereo to spatial audio coding recent progress and standardization. 7th
International Conference on Digital Audio Effects (DAFX04) pp. 157162 (2004)
26. Herre, J., Johnston, J.D.: Enhancing the performance of perceptual audio coders by using
temporal noise shaping (TNS). 101st AES Convention (1996 Reprint #4384)
27. Hoffner, R.: Multichannel television sound broadcasting in the united states. 82nd AES Convention, Paper Number: 2435 (1987)
28. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1990)
29. Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proceedings
of the I.R.E. pp. 10981102 (1952)
30. ITU: ITU-R Recommendation BS.1116: Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. ITU (1997)
31. ITU: ITU-R Recommendation BS.1387: Method for objective measurements of perceived
audio quality (PEAQ). ITU (1998)
32. Jayant, N., Johnston, J., Safranek, R.: Signal compression based on models of human perception. Proceedings of the IEEE 81, 13851422 (1993)
33. Jayant, N.S., Noll, P.: Digital Coding of Waveforms: Principles and Applications to Speech
and Video. Prentice Hall, Englewood Cliffs (1984)
34. Johnston, J.D.: Transform coding of audio signals using perceptual noise criteria. IEEE Journal on Selected Areas in Communications 6(2), 314323 (1988)
35. Johnston, J.D., Ferreira, A.J.: Sum-difference stereo transform coding. IEEE International
Conference on Acoustics, Speech, and Signal Processing 2, 569572 (1992)
36. Johnston, J.D., Sinha, D., Dorward, S., Quackenbush, S.: AT&T perceptual audio coding
(PAC). Collected Papers on Digital Audio Bit-Rate Reduction pp. 7381 (1996)
37. JPEG: Information technology Digital compression and coding of continuous-tone still images Requirements and guideline, vol. 10918-1. ISO/IEC (1994)
38. Kimme, E., Kuo, F.: Synthesis of optimal filters for a feedback quantization system. IEEE
Transactions on Circuit Theory 10, 405413 (1963)
39. Koilpillai, R.D., Vaidyanathan, P.P.: New results on cosine-modulated fir filter banks satisfying perfect reconstruction. IEEE International Conference on Acoustics, Speech, and Signal
Processing 3, 17931796 (1991)
40. Koilpillai, R.D., Vaidyanathan, P.P.: Cosine-modulated fir filter banks satisfying perfect reconstruction. IEEE Transactions on Signal Processing 40, 770783 (1992)
41. Levinson, N.: The wiener rms error criterion in filter design and prediction. Journal of Mathematical Physics 25, 261278 (1947)
42. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Transactions
on Communications 28, 8495 (1980)
43. MacKay, D.J.C.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
44. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Code. North-Holland,
Amsterdam (1988)
45. Malvar, H.: Lapped transforms for efficient transform/subband coding. IEEE Transactions on
Acoustics, Speech and Signal Processing 38, 969978 (1990)
46. Malvar, H.: Modulated qmf filter banks with perfect reconstruction. Electronics Letters 26,
906907 (1990)
47. Malvar, H.: Extended lapped transforms: fast algorithms and applications. International Conference on Acoustics, Speech, and Signal Processing 3, 17971800 (1991)
48. Malvar, H., Staelin, D.H.: The lot: transform coding without blocking effects. IEEE Transactions on Acoustics, Speech and Signal Processing 37, 553559 (1989)

References

337

49. Malvar, H.S.: Signal Processing with Lapped Transforms. Artech House, Norwood (1992)
50. Masson, J., Picel, Z.: Flexible design of computationally efficient nearly perfect qmf filter
banks. IEEE International Conference on Acoustics, Speech, and Signal Processing 10,
541544 (1985)
51. McDonald, R., Schultheiss, P.: Information rates of gaussian signals under criteria constraining the error spectrum. Proceedings of the IEEE 52, 415416 (1964)
52. Miller, G.A.: Sensitivity to changes in the intensity of white noise and its relation to masking
and loudness. Journal of the Acoustical Society of America 19, 609619 (1947)
53. Moore, B.C.: An Introduction to the Psychology of Hearing. Academic, London (1997)
54. MPEG: Coding of moving pictures and associated audio for digital storage media at up to
about 1.5 Mbit/s Part 2: Video, vol. 11172-2. ISO/IEC (1993)
55. MPEG: Coding of moving pictures and associated audio for digital storage media at up to
about 1.5 Mbit/s Part 3: Audio, vol. 11172-3. ISO/IEC (1993)
56. MPEG: Information technology: generic coding of moving pictures and associated audio information Part 3: Audio, vol. 13818-3. ISO/IEC (1998)
57. MPEG: Information technology: generic coding of moving pictures and associated audio information Part 2: Video, vol. 13818-2. ISO/IEC (2000)
58. MPEG: Coding of Audio-Visual Objects: Visual, vol. 14496-2. ISO/IEC (2004)
59. MPEG: Coding of Audio-Visual Objects: Audio, vol. 14496-3. ISO/IEC (2005)
60. MPEG: Information technology: Generic coding of moving pictures and associated audio
information Part 7: Advanced Audio Coding (AAC), vol. 13818-7. ISO/IEC (2006)
61. MPEG: Coding of Audio-Visual Objects: Advanced Video Coding, vol. 14496-10. ISO/IEC
(2008)
62. NHK: Super Hi-vision. http://www.nhk.or.jp/digital/en/superhivision/index.html
63. Norbert, W.: Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT,
Cambridge (1964)
64. Nussbaumer, H.J.: Pseudo qmf filter bank. IBM Technical Disclosure Bulletin 24, 30813087
(1981)
65. Nyquist, H.: Certain topics in telegraph transmission theory. Transactions AIEE 47, 617644
(1928)
66. Oliver, B.M.: Efficient Coding. Bell System Technical Journal 21, 724750 (1952)
67. Oppenheim, A.V., Schafer, R.W., Yoder, M.T., Padgett, W.T.: Discrete-Time Signal Processing. Prentice-Hall, New Jersey (2009)
68. Oppenheim, A.V., Willsky, A.S., Hamid, S.: Signals and Systems. Prentice-Hall, Englewood
Cliffs, NJ (1996)
69. Painter, T., Spanias, A.: Perceptual coding of digital audio. Proceedings of the IEEE 88(4),
451513 (2000)
70. Patterson, R.D.: Auditory filter shapes derived with noise stimuli. Journal of the Acoustical
Society of America 59(3), 640654 (1976)
71. Patterson, R.D., Allerhand, M., Giguere, C.: Time-domain modeling of peripheral auditory
processing: A modular architecture and software platform. Journal of the Acoustical Society
of America 98(4), 18901894 (1995)
72. Patterson, R.D., Moore, B.C.J.: Auditory filters and excitation patterns as representations of
frequency resolution. In: B.C.J. Moore (ed.) Frequency Selectivity in Hearing, pp. 123177.
Academic, London (1986)
73. Patterson, R.D., Nimmo-Smith, I., Weber, D.L., Milroy, R.: The deterioration of hearing with
age: Frequency selectivity, the critical ratio, the audiogram, and speech threshold. Journal of
the Acoustical Society of America 72(6), 17881803 (1982)
74. Pearl, J.: On coding and filtering stationary signals by discrete fourier transforms. IEEE
Transactions on Information Theory 19(2), 229232 (1973)
75. Pinsky, M.A.: Introduction to Fourier Analysis and Wavelets. American Mathematical Society
(2009)
76. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition:
The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)

338

References

77. Princen, J.P., Bradley, A.B.: Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on ASSP 34(5), 11531161 (1986)
78. Princen, J.P., Johnson, A., Bradley, A.B.: Subband/transform coding using filter bank designs
based on time domain aliasing cancellation. In: IEEE International Conference on Acoustics,
Speech, and Signal Processing, pp. 21612164 (1987)
79. Ramstad, T., Tanem, J.: Cosine-modulated analysis-synthesis filter bank with critical sampling and perfect reconstruction. IEEE International Conference on Acoustics, Speech, and
Signal Processing 3, 17891792 (1991)
80. Rao, K.R., Yip, P.: Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic, Boston (1990)
81. Rothweiler, J.: Polyphase quadrature filters - a new subband coding technique. IEEE International Conference on Acoustics, Speech, and Signal Processing 8, 12801283 (1983)
82. Sathe, V., Vaidyanathan, P.: Effects of multirate systems on the statistical properties of random
signals. IEEE Transactions on Signal Processing 41, 131146 (1993)
83. Sayood, K.: Introduction to Data Compression, second edn. Morgan Kaufmann, Burlington
(2000)
84. Schroeder, M.R., Atal, B.S., Hall, J.L.: Optimizing digital speech coders by exploiting masking properties of the human ear. Journal of the Acoustical Society of America 66(6),
16471652 (1979)
85. Shannon, C.: A mathematical theory of communication. Bell System Technical Journal 27,
379423 (1948)
86. Shannon, C.: A mathematical theory of communication. Bell System Technical Journal 27,
623656 (1948)
87. Sinha, D., Johnston, J.D.: Audio compression at low bit rates using a signal adaptive switched
filter bank. IEEE International Conference on Acoustics, Speech, and Signal Processing 2,
10531056 (1996)
88. Smyth, M.: White Paper: An Overview of the Coherent Acoustics Coding System. DTS,
Agoura Hills (1999)
89. Steele, J.M.: The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical
Inequalities. Cambridge University Press, Cambridge (2004)
90. Stein, E., Shakarchi, R.: Fourier Analysis: An introduction. Princeton University Press,
Princeton (2003)
91. Terhardt, E.: Calculating virtual pitch. Hearing research 1(1(2)), 155182 (1979)
92. Tsutsui, K.: ATRAC (adaptive transform acoustic coding) and ATRAC 2. In: V. Madisetti
and D. Williams (eds.) The Digital Signal Processing Handbook, pp. 43.1643.20. CRC, Boca
Raton (1998)
93. Vaidyanathan, P.P.: Multirate Systems And Filter Banks. Prentice Hall, Englewood Cliffs
(1992)
94. Vary, P.: On the design of digital filter banks based on a modified principle of polyphase. AEU
33(7/8), 293300 (1979)
95. Wikipedia: Windows Media Audio. http://en.wikipedia.org/wiki/Windows Media Audio
(2007)
96. Xiph.org, F.: Vorbis I specification. Xiph.org Foundation (2004)
97. You, Y.L., Ma, W.: Mobile Multimedia Broadcasting Standards: Technology and Practice. In:
F.L. Luo, chap. 20 DRA Audio Coding Standard: An Overview, pp. 587606. Springer, US
(2008)
98. You, Y.L. et al.: Multichannel Digital Audio Coding Technology, vol. SJ/T11368-2006. Ministry of Information Industry, Peoples Republic of China (2007)
99. You, Y.L. et al.: DRA Digital Audio Coding Technology Specification, vol. Q/DCHL0012008. China Hualu Group Co., Ltd. (2009)
100. You, Y.L. et al.: Multichannel Digital Audio Coding Technology, vol. GB/T 22726-2008.
National Standardization Committee, Peoples Republic of China (2009)
101. Zwicker, E.: Subdivision of the audible frequency range into critical bands. Journal of the
Acoustical Society of America 33(2), 248248 (1961)
102. Zwicker, E., Fastl, H.: Psychoacoustics: Facts and Models. Springer, Berlin (1999)

Index

2  2 delay system (defined), 125

absolute threshold of hearing (defined), 174


ADC (defined), 23
Additive noise model for quantization
(defined), 22
additive quantization noise model, 47
aliasing (defined), 99
aliasing transfer function (defined), 102
alphabet (defined), 146
AMGM inequality (defined), 77
analog-to-digital conversion (defined), 23
analysis filter banks (defined), 92
apex (defined), 176
AR, 137
AR (defined), 66
arithmetic mean (defined), 77
ATH (defined), 174
audible impairment, 255
auditory filters (defined), 176
autoregression, 88
autoregressive process (defined), 66
auxiliary data (defined), 246
average codeword length, 145
average codeword length (defined), 147
average perceptual quantization error (defined),
194
average quantization error (defined), 47

backward quantization (defined), 21


Bark scale, 185
Bark scale (defined), 179
base (defined), 176
baseline code (defined), 148
basilar membrane (defined), 176
basis functions (defined), 75
basis vectors (defined), 75

beating phenomenon (defined), 187


binary tree (defined), 154
bit allocation strategy (defined), 77
bit allocation, (defined), 77
bit allocation, 11
bit pool (defined), 242
bit rate (defined), 4, 28
bit stream (defined), 147
bits per sample (defined), 28
block codebook (defined), 159
block length, 11
block length (defined), 73
block size, 11
block size (defined), 73
block symbol (defined), 159
blocking artifacts (defined), 91
blocking effects (defined), 91
blocky effect, 11
brief window, 257
brief window (defined), 217
Buffer overflow (defined), 242
Buffer underflow (defined), 242
buffer underrun (defined), 242

cascade of paraunitary matrices (defined), 126


CB (defined), 179
CMFB, 11
code (defined), 147
code sequence (defined), 147
code tree (defined), 154
codebook, 145
codebook (defined), 147
codes, 145
codeword length (defined), 147
codeword sequence (defined), 147
codewords, 145
codewords (defined), 147
coding gain (defined), 60

339

340

Index

compact representation, 5
Companding (defined), 39
companding (defined), 40
compressed data size codeword (defined), 243
compression ratio, 148, 255
compression ratio (defined), 6
compressor (defined), 39
cosine modulated analysis filter (defined), 121
cosine modulated filter banks, 11
cosine modulated synthesis filter (defined), 122
critical band (defined), 179
critical band level (defined), 184
critical band rate (defined), 179
critical band rate scale (defined), 184
critical band segments (defined), 238
critical bands (defined), 238
critical bandwidth, 179, 182
critical bandwidth (defined), 179
critically sampled (defined), 97

end of frame signature (defined), 243


energy compaction, 11
energy conservation (defined), 75
entropy, 6
entropy (defined), 149
entropy coding, 9
entropy coding (defined), 149
equivalent rectangular bandwidth (defined),
184
ERB, 179
ERB (defined), 184
ERB scale (defined), 185
Error protection codes (defined), 245
error resilience (defined), 245
error-feedback function (defined), 71
even channel stacking (defined), 136
evenly-stacked TDAC (defined), 136
expander (defined), 40, 97
external node (defined), 154

DAC (defined), 23
data modeling, 9
decimator (defined), 96
decision boundaries (defined), 21
decision cells, 45
decision interval, 19
decision intervals, 47
decision intervals (defined), 21
decision regions, 45
decision regions (defined), 47
decoding delay (defined), 154
decorrelation (defined), 85
delay chain (defined), 93
DFT, 11
DFT (defined), 86
differential pulse code modulation (defined),
59
digital audio (compression) coding (defined), 5
digital-to-analog conversion (defined), 23
Discrete Fourier Transform, 11
Discrete Fourier Transform (defined), 86
Double blind principle (defined), 253
double buffering (defined), 236
double-blind, triple-stimulus with hidden
reference (defined), 254
double-resolution switched MDCT, 223
double-resolution switched MDCT (defined),
213
downsampler (defined), 96
DPCM, 69
DPCM (defined), 59
DRA (defined), 255
Dynamic Resolution Adaptation (defined), 255

filter banks, 11
fine quantization, 69, 71, 111
fine quantization (defined), 62
fixed length, 145
fixed-length code (defined), 145, 148
fixed-length codebook (defined), 145, 148
fixed-length codes, 28
fixed-point (defined), 247
fixed-point arithmetic (defined), 246
flattening effect, 202
forward quantization (defined), 21
Forward Quantization., 19
Fourier uncertainty principle, 12
Fourier uncertainty principle (defined), 204
frame, 213
frame-based processing, 236
frequency coefficients (defined), 86
frequency domain (defined), 86
frequency resolution (defined), 200
frequency transforms (defined), 86
gammatone filter (defined), 177
geometric mean (defined), 77
Givens rotation (defined), 125
global bit allocation, 13
granular error (defined), 29
granular noise (defined), 29
half-sine window (defined), 137
Heisenberg uncertainty principle, 12
Heisenberg uncertainty principle (defined),
204

Index
Huffman code, 9
Huffman code (defined), 161
hybrid filter bank (defined), 205

IDCT, 89
ideal bandpass filter (defined), 95
ideal subband coder (defined), 110
iid, 146
iid (defined), 147
images (defined), 101
independently and identically distributed
(defined), 147
inputoutput map of a quantizer (defined), 21
instantaneously decodable (defined), 153
intensity density level, 184
intensity density level (defined), 174
inter-aural amplitude differences (defined), 232
inter-aural time differences (defined), 232
inter-frame bit allocation, 290
inter-frame bit allocation (defined), 241
internal node (defined), 154
interpolator (defined), 97
intra-frame bit allocation (defined), 241
Inverse Quantization, 20
inverse quantization, 281
inverse quantization (defined), 21
inverse transform (defined), 74

JIC, 282
JIC (defined), 232
joint channel coding (defined), 231
joint intensity coding, 14, 258, 259, 278, 282
joint intensity coding (defined), 232

k-means algorithm (defined), 49


KarhunenLoeve Transform (defined), 84
KLT (defined), 84
kurtosis, 38
kurtosis (defined), 32

lapped transforms, 11
lattice structure (defined), 126
LBG algorithm (defined), 49
leaf (defined), 154
LevinsonDurbin recursion (defined), 63
LFE, 231
LFE (defined), 4, 234
linear phase condition, 207
linear phase condition (defined), 121
Linear prediction, 51

341
linear prediction, 10
linear prediction coding (defined), 55
linear predictor (defined), 53
linear transform (defined), 73
Lloyd-Max quantizer (defined), 36
loading factor (defined), 29
Logarithmic companding, 257
long filter bank (defined), 200
look-ahead, 213
lossless (defined), 6, 147
lossless coding (defined), 147
lossless compression coding (defined), 147
lossy (defined), 6
low frequency effect, 231
low-frequency effect (defined), 4
Low-frequency effects (defined), 234
LPC (defined), 55
LPC decoder (defined), 56
LPC encoder (defined), 55
LPC reconstruction filter, 58, 69, 71
LPC reconstruction filter (defined), 56

M/S stereo coding, 14


M/S stereo coding (defined), 231
Markov process (defined), 66
masked (defined), 181
masked threshold (defined), 182, 189
maskee (defined), 181
maskers (defined), 181
masking in frequency gap (defined), 181
masking index (defined), 186
masking signals (defined), 181
masking spread function (defined), 188
masking threshold, 13
masking threshold (defined), 182
maximal active critical band (defined), 275
maximally decimated (defined), 97
MDCT, 11, 91, 223
MDCT (defined), 136
mean squared quantization error (defined), 22
medium window (defined), 224
messages (defined), 146
meta-symbol (defined), 161
midrise (defined), 26
midtread (defined), 25
midtread uniform quantization, 257, 281
minimum average codeword length (defined),
153
mirror image condition (defined), 121
modified discrete cosine transform, 11, 91
Modified discrete cosine transform (defined),
136
modulated filter bank (defined), 94

342
modulating (defined), 94
monaural sound (defined), 4
MSQE, 47
MSQE (defined), 22
multiplexer, 14

near-perfect reconstruction (defined), 103


NMR, 242
NMR (defined), 194
noble identity for decimation (defined), 107
noble identity for interpolation (defined), 108
noise to mask ratio (defined), 194
noise-feedback coding (defined), 71
noise-feedback function (defined), 71
noise-shaping filter (defined), 71
nonperfect reconstruction, 122
nonperfect reconstruction (defined), 103
nonuniform quantization, 8
normal channels (defined), 4
normal equations, 215
normal equations (defined), 63
NPR (defined), 103
number of ERBs (defined), 185

odd channel stacking (defined), 136


oddly stacked TDAC (defined), 136
open-loop DPCM, 71, 215
open-loop DPCM (defined), 55
optimal codebook (defined), 152, 153
optimal coding gain, 113
Orthogonal transforms (defined), 74
orthogonality principle (defined), 62
output values (defined), 21
oval window (defined), 176
overall transfer function (defined), 102
overload areas (defined), 28
overload error (defined), 29
overload noise (defined), 29
overshoot effect (defined), 194

paraunitary (defined), 124


PCM (defined), 3, 24
PDF (defined), 21
peakedness, 38
peakedness (defined), 32
perceptual entropy (defined), 197
perceptual irrelevance, 8
perceptual quantization error (defined), 194
perceptually irrelevant (defined), 7
perfect reconstruction (defined), 103
perfect reconstruction, 123, 207

Index
ping-pong buffering (defined), 236
polyphase (component) matrix (defined), 105
polyphase components, 106
polyphase implementation, 107
polyphase representation, 106
postmasking (defined), 193
power-complementary conditions, 207
PR (defined), 103
pre-echo artifacts (defined), 203
prediction coefficients (defined), 53
prediction error, 10, 51, 53
prediction error (defined), 55
prediction filter (defined), 53
prediction gain (defined), 55
prediction residue (defined), 55
predictor order (defined), 53
prefix (defined), 153
prefix code (defined), 154
prefix-free code (defined), 154
premasking, 257
premasking (defined), 193
primary windows (defined), 210
probability density function (defined), 21
prototype filter (defined), 94
pseudo QMF (quadrature mirror filter)
(defined), 122
Pulse-code modulation, 3
pulse-code modulation (defined), 24

quantization, 8, 19
quantization distortion (defined), 21
quantization error, 8
quantization error (defined), 21
quantization error accumulation, 59
quantization index, 19
quantization index (defined), 21
quantization noise (defined), 21
quantization noise accumulation (defined), 57
quantization step size, 8, 24, 275
quantization step size (defined), 24
quantization table, 21
quantization table (defined), 19
quantization unit, 257, 275, 281
quantization unit (defined), 238
quantized value, 20
quantized values (defined), 21
quasistationary (defined), 199

re-quantize (defined), 24
re-quantizing (defined), 17
reconstructed vector (defined), 47
reconstruction error (defined), 57

Index
representative values (defined), 21
representative vector (defined), 47
representative vectors, 45
residue, 10, 53
reversal or anti-diagonal matrix (defined), 131
rotation vector (defined), 126
rounded exponential filter (defined), 177
rounding function (defined), 26

sample rate, 3
sample rate (defined), 19
sample rate compressor (defined), 96
sampling period (defined), 19
sampling theorem, 3
sampling theorem (defined), 19
SBC (defined), 91
scalar quantization (defined), 21
scalar quantizer, 8
self-information (defined), 148
Shannons noiseless coding theorem, 167
siblings (defined), 163
side information (defined), 244
signal-to-mask ratio (defined), 186
signal-to-noise ratio (defined), 23, 252
simultaneous masking (defined), 193
sine window (defined), 137
SMR, 252
SMR (defined), 186
SMR threshold, 195
SMR threshold (defined), 186
SNR (defined), 252
sound intensity level, 184
sound intensity level (defined), 174
sound pressure level (defined), 174
source sequence, 145
source sequence (defined), 147
source signal, 5
source symbol, 145
source symbols (defined), 146
spectral envelop (defined), 69
spectral flatness measure, 113
spectrum flatness measure (defined), 85
SPL (defined), 174
spread of masking (defined), 188
SQ, 8
SQ (defined), 17, 21
SRN (defined), 23
statistic-adaptive codebook assignment, 268
statistical redundancy. (defined), 6
statistical redundancy, 9
steering vector (defined), 233
stereo (defined), 4
stopband attenuation, 94

343
subband (defined), 93
subband coding, 91
subband coding (defined), 97
subband filter (defined), 93
subband filters (defined), 93
subband samples (defined), 93
subbands (defined), 93
subjective listening test, 290
sum/difference coding, 14, 258, 259, 276
sum/difference coding (defined), 231
Sum/difference decoding, 283
surround sound (defined), 4
Switched MDCT (defined), 207
switched-window MDCT (defined), 207
symbol, 145
symbol code (defined), 147
symbol set, 145
symbol set (defined), 146
synchronization codeword (defined), 243
synthesis filter banks (defined), 92

TC (defined), 73
TDAC (defined), 136
temporal masking (defined), 193
Temporal noise shaping (defined), 215
temporal-frequency analysis, 69
test set (defined), 49
test signal (defined), 181
the DCT, 89
the inverse DCT, 89
THQ (defined), 174
threshold in quiet (defined), 174
timefrequency analysis (defined), 86
timefrequency resolution (defined), 204
timefrequency tiling (defined), 238
time-domain aliasing cancellation (defined),
136
TLM, 256
TLM (defined), 220
TNS (defined), 215
tonal frequency components, 199
training set (defined), 48
transform (defined), 73
Transform coding (defined), 73
transform coefficients (defined), 73
transformation matrix (defined), 73
transformed block (defined), 73
transient attack (defined), 202
transient detection interval, 227
transient detection interval (defined), 213
transient function (defined), 228
Transient localized MDCT, 256
transient segment (defined), 237

344
transient segmentation (defined), 237
transient-detection blocks (defined), 227
transient-localized MDCT (defined), 220
transitional windows (defined), 210
transparent (defined), 254
triple-resolution switched MDCT (defined),
224
truncate function (defined), 27
type-I polyphase (component) vector (defined),
105
type-I polyphase component (defined), 104
type-I polyphase representation (defined), 104
type-II DCT (defined), 88
type-II polyphase components (defined), 106
Type-II polyphase representation (defined),
106
Type-IV DCT (defined), 89
unary code, 9, 152, 162
unary codebook, 146
unary numeral system, 146, 152
unequal error protection (defined), 245
uniform quantization, 24
uniform quantizer (defined), 24
uniform scalar quantization, 8
unique decodability (defined), 152
unitary (defined), 86, 124
upsampler (defined), 97

Index
variable-length codebooks (defined), 145
variable-length codes (defined), 145
vector quantization, 8
vector quantization (defined), 43
vector-quantized, 47
virtual window (defined), 217
virtual windows, 256
Voronoi regions (defined), 48
VQ, 8
VQ (defined), 17, 43
VQ codebook, 43
VQ codebook (defined), 47
VQ quantization error (defined), 47

water-filling, 81
water-filling algorithm (defined), 242
whitening filter (defined), 67
WienerHopf equations, 67
WienerHopf equations (defined), 63
window (defined), 137
window function (defined), 137

YuleWalker prediction equations (defined),


63

Vous aimerez peut-être aussi