Vous êtes sur la page 1sur 49

Audio Coding Techniques (I)

 Introduction
 How is audio different from speech?
 Human auditory system
 Lossless audio coding
 Reversibility of closed-loop DPCM
 Inter-channel decorrelation
 Perceptual audio coding
 Psychoacoustics
 Perceptual entropy

EE493Q: Digital Speech Processing


Introduction to Audio
 What is audio?
 “Of or relating to high-fidelity sound reproduction”
 How is audio different from speech?
 Higher sampling rate
 CD-quality music: 20KHz
 Wideband speech in video conferencing: 7KHz
 Higher accuracy
 12-16 bits per sample
 Multi-channel
 Mono, stereo, 6-channel
 Require much more bandwidth
 raw data rate ~700Kbps per channel

EE493Q: Digital Speech Processing


Stereo-Audio

EE493Q: Digital Speech Processing


Audio Compression
 How is it different from speech
compression?
 Requirement
 Lossless compression is important in some
applications (e.g., archiving and mixing of high-
quality recordings in professional environments)
 Perceptually lossless compression (e.g., MP3 music)

 Principles
 No physical model exists for audio production
 Instead, more emphasis is put on human auditory

system, in particular, psychoacoustic masking effect

EE493Q: Digital Speech Processing


Sound Quality Requirements

EE493Q: Digital Speech Processing


Review Question (I)

NO

Q: Given an audio sampled at 16KHz, its 25th subband


(12-12.5KHz) has SPL of 10dB, can human ear hear it?

EE493Q: Digital Speech Processing


Review Question (II)

NO

Q: Given a masker tone with 2Khz and 60dB, if the testing


tone is played at the 15th CB with SPL of 50dB, is it masked?

EE493Q: Digital Speech Processing


Review Question (III)

AFTER

Q: Consider the echo hiding scheme for audio watermarking,


do we want to insert echoes before or after the masker?

EE493Q: Digital Speech Processing


Audio Coding Techniques (I)
 Introduction
 How is audio different from speech?
 Human auditory system
 Lossless audio coding
 Reversibility of closed-loop DPCM
 Inter-channel decorrelation
 Perceptual audio coding
 Psychoacoustics
 Perceptual entropy

EE493Q: Digital Speech Processing


Overview

Hans and Schafer, “Lossless compression of digital audio”


IEEE Signal Processing Magazine, July 2001

Note: such approach does not take inter-channel correlation


into account, which is unlikely to be optimal
EE493Q: Digital Speech Processing
Intra-Channel Decorrelation

(rounding)

Notes: prediction residues e(n) are integers due to rounding


A(z) is autoregressive (AR) model; B(z) is moving average (MA) model

EE493Q: Digital Speech Processing


Justification of Reversibility
Recall: quantization is not invertible, how can we achieve lossless
compression regardless of the rounding operation?

K
e(n) = x(n) − Q[ xˆ (n)], xˆ (n) = ∑ ak x(n − k ),
K k =1
xˆ (n) = ∑ ak x(n − k ) x(n) = e(n) + Q[ xˆ (n)]
k =1

Encoder Decoder

Answer: closed-loop DPCM guarantees the reversibility

EE493Q: Digital Speech Processing


Inter-Channel Decorrelation

L
s

Average: s=(L+R)/2 Difference: d=R-L

EE493Q: Digital Speech Processing


Stereo Recording
Techniques*
 X-Y technique: two directional
microphones are placed coincidentally,
typically at a 90+ degree angle to each
other
 Mono-compatible
 A-B technique: two omni-directional
microphones are used at an especial
distance to each other (20 centimeters up
to some meters).
 Add another microphone at the center, it
becomes “Decca Tree”
EE493Q: Digital Speech Processing
Audio Coding Techniques (II)
 MP3 Audio Compression
 Filter bank/Modified DCT
 Psychoacoustic Models
 Bit Allocation
 Advanced Audio Coding (AAC)
Techniques
 MPEG-1,2,4
 SONY ATRAC
 Lucent PAC
 Dolby AC-3
EE493Q: Digital Speech Processing
Introduction
 What does ISO MPEG-1 Audio provide?
A transparently lossy audio compression system based on
the weaknesses of the human ear.
 Can provide compression by a factor of 6 and retain
sound quality.
 One part of a three part standard that includes audio,
video, and audio/video synchronization
 MPEG-2 and MPEG-4 have advanced audio
coding (AAC) options
 ITU-T has its own standardized algorithm for
wideband speech (audio)
EE493Q: Digital Speech Processing
MPEG-I Audio Features
 PCM sampling rate of 32, 44.1, or 48 kHz
 Four channel modes:
 Monophonic and Dual-monophonic
 Stereo and Joint-stereo
 Three modes (layers in MPEG-I speak):
 Layer I: Computationally cheapest, bit rates > 128kbps
 Layer II: Bit rate ~ 128 kbps, used in VCD
 Layer III: Most complicated encoding/decoding, bit rates
~ 64kbps, originally intended for streaming audio

EE493Q: Digital Speech Processing


MPEG-I Encoder Architecture

EE493Q: Digital Speech Processing


MPEG-I Encoder
Architecture
 Polyphase Filter Bank: Transforms PCM samples
to frequency domain signals in 32 subbands
 Psychoacoustic Model: Calculates acoustically
irrelevant parts of signal
 Bit Allocation: Allots bits to subbands according
to input from psychoacoustic calculation.
 Frame Creation: Generates an MPEG-I compliant
bit stream.

EE493Q: Digital Speech Processing


What is Filter Bank?

Analysis Synthesis
EE493Q: Digital Speech Processing
Filter Bank Illustration

EE493Q: Digital Speech Processing


Modified Discrete Cosine
Transform

Forward Transform

Inverse Transform
EE493Q: Digital Speech Processing
Pre-Echo Distortion

EE493Q: Digital Speech Processing


MPEG-I Psychoacoustic
Models
 MPEG-I standard defines two models:
 Psychoacoustic Model 1:
 Less computationally expensive
 Makes some serious compromises in what it
assumes a listener cannot hear
 Psychoacoustic Model 2:
 Provides more features suited for Layer III
coding, assuming of course, increased
processor bandwidth.

EE493Q: Digital Speech Processing


Step 1: Spectral Analysis and
SPL Normalization
 Convert samples to frequency domain
 Use a Hann weighting and then a DFT
 Simply gives an edge artifact (from finite window
size) free frequency domain representation.
 Model 1 uses 512 (Layer I) or 1024 (Layers II
and III) sample window.
 Model 2 uses a 1024 sample window and two
calculations per frame.

EE493Q: Digital Speech Processing


Step 2: Identification of Tonal
and Noise Maskers
 Need to separate sound into “tones” and “noise”
components
 Model 1:
 Local peaks are tones, lump remaining spectrum per
critical band into noise at a representative frequency.

Example:

 Model 2:
 Calculate “tonality” index to determine likelihood of each
spectral point being a tone
 based on previous two analysis windows

EE493Q: Digital Speech Processing


Graphic Illustration

X: tonal
O: noise

EE493Q: Digital Speech Processing


Three Types of Frequency
Masking
 Noise-Masking-Tone (NMT): SMR=4dB
 Tone-Masking-Noise (TMN): SMR=24dB
 Noise-Masking-Noise (NMN): SMR=26dB

NMT Asymmetry
TMN
EE493Q: Digital Speech Processing
Step 3: Decimation and
Reorganization of Maskers
 “Smear” each signal within its critical band
 Use either a masking (Model 1) or a spreading
function (Model 2).
 Adjust calculated threshold by
incorporating a “quiet” mask – masking
threshold for each frequency when no
other frequencies are present.

EE493Q: Digital Speech Processing


Step 4: Calculation of
Individual Masking Thresholds
 Calculate a masking threshold for each subband in the
polyphase filter bank
 Model 1:
 Selects minima of masking threshold values in range of each
subband
 Inaccurate at higher frequencies – recall how subbands are
linearly distributed, critical bands are NOT!
 Model 2:
 If subband wider than critical band:
 Use minimal masking threshold in subband
 If critical band wider than subband:
 Use average masking threshold in subband

EE493Q: Digital Speech Processing


Graphic Illustration

Tonal components Noise components

EE493Q: Digital Speech Processing


Step 5: Calculating Global
Masking Thresholds
 The hard work is done – now, we just
calculate the signal-to-mask ratio (SMR)
per subband
 SMR = signal energy / masking threshold
 The calculated SMR results can be used by
audio codec to determine how many bits
are needed to spend on each subband
 This is where most compression occurs – if
some coefficient is below the masking
threshold, it does not need any bit!
EE493Q: Digital Speech Processing
Graphic Illustration

EE493Q: Digital Speech Processing


Psychoacoustic Model
Summary
input audio frame

Spectral Analysis and SPL Normalization

Identification of Tonal and Noise Maskers

Decimation and Reorganization of Maskers

Calculation of Individual Masking Thresholds

Calculating Global Masking Thresholds

Signal-to-Masking Ratios (SMR)


EE493Q: Digital Speech Processing
Example: Calculating Signal
Energy

EE493Q: Digital Speech Processing


Calculating Masking
Thresholds

EE493Q: Digital Speech Processing


SMR Results

EE493Q: Digital Speech Processing


How Perceptual Lossless
Compression is Achieved?
A C B
D

Coefficient A requires bits, but not coefficient B (masked)


Question: how about coefficients C and D?

EE493Q: Digital Speech Processing


Summary of Perceptual Audio
Coding
 Psychoacoustics
 Frequency dependency: Human ears are most sensitive
to 2-4KHz
 Masking: A tone could be inaudible because of the
presence of another one (close in frequency or time)
 Asymmetry: Noise-masking-tone is easier than tone-
masking-noise
 MP3
 Time-to-frequency transformation by filter bank or
modified Discrete Cosine Transform
 Psychoacoustic Model I or II produces Signal-to-Masking
Ratio (SMR) that guides the bit allocation process for
each subband
 Perceptually lossless at the bit rate of 64K-128Kbps
EE493Q: Digital Speech Processing
Headphone Technology

http://www.technologyreview.com/read_article.aspx?id=17642&ch=infotech

EE493Q: Digital Speech Processing


Audio Coding Techniques (II)
 MP3 Audio Compression
 Filter bank/Modified DCT
 Psychoacoustic Models
 Bit Allocation
 Beyond technical issues
 Legal, practical and ethic issues
 Open discussions

EE493Q: Digital Speech Processing


Legal Issues Surrounding
MP3
 It's a civil offense, punishable by fine, if
you distribute music that you don't own
the rights to.
 It's a criminal offense to copy music
illegally and then redistribute it for
financial gain.
 There is a great deal of uncertainty about
how copyright laws should function in the
digital world, but the laws themselves are
clear

EE493Q: Digital Speech Processing


The Story of Home-Taping
Nightmare
 In 1970s, tapes become easy to be duplicated at
home – nobody was caught as copyright
violators, right?
 The economics of the entire system actually
collapsed, and was only revived by the forced
implementation of an entirely new audio format,
the compact disc (CD).
 A tax on all blank tapes and taping mechanisms
was created in accordance with the 1992 Home
Recording Act to offset lost revenues

EE493Q: Digital Speech Processing


Now Comes the MP3
Nightmare
 The ease of downloading and sharing MP3s due to
internet
 Those mammoth companies are still going to sue
every college student they find with MP3s on their
site.
 On 9 October 1998, when the RIAA filed for a
temporary restraining order to prevent San Jose-
based Diamond Multimedia from selling their new
MP3 player. Called the "Rio," this player retails for
$199 and is essentially a Walkman for MP3 files.

EE493Q: Digital Speech Processing


What is Legal?
 Most MP3 files on the internet are illegal
except
 Recorded works to which you personally own
the copyrights.
 Recorded works in the public domain.
 As long as you keep your MP3s in the
privacy of your own hard drive and not on
the Web, you are very hard to catch and
relatively harmless.

EE493Q: Digital Speech Processing


MP3: The Transformation of
Recording Industry
 Why didn't the ISO/IEC address the
copyright issue when developing MPEG1?
 Its members weren't necessarily thinking
about the legal ramifications but instead
focused on creating an effective technology.
 Unsuccessful fight-back strategy by RIAA:
search and destroy
 Only by systematic, consistent, massive legal
action can the record company possibly hope
to win this war.

EE493Q: Digital Speech Processing


Watermarking: a Technical
Savior?
 RIAA's Secure Digital Music Initiative (SDMI) goes
in vain around 2000
 Your midterm project might have shown this
 All existing watermarking techniques are not good
enough to win this war against piracy
 Ultimately, it's all about a workable revenue
model. Once that's been established, then
perhaps the quality and convenience of the MP3
format can be seen as a boon to the industry
instead of a threat.

EE493Q: Digital Speech Processing


Ethic Issues
 Copying music for your own private use is
cool, but posting that music to a Web site
or distributing it in any way is not. By
doing this you are robbing people who
worked very hard to create the music you
like.
 Think about: is there any difference
between downloading an album from
internet for free and stealing an album
from the local store?
EE493Q: Digital Speech Processing
Open Discussions
 Who should get paid?
 Is there a better business model?
 Is there any better technical solution than
watermarking to fight against piracy?
 Which side will you take? A defender of
RIAA or a hacker?

EE493Q: Digital Speech Processing

Vous aimerez peut-être aussi