MPEG X

MPEG-x AUDIO STANDARDS
1. ng B Vit 11DT3
2. Ng t
11DT3
3. Bi nh Phc 11DT2
OUTLINE
INTRODUCTION
FUNDAMENTALS
PERFORMANCE MEASURES
EVALUATION
CONCLUSION
INTRODUCTION
MPEG-x Standards: evolving set of standards for video and audio
compression developed by the Moving Picture Experts Group.
MPEG-x Audio:
General Audio (GA) coding
Taking PCM audio streams and effectively encoding them for
transmission and storage
Synthetic audio
Text-to-Speech, how to generate and play virtual instruments
We focus on General Audio coding
INTRODUCTION
Ancillary Data
(Optional)
FUNDAMENTALS
Psycho-acoustic model:
Hearing characteristics
Threshold of hearing
Frequency masking
Critical bands
Bark units
Temporal masking
Time to Frequency Transformation:
Filter banks
Bit allocation:
Bitstream Formatting
MPEG Audio Algorithm
PSYCHO-ACOUSTIC
How humans perceive the sound
The main feature in the compression context is that it tells what parts
that we can remove
HEARING CHARACTERISTICS
Range of human hearing: 20 Hz to 20 kHz

Voice: 500 Hz to 4 kHz
Range of hearing: depends upon frequency
Fletcher-Munson equal-loudness curves
Describe relationship between perceived loudness for a given
stimulus sound volume as a function of frequency
FLETCHER-MUNSON EQUAL-LOUDNESS
CURVES
THRESHOLD OF HEARING
THRESHOLD OF HEARING
[dB]
The origin is at the frequency of 2 kHz since Threshold(f) = 0 at f = 2 kHz
FREQUENCY MASKING
A sound makes another sound be heard difficult if there is a certain
difference in frequencies between them.
A lower frequency can effectively mask a higher frequency.
A higher frequency does not mask a lower frequency well.
The greater the power in the masking frequency, the broader the
range of frequencies it can mask
If two sounds are widely separated in frequency, little masking occurs.
FREQUENCY MASKING CURVES
Effect on threshold for 1 kHz masking tone
FREQUENCY MASKING CURVES
Effect of masking tone at three different frequencies
CRITICAL BANDS
Because of frequency masking, we can divide human hearing range
into critical bands.
Human auditory system cannot resolve sounds better than within
about one critical band when other sounds are present
Critical bandwidth corresponds to the smallest frequency difference
between two partials such that each can still be heard separately
Critical band
Less than 100 Hz at f < 500 Hz nearly constant
For f 500 Hz, increases roughly linearly with frequency
Audio frequency range for hearing can be partitioned into about 24

critical bands (25 are typically used for coding applications)
CRITICAL BANDS
BARK UNITS
The range of frequencies affected by masking is broader for higher
frequencies
It is useful to define a new frequency unit
In terms of this new unit, each of the masking curves has about the
same width
The new unit defined is called the Bark, named after Heinrich
Barkhausen (1881-1956)
One Bark unit corresponds to the width of one critical band, for any
masking frequency
BARK UNITS
The conversion between a frequency f and its corresponding critical
band number b, expressed in Bark units
Another formula:
where f is in kHz, b is in Barks
BARK UNITS
TEMPORAL MASKING
Any loud tone causes the hearing receptors in the inner ear to
become saturated, and they require time to recover
It may take up to as much as 500 ms for us to discern a quiet test

tone after a 60 dB masking tone has been played
TEMPORAL MASKING
Effect of temporal and frequency masking depending on both

time and closeness in frequency.
TEMPORAL MASKING
The phenomenon of saturation also depends upon how long the

masking tone is applied
Solid curve: masking tone played for 200 ms

Dashed curve: masking tone played for 100 ms
TEMPORAL MASKING
A signal is able to mask other signals that occur just after or before it
sounds
Time to Frequency
Transformation
Filter banks:
A parallel bank of bandpass filters covering the entire spectrum
Used to break input signal into frequency components- subbands
The subbands samples are normalized by a scaling factor such that

the maximum sample amplitude in each block is unity
BIT-ALLOCATION
The bit-allocation module decides quantizers(how many bits are used

to quantized) for each subbands.
The bit-allocation is not part of the standard, and it can therefore be
done in many possible ways
Ensure that all the quantization noise is below the masking threshold
BIT-ALLOCATION
Bitstream Formatting
SBS: Sub-band samples

Header contains
synchronization code (twelve 1s - 111111111111)
sampling rate used
bit-rate
stereo information
Ancillary data: i.e. multilingual data and surround-sound data

SBS format: quantized scaling factor and code-words
MPEG audio compression takes advantage of these considerations

It uses these to transmit frequency components that are masked by
frequency masking or temporal masking or both, using few bits.
MPEG AUDIO ALGORITHM
Ancillary
Data
(Optional)
Employ a bank of filters that act to first analyze the

frequency components of the audio signal
Divides the input into 32 frequency sub-bands
Determine the scaling factor for each sub-band
Pass the scaling factor along with sub-band samples
(SBS) to the bit-allocation block
Ancillary
Data
(Optional)
Decide whether each frequency sub-band is tone-like or

noise-like
Based on that decision and scaling factor, calculates the
masking threshold for each band and compares with the
threshold of hearing
Ancillary
Data
(Optional)
Determine the number of code bits to quantize the subband to minimize the audibility of quantization noise
Bits are allocated where they are most needed to lower the
quantization noise below an audible level.
Then the number of bits allocated is used to quantize the
information from the filter bank
Ancillary
Data
(Optional)
Balance the masking behavior and the available number of

bits by discarding inaudible frequencies and scaling
quantization according to the sound level left over, above
masking levels
Ancillary
Data
(Optional)
Format bitstream into suitable blocks

MPEG-1
MPEG-1 Layer 1
MPEG-1 Layer 2
MPEG-1 Layer 3
MPEG-2
MPEG-2 Advanced Audio Coding(AAC)
MPEG-4
MPEG-4 Advanced Audio Coding(AAC)
MPEG-4 High Efficiency Advanced Audio Coding(HE-AAC)
MPEG-7
MPEG-21
MPEG-1
Three downward-compatible layers of audio compression
Each offers more complexity in the psychoacoustic model applied and
correspondingly better compression for a given level of audio quality
Layer 1 quality can be quite good, provided a comparatively high
bitrate is available
Layer 2 has more complexity and was proposed for use in digital
audio broadcasting
Layer 3 is most complex and was originally aimed at audio
transmission over ISDN lines
Each of the layers uses a different frequency transform
MPEG-1 Layers
In the Layer 1 encoder, the sets of 32 PCM values are first assembled
into a set of 12 groups of 32s
MPEG-1 Layers
A Layer 2 or Layer 3, frame actually accumulates more than 12
samples for each sub band: a frame includes 1,152 samples
MPEG-1 Layer 1 & Layer 2

Main difference:
Three groups of 12 samples are encoded in each frame and
temporal masking is brought into play, as well as frequency
masking
Bit allocation is applied to window lengths of 36 samples instead
of 12
The resolution of the quantizers is increased from 15 bits to 16
Advantage:
A single scaling factor can be used for all three groups
Reduce bitrate with expense of higher complexity and delay
MPEG-1 Layer 1 & Layer 2
MPEG-1 Audio Layers 1 and 2
MPEG-1 Layer 3
Main difference:
Employs a similar filter bank to that used in Layer 2, except using
a set of filters with non-equal frequencies
Takes into account stereo redundancy(Mid/Side Coding)
Uses Modified Discrete Cosine Transform (MDCT)
Sophisticated bit allocation and quantization strategies rely on
non-uniform quantization.
Use Huffman Coding-loss less coding.
MPEG-1 Layer 3
MPEG-1 Layer 3
MPEG-Audio Layer 3 Coding
MPEG-2 Advanced Audio

Coding

Coding
Temporal Noise Shaping:
Shapes the distribution of quantization noise in time by prediction in
the frequency domain
Voice signals in particular experience considerable improvement
through TNS
Prediction(in time domain):
A technique commonly established in the area of speech coding
systems
It benefits from the fact that stationary audio signals are predictable
to a certain extend and amplitude at certain frequency do not
change significantly from block to block.
But requires higher computational complexity

Coding
Mid-Side Coding(MS):
For dual-channel audio(stereo signal).
Transform the left(L) and right(R) channels into a mid(M) channel
and a side(S) channel.
M=L+R
S=L-R
Then gives more bits to the mid than the side channel (as usually
the side channel is less complex

Coding
Intensity Coding:
For multi-channel audio
Human hearing is predominantly less acute at perceiving the
direction of certain audio frequencies.
Certain subbands from right and left channel will be merged into
one channel to reduce bit rate. Some directional information may
be conveyed as scale factor.
MPEG-4 ADVANCED AUDIO CODING
Being built around the coder kernel provided by MPEG-2 Advanced

Audio Coding (AAC)
With some additional coding tools and coder configurations
The perceptual noise substitution (PNS) tool and the Long-Term
Prediction (LTP) tool are available to enhance the coding
performance for the noise-like and very tonal signal, respectively
A special coder kernel (Twin VQ) is provided to cover extremely low
bitrates
A flexible bitrate scalable coding system is defined including a variety
of possible coder configurations
MPEG-4 ADVANCED AUDIO CODING
LONG TERM PREDICTION
Long-term prediction (LTP) is an efficient tool for reducing the

redundancy of a signal between successive coding frames newly
introduced in MPEG-4
Predict incoming input signal based on preceding signals.
LTP tool provides optimal coding gain for stationary harmonic signals
as well as some gain for non-harmonic tonal signals
Compared with the rather complex MPEG-2 AAC predictor tool, the
LTP tool shows a saving of roughly one-half in both computational
complexity and memory requirements
LONG TERM PREDICTION
PERCEPTUAL NOISE SUBSTITUTION (PNS)
A feature newly introduced into MPEG-4

Aiming at a further optimization of the bit-rate efficiency of AAC at
lower bit-rates
The technique of PNS is based on the observation that one noise
sounds like the other
Additional decoder complexity associated with the PNS coding tool is
very slow in terms of both computational and memory requirements
PERCEPTUAL NOISE SUBSTITUTION
MPEG-4 SCALABLE CODING

Enable the transmission and decoding of the bit-stream with a bit-rate
that can be adapted to dynamically varying requirements
Offer significant advantages for transmitting content over channels
with a variable channel capacity or connections for which the
available channel capacity is unknown, at the time of encoding
MPEG-4 SCALABLE CODING

Base layer: transmits the most relevant components of the signal at a
basic level quality
Enhancement layers: enhance the coding precision delivered by the
preceding layers
MPEG-4 HIGH EFFICIECY

ADVANCED AUDIO CODING
Spectral Band Replication:

Technology to enhance audio codecs, especially at
lowbit rates.
Human brain tends to analyse higher frequencies with
less accuracy
Efficiently reconstruct the high frequency data of an
audio signal from the low frequency data
Only need to transmit lower and midfrequencies and
some guidance information for reconstruction of the
high-frequency spectrum
WHAT IS MPEG-7 ?
"Multimedia Content Description Interface
Providing meta-data for multimedia.
MPEG-7: makes content accessible, retrievable,
filterable, manageable (via device / computer).
Multi-degrees of interpretation of informations
meaning
Support as broad a range of applications as possible.
A compatible (with existing tech) and extensible
standard.
MPEG-7 OBJECTIVES
Standardize content-based description for various
types of audiovisual information
Independent from media support (encoding and
storage)
Different granularity
Low-level features: shape, size, key, tempo changes,

High-level semantic info: scene with a barking brown dog
on the left and with the sound of passing cars in the
background.
Meaningful in the context of the application

Same material -> different types of features and
combinations
MPEG-7 AUDIO
Audio provides structures- building upon some

basic structures from the MDS- for describing
audio content.
Low-level Descriptors:
audio features that cut across many applications
High-level Description Tools:
more specific to a set of applications.
LOW-LEVEL FEATURES
MPEG-7 Audio Framework:
Two low-level descriptor types: (for sample and
segment)
Scalar : (e.g. power or fundamental frequency)
Vector : (e.g. spectra)
Hierarchical, consistent interface
Any descriptor inheriting from these types can be

instantiated, describing a segment with a single summary
value or a series of sampled values, as the application
requires.
Scalable Series: (hierarchical re-sampling)
Progressively down-sample the data contained in a series
LOW-LEVEL FEATURE
(TYPES)
Basic: Instantaneous waveform and power values.

Basic spectral: Log-frequency power spectrum and
spectral features (for e.g. spectral centroid, spectral spread,
spectral flatness).
Signal parameters: fundamental frequency and
harmonicity of signals.
Temporal Timbral: Log attack time and temporal centroid
Spectral Timbral: specialized spectral features in a linear
frequency space
Spectral basis representations: a number of features
used in conjunction for sound recognition for projections
into a low-dimensional space.
HIGH-LEVEL AUDIO DESCRIPTION TOOLS

(DS and DSs)
Exchange some generality for descriptive richness:

a smaller set of audio features (as compared to visual
features) that may canonically represent a sound without
domain-specific knowledge.
Audio Signature (DS)
Musical Instrument Timbre

Melody
General Sound Recognition and Indexing
Spoken Content
MPEG-21 (ISO/IEC
21000)
What?
Multimedia Framework for multimedia delivery and consumption

Content creator and content consumer as focal points
Why?
Many elements (standards) exist for delivery and consumption

of multimedia contents
Absence of 'big picture to describe how elements relate to each
other
Increase interoperability to allow existing components to be
used together by filling gaps
Why now?
HW building blocks and infrastructure in place

Compression, transmission, description standards are ready
MPEG-21
OBJECTIVES
Vision
To define a multimedia framework to enable transparent use of

multimedia resources across a wide range of networks and
devices used by different communities
Purpose
Enable electronic creation, delivery, trade of digital multimedia

content
Goals
Provide access to information and services from almost

anywhere at anytime with ubiquitous terminals and networks
Identify, describe, manage, and protect multimedia content to
support delivery chain of content creation, production, delivery,
and consumption
FUNDAMENTAL CONCEPTS
A structured digital object with a standard
representation, identification and meta-data
The fundamental unit of distribution and transaction
in the MPEG-21 framework
Digital Item = resource + metadata + structure
Resource: individual asset, e.g., MPEG-2 video
Metadata: descriptive information, e.g., MPEG-7
Structure: relationships among parts of the item
DIGITAL ITEM
Resources
Metadata
MPEG-1
MPEG-7
MPEG-2
New Metadata
& Resource
Forms
Structure
MPEG-4
MPEG-21
PERFORMANCE MEASURES
Two criteria:
1. Compression ratio
2. Hearing perception
FFMPEG software
EVALUATION
Three main libraries:

1. fdkaac
2. libmp3lame
3. libtwolame
Three corresponding statements:

fdkaac [option] input_file
2. ffmpeg i input_file codec:a libmp3lame b:a [bitrate] output_file
3. ffmpeg i input_file codec:a libtwolame b:a [bitrate] output_file
1.
DEMONSTRATION
DEMONSTRATION
CONCLUSION
Based in part on MPEG-2 AAC, in part on conventional speech

coding technology, and in part in new methods, the MPEG-4 General
Audio coder provides a rich set of tools and features to deliver both
enhanced coding performance and provisions for various types of
scalability. The MPEG-4 GA coding defines the current state of the art
in perceptual audio coding
THANK
THANKYOU
YOUFOR
FORLISTENING!
LISTENING!

MPEG X

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

MPEG X

Transféré par

Droits d'auteur :

Formats disponibles

MPEG-x AUDIO STANDARDS

We focus on General Audio coding

Range of human hearing: 20 Hz to 20 kHz

FREQUENCY MASKING CURVES

Effect on threshold for 1 kHz masking tone

FREQUENCY MASKING CURVES

Effect of masking tone at three different frequencies

Audio frequency range for hearing can be partitioned into about 24

band number b, expressed in Bark units

It may take up to as much as 500 ms for us to discern a quiet test

Effect of temporal and frequency masking depending on both

The phenomenon of saturation also depends upon how long the

Solid curve: masking tone played for 200 ms

The subbands samples are normalized by a scaling factor such that

The bit-allocation module decides quantizers(how many bits are used

SBS: Sub-band samples

Ancillary data: i.e. multilingual data and surround-sound data

MPEG audio compression takes advantage of these considerations

MPEG AUDIO ALGORITHM

Employ a bank of filters that act to first analyze the

MPEG AUDIO ALGORITHM

Decide whether each frequency sub-band is tone-like or

MPEG AUDIO ALGORITHM

MPEG AUDIO ALGORITHM

Balance the masking behavior and the available number of

MPEG AUDIO ALGORITHM

Format bitstream into suitable blocks

MPEG-x AUDIO STANDARDS

MPEG-1 Layer 1 & Layer 2

MPEG-1 Layer 1 & Layer 2

MPEG-1 Audio Layers 1 and 2

MPEG-Audio Layer 3 Coding

MPEG-2 Advanced Audio

MPEG-Audio Layer 3 Coding

MPEG-2 Advanced Audio

MPEG-2 Advanced Audio

MPEG-Audio Layer 3 Coding

MPEG-2 Advanced Audio

MPEG-Audio Layer 3 Coding

MPEG-4 ADVANCED AUDIO CODING

Being built around the coder kernel provided by MPEG-2 Advanced

MPEG-4 ADVANCED AUDIO CODING

LONG TERM PREDICTION

Long-term prediction (LTP) is an efficient tool for reducing the

LONG TERM PREDICTION

PERCEPTUAL NOISE SUBSTITUTION (PNS)

A feature newly introduced into MPEG-4

PERCEPTUAL NOISE SUBSTITUTION

MPEG-4 SCALABLE CODING

MPEG-4 SCALABLE CODING

MPEG-4 HIGH EFFICIECY

Spectral Band Replication:

Low-level features: shape, size, key, tempo changes,

Meaningful in the context of the application

Audio provides structures- building upon some

audio features that cut across many applications

High-level Description Tools:

more specific to a set of applications.

Hierarchical, consistent interface

Any descriptor inheriting from these types can be

Basic: Instantaneous waveform and power values.

HIGH-LEVEL AUDIO DESCRIPTION TOOLS

Exchange some generality for descriptive richness: