Vous êtes sur la page 1sur 69

MPEG-x AUDIO STANDARDS

1. ng B Vit 11DT3
2. Ng t
11DT3
3. Bi nh Phc 11DT2

OUTLINE

INTRODUCTION
FUNDAMENTALS
MPEG-x AUDIO STANDARDS
PERFORMANCE MEASURES
EVALUATION
CONCLUSION

INTRODUCTION
MPEG-x Standards: evolving set of standards for video and audio
compression developed by the Moving Picture Experts Group.
MPEG-x Audio:
General Audio (GA) coding
Taking PCM audio streams and effectively encoding them for
transmission and storage
Synthetic audio
Text-to-Speech, how to generate and play virtual instruments

We focus on General Audio coding

INTRODUCTION

Ancillary Data
(Optional)

FUNDAMENTALS
Psycho-acoustic model:
Hearing characteristics
Threshold of hearing
Frequency masking
Critical bands
Bark units
Temporal masking
Time to Frequency Transformation:
Filter banks
Bit allocation:
Bitstream Formatting
MPEG Audio Algorithm

PSYCHO-ACOUSTIC
How humans perceive the sound
The main feature in the compression context is that it tells what parts
that we can remove

HEARING CHARACTERISTICS

Range of human hearing: 20 Hz to 20 kHz


Voice: 500 Hz to 4 kHz
Range of hearing: depends upon frequency
Fletcher-Munson equal-loudness curves
Describe relationship between perceived loudness for a given
stimulus sound volume as a function of frequency

FLETCHER-MUNSON EQUAL-LOUDNESS
CURVES

THRESHOLD OF HEARING

THRESHOLD OF HEARING

[dB]
The origin is at the frequency of 2 kHz since Threshold(f) = 0 at f = 2 kHz

FREQUENCY MASKING
A sound makes another sound be heard difficult if there is a certain
difference in frequencies between them.
A lower frequency can effectively mask a higher frequency.
A higher frequency does not mask a lower frequency well.
The greater the power in the masking frequency, the broader the
range of frequencies it can mask
If two sounds are widely separated in frequency, little masking occurs.

FREQUENCY MASKING CURVES

Effect on threshold for 1 kHz masking tone

FREQUENCY MASKING CURVES

Effect of masking tone at three different frequencies

CRITICAL BANDS
Because of frequency masking, we can divide human hearing range
into critical bands.
Human auditory system cannot resolve sounds better than within
about one critical band when other sounds are present
Critical bandwidth corresponds to the smallest frequency difference
between two partials such that each can still be heard separately
Critical band
Less than 100 Hz at f < 500 Hz nearly constant
For f 500 Hz, increases roughly linearly with frequency

Audio frequency range for hearing can be partitioned into about 24


critical bands (25 are typically used for coding applications)

CRITICAL BANDS

BARK UNITS
The range of frequencies affected by masking is broader for higher
frequencies
It is useful to define a new frequency unit
In terms of this new unit, each of the masking curves has about the
same width
The new unit defined is called the Bark, named after Heinrich
Barkhausen (1881-1956)
One Bark unit corresponds to the width of one critical band, for any
masking frequency

BARK UNITS
The conversion between a frequency f and its corresponding critical

band number b, expressed in Bark units

Another formula:
where f is in kHz, b is in Barks

BARK UNITS

TEMPORAL MASKING

Any loud tone causes the hearing receptors in the inner ear to
become saturated, and they require time to recover

It may take up to as much as 500 ms for us to discern a quiet test


tone after a 60 dB masking tone has been played

TEMPORAL MASKING

Effect of temporal and frequency masking depending on both


time and closeness in frequency.

TEMPORAL MASKING

The phenomenon of saturation also depends upon how long the


masking tone is applied

Solid curve: masking tone played for 200 ms


Dashed curve: masking tone played for 100 ms

TEMPORAL MASKING
A signal is able to mask other signals that occur just after or before it
sounds

Time to Frequency
Transformation
Filter banks:
A parallel bank of bandpass filters covering the entire spectrum
Used to break input signal into frequency components- subbands

The subbands samples are normalized by a scaling factor such that


the maximum sample amplitude in each block is unity

BIT-ALLOCATION

The bit-allocation module decides quantizers(how many bits are used


to quantized) for each subbands.
The bit-allocation is not part of the standard, and it can therefore be
done in many possible ways
Ensure that all the quantization noise is below the masking threshold

BIT-ALLOCATION

Bitstream Formatting

SBS: Sub-band samples


Header contains
synchronization code (twelve 1s - 111111111111)
sampling rate used
bit-rate
stereo information

Ancillary data: i.e. multilingual data and surround-sound data


SBS format: quantized scaling factor and code-words

MPEG audio compression takes advantage of these considerations


It uses these to transmit frequency components that are masked by
frequency masking or temporal masking or both, using few bits.

MPEG AUDIO ALGORITHM

Ancillary
Data
(Optional)

Employ a bank of filters that act to first analyze the


frequency components of the audio signal
Divides the input into 32 frequency sub-bands
Determine the scaling factor for each sub-band
Pass the scaling factor along with sub-band samples
(SBS) to the bit-allocation block

MPEG AUDIO ALGORITHM

Ancillary
Data
(Optional)

Decide whether each frequency sub-band is tone-like or


noise-like
Based on that decision and scaling factor, calculates the
masking threshold for each band and compares with the
threshold of hearing

MPEG AUDIO ALGORITHM

Ancillary
Data
(Optional)

Determine the number of code bits to quantize the subband to minimize the audibility of quantization noise
Bits are allocated where they are most needed to lower the
quantization noise below an audible level.
Then the number of bits allocated is used to quantize the
information from the filter bank

MPEG AUDIO ALGORITHM

Ancillary
Data
(Optional)

Balance the masking behavior and the available number of


bits by discarding inaudible frequencies and scaling
quantization according to the sound level left over, above
masking levels

MPEG AUDIO ALGORITHM

Ancillary
Data
(Optional)

Format bitstream into suitable blocks

MPEG-x AUDIO STANDARDS


MPEG-1
MPEG-1 Layer 1
MPEG-1 Layer 2
MPEG-1 Layer 3
MPEG-2
MPEG-2 Advanced Audio Coding(AAC)
MPEG-4
MPEG-4 Advanced Audio Coding(AAC)
MPEG-4 High Efficiency Advanced Audio Coding(HE-AAC)
MPEG-7
MPEG-21

MPEG-1
Three downward-compatible layers of audio compression
Each offers more complexity in the psychoacoustic model applied and
correspondingly better compression for a given level of audio quality
Layer 1 quality can be quite good, provided a comparatively high
bitrate is available
Layer 2 has more complexity and was proposed for use in digital
audio broadcasting
Layer 3 is most complex and was originally aimed at audio
transmission over ISDN lines
Each of the layers uses a different frequency transform

MPEG-1 Layers
In the Layer 1 encoder, the sets of 32 PCM values are first assembled
into a set of 12 groups of 32s

MPEG-1 Layers
A Layer 2 or Layer 3, frame actually accumulates more than 12
samples for each sub band: a frame includes 1,152 samples

MPEG-1 Layer 1 & Layer 2


Main difference:
Three groups of 12 samples are encoded in each frame and
temporal masking is brought into play, as well as frequency
masking
Bit allocation is applied to window lengths of 36 samples instead
of 12
The resolution of the quantizers is increased from 15 bits to 16
Advantage:
A single scaling factor can be used for all three groups
Reduce bitrate with expense of higher complexity and delay

MPEG-1 Layer 1 & Layer 2

MPEG-1 Audio Layers 1 and 2

MPEG-1 Layer 3
Main difference:
Employs a similar filter bank to that used in Layer 2, except using
a set of filters with non-equal frequencies
Takes into account stereo redundancy(Mid/Side Coding)
Uses Modified Discrete Cosine Transform (MDCT)
Sophisticated bit allocation and quantization strategies rely on
non-uniform quantization.
Use Huffman Coding-loss less coding.

MPEG-1 Layer 3

MPEG-1 Layer 3

MPEG-Audio Layer 3 Coding

MPEG-2 Advanced Audio


Coding

MPEG-Audio Layer 3 Coding

MPEG-2 Advanced Audio


Coding
Temporal Noise Shaping:
Shapes the distribution of quantization noise in time by prediction in
the frequency domain
Voice signals in particular experience considerable improvement
through TNS
Prediction(in time domain):
A technique commonly established in the area of speech coding
systems
It benefits from the fact that stationary audio signals are predictable
to a certain extend and amplitude at certain frequency do not
change significantly from block to block.
But requires higher computational complexity
MPEG-Audio Layer 3 Coding

MPEG-2 Advanced Audio


Coding
Mid-Side Coding(MS):
For dual-channel audio(stereo signal).
Transform the left(L) and right(R) channels into a mid(M) channel
and a side(S) channel.
M=L+R
S=L-R
Then gives more bits to the mid than the side channel (as usually
the side channel is less complex

MPEG-Audio Layer 3 Coding

MPEG-2 Advanced Audio


Coding
Intensity Coding:
For multi-channel audio
Human hearing is predominantly less acute at perceiving the
direction of certain audio frequencies.
Certain subbands from right and left channel will be merged into
one channel to reduce bit rate. Some directional information may
be conveyed as scale factor.

MPEG-Audio Layer 3 Coding

MPEG-4 ADVANCED AUDIO CODING

Being built around the coder kernel provided by MPEG-2 Advanced


Audio Coding (AAC)
With some additional coding tools and coder configurations
The perceptual noise substitution (PNS) tool and the Long-Term
Prediction (LTP) tool are available to enhance the coding
performance for the noise-like and very tonal signal, respectively
A special coder kernel (Twin VQ) is provided to cover extremely low
bitrates
A flexible bitrate scalable coding system is defined including a variety
of possible coder configurations

MPEG-4 ADVANCED AUDIO CODING

LONG TERM PREDICTION

Long-term prediction (LTP) is an efficient tool for reducing the


redundancy of a signal between successive coding frames newly
introduced in MPEG-4
Predict incoming input signal based on preceding signals.
LTP tool provides optimal coding gain for stationary harmonic signals
as well as some gain for non-harmonic tonal signals
Compared with the rather complex MPEG-2 AAC predictor tool, the
LTP tool shows a saving of roughly one-half in both computational
complexity and memory requirements

LONG TERM PREDICTION

PERCEPTUAL NOISE SUBSTITUTION (PNS)

A feature newly introduced into MPEG-4


Aiming at a further optimization of the bit-rate efficiency of AAC at
lower bit-rates
The technique of PNS is based on the observation that one noise
sounds like the other
Additional decoder complexity associated with the PNS coding tool is
very slow in terms of both computational and memory requirements

PERCEPTUAL NOISE SUBSTITUTION

MPEG-4 SCALABLE CODING


Enable the transmission and decoding of the bit-stream with a bit-rate
that can be adapted to dynamically varying requirements
Offer significant advantages for transmitting content over channels
with a variable channel capacity or connections for which the
available channel capacity is unknown, at the time of encoding

MPEG-4 SCALABLE CODING


Base layer: transmits the most relevant components of the signal at a
basic level quality
Enhancement layers: enhance the coding precision delivered by the
preceding layers

MPEG-4 HIGH EFFICIECY


ADVANCED AUDIO CODING

Spectral Band Replication:


Technology to enhance audio codecs, especially at
lowbit rates.
Human brain tends to analyse higher frequencies with
less accuracy
Efficiently reconstruct the high frequency data of an
audio signal from the low frequency data
Only need to transmit lower and midfrequencies and
some guidance information for reconstruction of the
high-frequency spectrum

WHAT IS MPEG-7 ?
"Multimedia Content Description Interface
Providing meta-data for multimedia.
MPEG-7: makes content accessible, retrievable,
filterable, manageable (via device / computer).
Multi-degrees of interpretation of informations
meaning
Support as broad a range of applications as possible.
A compatible (with existing tech) and extensible
standard.

MPEG-7 OBJECTIVES
Standardize content-based description for various
types of audiovisual information
Independent from media support (encoding and
storage)

Different granularity

Low-level features: shape, size, key, tempo changes,


High-level semantic info: scene with a barking brown dog
on the left and with the sound of passing cars in the
background.

Meaningful in the context of the application


Same material -> different types of features and
combinations

MPEG-7 AUDIO

Audio provides structures- building upon some


basic structures from the MDS- for describing
audio content.
Low-level Descriptors:

audio features that cut across many applications

High-level Description Tools:

more specific to a set of applications.

LOW-LEVEL FEATURES
MPEG-7 Audio Framework:
Two low-level descriptor types: (for sample and
segment)
Scalar : (e.g. power or fundamental frequency)
Vector : (e.g. spectra)

Hierarchical, consistent interface

Any descriptor inheriting from these types can be


instantiated, describing a segment with a single summary
value or a series of sampled values, as the application
requires.
Scalable Series: (hierarchical re-sampling)
Progressively down-sample the data contained in a series

LOW-LEVEL FEATURE

(TYPES)

Basic: Instantaneous waveform and power values.


Basic spectral: Log-frequency power spectrum and
spectral features (for e.g. spectral centroid, spectral spread,
spectral flatness).
Signal parameters: fundamental frequency and
harmonicity of signals.
Temporal Timbral: Log attack time and temporal centroid
Spectral Timbral: specialized spectral features in a linear
frequency space
Spectral basis representations: a number of features
used in conjunction for sound recognition for projections
into a low-dimensional space.

HIGH-LEVEL AUDIO DESCRIPTION TOOLS


(DS and DSs)

Exchange some generality for descriptive richness:


a smaller set of audio features (as compared to visual
features) that may canonically represent a sound without
domain-specific knowledge.
Audio Signature (DS)

Musical Instrument Timbre


Melody
General Sound Recognition and Indexing
Spoken Content

MPEG-21 (ISO/IEC
21000)
What?

Multimedia Framework for multimedia delivery and consumption


Content creator and content consumer as focal points

Why?

Many elements (standards) exist for delivery and consumption


of multimedia contents
Absence of 'big picture to describe how elements relate to each
other
Increase interoperability to allow existing components to be
used together by filling gaps

Why now?

HW building blocks and infrastructure in place


Compression, transmission, description standards are ready

MPEG-21
OBJECTIVES
Vision

To define a multimedia framework to enable transparent use of


multimedia resources across a wide range of networks and
devices used by different communities

Purpose

Enable electronic creation, delivery, trade of digital multimedia


content

Goals

Provide access to information and services from almost


anywhere at anytime with ubiquitous terminals and networks
Identify, describe, manage, and protect multimedia content to
support delivery chain of content creation, production, delivery,
and consumption

FUNDAMENTAL CONCEPTS
A structured digital object with a standard
representation, identification and meta-data
The fundamental unit of distribution and transaction
in the MPEG-21 framework
Digital Item = resource + metadata + structure
Resource: individual asset, e.g., MPEG-2 video
Metadata: descriptive information, e.g., MPEG-7
Structure: relationships among parts of the item

DIGITAL ITEM

Resources

Metadata

MPEG-1

MPEG-7

MPEG-2

New Metadata
& Resource
Forms

Structure

MPEG-4

MPEG-21

PERFORMANCE MEASURES

Two criteria:
1. Compression ratio
2. Hearing perception

FFMPEG software

EVALUATION

Three main libraries:


1. fdkaac
2. libmp3lame
3. libtwolame

Three corresponding statements:


fdkaac [option] input_file
2. ffmpeg i input_file codec:a libmp3lame b:a [bitrate] output_file
3. ffmpeg i input_file codec:a libtwolame b:a [bitrate] output_file
1.

DEMONSTRATION
DEMONSTRATION

CONCLUSION

Based in part on MPEG-2 AAC, in part on conventional speech


coding technology, and in part in new methods, the MPEG-4 General
Audio coder provides a rich set of tools and features to deliver both
enhanced coding performance and provisions for various types of
scalability. The MPEG-4 GA coding defines the current state of the art
in perceptual audio coding

THANK
THANKYOU
YOUFOR
FORLISTENING!
LISTENING!

Vous aimerez peut-être aussi