Académique Documents
Professionnel Documents
Culture Documents
Psychoacoustics
1. Introduction
This paper introduces digital audio signal compression, a technique essential to the
implementation of many digital audio applications. Digital audio signal compression is
the removal of redundant or otherwise irrelevant information from a digital audio signal,
a process that is useful for conserving both transmission bandwidth and storage space.
We begin by defining some useful terminology. We then present a typical "encoder" (as
compression algorithms are often called) and explain how it functions. Finally consider
some standards that employ digital audio signal compression, and discuss the future of
the field.
www.seminarstopics.com 1
Seminar Report
Psychoacoustics
Psychoacoustics is the study of subjective human perception of sounds. Effectively, it is
the study of acoustical perception. Psychoacoustic modeling has long-since been an
integral part of audio compression. It exploits properties of the human auditory system to
remove the redundancies inherent in audio signals that the human ear cannot perceive.
More powerful signals at certain frequencies 'mask' less powerful signals at nearby
frequencies by desensitizing the human ear's basilar membrane (which is responsible for
resolving the frequency components of a signal). The entire MP3 phenomenon is made
possible by the confluence of several distinct but interrelated elements: a few simple
insights into the nature of human psychoacoustics, a whole lot of number crunching, and
conformance to a tightly specified format for encoding and decoding audio into compact
bit streams.
2. Terminology
www.seminarstopics.com 2
Seminar Report
Psychoacoustics
reconstructed signals only matter if they are detectable by the human ear. As we will
explore shortly, audio compression employs both lossy and lossless techniques.
Figure 1 shows a generic encoder or "compressor that takes blocks of sampled audio
signal as its input. These blocks typically consist of between 500 and 1500 samples per
channel, depending on the encoder specification. For example, the MPEG-1 layer III
(MP3) specification takes 576 samples per channel per input block. The output is a
compressed representation of the input block (a "frame") that can be transmitted or stored
for subsequent decoding.
www.seminarstopics.com 3
Seminar Report
Psychoacoustics
No matter what you do, your ears are always working. They are constantly detecting,
deciphering and analyzing sounds and communicating them to the brain. In a
comparatively tiny area of our body the ear is performing many highly technical and
intricate functions. There are three distinct portions to the ear: the outer ear containing the
fleshy skin and the canal that leads to the inner ear, the middle ear containing the three
smallest bones in the human body the malleus, incus and stapes (commonly called the
hammer, anvil and stirrup) and the inner ear, made up of a cluster of three semicircular
canals and the snail shaped cochlea. Let's take a look at them one at a time...
www.seminarstopics.com 4
Seminar Report
Psychoacoustics
motion by the stirrup in the middle ear. Moving in and out it sets up hydraulic pressure in
the fluid. As these waves travel to and from the apex of the spirals, they cause the walls
separating the canals to undulate. Along one of these walls is a sensitive organ called the
Corti. It is made up of many thousands of sensory hair cells. From here thousands of
nerve fibers carry information about the frequency, intensity and timbre of all these
sounds to the brain, where the sensation of hearing occurs.
Scientists cannot fully explain just how the signals are transmitted to the brain. They do
know that the signals sent by all the hair cells are about the same in duration and
strength. This has led them to believe that it is not the content of the signals but rather the
signals themselves that convey some sort of message to the brain.
Our ears, so often taken for granted, thus are a marvel of intricacy and design that leaves
anything that man can produce in the shade as a cheap imitation. Your hearing can never
be replaced. Don't take it for granted.
www.seminarstopics.com 5
Seminar Report
Psychoacoustics
5. Psychoacoustics
How do we reduce the size of the input data? The basic idea is to eliminate information
that is inaudible to the ear. This type of compression is often referred to as perceptual
encoding. To help determine what can and cannot be heard, compression algorithms rely
on the field of psychoacoustics, i.e., the study of human sound perception. Waves
vibrating at different frequencies manifest themselves differently, all the way from the
astronomically slow pulsations of the universe itself to the inconceivably fast vibration of
matter (and beyond). Somewhere in between these extremes are wavelengths that are
perceptible to human beings as light and sound. Just beyond the realms of light and
sound are sub- and ultrasonic vibration, the infrared and ultraviolet light spectra, and
zillions of other frequencies imperceptible to humans (such as radio and microwave).
www.seminarstopics.com 6
Seminar Report
Psychoacoustics
Our sense organs are tuned only to very narrow bandwidths of vibration in the overall
picture. In fact, even our own musical instruments create many vibrational frequencies
that are imperceptible to our ears. Frequencies are typically described in units called
Hertz (Hz), which translates simply as "cycles per second." In general, humans cannot
hear frequencies below 20Hz (20 cycles per second), nor above 20kHz (20,000 cycles
per second), as shown in Figure 2.
While hearing capacities vary from one individual to the next, it's generally true that
humans perceive midrange frequencies more strongly than high and low frequencies,[2J.
and that sensitivity to higher frequencies diminishes with age and prolonged exposure to
loud volumes. In fact, by the time we're adults, most of us can't hear much of anything
above 16kHz (although women tend to preserve the ability to hear higher frequencies
later into life than do men). The most sensitive range of hearing for most people hovers
between 2kHz to 4kHz, a level probably evolutionarily related to the normal range of the
human voice, which runs roughly from SOOHz to 2kHz.
Specifically, audio compression algorithms exploit the conditions under which signal
characteristics obscure or mask each other. This phenomenon occurs in three different
ways: threshold cut-off, frequency masking and temporal masking. The remainder of this
section explains the nature of these concepts; subsequent sections explain how they are
typically applied to audio signal compression.
Threshold Cut-off
The human ear detects sounds as a local variation in air pressure measured as the Sound
Pressure Level (SPL). If variations in the SPL are below a certain threshold in amplitude,
the ear cannot detect them. This threshold, shown in Figure 3, is a function of the sound's
frequency. Notice in Figure 3 that because the lowest-frequency component is below the
threshold, it will not be heard.
www.seminarstopics.com 7
Seminar Report
Psychoacoustics
Frequency Maskins
Even if a signal component exceeds the hearing threshold, it may still be masked by
louder components that are near it in frequency. This phenomenon is known as frequency
masking or simultaneous masking. Each component in a signal can cast a "shadow" over
neighboring components. If the neighboring components are covered by this shadow,
they will not be heard. The effective result is that one component, the masker, shifts the
hearing threshold. Figure 4 shows a situation in which this occurs.
www.seminarstopics.com 8
Seminar Report
Psychoacoustics
Temporal Maskins
Just as tones cast shadows on their neighbors in the frequency domain, a sudden increase
in volume can mask quieter sounds that are temporally close. This phenomenon is known
as temporal masking. Interestingly, sounds that occur both after and before the volume
increase can be masked! Figure 5 illustrates a typical temporal masking scenario: events
below the indicated threshold will not be heard. The idea behind temporal masking is that
humans also have trouble hearing distinct sounds that are close to one another in time.
For example, if a loud sound and a quiet sound are played simultaneously, you won't be
able to hear the quiet sound. If, however, there is sufficient delay between the two
sounds, you will hear the second, quieter sound. The key to the success of temporal
masking is in determining (quantifying) the length of time between the two tones at
which the second tone becomes audible, i.e., significant enough to keep it in the
bitstream rather than throwing it away. This distance, or threshold, turns out to be around
www.seminarstopics.com 9
Seminar Report
Psychoacoustics
five milliseconds when working with pure tones, though it varies up and down in
accordance with different audio passages.
6. Spectral Analysis
Of the three masking phenomena explained above, two are best described in the
frequency domain. Thus, a frequency domain representation, also called the "spectrum"
of a signal, is a useful tool for analyzing the signal's frequency characteristics and
determining thresholds. There are several different techniques for converting a finite time
sequence into its spectral representation, and these typically fall into one of two
categories: transforms and filter banks. Transforms calculate the spectrum of their inputs
in terms of a set of basis sequences; e.g., the Fourier Transform uses basic sequences that
are complex exponentials. Filter banks apply several different band pass filters to the
input. Typically the result is several time sequences, each of which corresponds to a
particular frequency band. Taking the spectrum of a signal has two purposes:
www.seminarstopics.com 10
Seminar Report
Psychoacoustics
> To derive the masking thresholds in order to determine which portion of the signal
can be dropped.
> To generate a representation of the signal to which the masking threshold can be
applied.
Some compression schemes use different techniques for these two tasks.
The most popular transform in signal processing is the Fast Fourier Transform (FFT).
Given a finite time sequence, the FFT produces a complex-value frequency domain
representation. Encoders often use FFTs as a first step toward determining masking
thresholds. Another popular transform is the Discrete Cosine Transform (DCT), which
outputs a real-valued frequency domain representation. Both the FFT and the DCT suffer
from distortion when transforms are taken from contiguous blocks of time data. To solve
this problem, inputs and outputs can be overlapped and windowed in such a way that, in
the absence of lossy compression techniques, entire time signals can be perfectly
reconstructed. For this reason, most transform-based encoding schemes employ an
overlapped and windowed DCT known as the Modified Discrete Cosine Transform
(MDCT).
Some compression algorithms that use the MDCT are MPEG-1 layer-Ill, MPEG-2 AAC,
and Do Dolby AC-3. Filter banks pass a block of time samples through several band
pass filters to generate different signals corresponding to different sub-bands in
frequency. After filtering, masking thresholds can be applied to each sub-band. Two
popular filter bank structures are the poly-phase filter bank and the wavelet filter bank.
The poly-phase filter bank uses parallel band pass filters of equal width whose outputs
are down-sampled to create one (shorter) signal per sub-band. In the absence of lossy
compression techniques, a decoder can achieve perfect reconstruction by up-sampling,
filtering, and adding each sub-band. This type of structure is used in all of the MPEG-1
audio encoders.
www.seminarstopics.com 11
Seminar Report
Psychoacoustics
7.1.1 History
In 1987, the Fraunhofer IIS started to work on perceptual audio coding in the framework
of the EUREKA project EU147, Digital Audio Broadcasting (DAB). In a joint
cooperation with the University of Erlangen (Prof. Dieter Seitzer), the Fraunhofer IIS
finally devised a very powerful algorithm that is standardized as ISO-MPEG Audio
Layer-3 (IS 11172-3 and IS 13818-3).
Without data reduction, digital audio signals typically consist of 16 bit samples recorded
at a sampling rate more than twice the actual audio bandwidth (e.g. 44.11 KHz for
Compact Discs). So ou end up with more than 1.4 Mbit to represent just one second of
stereo music in CD quality. By using MPEG audio coding, you may shrink down the
original sound data from a CD by a factor of 12, without losing sound quality. Basically,
this is realized by perceptual coding techniques addressing the perception of sound waves
by the human ear.
www.seminarstopics.com 12
Seminar Report
Psychoacoustics
By exploiting stereo effects and by limiting the audio bandwidth, the coding schemes
may achieve an acceptable sound quality at even lower bit rates. MPEG Layer-3 is the
most powerful member of the MPEG audio coding family. For a given sound quality
level, it requires the lowest bit rate or for a given bit rate, it achieves the highest sound
quality.
Perceptual codecs are highly complex beasts, and all of them work a little differently.
However, the general principles of perceptual coding remain the same from one codec to
the next. In brief, the MP3 encoding process can be subdivided into a handful of discrete
tasks (not necessarily in this order):
Break the signal into smaller component pieces called " frames," each typically
lasting a fraction of a second. You can think of frames much as you would the frames in a
movie film.
www.seminarstopics.com 13
Seminar Report
Psychoacoustics
Analyze the signal to determine its "spectral energy distribution." In other words, on
the entire spectrum of audible frequencies, find out how the bits will need to be
distributed to best account for the audio to be encoded. Because different portions of the
frequency spectrum are most efficiently encoded via slight variants of the same
algorithm, this step breaks the signal into sub-bands, which can be processed
independently for optimal results (but note that all sub-bands use the algorithm-they just
allocate the number of bits differently, as determined by the encoder).
The encoding bitrate is taken into account, and the maximum number of bits that can
be allocated to each frame is calculated. For instance, if you're encoding at 128 kbps, you
have an upper limit on how much data can be stored in each frame (unless you're
encoding with variable bitrates, but we'll get to that later). This step determines how
much of the available audio data will be stored, and how much will be left on the cutting
room floor.
The frequency spread for each frame is compared to mathematical models of human
psychoacoustics, which are stored in the codec as a reference table. From this model, it
can be determined which frequencies need to be rendered accurately, since they'll be
perceptible to humans, and which ones can be dropped or allocated fewer bits, since we
wouldn't be able to hear them anyway. Why store data that can't be heard?
The bitstream is run through the process of " Huffman coding," which compresses
redundant information throughout the sample. The Huffman coding does not
work with a psychoacoustic model, but achieves additional compression via more
traditional means. Thus, you can see the entire MP3 encoding process as a two-pass
system: First you run all of the psychoacoustic models, discarding data in the process,
and then you compress what's left to shrink the storage space required by any
redundancies. This second step, the Huffman coding, does not discard any data-it just lets
you store what's left in a smaller amount of space.
www.seminarstopics.com 14
Seminar Report
Psychoacoustics
The collection of frames is assembled into a serial bitstream, with header information
preceding each data frame. The headers contain instructional "meta-data" specific to that
frame.
Along the way, many other factors enter into the equation, often as the result of options
chosen prior to beginning the encoding. In addition, algorithms for the encoding of an
individual frame often rely on the results of an encoding for the frames that precede or
follow it. The entire process usually includes some degree of simultaneity; the preceding
steps are not necessarily run in order.
* Fraunhofer IIS uses a non-ISO extension of MPEG Layer-3 for enhanced performance
("MPEG 2.5")
www.seminarstopics.com 15
Seminar Report
Psychoacoustics
7.7.5 MP3 Encoding (Block Diagram)
Filter Bank
The filter bank used in MPEG Layer-3 is a hybrid filter bank which consists of a poly-
phase filter bank and a Modified Discrete Cosine Transform (MDCT). This hybrid form
was chosen for reasons of compatibility to its predecessors, Layer-1 and Layer-2.
www.seminarstopics.com 16
Seminar Report
Psychoacoustics
Perceptual Model
The perceptual model mainly determines the quality of a given encoder implementation.
It uses either a separate filter bank or combines the calculation of energy values (for the
masking calculations) and the main filter bank. The output of the perceptual model
consists of values for the masking threshold or the allowed noise for each coder partition.
If the quantization noise can be kept below the making threshold, then the compression
results should be indistinguishable from the original signal.
Joint Stereo
Joint stereo coding takes advantage of the fact that both channels of a stereo channel pair
contain far the same information. These stereophonic irrelevancies and redundancies are
exploited to reduce the total bit rate. Joint stereo is used in cases where only low bit rates
are available but stereo signals are desired.
Quantization is done via a power-law quantizer. In this way, larger values are
automatically coded with less accuracy and some noise shaping is already built into the
quantization process.
The quantized values are coded by Huffman coding. As a specific method for entropy
coding, Huffman coding is lossless. This is called noiseless coding because no noise is
added to the audio signal.
The process to find the optimum gain and scale factors for a given block, bit-rate and
output from the perceptual model is usually done by two nested iteration loops in an
analysis-by-synthesis way:
> Inner iteration loop (rate loop)
www.seminarstopics.com 17
Seminar Report
Psychoacoustics
The Huffman code tables assign shorter code words to (more frequent) smaller quantized
values. If the number of bits resulting from the coding operation exceeds the number of
bits available to code a given block of data, this can be corrected by adjusting the global
gain to result in a larger quantization step sizes until the resulting bit demand for
Huffman coding is small enough.
To shape the quantization noise according to the masking threshold, scale factors are
applied to each scale factor band. The system starts with a default factor of 1.0 for each
band. If the quantization noise in a given band is found to exceed the masking threshold
(allowed noise) as supplied by the perceptual model, the scale factor for this band is
adjusted to reduce the quantization noise. Since achieving a smaller quantization noise
requires a larger number of quantization steps and thus a higher bit rate, the rate
adjustment loop has to be repeated every time new scale factors are used. In other words,
the rate loop is nested within the noise control loop. The outer (noise control) loop is
executed until the actual noise (computed from the difference of the original spectral
values minus the quantized spectral values) is below the masking threshold for every
scale factor band (i.e. critical band).
The great bulk of the work in the MP3 system as a whole is placed on the encoding
process. Since one typically plays files more frequently than one encodes them, this
makes sense. Decoders do not need to store or work with a model of human
psychoacoustic principles, nor do they require a bit allocation procedure. All the MP3
player has to worry about is examining the bitstream of header and data frames for
spectral components and the side information stored alongside them, and then
reconstructing this information to create an audio signal. The player is nothing but an
(often) fancy interface onto your collection of MP3 files and playlists and your sound
www.seminarstopics.com 18
Seminar Report
Psychoacoustics
card, encapsulating the relatively straightforward rules of decoding the MP3 bitstream
format.
While there are measurable differences in the efficiency-and audible differences in the
quality-of various MP3 decoders, the differences are largely negligible on computer
hardware manufactured in the last few years. That's not to say that decoders just sit in the
background consuming no resources. In fact, on some machines and some operating
systems you'll notice a slight (or even pronounced) sluggishness in other operations
while your player is running. This is particularly true on operating systems that don't
feature a finely grained threading model, such as MacOS and most versions of Windows.
Linux and, to an even greater extent, BeOS are largely exempt from MP3 skipping
problems, given decent hardware. And of course, if you're listening to MP3 audio
streamed over the Internet, you'll get skipping problems if you don't have enough
bandwidth to handle the bitrate/sampling frequency of the stream.
Some MP3 decoders chew up more CPU time than others, but the differences between
them in terms of efficiency are not as great as the differences between their feature sets,
or between the efficiency of various encoders. Choosing an MP3 player becomes a
question of cost, extensibility, audio quality, and appearance.
www.seminarstopics.com 19
Seminar Report
Psychoacoustics
www.seminarstopics.com 20
Seminar Report
Psychoacoustics
Conclusion
By eliminating audio information that the human ear cannot detect, modern audio coding
standards are able to compress a typical 1.4 Mbps signal by a factor of about twelve. This
is done by employing several different methodologies, including noise allocation
techniques based on psychoacoustic models. Future goals for the field of audio
compression are quite broad. Several initiatives are focused on establishing a format for
digital encryption (watermarking) to protect copyrighted audio content. Improvements in
psychoacoustic models are expected to drive bit rates lower. Finally, entirely new
avenues are being explored in an effort to compress audio based on how it is produced
rather than how it is perceived. This last approach was integral in the development of the
MPEG-4 standard.
www.seminarstopics.com 21
Seminar Report
Psychoacoustics
References
> B. Cavagnolo and J. Bier "Introduction to Digital Audio Compression", Berkely
Design Inc.
> Madisetti, Vijay K. and Williams, Douglass B. "The Digital Signal Processing
Handbook", Section IX. CRC Press LLC, 1998.
> Massie, Dana, and Strawn, John, et al., Digital Audio: Applications, Algorithms and
Implementation. Seminar presented at Embedded Processor Forum by Berkely Design
Technology, Inc., 12 June 2000.
> Dylan T F Carline, Reuben Edwards, Paul Coulton "Psychoacoustic properties of
multichannel audio signals" Lancaster University.
www.seminarstopics.com 22