Académique Documents
Professionnel Documents
Culture Documents
2. Representation of signals
Source detection, localization, and identification begin with the realization that the
sources are by their nature probabilistic. The source, the coupling of source to medium,
the medium itself and the noise processes are each variable to some degree. Indeed, it is
this very randomness that presents the fundamental barrier to rapid and accurate
estimation of signals. As such applied probability is essential to the study of
communications and other detection and estimation problems. This chapter is concerned
with the description of signals as random processes, how signals are transformed from
continuous (real world) signals into discrete representations, and the fundamental limits
on the loss of information that result from such transformations. Three broad topics will
be touched upon: basic probability theory, representation of stochastic processes, and
information theory.
2.1 Probability
Discrete Random Variables
A discrete random variable (RV) X takes on values in a finite set X={x1, x2,xm}. The
probability of any instance x being xi, written P(x=xi), is pi. Any probability distribution
must satisfy the following axioms:
1. P(x i ) 0
2. The probability of an event which is certain is 1.
3. If xi and xj are mutually exclusive events, then P(xi+xj)=P(xi)+P(xj)
That is probabilities are always positive, the maximum probability is 1, and probabilities
one event excludes the occurrence of another. Throughout the book, a
are additive if
random variable is denoted by a capital letter, while any given sample of the distribution
is denoted by a lower case letter.
Example 2.1 Binomial distribution
An experiment is conducted in which an outcome of 1 has probability p and the outcome of a zero has
probability 1-p. In Bernoulli trials, each outcome is independent, that is, it does not depend on the results
of the prior trials. What is the probability that exactly k 1s are observed in n trials?
Solution: Consider the number of ways there can be k 1s in a vector of length n, and the probability of
each instance. Each 1 has probability p, and each 0 has probability (1-p), so that each vector with k 1s has
" n%
n!
k
n-k
probability p (1 p)
. The number of possible combinations of k 1s is $ ' =
, thus P(k)=
# k & k! (n k)!
" n% k
nk
. This is known as the binomial distribution. The mean (average) value is np and the
$ ' p (1 p)
#k&
standard deviation is np(1 p) . In performing experiments, error bars of the size of the
standard
deviation are sometimes drawn. Thus, in Bernoulli trials where p is small, the certainty is about the square
root of the number of observed 1s. For large n, the binomial distribution approximates the Gaussian or
normal distribution
in the body, but is generally a poor approximation to the tails (low probability regions)
due to insufficient density of samples.
The trials that occur within the standard deviation represent roughly
2/3 of the possible outcomes.
2-1
Note that if the trials are not independent, then little credibility can be attached to this uncertainty measure.
Consider now two RVs X and Y. The joint probability distribution P(X,Y) is the
probability that both x and y occur. In the event that outcomes x do not depend in any
way on outcomes y (and vice versa) then X and Y are said to be statistically independent
and thus P(X,Y)=P(X)P(Y). For example let x be the outcome of a fair coin toss and y the
outcome of a second fair coin toss. The probability of a head or tail on either toss is 1/2,
independently of what will happen in the future or what has happened in the past. Thus
the probability of each of the four possible outcomes of (head, head), (head, tail), (tail,
head), (tail, tail) is 1/4.
The conditional distribution P(X|Y) is the probability distribution for X given that Y=y has
occurred. When X and Y are statistically independent, this is simply P(X). More
generally,
P(X |Y )P(Y ) = P(X,Y ) = P(Y | X)P(X) .
(2.1)
2-2
of a density function is not a probability; the pdf must be integrated over some finite
interval (a,b) to obtain the probability that x lies in the interval.
The expected value or mean of a random variable X is denoted by =E[X], given by
E[X] =
xf (x)dx .
(2.2)
More generally, the expectation operator involves an integration of the product of the
argument of the expectation and the corresponding pdf. The nth moment is given by
E[X n ] = x n f (x)dx .
(2.3)
Of particular importance are the mean and variance. The variance is given by
The square root of the variance is also known as the standard deviation in the particular
instance of a Gaussian distribution.
Two distributions
of immense importance are the uniform and Gaussian (normal)
distributions. In the uniform distribution, a random variable X takes on values with equal
probability in the interval (a,b). Thus f(x)=1/(b-a). Notice that the integral of f(x) over
(a,b) will be equal to 1. Many physical phenomena are approximately described by a
Gaussian distribution. The ubiquity of the Gaussian distribution is a consequence of the
Central Limit Theorem:
1 n
For n independent random variables Xi, the RV X = X i tends to the Gaussian
n i=1
distribution for n sufficiently large. That is,
2
2
1
f (x)
e(x ) /2
(2.5)
2
Q(x) =
1 z 2 / 2
e
dz
2
(2.6)
This integral is the area under the tail of a zero mean Gaussian with unit standard
deviation as sketched in Figure 2.1, and may only be solved numerically.
2-3
f(x)
tail
Q(x)
x
Figure 2.1: The Gaussian Q function
A good approximation for x larger than 4 is
2
1
Q(x)
ex / 2
x 2
The Q function is tabulated in Appendix A.
(2.7)
The tail of a distribution is the region of low probability. For a Gaussian, the tails are bounded by
exponentials. Using the approximation to the Q function, compute the probability that y is between 2 and
2.1 where f(y) is a N(0,0.25) distribution, i.e., a normal distribution with zero mean and a variance of 0.25
(standard deviation=0.5).
Solution: Since the Q function is defined for a unit standard deviation normal distribution, one must
perform a change of variables, scaling by the ratio of the standard deviations (and adjusting for the nonzero mean as might be required in some cases). Let x=y/=2y. Then P(2<y<2.1)=Q(4)-Q(4.2), which
using the approximation is 3.35x10-5-1.44x10-5=1.91x10-5.
If instead the means of X1 and X2 are m1 and m2 respectively, the Rice distribution is
obtained:
z (z 2 +s2 )/ 2 2 $ zs '
f (z) = 2
e
I0 & 2 ), z > 0
(2.9)
% (
where s=m1+m2, and I0 is the modified Bessel function of the first kind, of order zero. As
s goes to zero the Rice distribution approaches the Rayleigh distribution.
Approximations and series solutions for I0(x) are
2-4
I0 ( x) 1 for x 0
ex
for x
2x
I0 ( x)
(2.10)
% 1 ( 2m
' x*
&2 )
I0 ( x) =
m= 0 m! m!
The lognormal distribution results from the application of the central limit theorem to a
multiplicative process, that is, to a product of independent random variables. Let f (x) be
N(0, ) where all quantities are expressed in decibels (dB). Then the lognormal
is what results after converting back to absolute units, i.e., Y=10X/10.
distribution
2
Frequency domain
Signals of interest in communication or detection problems are frequently characterized
in multiple domains, according to a representation that makes a particular problem easier
to solve. Each of these domains provides a complete representation of the signal, unless
the representation is deliberately truncated for reasons of reduced complexity
implementation of a system. These two domains are related by orthogonal
transformations. The classic example is representation of signals passing through linear
systems in both the time and frequency domains. The Fourier transform X(f ) of signal
x(t) is the frequency domain representation, given by
x(t)e
X( f ) =
j 2 ft
dt
(2.11)
X( f )e
j 2 ft
df
(2.12)
The functions x(t) and X(f ) are said to form a Fourier transform pair, denoted
x(t) X( f ) . Some operations are more convenient in the time domain, while others are
x(t) =
2-5
more convenient in the frequency domain. In solving these problems, resort is typically
made to a number of tabulated transform pairs, as well as the following properties:
1. Linearity:
if x1 (t) X1 ( f ) and x 2 (t) X 2 ( f ) then
c1 x1 (t) + c 2 x 2 (t) c1 X1 ( f ) + c 2 X 2 ( f )
1 #f&
X% (
2. Time scaling: x(at)
|a| $ a'
3. Duality:X(t) x( f )
4. Time Shifting: x(t t 0 ) X( f )e j 2 ft 0
fc t
5. Frequency Shifting: e j 2
x(t) X( f f c )
dn
x1 (t)x 2 (t)
X ( )X
( f )d
x (t)x (t )d X ( f )X
1
(f)
is given by the convolution integral x( t )h( t )d . Suppose h(t) represents the lowpass filter
Asinc(2Wt) and x(t) is the sinusoid cos(2fct), where the carrier frequency is less than W. Compute y(t).
Solution: This integral is rather difficult to directly evaluate. Instead, one may use the Fourier transform
# f &
A
1
rect%
pairs Asinc( 2Wt )
( and cos( 2 fc t) ( (f fc ) + (f + fc )) together with property
$ 2W '
2W
2
A
(10) above to obtain Y(f) =
( (f fc ) + (f + fc )) . Taking the inverse transform,
4W
A
y(t) =
cos( 2 fc t) . Avoidance of difficult integrals
in dealing with linear systems is a major reason for
2W
use of the frequency domain. Another is that many of these problems are more easily visualized in the
frequency domain.
2-6
dealing with some aspects of linear systems (e.g., filters) while the time domain is useful
for other aspects (e.g., description of pulses). The problem solver needs facility in
multiple abstract models, and indeed many orthogonal transformations have been
developed for various signal processing purposes.
As an example, the energy of a pulse can be computed in either time or frequency using
the Parseval relation:
| x(t) |
dt =
| X( f ) |
df .
(2.13)
Chances are that the integration is much less difficult in one domain compared to the
other.
Representation
of Stochastic Processes
Thus equipped, the representation of random processes can now be addressed. Let X be a
continuous random variable with probability density function f (x). A random or
stochastic process consists of a sequence of random variables X1, X2, X3, and their
associated probability density functions. These random variables may for example be the
samples in time of continuous process (e.g. speech), or simply the sample variables of a
discrete-time process (e.g. a bit stream). In either case, the random process may be
denoted by X(t).
Random processes are typically classified as to how their statistical properties change
with time, and the correlation of sample values over time, as well as according to which
probability distribution(s) the samples are drawn from. Consider for example a Gaussian
random process, in which f (x) is N(0, ) and does not change with time. This is a
stationary random process. However, this is not a complete description of the random
process, since it neglects any measure of how adjacent samples are correlated in their
values. Now it would in principle be possible for a discrete process to have the complete
set of joint distributions to fully characterize the random process. However, this is
unwieldy and in any case could never be measured to any reasonable accuracy. Instead
this correlation in time is usually captured in the autocorrelation function, which for a
complex random process X(t) of finite power is given by
lim 1 T/ 2
x (t) =
(2.14)
x * (t)x(t + )dx
T T T/ 2
where is the delay between two versions of the signals. Properties of the
autocorrelation function include:
1. x ( ) = x * ( ) (conjugate symmetry)
5. If for a periodic
function x(t+T)=x(t) for all t, then x ( + T) = x ( ) for all .
(autocorrelation
function also periodic).
2
2-7
The first two properties give an indication of the general shape of an autocorrelation
function. The magnitude is symmetric about a peak, and the function generally decays as
one moves away from the peak. This corresponds to the fact that the samples of the
random process are generally less correlated with one another the further apart they are in
time. Property 3 actually defines the power spectral density (psd). The integral of Sx(f )
over some interval (f1,f2) is the power of the process in this interval. As for linear
systems and deterministic signals, the Fourier relationships between autocorrelation
functions and the psd enables convenient calculation of the outputs of filters when the
input is a random process. If X(t) is the input process, Y(t) the output process, and h(t)
the impulse response of the time-invariant linear filter, then SY ( f ) =| H( f ) |2 SX ( f ) .
Example 2.6 Colored noise
White Gaussian noise has psd SX(f )=No/2, extending over all frequencies. What is its autocorrelation
function?
y()
T2No/2
-T
The above discussion assumed the random process to be time invariant. More generally,
for a real random process the autocorrelation function is given by
(2.15)
x (t1 ,t 2 ) = E[X(t1 )X(t 2 )] .
For a random process to be termed strict-sense stationary (SSS), it is required that none of
its statistics are affected by a time shift of the origin. This is impractical to measure, and
usually one settles for showing that a process is wide sense stationary (WSS). Two
must be met:
requirements
1. E[X(t)] = X = constant
2. x (t1 ,t 2 ) = x (t1 - t 2 ) = x ( )
WSS Gaussian random processes are also SSS, but this is not generally true of other
random processes. Note that the second property also implies the second moment is
autocorrelation function at =0 is the power of the process. In
stationary, since the
practice, many signals have slowly drifting statistics, and so the assumption is often made
that they are quasi-static for purposes of calculations meant for some short time interval.
2-8
Gaussian random processes have other special properties. The convolution of a Gaussian
process with a fixed linear filter produces another Gaussian process. If the input process
is zero mean, so is the output process. However, convolution does change the
autocorrelation function of the output process. To take a simple example, suppose the
input process is white Gaussian noise. That is, the time samples follow a Gaussian
distribution, and power spectral density is flat (equal) over all frequencies. This results in
the autocorrelation function being a delta function. In consequence, the output random
process will continue to be Gaussian, but the output process psd will be colored by the
frequency response of the filter, as discussed in Example 2.6.
Sampling theorem
While the physical world is reasonably modeled as being a (rather complicated) set of
continuous real- or complex-valued random processes, digital computers in the end can
only deal with binary strings. Thus understanding is required of the process of
transformation of analog signals into digital form and back if computations are to be
coupled to the physical world. There are two aspects to these transformations:
discretization in time by sampling, and quantizing the samples to a finite number of
levels.
The Shannon-Hartley sampling theorem states that any random process strictly limited to
a range W Hz in frequency can be completely represented by 2W samples per second.
That is, it is possible to reconstruct the original process with complete fidelity from this
set of samples. Viewed another way, this means that a random process strictly
bandlimited to W Hz represents a resource of 2W independent dimensions per second.
This viewpoint makes the connection of the sampling theorem to earlier work by Nyquist,
who proved that the maximum rate at which impulses could be communicated over a
channel bandlimited to W Hz without interference among them in the receiver is 2W
pulses per second (known subsequently as the Nyquist rate). Thus in sampling, at least
2W samples per second are needed to be able to reproduce the process, while in
communicating at most the 2W independent dimensions per second are available.
Achieving reproduction or communication at these limits demands extremely long filters,
and so in practice the signals are sampled more quickly or communicated more slowly.
The sampling theorem is illustrated in Figure 2.3, where on the left is illustrated the
picture in the time domain, and on the right the picture in the frequency domain. The
sampling function s(t) is a train of equally spaced delta functions with spacing T, known
as a Dirac comb. The Fourier transform of a Dirac comb is another Dirac comb S(f ) with
spacing 1/T in the frequency domain. Time multiplication corresponds to convolution in
the frequency domain, resulting in multiple copies of the spectral image of the
bandlimited process. If T is small enough, the spectral copies will not overlap and the
original process can be reproduced using a low-pass filter. Otherwise, there is no means
for reproduction. Clearly 1/T > 2W.
2-9
X(f)
x(t)
X(f)*S(f)
x(t)s(t)
1 T/ 2
g(t)cos(2 nt/T)dt
g(t) = a n cos( 2 nt/T) where a n =
T T/ 2
n=1
Integration yields a n =
' n *
2
sin) ,, which is 0 for even n.
n ( 2 +
Sampling at rate 5/T implies that frequency components above 2.5/T will be aliased. Thus since the
waveform has no frequency content between 1/T and 3/T any cutoff frequency in this region will do. The
filtering willproduce a sinusoid at the fundamental frequency. Sampling and reconstruction thus result in
2
g r (t) = cos( 2 t/T) .
2-10
Quantization
Quantization of the samples into discrete levels introduces distortion that cannot be
reversed. The difference between the input (real) value and the quantized value is known
as quantization noise. Using more finely spaced levels is of course one way to reduce the
quantization noise. Over a very broad class of quantization schemes, each time the
number of levels is doubled the quantization noise is reduced by 6 dB. That is, for each
extra bit of quantization the mean squared error is reduced by a factor of 4.
A quantizer acts by dividing up the space into bins, with every signal lying within the bin
assigned the value of a reproducer level somewhere within the bin. In the case of a onedimensional linear quantizer the bins are evenly spaced and the reproducer levels are in
the centers of the bins. Associated with each bin is a bit label. Thus if there are 8 bins
each reproducer level requires a 3-bit label, as illustrated in Figure 2.4 for a uniform
quantizer. In this example, the bins have a size of 1, and the reproducer values are placed
in the bin centers (denoted by arrows), taking on values {-3.5, -2.5,,3.5}. Thus if
x=1.75, the corresponding reproducer level is 1.5 and the bit label is 101. The error or
quantization noise is 0.25. The analog to digital (A/D) conversion process thus consists
of anti-aliasing filtering followed by sampling, and then quantization with the input
process represented as a sequence of binary labels, one per sample.
-3
000
-2
001
-1
010
1
011
100
2
101
3
110
111
x=1.75
h(t)
Low pass
filter
Binary
Labels
quantizer
Binary
Labels
Sampler
Reproducer
h(t)
Low pass
filter
Y(t)
A variety of means exist to get superior performance to uniform linear quantizers. These
are particularly important when the distortion measure of interest is something other than
mean squared error, as is often the case with visual, auditory or other perceptual signals.
Example 2.8 The -law quantizer
The amplitude levels of ordinary speech follow a Gaussian distribution. In the auditory system it is the
relative distortion that is perceptible rather than the absolute level. That is, the fractional distortion from
the waveform is what is important. Consequently larger absolute errors can be made on the loud signals
compared to the quiet signals. In telephone network local offices, signals from the local loop are strictly
bandlimited to 3.5 kHz, sampled at 8 kHz, and then quantized into 256 levels (8 bits) to produce a 64 kb/s
stream. This stream is then what is transmitted by various digital modulation techniques over the long
distance network. At the local office on the other side, the stream is converted back to an analog signal for
transmission over the local loop. To achieve approximately equal fractional distortion at each amplitude,
since the Gaussian distribution has exponential tails one could take the logarithm and then apply a uniform
quantizer. Instead, this is combined into one step with a set of piece-wise linear uniform quantizers with
increased spacing for the larger amplitudes. Thus finely spaced levels are used for low-amplitude signals
and widely spaced ones for large amplitude signals. This allows easily constructed linear quantizers to be
used on signals with a broad dynamic range while limiting the number of bits required. An unintended
consequence is increased difficulty for the design of voice-band modems due to the amplitude-dependent
quantization noise this introduces.
2-12
(2.16)
i=1
where the logarithm is base 2. Since the probabilities are between 0 and 1, and the
logarithm is negative over that range, the entropy is a positive quantity. In the particular
case that X takes on binary values the entropy achieves its maximum value of 1 bit when
are equally likely, and a minimum of 0 when either outcome is certain.
the outcomes
Consider two successive instances of fair coin tosses. The probabilities for each of the
four possible outcomes are each 1/4, and the entropy is 2 bits as would be expected.
Thus it is a plausible information measure.
The noiseless source coding theorem is to the effect that the entropy represents the
minimum number of bits required to represent a sequence of discrete-valued RVs with
complete fidelity. Thus ascii or English text, or gray-levels on a fax transmission all have
a minimum representation, and that is the entropy of the sequence. Indeed practical
source coding techniques have been developed that get very close to the entropy limit.
Shannon suggested a method that illuminates some very important facts about correlated
sequences.
Asymptotic Equipartition Property
The weak law of large numbers states that for iid RVs X1, X2,,Xn,
1 n
(2.17)
X i E[X] as n
n i= 1
This is easily applied to prove an important idea in information theory, the asymptotic
equipartition property (AEP) which states that if P(x1, x2,,xn) is the probability of
observing the sequence x1, x2,,xn, then -1/n log P(X1, X2,,Xn) is close to H(X) for
large n. According
to the AEP, a small set of the sequences, the typical set, will consume
almost all of the observations, the members of the typical set are equally likely, and the
number of typical sequences is approximately 2nH. The remaining sequences, although
large in number, are collectively only rarely observed.
Example 2.9 Typical binary sequences
Consider a binary source with probabilities of the two outcomes being (1/4,3/4). Compute (a) H(X) (b) the
number of sequences of length 100 (c) the size of the typical set and (d) the number of bits required to
uniquely identify each member of the typical set.
Solution: H(X)=-1/4 log 1/4 3/4 log 3/4 = 0.811. The number of sequences of length 100 is simply 2100 =
1.27x1030, while the number of typical sequences is 281=2.42x1024, six orders of magnitude less. 82 bits are
needed to index the members of the typical set (rounding up).
A data compression method due to Shannon which exploits the typical set is as follows.
Allocate one bit to labeling whether a sequence is a member of the typical set or not.
Then enumerate the typical set members using their indices as the codewords. Since
there are 2nH typical set members, the index length will be at most the integer above nH,
2-13
plus 1 for the indication of typical set membership. For the atypical set members, at most
nlog|X| +1 bits will be required; simply send them in raw form. The expected number of
bits required to describe a message will however only be nH+1 bits, since the atypical set
has vanishing probability as n gets large. Thus, the number of bits per symbol required
will approach the entropy H, the smallest possible according to the noiseless source
coding theorem. Note that the coding scheme is fully reversible, there being a one to one
correspondence between codewords and the original data.
Thus, effort is spent on the typical set, and it is a matter of little importance how the
vastly more numerous atypical set members are dealt with. Given that the conclusion
depends only upon the weak law of large numbers, this is a design principle with broad
implications. In practice, the source statistics may be time-varying or unknown, or may
be such that more highly structured solutions will converge to the entropy bound with
shorter sequences. Consequently, there exist a wide variety of compression algorithms.
Mutual information
Just as there are joint and conditional probability distributions, so there are joint and
conditional entropy functions. The joint entropy between two variables X and Y is
(2.18)
H(X,Y) = P(x, y)log P(x, y)
x X y Y
(2.19)
x X y Y
A question that naturally arises is the extent to which knowledge of X assists in knowing
Y. This ismeasured by the mutual information I(X;Y ) = H(X) H(X |Y ) or
equivalently, I(X;Y ) = H(Y ) H(Y | X) . The mutual information can thus be thought of
as either the reduction in uncertainty of X due to knowledge of Y, or the reduction in the
uncertainty of Y due to knowledge of X. Note that I(X;Y ) = H(X) H(X | X) = H(X) ,
+ H(Y ) H(X,Y ) . Thus, if X
since H(X | X) must be zero. Also note that I(X;Y ) = H(X)
It comes as no surprise that the question of ultimate limits in communication systems will
involve the mutual information between the transmitted sequence, and the received
sequence. Less obvious is that the problem of lossy compression is actually the dual of
the communications problem, and relates directly to the problems of detection and source
identification. The mutual information thus plays a central role in these problems as well.
The problem of communicating over a single one-way communications link is depicted in
Figure 2.6. A message sequence is encoded to produce the vector Xn, and then a channel
2-14
Encoder
Yn
Channel
P(y|x)
Message
Estimate
Decoder
(2.21)
The effect of the channel is to produce a number of different possible images of the input
sequencesin the output (typically as the result of noise). For each input sequence, there
will be approximately 2 nH (Y |X ) output sequences. To avoid errors, it must be assured that
no two X sequences generate the same output sequence. Otherwise there will be
ambiguity in decoding, rather than a clear mapping. The total number of typical Y
sequences is 2 nH (Y ),and thus the maximum number of disjoint sets is
Omitting the proof that this is indeed both achievable and the best that can be done, it can
be noticed that the notion of typicality again is important. Here the number of typical
output sequences per input is determined by the noise process, and thus it is the noise that
fundamentally limits the throughput.
Capacity can also be defined for channels with discrete inputs but with output vectors
drawn from the reals. In such cases, the differential entropy is used, defined by
h(X) = f(x)logf(x)dx
(2.22)
S
where S is the support set of the random variable. An analogous set of information
theoretic identities can be constructed as for the entropy. Here as well it is found that the
capacity is fundamentally limited by the noise. The most important example is the
Gaussian channel,
in which signals of power P and bandwidth W are passed through a
channel which adds white Gaussian noise of power spectral density No/2. Then
C = W log(1+ P /WN o ) bits/s.
(2.23)
That is, capacity (not reliability) increases linearly with bandwidth (the dimensions
available) and logarithmically with signal to noise ratio (SNR). Viewed another way,
bandwidth and data rate requirements, there is some minimum SNR that is
given fixed
required for reliable transmission. Below this SNR, reliable communication is not
2-15
possible at the desired transmission ratethe error rate very sharply escalates to unusable
levels. The aim of channel coding is to approach this limit with reasonable
computational delay and complexity. Codes have in fact been developed that very
closely approach capacity.
Example 2.10 Water-filling and multiple Gaussian channels
Power
P1
P3
N1
N2
N3
Ch. 1
Ch. 2
Ch. 3
2-16
from the reals) and the reconstructed sequence Y (some finite set). Let D be the
maximum permitted distortion. The rate distortion theorem states
min
R(D) =
I(X;Y)
(2.24)
f(y | x) : E[(Y X) 2 ] D
Consider the particular instance of using the mean squared error E[(Y-X)2] as the
distortion measure, and let the source generate a sequence of iid Gaussian RVs with zero
mean and
variance 2. Rate distortion problems are generally considered from the point
of view of a test channel, that is, like a communications problem where Y is
communicated over a noisy channel that adds a random variable Z, to produce the source
X. For the Gaussian source, the channel along with the probability distributions for the
variables are illustrated in Figure 2.8.
Z; N(0,D)
Y; N(0,2-D)
X; N(0,2)
Another way of looking at this is distortion as a function of rate, D(R) = 2 22R . Thus,
as noted earlier with respect to the discussion on quantizers, each additional bit reduces
the distortion by a factor of 4, or 6 dB. This is in keeping with the general trend for
quantizersthat the distortion in dB is of the form D(n)=k-6n dB, where n is the number
than
of bits. Because the optimal rate distortion coder operates on sequences rather
performing operations symbol by symbol, it achieves a lower value for the constant k.
Vector quantizers can closely approach the rate distortion bound.
In lossy source coding, in practice there are generally a number of steps. First some
heuristic is applied to remove most of the correlation of the source, using knowledge of
the source statistics, and in the case of video or audio signals, knowledge of which
features have the largest perceptual impact. Then rate-distortion coding or lossless
source coding is applied to remove the remaining correlation in the sequence.
In sensor networks, it is generally not feasible for reasons of network scalability or
energy consumption to send all the data collected by sensors to some central location for
processing. Thus sensor nodes must perform some form of lossy compression. In the
extreme, decisions must be made locally for example to identify a source and thus simply
describe it by a few bits. A research question is the trade between network resource
consumption (energy, bandwidth) and the fidelity of the final decisions fused from the
2-17
(processed) observations of set of nodes. The former may be identified with rate, and the
latter with the distortion as a network rate distortion problem.
Data processing inequality
Another information theoretic result of considerable importance in design of receivers
(whether for sensing or communications) is the data processing inequality. Consider the
information processing system depicted in Figure 2.9.
P(Y|X)
P(Z|Y)
2-18
Stochastic processes and linear systems are treated at an introductory level but with
sufficient mathematical rigor in
S. Haykin. Communication Systems 4th Ed. Wiley, 2001.
The discussion on information theory was abstracted from the highly readable textbook
T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley, 1991.
An excellent book covering much of the content of this chapter in more detail (but sadly
out of print) is:
P.M. Woodward. Probability and Information Theory, with Applications to Radar. Pergamon Press, 1953.
2.7 Problems
2.1 The Rayleigh Distribution
Let z be Rayleigh distributed with pdf:
f(z) =
Find the mean and the variance of z.
z z 2 /2 2
e
,z > 0 .
2
2-19
Input
Output
1-p
0
p
p
1-p
#1 with probability p,
X =$
%0 with probability 1 p.
Plot H(X) as a function of p. Prove the maximum occurs when the likelihoods of the two
outcomes are equal.
2-20
Hints: It is suggested to solve this problem using moment generating functions. The
moment generating function of x is defined by
M X (t) Ee tx .
Since Y can be expressed as the exponential of x. By substituting t with specific values,
the mean and the variance can easily be obtained. The generating function is also a
powerful tool for other distributions.
(a) Consider a Gaussian distributed random variable X ~ N(0, 2 ) . Express its cdf
F(x) in terms of Q(x).
(b) Another useful function is the error function, defined as
2 x 2
erf(x) =
et dt .
Express F(x) in terms of the error function and the complementary error function,
respectively.
(c) Express Q(x) in terms of the complementary error function.
X ( ) = x(t)e jt dt
.
Derive the formula of inverse Fourier transform from X ( ) to x(t) and the relationship
between X ( ) and X f (f) .
There are four persons. They write their names on individual slips of paper and deposit
Compute the Fourier transform of the following signals without resorting to the Fourier
transform pair table:
#1 | t |< T1
t
(a) rect( ) $
2T1 %0 | t |> T1
(b) (t)
(c) eat u(t),Re{a} > 0
Compute the Fourier transform of each of the following signals, using the results in
the properties described in this chapter.
problem 2.7.12 and
(a) cos(2f 0 t) .
"$1, | t |< T1
T0 and x(t + T0 ) = x(t ) .
(b) Periodic square wave x(t) = #
$%0, T1 <| t |< 2
2
2
2
where s = m1 + m2 . The value of s depends on the line of sight signal path. If this path
is present and the path gain is high, s is large and it is easier to communicate information.
To recognize this, plot the pdf for s=0,0.5,
, 5 and compute the probabilities of z > 5
for each s.
2.16 A/D and D/A conversion
2-22
In this problem, A/D and D/A conversion are practiced and two simulations are
compared. Consider the steps in Figure 2.11
x(t)
quantizer
h1 (t )
Binary
Labels
Reproducer
h2 (t )
Binary
Labels
y(t)
0
-4
Figure 2.12 A triangle wave
T0 = 1/ f 0 = 1 ms . The sample rate is 20k samples/s. In the first experiment, compute the
mean square error. Notice, the A/D converter has no information about the input phase.
In the second experiment, the anti-aliasing filter h1 (t ) is removed. Compute the mean
square error
in this situation.
2-23