Principles of Embedded Network Design

Pottie and Kaiser: Principles of Embedded Network Systems Design
2. Representation of signals
Source detection, localization, and identification begin with the realization that the
sources are by their nature probabilistic. The source, the coupling of source to medium,
the medium itself and the noise processes are each variable to some degree. Indeed, it is
this very randomness that presents the fundamental barrier to rapid and accurate
estimation of signals. As such applied probability is essential to the study of
communications and other detection and estimation problems. This chapter is concerned
with the description of signals as random processes, how signals are transformed from
continuous (real world) signals into discrete representations, and the fundamental limits
on the loss of information that result from such transformations. Three broad topics will
be touched upon: basic probability theory, representation of stochastic processes, and
information theory.
2.1 Probability
Discrete Random Variables
A discrete random variable (RV) X takes on values in a finite set X={x1, x2,xm}. The
probability of any instance x being xi, written P(x=xi), is pi. Any probability distribution
must satisfy the following axioms:
1. P(x i ) 0
2. The probability of an event which is certain is 1.
3. If xi and xj are mutually exclusive events, then P(xi+xj)=P(xi)+P(xj)
That is probabilities are always positive, the maximum probability is 1, and probabilities
one event excludes the occurrence of another. Throughout the book, a
are additive if
random variable is denoted by a capital letter, while any given sample of the distribution
is denoted by a lower case letter.
Example 2.1 Binomial distribution
An experiment is conducted in which an outcome of 1 has probability p and the outcome of a zero has
probability 1-p. In Bernoulli trials, each outcome is independent, that is, it does not depend on the results
of the prior trials. What is the probability that exactly k 1s are observed in n trials?
Solution: Consider the number of ways there can be k 1s in a vector of length n, and the probability of
each instance. Each 1 has probability p, and each 0 has probability (1-p), so that each vector with k 1s has
" n%
n!
k
n-k
probability p (1 p)
. The number of possible combinations of k 1s is $ ' =
, thus P(k)=
# k & k! (n k)!
" n% k
nk
. This is known as the binomial distribution. The mean (average) value is np and the
$ ' p (1 p)
#k&
standard deviation is np(1 p) . In performing experiments, error bars of the size of the
standard
deviation are sometimes drawn. Thus, in Bernoulli trials where p is small, the certainty is about the square
root of the number of observed 1s. For large n, the binomial distribution approximates the Gaussian or
normal distribution
in the body, but is generally a poor approximation to the tails (low probability regions)
due to insufficient density of samples.
The trials that occur within the standard deviation represent roughly
2/3 of the possible outcomes.
2-1
Note that if the trials are not independent, then little credibility can be attached to this uncertainty measure.
Consider now two RVs X and Y. The joint probability distribution P(X,Y) is the
probability that both x and y occur. In the event that outcomes x do not depend in any
way on outcomes y (and vice versa) then X and Y are said to be statistically independent
and thus P(X,Y)=P(X)P(Y). For example let x be the outcome of a fair coin toss and y the
outcome of a second fair coin toss. The probability of a head or tail on either toss is 1/2,
independently of what will happen in the future or what has happened in the past. Thus
the probability of each of the four possible outcomes of (head, head), (head, tail), (tail,
head), (tail, tail) is 1/4.
The conditional distribution P(X|Y) is the probability distribution for X given that Y=y has
occurred. When X and Y are statistically independent, this is simply P(X). More
generally,
P(X |Y )P(Y ) = P(X,Y ) = P(Y | X)P(X) .
(2.1)
Example 2.2 Campaign contributions and votes in Dystopia.

political system of the town of Dystopia, campaign contributions Y influence votes on
In the corrupt
legislation X. Let the two outcomes of X be {yea,nay} and let the sum of campaign contributions by
persons favorable to the legislation, less the sum of contributions by persons unfavorable to the legislation
be drawn from the set {-100, 100}. These outcomes are equally likely as are the voting outcomes.
P(yea|100)=1, P(nay|-100)=1. Determine P(X,Y) and P(Y|X).
Solution: Clearly P(yea|-100)=P(nay|100)=0. Thus P(yea,100)=P(yea|100)P(100)=1/2, and similarly
P(nay|-100)=1/2, P(yea|-100)=P(nay|100)=0. Checking, the sum of all the possibilities is 1. Working from
this, P(Y|X)=P(X,Y)/P(X), i.e., P(100|yea)=0.5/0.5=1=P(-100|nay); P(100|nay)=P(-100|yea)=0.
Notice that while the causal relationship is that campaign contributions lead to a particular voting outcome,
P(Y|X) still has meaning as an inference from the observations. It is often beneficial to use a form of this
relationship known as Bayes Rule to rearrange problems in terms of probability distributions that are
known or easily observed:
P(X|Y)=P(Y|X)P(X)/P(Y).
In particular this relation is invoked in the derivation of the optimal detectors for signals in noise.
Exercise: In the neighboring town of Utopia, campaign contributions have no influence on legislative
outcome. Write P(X,Y), P(Y|X) and P(X|Y).
Continuous random variables

Discrete distributions arise whenever there is a finite alphabet of outcomes, as for
example in communicating binary strings or addressing a digital visual display. Because
digital signal processors always represent the world as a discretized and quantized
approximation to continuous processes, computers exclusively deal with finite sets.
However, many problems are much more naturally thought about in the context of
random variables X spanning portions of the real numbers, with an associated probability
density function (pdf) fX(x), which we will usually abbreviate as f(x). Note that the value
2-2
of a density function is not a probability; the pdf must be integrated over some finite
interval (a,b) to obtain the probability that x lies in the interval.
The expected value or mean of a random variable X is denoted by =E[X], given by
E[X] =
xf (x)dx .
(2.2)
More generally, the expectation operator involves an integration of the product of the
argument of the expectation and the corresponding pdf. The nth moment is given by
E[X n ] = x n f (x)dx .
(2.3)
Of particular importance are the mean and variance. The variance is given by
var(x) = 2 = E[(X ) 2 ] = (x ) 2 f (x)dx = E[X 2 ] 2

(2.4)
The square root of the variance is also known as the standard deviation in the particular
instance of a Gaussian distribution.
Two distributions
of immense importance are the uniform and Gaussian (normal)
distributions. In the uniform distribution, a random variable X takes on values with equal
probability in the interval (a,b). Thus f(x)=1/(b-a). Notice that the integral of f(x) over
(a,b) will be equal to 1. Many physical phenomena are approximately described by a
Gaussian distribution. The ubiquity of the Gaussian distribution is a consequence of the
Central Limit Theorem:
1 n
For n independent random variables Xi, the RV X = X i tends to the Gaussian
n i=1
distribution for n sufficiently large. That is,
2
2
1
f (x)
e(x ) /2
(2.5)
2
where is the mean of the distribution and the standard deviation.

Associated with the Gaussian distribution are two functions which relate to the area under
Gaussian Q function is defined as
its tails. The
Q(x) =
1 z 2 / 2
e
dz
2
(2.6)
This integral is the area under the tail of a zero mean Gaussian with unit standard
deviation as sketched in Figure 2.1, and may only be solved numerically.
2-3
f(x)
tail
Q(x)
x
Figure 2.1: The Gaussian Q function
A good approximation for x larger than 4 is
2
1
Q(x)
ex / 2
x 2
The Q function is tabulated in Appendix A.
(2.7)
Example 2.3 Gaussian tails
The tail of a distribution is the region of low probability. For a Gaussian, the tails are bounded by
exponentials. Using the approximation to the Q function, compute the probability that y is between 2 and
2.1 where f(y) is a N(0,0.25) distribution, i.e., a normal distribution with zero mean and a variance of 0.25
(standard deviation=0.5).
Solution: Since the Q function is defined for a unit standard deviation normal distribution, one must
perform a change of variables, scaling by the ratio of the standard deviations (and adjusting for the nonzero mean as might be required in some cases). Let x=y/=2y. Then P(2<y<2.1)=Q(4)-Q(4.2), which
using the approximation is 3.35x10-5-1.44x10-5=1.91x10-5.
The Gaussian distribution may be used to construct a number of other important

distributions for signal detection and communication. Consider two Gaussian RVs X1
and X2. Suppose they represent the distributions for the real and imaginary parts of a
complex random variable Y (e.g., a signal transmitted over some channel). That is,
Y=X1+jX2. Let the envelope of Y be denoted by Z = X12 + X 22 . Then if the two
components have zero mean and variance 2, the result is the Rayleigh distribution:
2
2
z
(2.8)
f (z) = 2 ez / 2 , z > 0
If instead the means of X1 and X2 are m1 and m2 respectively, the Rice distribution is
obtained:
z (z 2 +s2 )/ 2 2 $ zs '
f (z) = 2
e
I0 & 2 ), z > 0
(2.9)
% (
where s=m1+m2, and I0 is the modified Bessel function of the first kind, of order zero. As
s goes to zero the Rice distribution approaches the Rayleigh distribution.
Approximations and series solutions for I0(x) are
2-4
I0 ( x) 1 for x 0
ex
for x
2x
I0 ( x)
(2.10)
% 1 ( 2m
' x*
&2 )
I0 ( x) =
m= 0 m! m!
The lognormal distribution results from the application of the central limit theorem to a
multiplicative process, that is, to a product of independent random variables. Let f (x) be
N(0, ) where all quantities are expressed in decibels (dB). Then the lognormal
is what results after converting back to absolute units, i.e., Y=10X/10.
distribution
2
Example 2.4 The Poisson distribution

Suppose the probability of an event happening in a short time interval t is t, with the probability of two
or more events occurring in this short interval being very low. This event might be the beginning of a
telephone call or the reception of a photon in a detector. Such events are usually termed arrivals. The
probability of no arrivals in this interval is then 1- t. Suppose further that the probability of an arrival in
any given interval is independent of the probability of an arrival in any prior interval. Determine the
probability of k arrivals in a longer interval T.
Solution: This is of course related to the binomial distribution. Suppose T=nt, and let p=t. Then the
probability of k arrivals is simply
k
k
n!
np (np)
k
nk
T ( T)
p (1 p)
e
=e
k! (n k)!
k!
k!
where the approximation is valid if p<<1. The RHS is the Poisson distribution.
2.2 Stochastic processes
Frequency domain
Signals of interest in communication or detection problems are frequently characterized
in multiple domains, according to a representation that makes a particular problem easier
to solve. Each of these domains provides a complete representation of the signal, unless
the representation is deliberately truncated for reasons of reduced complexity
implementation of a system. These two domains are related by orthogonal
transformations. The classic example is representation of signals passing through linear
systems in both the time and frequency domains. The Fourier transform X(f ) of signal
x(t) is the frequency domain representation, given by
x(t)e
X( f ) =
j 2 ft
dt
(2.11)
To go back to the time domain, the inverse transform is employed:
X( f )e
j 2 ft
df
(2.12)
The functions x(t) and X(f ) are said to form a Fourier transform pair, denoted
x(t) X( f ) . Some operations are more convenient in the time domain, while others are
x(t) =
2-5
more convenient in the frequency domain. In solving these problems, resort is typically
made to a number of tabulated transform pairs, as well as the following properties:
1. Linearity:
if x1 (t) X1 ( f ) and x 2 (t) X 2 ( f ) then
c1 x1 (t) + c 2 x 2 (t) c1 X1 ( f ) + c 2 X 2 ( f )
1 #f&
X% (
2. Time scaling: x(at)
|a| $ a'
3. Duality:X(t) x( f )
4. Time Shifting: x(t t 0 ) X( f )e j 2 ft 0
fc t
5. Frequency Shifting: e j 2
x(t) X( f f c )
dn
6. Differentiation in time domain: n x(t) ( j2f ) n X( f )

t dt
1
x(t)dt
X( f )
7. Integration in time domain:
j2f
8. Conjugate functions: x * (t) X * ( f )

9. Multiplication in time domain (convolution in frequency domain):
x1 (t)x 2 (t)
X ( )X
( f )d
10. Convolution in time domain (multiplication in frequency domain):

1
x (t)x (t )d X ( f )X
1
(f)
Example 2.5 Linear system

A signal x(t)passes through a linear filter with impulse response h(t) to produce the output y(t). Then y(t)
is given by the convolution integral x( t )h( t )d . Suppose h(t) represents the lowpass filter
Asinc(2Wt) and x(t) is the sinusoid cos(2fct), where the carrier frequency is less than W. Compute y(t).
Solution: This integral is rather difficult to directly evaluate. Instead, one may use the Fourier transform
# f &
A
1
rect%
pairs Asinc( 2Wt )
( and cos( 2 fc t) ( (f fc ) + (f + fc )) together with property
$ 2W '
2W
2
A
(10) above to obtain Y(f) =
( (f fc ) + (f + fc )) . Taking the inverse transform,
4W
A
y(t) =
cos( 2 fc t) . Avoidance of difficult integrals
in dealing with linear systems is a major reason for
2W
use of the frequency domain. Another is that many of these problems are more easily visualized in the
frequency domain.
A questionsometimes asked is what is the meaning of the frequency domain. It is

perhaps best visualized from the point of view of Fourier series, whereby functions are
composed of a weighted sum of harmonic components. The time and frequency domains
are merely abstract models of the behavior of systems, and since both can potentially be
complete representations they have equal value. The frequency domain is convenient in
2-6
dealing with some aspects of linear systems (e.g., filters) while the time domain is useful
for other aspects (e.g., description of pulses). The problem solver needs facility in
multiple abstract models, and indeed many orthogonal transformations have been
developed for various signal processing purposes.
As an example, the energy of a pulse can be computed in either time or frequency using
the Parseval relation:
| x(t) |
dt =
| X( f ) |
df .
(2.13)
Chances are that the integration is much less difficult in one domain compared to the
other.
Representation
of Stochastic Processes
Thus equipped, the representation of random processes can now be addressed. Let X be a
continuous random variable with probability density function f (x). A random or
stochastic process consists of a sequence of random variables X1, X2, X3, and their
associated probability density functions. These random variables may for example be the
samples in time of continuous process (e.g. speech), or simply the sample variables of a
discrete-time process (e.g. a bit stream). In either case, the random process may be
denoted by X(t).
Random processes are typically classified as to how their statistical properties change
with time, and the correlation of sample values over time, as well as according to which
probability distribution(s) the samples are drawn from. Consider for example a Gaussian
random process, in which f (x) is N(0, ) and does not change with time. This is a
stationary random process. However, this is not a complete description of the random
process, since it neglects any measure of how adjacent samples are correlated in their
values. Now it would in principle be possible for a discrete process to have the complete
set of joint distributions to fully characterize the random process. However, this is
unwieldy and in any case could never be measured to any reasonable accuracy. Instead
this correlation in time is usually captured in the autocorrelation function, which for a
complex random process X(t) of finite power is given by
lim 1 T/ 2
x (t) =
(2.14)
x * (t)x(t + )dx
T T T/ 2
where is the delay between two versions of the signals. Properties of the
autocorrelation function include:
1. x ( ) = x * ( ) (conjugate symmetry)
2. x ( ) x (0) (maximum for zero delay)

3. x (t) SX (f) (Fourier transform pair with power spectral density)
4. x (0) = E[SX (f)] (average power)
5. If for a periodic
function x(t+T)=x(t) for all t, then x ( + T) = x ( ) for all .
(autocorrelation
function also periodic).
2
2-7
The first two properties give an indication of the general shape of an autocorrelation
function. The magnitude is symmetric about a peak, and the function generally decays as
one moves away from the peak. This corresponds to the fact that the samples of the
random process are generally less correlated with one another the further apart they are in
time. Property 3 actually defines the power spectral density (psd). The integral of Sx(f )
over some interval (f1,f2) is the power of the process in this interval. As for linear
systems and deterministic signals, the Fourier relationships between autocorrelation
functions and the psd enables convenient calculation of the outputs of filters when the
input is a random process. If X(t) is the input process, Y(t) the output process, and h(t)
the impulse response of the time-invariant linear filter, then SY ( f ) =| H( f ) |2 SX ( f ) .
Example 2.6 Colored noise
White Gaussian noise has psd SX(f )=No/2, extending over all frequencies. What is its autocorrelation
function?
Solution: Simply take the inverse Fourier transform, so that x()=(No/2)()

Now suppose the white noise process passes through a linear filter with response h(t)=1 over (0,T) and 0
otherwise. What is the autocorrelation function of the output process y(t)?
Solution: SY(f )=|H(f )|2No/2. The Fourier transform of a rectangular pulse is a sinc function, and the
inverse transform of the sinc2 function is a triangle function. Hence the autocorrelation function is simply
computed by multiplying h() by No/2, i.e., y() is as plotted below
y()
T2No/2
-T
Figure 2.2 Autocorrelation function of band-limited noise
The above discussion assumed the random process to be time invariant. More generally,
for a real random process the autocorrelation function is given by
(2.15)
x (t1 ,t 2 ) = E[X(t1 )X(t 2 )] .
For a random process to be termed strict-sense stationary (SSS), it is required that none of
its statistics are affected by a time shift of the origin. This is impractical to measure, and
usually one settles for showing that a process is wide sense stationary (WSS). Two
must be met:
requirements
1. E[X(t)] = X = constant
2. x (t1 ,t 2 ) = x (t1 - t 2 ) = x ( )
WSS Gaussian random processes are also SSS, but this is not generally true of other
random processes. Note that the second property also implies the second moment is
autocorrelation function at =0 is the power of the process. In
stationary, since the
practice, many signals have slowly drifting statistics, and so the assumption is often made
that they are quasi-static for purposes of calculations meant for some short time interval.
2-8
Gaussian random processes have other special properties. The convolution of a Gaussian
process with a fixed linear filter produces another Gaussian process. If the input process
is zero mean, so is the output process. However, convolution does change the
autocorrelation function of the output process. To take a simple example, suppose the
input process is white Gaussian noise. That is, the time samples follow a Gaussian
distribution, and power spectral density is flat (equal) over all frequencies. This results in
the autocorrelation function being a delta function. In consequence, the output random
process will continue to be Gaussian, but the output process psd will be colored by the
frequency response of the filter, as discussed in Example 2.6.
Sampling theorem
While the physical world is reasonably modeled as being a (rather complicated) set of
continuous real- or complex-valued random processes, digital computers in the end can
only deal with binary strings. Thus understanding is required of the process of
transformation of analog signals into digital form and back if computations are to be
coupled to the physical world. There are two aspects to these transformations:
discretization in time by sampling, and quantizing the samples to a finite number of
levels.
The Shannon-Hartley sampling theorem states that any random process strictly limited to
a range W Hz in frequency can be completely represented by 2W samples per second.
That is, it is possible to reconstruct the original process with complete fidelity from this
set of samples. Viewed another way, this means that a random process strictly
bandlimited to W Hz represents a resource of 2W independent dimensions per second.
This viewpoint makes the connection of the sampling theorem to earlier work by Nyquist,
who proved that the maximum rate at which impulses could be communicated over a
channel bandlimited to W Hz without interference among them in the receiver is 2W
pulses per second (known subsequently as the Nyquist rate). Thus in sampling, at least
2W samples per second are needed to be able to reproduce the process, while in
communicating at most the 2W independent dimensions per second are available.
Achieving reproduction or communication at these limits demands extremely long filters,
and so in practice the signals are sampled more quickly or communicated more slowly.
The sampling theorem is illustrated in Figure 2.3, where on the left is illustrated the
picture in the time domain, and on the right the picture in the frequency domain. The
sampling function s(t) is a train of equally spaced delta functions with spacing T, known
as a Dirac comb. The Fourier transform of a Dirac comb is another Dirac comb S(f ) with
spacing 1/T in the frequency domain. Time multiplication corresponds to convolution in
the frequency domain, resulting in multiple copies of the spectral image of the
bandlimited process. If T is small enough, the spectral copies will not overlap and the
original process can be reproduced using a low-pass filter. Otherwise, there is no means
for reproduction. Clearly 1/T > 2W.
2-9
X(f)
x(t)
Multiply by sampling function s(t)
Convolve with sampling function S(f)
X(f)*S(f)
x(t)s(t)
Low pass filter to recover X(f)
Figure 2.3 The sampling theorem

Actual filters are not strictly bandlimited, and neither usually is the process being
sampled. Over-sampling results in reproduction filters with lower required order for a
given level of reproduction fidelity, but of course will result in more samples per second
being required. Under-sampling results in aliasing: overlap of the spectral copies. Usual
practice is to cut off the upper bandedges using sharp anti-aliasing filters prior to
sampling. While this eliminates the high frequency content, the result is usually less
damaging than permitting aliasing.
Example 2.7: Sampled square wave
A square wave g(t) symmetric about the origin with amplitude 1 and period T is to be sampled at rate 5/T,
What must be the cutoff frequency of the anti-aliasing filter, and what waveform will be reproduced from
the samples by a low-pass reconstruction filter?
Solution: Since the square wave is periodic, the Fourier series may be used. For an even function, only the
cosine terms need be considered. It has zero mean and thus
1 T/ 2
g(t)cos(2 nt/T)dt
g(t) = a n cos( 2 nt/T) where a n =
T T/ 2
n=1
Integration yields a n =
' n *
2
sin) ,, which is 0 for even n.
n ( 2 +
Sampling at rate 5/T implies that frequency components above 2.5/T will be aliased. Thus since the
waveform has no frequency content between 1/T and 3/T any cutoff frequency in this region will do. The
filtering willproduce a sinusoid at the fundamental frequency. Sampling and reconstruction thus result in
2
g r (t) = cos( 2 t/T) .
2-10
Quantization
Quantization of the samples into discrete levels introduces distortion that cannot be
reversed. The difference between the input (real) value and the quantized value is known
as quantization noise. Using more finely spaced levels is of course one way to reduce the
quantization noise. Over a very broad class of quantization schemes, each time the
number of levels is doubled the quantization noise is reduced by 6 dB. That is, for each
extra bit of quantization the mean squared error is reduced by a factor of 4.
A quantizer acts by dividing up the space into bins, with every signal lying within the bin
assigned the value of a reproducer level somewhere within the bin. In the case of a onedimensional linear quantizer the bins are evenly spaced and the reproducer levels are in
the centers of the bins. Associated with each bin is a bit label. Thus if there are 8 bins
each reproducer level requires a 3-bit label, as illustrated in Figure 2.4 for a uniform
quantizer. In this example, the bins have a size of 1, and the reproducer values are placed
in the bin centers (denoted by arrows), taking on values {-3.5, -2.5,,3.5}. Thus if
x=1.75, the corresponding reproducer level is 1.5 and the bit label is 101. The error or
quantization noise is 0.25. The analog to digital (A/D) conversion process thus consists
of anti-aliasing filtering followed by sampling, and then quantization with the input
process represented as a sequence of binary labels, one per sample.
-3
000
-2
001
-1
010
1
011
100
2
101
3
110
111
x=1.75
Figure 2.4 3-bit uniform quantizer

In the digital to analog conversion process, the voltage levels corresponding to these
reproducer levels may be read out of memory (indexed by the labels), an impulse
generated, and then the impulse sequence passed through a low-pass filter to
approximately reproduce the analog signal. The complete A/D and D/A sequence is
illustrated in Figure 2.5.
X(t)
h(t)
Low pass
filter
Binary
Labels
quantizer
Binary
Labels
Sampler
Reproducer
h(t)
Low pass
filter
Figure 2.5 A/D and D/A conversion steps

2-11
Y(t)
A variety of means exist to get superior performance to uniform linear quantizers. These
are particularly important when the distortion measure of interest is something other than
mean squared error, as is often the case with visual, auditory or other perceptual signals.
Example 2.8 The -law quantizer
The amplitude levels of ordinary speech follow a Gaussian distribution. In the auditory system it is the
relative distortion that is perceptible rather than the absolute level. That is, the fractional distortion from
the waveform is what is important. Consequently larger absolute errors can be made on the loud signals
compared to the quiet signals. In telephone network local offices, signals from the local loop are strictly
bandlimited to 3.5 kHz, sampled at 8 kHz, and then quantized into 256 levels (8 bits) to produce a 64 kb/s
stream. This stream is then what is transmitted by various digital modulation techniques over the long
distance network. At the local office on the other side, the stream is converted back to an analog signal for
transmission over the local loop. To achieve approximately equal fractional distortion at each amplitude,
since the Gaussian distribution has exponential tails one could take the logarithm and then apply a uniform
quantizer. Instead, this is combined into one step with a set of piece-wise linear uniform quantizers with
increased spacing for the larger amplitudes. Thus finely spaced levels are used for low-amplitude signals
and widely spaced ones for large amplitude signals. This allows easily constructed linear quantizers to be
used on signals with a broad dynamic range while limiting the number of bits required. An unintended
consequence is increased difficulty for the design of voice-band modems due to the amplitude-dependent
quantization noise this introduces.
2.3 Introduction to information theory

The sampling theorem provides one example of a fundamental limit in digital signal
processing (with the converse of the Nyquist rate providing a limit in digital
communications). The quantization example raises the question of what might be the
ultimate limits in terms of the least distortion that can be achieved with a given number of
bits in representing a random process. Information theory provides a powerful set of
tools for studying the questions of fundamental limits, producing guideposts as to what is
possible, and sometimes on the means for approaching these limits. This section will
introduce three basic domains of information theory and the associated fundamental
limits: noiseless source coding, rate distortion coding, and channel coding. Remarkably,
the topics themselves and the early theorems were all the work of one man, Claude
Shannon. In the fifty years since the publication of A mathematical theory of
communication, practical methods have been devised that approach a number of these
limits, and information theory has expanded to broader domains.
Entropy and information
Shannon postulated that a useful information measure must meet a number of
requirements. First, independent pieces of information should add. Second, information
should be measured according to the extent that data is unexpected. Hearing that it will
be sunny in Los Angeles has less information content than the statement in January that it
will be sunny in Ithaca, New York. The most important insight however is that
information content of sequences is more interesting and fundamental than consideration
of individual messages. The particular measure he adopted is termed the entropy, since it
2-12
expresses the randomness of a source. Let X be a RV taking values in the set

X={x1,x2,xn} with the probability of element i being P(xi). Then the entropy in bits is
given by
n
H( X ) = P(x i )log P(x i )
(2.16)
i=1
where the logarithm is base 2. Since the probabilities are between 0 and 1, and the
logarithm is negative over that range, the entropy is a positive quantity. In the particular
case that X takes on binary values the entropy achieves its maximum value of 1 bit when
are equally likely, and a minimum of 0 when either outcome is certain.
the outcomes
Consider two successive instances of fair coin tosses. The probabilities for each of the
four possible outcomes are each 1/4, and the entropy is 2 bits as would be expected.
Thus it is a plausible information measure.
The noiseless source coding theorem is to the effect that the entropy represents the
minimum number of bits required to represent a sequence of discrete-valued RVs with
complete fidelity. Thus ascii or English text, or gray-levels on a fax transmission all have
a minimum representation, and that is the entropy of the sequence. Indeed practical
source coding techniques have been developed that get very close to the entropy limit.
Shannon suggested a method that illuminates some very important facts about correlated
sequences.
Asymptotic Equipartition Property
The weak law of large numbers states that for iid RVs X1, X2,,Xn,
1 n
(2.17)
X i E[X] as n
n i= 1
This is easily applied to prove an important idea in information theory, the asymptotic
equipartition property (AEP) which states that if P(x1, x2,,xn) is the probability of
observing the sequence x1, x2,,xn, then -1/n log P(X1, X2,,Xn) is close to H(X) for
large n. According
to the AEP, a small set of the sequences, the typical set, will consume
almost all of the observations, the members of the typical set are equally likely, and the
number of typical sequences is approximately 2nH. The remaining sequences, although
large in number, are collectively only rarely observed.
Example 2.9 Typical binary sequences
Consider a binary source with probabilities of the two outcomes being (1/4,3/4). Compute (a) H(X) (b) the
number of sequences of length 100 (c) the size of the typical set and (d) the number of bits required to
uniquely identify each member of the typical set.
Solution: H(X)=-1/4 log 1/4 3/4 log 3/4 = 0.811. The number of sequences of length 100 is simply 2100 =
1.27x1030, while the number of typical sequences is 281=2.42x1024, six orders of magnitude less. 82 bits are
needed to index the members of the typical set (rounding up).
A data compression method due to Shannon which exploits the typical set is as follows.
Allocate one bit to labeling whether a sequence is a member of the typical set or not.
Then enumerate the typical set members using their indices as the codewords. Since
there are 2nH typical set members, the index length will be at most the integer above nH,
2-13
plus 1 for the indication of typical set membership. For the atypical set members, at most
nlog|X| +1 bits will be required; simply send them in raw form. The expected number of
bits required to describe a message will however only be nH+1 bits, since the atypical set
has vanishing probability as n gets large. Thus, the number of bits per symbol required
will approach the entropy H, the smallest possible according to the noiseless source
coding theorem. Note that the coding scheme is fully reversible, there being a one to one
correspondence between codewords and the original data.
Thus, effort is spent on the typical set, and it is a matter of little importance how the
vastly more numerous atypical set members are dealt with. Given that the conclusion
depends only upon the weak law of large numbers, this is a design principle with broad
implications. In practice, the source statistics may be time-varying or unknown, or may
be such that more highly structured solutions will converge to the entropy bound with
shorter sequences. Consequently, there exist a wide variety of compression algorithms.
Mutual information
Just as there are joint and conditional probability distributions, so there are joint and
conditional entropy functions. The joint entropy between two variables X and Y is
(2.18)
H(X,Y) = P(x, y)log P(x, y)
x X y Y
while the conditional entropy is defined as

H(Y | X) = P(x)H(Y | X = x) = P(x, y)logP(Y | X) .
x X
(2.19)
x X y Y
conditional entropies are related by the chain rule:

The joint and
(2.20)
H(X,Y) = H(X) + H(Y | X) = H(Y) + H(X |Y)
Notice that it is not generally the case for H(X |Y) to be the same as H(Y |X).
A question that naturally arises is the extent to which knowledge of X assists in knowing
Y. This ismeasured by the mutual information I(X;Y ) = H(X) H(X |Y ) or
equivalently, I(X;Y ) = H(Y ) H(Y | X) . The mutual information can thus be thought of
as either the reduction in uncertainty of X due to knowledge of Y, or the reduction in the
uncertainty of Y due to knowledge of X. Note that I(X;Y ) = H(X) H(X | X) = H(X) ,
+ H(Y ) H(X,Y ) . Thus, if X
since H(X | X) must be zero. Also note that I(X;Y ) = H(X)
and Y are independent, the mutual information is zero.

Capacity
It comes as no surprise that the question of ultimate limits in communication systems will
involve the mutual information between the transmitted sequence, and the received
sequence. Less obvious is that the problem of lossy compression is actually the dual of
the communications problem, and relates directly to the problems of detection and source
identification. The mutual information thus plays a central role in these problems as well.
The problem of communicating over a single one-way communications link is depicted in
Figure 2.6. A message sequence is encoded to produce the vector Xn, and then a channel
2-14
produces output sequence Yn according to probability transition matrix P(y|x). A decoder

then estimates the original sequence. The aim is to maximize the rate of transmission
while exactly reproducing the original sequence.
Xn
Message
Encoder
Yn
Channel
P(y|x)
Message
Estimate
Decoder
Figure 2.6 Discrete communication problem

The Shannon channel capacity is then given by
max
C=
I(X;Y)
P(x)
where the maximum is over all possible input distributions P(x).
(2.21)
The effect of the channel is to produce a number of different possible images of the input
sequencesin the output (typically as the result of noise). For each input sequence, there
will be approximately 2 nH (Y |X ) output sequences. To avoid errors, it must be assured that
no two X sequences generate the same output sequence. Otherwise there will be
ambiguity in decoding, rather than a clear mapping. The total number of typical Y
sequences is 2 nH (Y ),and thus the maximum number of disjoint sets is
2 n( H (Y ) H (Y |X )) = 2 nI (X ;Y ) . This is the maximum number of disjoint sequences of length n

that can be sent, or in other words, the capacity is I(X;Y) bits per symbol.
Omitting the proof that this is indeed both achievable and the best that can be done, it can
be noticed that the notion of typicality again is important. Here the number of typical
output sequences per input is determined by the noise process, and thus it is the noise that
fundamentally limits the throughput.
Capacity can also be defined for channels with discrete inputs but with output vectors
drawn from the reals. In such cases, the differential entropy is used, defined by
h(X) = f(x)logf(x)dx
(2.22)
S
where S is the support set of the random variable. An analogous set of information
theoretic identities can be constructed as for the entropy. Here as well it is found that the
capacity is fundamentally limited by the noise. The most important example is the
Gaussian channel,
in which signals of power P and bandwidth W are passed through a
channel which adds white Gaussian noise of power spectral density No/2. Then
C = W log(1+ P /WN o ) bits/s.
(2.23)
That is, capacity (not reliability) increases linearly with bandwidth (the dimensions
available) and logarithmically with signal to noise ratio (SNR). Viewed another way,
bandwidth and data rate requirements, there is some minimum SNR that is
given fixed
required for reliable transmission. Below this SNR, reliable communication is not
2-15
possible at the desired transmission ratethe error rate very sharply escalates to unusable
levels. The aim of channel coding is to approach this limit with reasonable
computational delay and complexity. Codes have in fact been developed that very
closely approach capacity.
Example 2.10 Water-filling and multiple Gaussian channels
Power
P1
P3
N1
N2
N3
Ch. 1
Ch. 2
Ch. 3
Figure 2.7 Three parallel Gaussian channels

Consider the transmission over a set of Gaussian channels, in which the noise power may be changing with
time or may be different according to the frequency used. It might at first be tempting to increase the
transmit power when the channel is poor, and decrease it when the noise is small to maintain a constant
information rate. However, this is wrong approach when attempting to achieve maximum rate. The optimal
way to allocate transmit power is according to the water-filling distribution, illustrated above. Begin by
pouring power into the channel with the lowest noise. When it reaches the level of the next best channel,
continue pouring into both. Continue until all the power is exhausted. The depth of the power in the
channel is then the power allocated. In Figure 2.7, the noise in channel 2 is so large that it is not used.
However, if the total available power were to increase eventually power would be allocated to all channels.
In the limit of high SNR, the optimal power allocation is approximately equal over all channels. In any
case, the capacity is computed using the usual formula for Gaussian channel capacity for each of the
channels and summed (if they are parallel) or averaged (if they are accessed serially).
In network information theory (to be discussed in Chapter 8), various problems of

communication among groups of communications transceivers are considered, including
broadcasting, relaying of messages, multiple access, and hybrid broadcast/relay
transmission schemes. As might be expected, the general problem of capacity of a
network while considering all the ways the elements may interact is as yet unsolved,
although solutions exist for a number of special cases.
Rate distortion coding
Quantization is an example of noisy source codingthe conversion of a sequence of
random variables into a more compact representation of rate R, where there is some
distortion D between the original and reconstructed sequence (e.g., the quantization
noise). The aim may be either to minimize R subject to a constraint on the maximum
distortion D, or minimize D subject to a constraint on the maximum rate R. The question
naturally arises as to which (R,D) pairs are actually achievable. Suppose the distortion
measure is the squared error (Y-X)2 between the input sequence of RVs X (e.g., drawn
2-16
from the reals) and the reconstructed sequence Y (some finite set). Let D be the
maximum permitted distortion. The rate distortion theorem states
min
R(D) =
I(X;Y)
(2.24)
f(y | x) : E[(Y X) 2 ] D
Consider the particular instance of using the mean squared error E[(Y-X)2] as the
distortion measure, and let the source generate a sequence of iid Gaussian RVs with zero
mean and
variance 2. Rate distortion problems are generally considered from the point
of view of a test channel, that is, like a communications problem where Y is
communicated over a noisy channel that adds a random variable Z, to produce the source
X. For the Gaussian source, the channel along with the probability distributions for the
variables are illustrated in Figure 2.8.
Z; N(0,D)
Y; N(0,2-D)
X; N(0,2)
Figure 2.8 Test channel for Gaussian source

This particular choice of the distribution produces a conditional density f(X|Y) that
achieves the rate distortion bound, so that I(X;Y ) = 1/2log 2 /D and the rate distortion
function is given by
1
2
(2.25)
R(D) = log , 0 D 2 , and 0 otherwise.
2
D
Another way of looking at this is distortion as a function of rate, D(R) = 2 22R . Thus,
as noted earlier with respect to the discussion on quantizers, each additional bit reduces
the distortion by a factor of 4, or 6 dB. This is in keeping with the general trend for
quantizersthat the distortion in dB is of the form D(n)=k-6n dB, where n is the number
than
of bits. Because the optimal rate distortion coder operates on sequences rather
performing operations symbol by symbol, it achieves a lower value for the constant k.
Vector quantizers can closely approach the rate distortion bound.
In lossy source coding, in practice there are generally a number of steps. First some
heuristic is applied to remove most of the correlation of the source, using knowledge of
the source statistics, and in the case of video or audio signals, knowledge of which
features have the largest perceptual impact. Then rate-distortion coding or lossless
source coding is applied to remove the remaining correlation in the sequence.
In sensor networks, it is generally not feasible for reasons of network scalability or
energy consumption to send all the data collected by sensors to some central location for
processing. Thus sensor nodes must perform some form of lossy compression. In the
extreme, decisions must be made locally for example to identify a source and thus simply
describe it by a few bits. A research question is the trade between network resource
consumption (energy, bandwidth) and the fidelity of the final decisions fused from the
2-17
(processed) observations of set of nodes. The former may be identified with rate, and the
latter with the distortion as a network rate distortion problem.
Data processing inequality
Another information theoretic result of considerable importance in design of receivers
(whether for sensing or communications) is the data processing inequality. Consider the
information processing system depicted in Figure 2.9.
P(Y|X)
P(Z|Y)
Figure 2.9 An information processing system

Here a source X passes through a block (e.g. a channel) which produces Y according to
the conditional distribution P(Y |X). A processing step produces Z according to the
conditional distribution P(Z |Y). Consequently the sequence X to Y to Z forms a Markov
chain. In this situation, the data processing inequality (proved using the chain rule for
mutual information) states that I(X;Y) I(X;Z) . That is, processing cannot increase the
mutual information between the received sequence Y and the source sequence X; it can at
best keep it constant and may destroy information. This is also true in the particular case
that Z=g(Y) where g is an arbitrary function. One does not get something from nothing
all the information is in the original signal,
through the magic of computer processing;
and it is likely all downhill from there. What processing can do is apply informationpreserving transformations that present the data in a more convenient form.
In an optimal receiver, the processing steps should thus be designed to preserve the
mutual information between the end result and the original source. If a more compact
representation is required due to constraints, then a reasonable goal is to maximize the
mutual information subject to rate or distortion constraints.
2.5 Summary
This chapter has laid out the basic description of signals as random processes. The
interface between continuous time signals and representations suitable for computers was
described. The sampling theorem indicates the minimum rate at which a random process
must be sampled in order for the possibility of accurate signal reconstruction to be
preserved. Information theory lays out fundamental limits on the compressibility of the
resulting data stream, and the minimum error that results in the transformation to a set of
discrete values. It further supplies fundamental limits on the resources required for
communication of data.
2.6 To inquire further
2-18
Stochastic processes and linear systems are treated at an introductory level but with
sufficient mathematical rigor in
S. Haykin. Communication Systems 4th Ed. Wiley, 2001.
A standard comprehensive reference for probability is

A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, 1965.
The discussion on information theory was abstracted from the highly readable textbook
T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley, 1991.
It uses an economy of mathematical techniques while discussing a very broad range of

topics in information theory from the basics to gambling on the stock market.
A very accessible overview of the implications of information theory is found in
R.W. Lucky. Silicon Dreams: Information, Man, and Machine. St. Martins Press, 1989.
An excellent book covering much of the content of this chapter in more detail (but sadly
out of print) is:
P.M. Woodward. Probability and Information Theory, with Applications to Radar. Pergamon Press, 1953.
Of particular note is the connection drawn between maximization of mutual information

and optimal receiver design.
The sampling theorem as stated here in the context of the Fourier transform does not
suggest efficient procedures for dealing with discontinuities (e.g., edges, as arise in
images). Theory can alternatively be constructed for other transforms or on the basis of
the information content of a random process. See for example
A. Ridolfi, I. Maravic, J. Kusuma and M. Vetterli, A Bernoulli-Gaussian approach to the reconstruction of
noisy signals with finite rate of innovation, submitted to IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2004.
2.7 Problems
2.1 The Rayleigh Distribution
Let z be Rayleigh distributed with pdf:
f(z) =
Find the mean and the variance of z.
z z 2 /2 2
e
,z > 0 .
2
2.2 The Binary Symmetric Channel
Consider the binary symmetric channel (BSC), which

is shown in Figure 2.10. This is a
binary channel in which the input symbols are complemented with probability p. The
probability associated with each input-output pair is also shown in the figure. For
example, the probability that input 0 is received as output 0 is 1-p. Compute the channel
capacity.
2-19
Input
Output
1-p
0
p
p
1-p
Figure 2.10 The binary symmetric channel

2.3 The binary entropy function
Let
#1 with probability p,
X =$
%0 with probability 1 p.
Plot H(X) as a function of p. Prove the maximum occurs when the likelihoods of the two
outcomes are equal.
2.4 The Max quantizer

2
Let X ~ N(0, ) (that is, X takes on a normal distribution) and let the distortion measure
be the squared error. Consider a 1-bit Max quantizer for this Gaussian source. One bit is
used to represent the random variable X. If the random variable is greater than or equal to
0, the value is represented by the digital 1. Otherwise, it is represented by the digital
0. The 1
and 0 correspond to two analog reproducer values used to produce the
reconstruction X ( X ) . Determine these two analog values such that the mean square error
E[X X (X)]2 is minimized. Also compute the expected distortion for this quantizer, and
then compare this with the rate distortion limit.
2.5 The binomial distribution
mean and standard deviation of the binomial distribution shown in Example
Compute the
2.1.
2.6 Comparison of Gaussian and binomial distributions
To demonstrate the similarity between the binomial and the Gaussian distributions,
consider a binomial distribution with n=150 and p=0.5. First, plot the probability density
functions of both distributions with the same variance. Also plot the absolute values of
difference between two distributions, without and with normalization to the values of
binomial distribution. Finally, for the binomial distribution compute the probability that
the trial outcome is within the standard deviation.
2.7 The lognormal distribution
Let X ~ N( , 2 ) and Y = 10 X/10 . Therefore Y is lognormal distributed. Compute the mean
and the variance of this lognormal distribution.
2-20
Hints: It is suggested to solve this problem using moment generating functions. The
moment generating function of x is defined by
M X (t) Ee tx .
Since Y can be expressed as the exponential of x. By substituting t with specific values,
the mean and the variance can easily be obtained. The generating function is also a
powerful tool for other distributions.
2.8 The cumulative density function

In addition to the probability density function (pdf), another function that is also
commonly used is the probability distribution function. It is also called cumulative
distribution function (cdf). It is defined as
F(x) = P(X x)
( < x < ) .
Its relationship with the pdf is
dF(x)
f(x) =
( < x < ) .
dx
(a) Consider a Gaussian distributed random variable X ~ N(0, 2 ) . Express its cdf
F(x) in terms of Q(x).
(b) Another useful function is the error function, defined as
2 x 2
erf(x) =
et dt .
Its complementary error function is

2 t 2
erfc(x) =
e dt = 1 erf(x) .
x
Express F(x) in terms of the error function and the complementary error function,
respectively.
(c) Express Q(x) in terms of the complementary error function.
2.9 Fourier transform for angular frequency

In section 2.2, the Fourier transform is defined in terms of frequency f. To avoid
confusion here, it is denoted as X f (f) . The Fourier transform is often defined in terms of
radians per second, i.e. the angular frequency
X ( ) = x(t)e jt dt
.
Derive the formula of inverse Fourier transform from X ( ) to x(t) and the relationship
between X ( ) and X f (f) .
2.10 The name game
There are four persons. They write their names on individual slips of paper and deposit
of the four draws at random a slip from the box in

the slips in a common box. Each
sequence. If a person retrieves one of his own name slips, he will duplicate all slips
remaining in the box and deposit them in the box. Determine the probability of each
person retrieving his own name slip.
2.11 Some handy transforms
2-21
Compute the Fourier transform of the following signals without resorting to the Fourier
transform pair table:
#1 | t |< T1
t
(a) rect( ) $
2T1 %0 | t |> T1
(b) (t)
(c) eat u(t),Re{a} > 0
2.12 Some more challenging transforms
Compute the Fourier transform of each of the following signals, using the results in
the properties described in this chapter.
problem 2.7.12 and
(a) cos(2f 0 t) .
"$1, | t |< T1
T0 and x(t + T0 ) = x(t ) .
(b) Periodic square wave x(t) = #
$%0, T1 <| t |< 2
2.13 Filtered noise

(a) Suppose white Gaussian noise with PSD of N o / 2 passes through a linear filter with
response h1 (t ) = e at u (t ), Re{a} > 0 . Determine the PSD and autocorrelation function of

the output process y1 (t ) .
(b) y1 (t ) passes through another linear filter with response
1
h2 (t ) = (t T0 ) + (t 3T0 ), T0 > 0 .
3
Determine the PSD and autocorrelation of the output process y2 (t ) .
2.14 Uniform quantizer
The uniform quantizer of 8 bins shown in Figure 2.4 is used to quantize two kinds of
input. Compute the mean square error (MSE) for each input.
(a) Random process with uniform distribution between 4 and 4.
(b) 4 sin( 2f 0 t ) . It is assumed the sampler has no information about the phase and the
frequency of this sinusoidal input.
2.15 The Rice Distribution
In wireless communications, the amplitude of the received signal is usually modeled by
the Rice distribution:
2
2
2
z
zs
f (z) = 2 e(z +s )/ 2 I0 ( 2 ),z > 0 ,
2
2
2
where s = m1 + m2 . The value of s depends on the line of sight signal path. If this path
is present and the path gain is high, s is large and it is easier to communicate information.
To recognize this, plot the pdf for s=0,0.5,
, 5 and compute the probabilities of z > 5
for each s.
2.16 A/D and D/A conversion
2-22
In this problem, A/D and D/A conversion are practiced and two simulations are
compared. Consider the steps in Figure 2.11
x(t)
quantizer
h1 (t )
Binary
Labels
Reproducer
h2 (t )
Binary
Labels
y(t)
Figure 2.11 A/D and D/A conversion

The quantizer is the 3-bit uniform quantizer shown in Figure 2.4. The reproducer is a D/A
converter. It converts the binary labels into analog values and holds the analog value until
the next new binary label is available. h1 (t ) and h2 (t ) are both low pass filters, where
h1 (t ) is called the anti-aliasing filter. The transfer function of both filters is a Butterworth
low pass filter:
1.984 1012
H 1 (s) = H 2 (s) = 3
s + 2.513 10 4 s 2 + 3.158 10 8 s + 1.984 1012
The input is a triangle wave as depicted in Figure 2.12.
4
T0
0
-4
Figure 2.12 A triangle wave
T0 = 1/ f 0 = 1 ms . The sample rate is 20k samples/s. In the first experiment, compute the
mean square error. Notice, the A/D converter has no information about the input phase.
In the second experiment, the anti-aliasing filter h1 (t ) is removed. Compute the mean
square error
in this situation.
2-23

Principles of Embedded Network Design

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Principles of Embedded Network Design

Transféré par

Droits d'auteur :

Formats disponibles

Pottie and Kaiser: Principles of Embedded Network Systems Design

Pottie and Kaiser: Principles of Embedded Network Systems Design

Example 2.2 Campaign contributions and votes in Dystopia.

Continuous random variables

Pottie and Kaiser: Principles of Embedded Network Systems Design

var(x) = 2 = E[(X ) 2 ] = (x ) 2 f (x)dx = E[X 2 ] 2

where is the mean of the distribution and the standard deviation.

Pottie and Kaiser: Principles of Embedded Network Systems Design

Example 2.3 Gaussian tails

The Gaussian distribution may be used to construct a number of other important

Pottie and Kaiser: Principles of Embedded Network Systems Design

Example 2.4 The Poisson distribution

2.2 Stochastic processes

To go back to the time domain, the inverse transform is employed:

Pottie and Kaiser: Principles of Embedded Network Systems Design

6. Differentiation in time domain: n x(t) ( j2f ) n X( f )

8. Conjugate functions: x * (t) X * ( f )

10. Convolution in time domain (multiplication in frequency domain):

Example 2.5 Linear system

A questionsometimes asked is what is the meaning of the frequency domain. It is

Pottie and Kaiser: Principles of Embedded Network Systems Design

2. x ( ) x (0) (maximum for zero delay)

Pottie and Kaiser: Principles of Embedded Network Systems Design

Solution: Simply take the inverse Fourier transform, so that x()=(No/2)()

Figure 2.2 Autocorrelation function of band-limited noise

Pottie and Kaiser: Principles of Embedded Network Systems Design

Pottie and Kaiser: Principles of Embedded Network Systems Design

Multiply by sampling function s(t)

Convolve with sampling function S(f)

Low pass filter to recover X(f)

Figure 2.3 The sampling theorem

Pottie and Kaiser: Principles of Embedded Network Systems Design

Figure 2.4 3-bit uniform quantizer

Figure 2.5 A/D and D/A conversion steps

Pottie and Kaiser: Principles of Embedded Network Systems Design

2.3 Introduction to information theory

Pottie and Kaiser: Principles of Embedded Network Systems Design

expresses the randomness of a source. Let X be a RV taking values in the set

H( X ) = P(x i )log P(x i )

Pottie and Kaiser: Principles of Embedded Network Systems Design

while the conditional entropy is defined as

conditional entropies are related by the chain rule:

and Y are independent, the mutual information is zero.

Pottie and Kaiser: Principles of Embedded Network Systems Design

produces output sequence Yn according to probability transition matrix P(y|x). A decoder

Figure 2.6 Discrete communication problem

2 n( H (Y ) H (Y |X )) = 2 nI (X ;Y ) . This is the maximum number of disjoint sequences of length n

Pottie and Kaiser: Principles of Embedded Network Systems Design

Figure 2.7 Three parallel Gaussian channels

In network information theory (to be discussed in Chapter 8), various problems of

Pottie and Kaiser: Principles of Embedded Network Systems Design

Figure 2.8 Test channel for Gaussian source

Pottie and Kaiser: Principles of Embedded Network Systems Design

Figure 2.9 An information processing system

Pottie and Kaiser: Principles of Embedded Network Systems Design

A standard comprehensive reference for probability is

It uses an economy of mathematical techniques while discussing a very broad range of

Of particular note is the connection drawn between maximization of mutual information

2.2 The Binary Symmetric Channel

Consider the binary symmetric channel (BSC), which

Pottie and Kaiser: Principles of Embedded Network Systems Design

Figure 2.10 The binary symmetric channel