in Information Theory
Lecture Notes
Stefan M. Moser
Contents
Preface
ix
1 Mathematical Preliminaries
1.1 Review of some Definitions .
1.2 Some Important Inequalities
1.3 FourierMotzkin Elimination
1.4 Law of Large Numbers . . . .
1.5 Additional Tools . . . . . . . . .
2 Method of Types
2.1 Types . . . . . . . . . . .
2.2 Properties of Types . .
2.3 Joint Types . . . . . . .
2.4 Conditional Types . . .
2.5 Remarks on Notation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
9
11
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
19
27
28
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
36
42
48
52
.
.
.
.
63
64
68
71
76
81
.
.
.
.
.
85
86
90
94
98
99
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Strong Typicality
4.1 Strongly Typical Sets . . . . . . . . . . . . . . . . . . . .
4.2 Jointly Strongly Typical Sets . . . . . . . . . . . . . . .
4.3 Conditionally Strongly Typical Sets . . . . . . . . . .
4.4 Accidental Typicality . . . . . . . . . . . . . . . . . . . .
4.A Appendix: Alternative Definition of Conditionally
Typical Sets . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Rate Distortion Theory
5.1 Motivation: Quantization of a Continuous RV
5.2 Definitions and Assumptions . . . . . . . . . . . .
5.3 The Information Rate Distortion Function . .
5.4 Rate Distortion Coding Theorem . . . . . . . . .
5.4.1 Converse . . . . . . . . . . . . . . . . . . . . .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.......
.......
.......
.......
Strongly
.......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
Contents
5.5
5.6
5.7
5.8
5.9
5.4.2 Achievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Characterization of R(D) . . . . . . . . . . . . . . . . . . . . . . . . . .
Further Properties of R(D) . . . . . . . . . . . . . . . . . . . . . . . .
Joint Source and Channel Coding Scheme . . . . . . . . . . . . . .
Information Transmission System: Transmitting above
Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rate Distortion for the Gaussian Source . . . . . . . . . . . . . . .
5.9.1 Rate Distortion Coding Theorem . . . . . . . . . . . . . .
5.9.2 Parallels to Channel Coding . . . . . . . . . . . . . . . . . .
5.9.3 Simultaneous Description of m Independent Gaussians
7.5
7.6
7.7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
105
106
110
119
121
123
123
125
126
.
.
.
.
.
.
133
133
135
141
144
148
149
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
155
155
158
159
161
161
161
162
169
170
171
172
176
177
178
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
185
187
189
189
190
190
193
193
.
.
.
.
.
.
Contents
8.3
8.4
8.5
8.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
194
200
202
204
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
207
208
210
212
215
216
217
217
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
219
219
221
221
225
225
229
230
233
233
235
241
243
244
244
245
246
249
10 The
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
Function
.......
.......
.......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
MultipleAccess Channel
Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
TimeSharing: Convexity of Capacity Region . . . . .
Some Illustrative Examples for the MAC . . . . . . . .
The MAC Capacity Region . . . . . . . . . . . . . . . . . .
10.4.1 Achievability of C1 . . . . . . . . . . . . . . . . . . .
10.4.2 Capacity Region C2 Being a Subset of C1 . . .
10.4.3 Converse of C2 . . . . . . . . . . . . . . . . . . . . . .
Some Observations and Discussion . . . . . . . . . . . . .
10.5.1 C1 with Fixed Distribution QX (1) QX (2) . . .
10.5.2 Convex Hull of two Pentagons . . . . . . . . . . .
10.5.3 General Shape of the MAC Capacity Region
MultipleUser MAC . . . . . . . . . . . . . . . . . . . . . . . .
Gaussian MAC . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.1 Capacity Region . . . . . . . . . . . . . . . . . . . .
10.7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.3 CDMA versus TDMA or FDMA . . . . . . . . .
Historical Remarks . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
261
261
263
265
265
265
vi
Contents
12.3
12.4
12.5
12.6
12.7
12.A
12.2.4 Case 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.5 Analysis Put Together . . . . . . . . . . . . . . . . . .
The GelfandPinsker Rate . . . . . . . . . . . . . . . . . . . .
Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Writing on Dirty Paper . . . . . . . . . . . . . . . . . . . . . . .
Different Types of SideInformation . . . . . . . . . . . . . .
Appendix: Concavity of GelfandPinsker Rate in Cost
Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
266
268
269
276
277
278
281
....
282
13 The
13.1
13.2
13.3
Broadcast Channel
Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Some Important Observations . . . . . . . . . . . . . . . . . . . . .
Some Special Classes of Broadcast Channels . . . . . . . . . . .
13.3.1 Degraded Broadcast Channel . . . . . . . . . . . . . . . .
13.3.2 Broadcast Channel with Less Noisy Output . . . . . .
13.3.3 The Broadcast Channel with More Capable Output
13.4 Superposition Coding . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.5 NairEl Gamal Outer Bound . . . . . . . . . . . . . . . . . . . . . .
13.6 Capacity Regions of Some Special Cases of BCs . . . . . . . . .
13.7 Achievability based on Binning . . . . . . . . . . . . . . . . . . . .
13.8 Best Known Achievable Region: Martons Region . . . . . . .
13.9 Some More Outer Bounds . . . . . . . . . . . . . . . . . . . . . . . .
13.10 Gaussian BC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 The
14.1
14.2
14.3
14.4
.
.
.
.
.
.
.
.
.
.
.
.
327
327
328
333
335
.
.
.
.
.
.
.
.
.
.
.
.
.
.
339
339
340
345
345
346
347
348
.
.
.
.
.
.
.
.
.
.
349
349
353
353
354
355
.
.
.
.
.
.
.
.
.
.
.
.
.
285
285
288
289
289
292
292
294
300
305
309
313
319
320
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
Contents
16.4 Strong and Very Strong Interference . . . . . . . . . .
16.5 HanKobayashi Region . . . . . . . . . . . . . . . . . . .
16.5.1 Superposition Coding with Rate Splitting
16.5.2 FourierMotzkin Elimination . . . . . . . . .
16.5.3 Best Known Achievable Rate Region . . . .
16.6 Gaussian IC . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.6.1 Channel Model . . . . . . . . . . . . . . . . . . .
16.6.2 Outer Bound . . . . . . . . . . . . . . . . . . . . .
16.6.3 Basic Communication Strategies . . . . . . .
16.6.4 Strong and Very Strong Interference . . . .
16.6.5 HanKobayashi Region for Gaussian IC . .
16.6.6 Symmetric Degrees of Freedom . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
359
364
365
369
371
373
373
374
375
377
380
381
Bibliography
385
List of Figures
391
List of Tables
395
Index
397
Preface
As its title indicates, this course covers some more advanced topics in information theory. The script can be split into roughly three parts:
Chapters 14 cover results and tools that will be needed in the proofs
later on;
Chapters 59 deal with advanced topics related to data compression;
and
Chapters 1016 cover advanced topics of data transmission.
More in detail, after a chapter reviewing some definitions and basic mathematical and information theoretical facts, in Chapter 2, we introduce the
foundation on which the rest of this script is built: types. We study types in
detail, but first do not give any informationtheory related motivation yet. We
simply ask the reader at that stage to enjoy the beauty of these mathematical
ideas and not to worry too much about their relation to communications. As
types are the most important tool throughout the course, we only talk about
our notation in Section 2.5 after we have introduced types. More comments
about notation can be found in Remark 4.2 in Section 4.1.
Chapter 3 then makes a detour into a fancy area of probability theory:
large deviation theory. Again, the connection to communications will not be
directly visible, however, we will rely on it later on in some proofs. And once
more, the results by themselves are beautiful! In Chapter 4, we then mold the
basic ideas of Chapter 2 into a form that will serve us as main tool in most of
the proofs in the remainder of this course: We define strongly typical sets and
discuss their properties.
After these lengthy preparations, we then finally start with information
theory. In Chapters 5 and 6, we discuss lossy compression: the rate distortion
theory. Rate distortion coding can be seen as a dual to channel coding: While
in channel coding we ask at what rates information can be transmitted for a
given available power such that the error probability is arbitrarily small, rate
distortion coding deals with the minimum necessary description rate needed to
compress a source such that it can be reconstructed within a given distortion
with arbitrarily small error probability. Chapter 6 further deepens the results
of Chapter 5 by looking at how quickly the error probability tends to zero
when the blocklength tends to infinity.
ix
Preface
xi
Preface
The books main strength is its intuitive description of the problems and
their proofs. This intuitive approach, however, sometimes comes at the
cost of a loss in accuracy.
For some topics, there exist quite accessible journal papers presenting the
newest research results alongside a good summary of the basic results up
to that point. For example, an easytoread introduction to the multiple
description problem can be found in [VKG03]. It is also worth having
a look at the extensive list of references to other literature that is given
in this paper.
The following works could also be included in our list of readings.
Imre Csisz
ar and Paul Shields discuss the relations between information
theory and statistics in [CS04].
Abbas El Gamal and YoungHan Kim have recently published a book
about network information theory [EGK11] [EGK10].
Another book on network information theory has been written by Raymond Yeung [Yeu08].
I will keep working on these notes and try to improve them continually.
So if you find typos, errors, or if you have any comments about these notes, I
would be very happy to hear them! Write to
stefan.moser@alumni.ethz.ch
Thanks!
Finally and once again I must express my deepest gratitude to YinTai and
Matthias who kept encouraging me during the whole project and particularly
towards the end phase when the schedule got tight.
Stefan M. Moser
Chapter 1
Mathematical Preliminaries
In this chapter, we prepare some mathematical tools that we need in our
derivations of this course. We start in Section 1.1 with a very brief review
of the main definitions in information theory. Then Section 1.2 summarizes
some important inequalities that stem partially from information theory itself
and partially are actually probability theory related. Section 1.3 reviews the
tool of FourierMotzkin elimination. In Section 1.4 we repeat the laws of large
numbers that will be the foundation of many of our proofs. And Section 1.5
presents some further statements and tools that are actually quite far away
from the core topics of this course, but that still are important for our analysis.
1.1
The following definitions and results are all stated without motivation or
proofs. The details can be found in [Mos14].
Definition 1.1. The entropy of a discrete random variable (RV) X is defined
as
H(X) ,
(1.1)
xsupp(PX )
= E[log PX (X)],
(1.2)
Mathematical Preliminaries
(1.3)
where we use X  to denote the size of the set X . Conditioning reduces entropy
(or at least it does not increase it. . . ):
H(X) H(XY ).
(1.4)
n
X
k=1
(1.5)
Definition 1.3. Let X be a continuous random variable with probability density function (PDF) X (). Then we define the differential entropy h(X) as
follows:
Z
h(X) , X (x) log X (x) dx
(1.6)
= E[log X (X)].
(1.7)
(1.10)
D(1 k 2 ) 0.
(1.11)
or
Note that if the set G , {x supp(1 ) : 2 (x) = 0} has positive Lebesgue measure,
then D(1 k 2 ) = . Otherwise, the points in G are ignored in the integration (1.9).
Definition 1.6. The mutual information between the discrete RVs X and Y
with joint PMF PX,Y is defined as
I(X; Y ) , D(PX,Y kPX PY ) = H(X) H(XY ).
(1.12)
Similarly, the mutual information between two continuous RVs X and Y with
joint PDF X,Y is defined as
I(X; Y ) , D(X,Y kX Y ) = h(X) h(XY ).
(1.13)
The further generalization of the mutual information functional to arguments being a mixture of discrete and continuous random variables is slightly
more subtle, but basically wellbehaved. We refer to [Mos14, Section 15.3] for
a brief discussion.
Proposition 1.7. Mutual information is nonnegative
I(X; Y ) 0
(1.14)
n
X
k=1
(1.15)
Remark 1.8. Note that we usually will omit to mention that the sum (or
integral) is only over the corresponding support. Instead we use the notational
convention that
0 log 0 , 0.
1.2
(1.16)
Some of the following inequalities are wellknown, some maybe less, but all of
them are going to be important in various places of this script. It therefore
makes sense to summarize them here already.
The first inequality is taken over from [Mos14].We named it according to
the suggestion of Prof. James L. Massey, retired professor at ETH in Zurich,
the Information Theory Inequality or the IT Inequality.
Theorem 1.9 (IT Inequality). For any base b > 0 and any > 0,
1
1
logb e logb ( 1) logb e
(1.17)
(1.18)
Mathematical Preliminaries
(1.19)
and
d
1
logb = logb e
d
(
> logb e
< logb e
if 0 < < 1,
if > 1.
(1.20)
Hence, the two functions coincide at = 1, and the linear function is above
the logarithm for all other values.
To prove the lower bound again note that
1
logb e
= 0 = logb =1
(1.21)
1
=1
and
(
>
1
d
1
1
logb e = 2 logb e
d
<
d
d
d
d
logb =
logb =
logb e
if 0 < < 1,
logb e
if > 1,
(1.22)
similarly to above.
Corollary 1.10 (Exponentiated IT Inequality). For any > 0, we have
(1 ) e ,
1,
(1.23)
(1.24)
ai
ai log
bi
ai
bi
n
X
!
ai
i=1
Pn
i=1 ai
P
log
n
i=1 bi
(1.25)
n
X
ai ,
B,
i=1
n
X
bi .
(1.26)
i=1
If A = 0, then (by Remark 1.8) both sides of the inequality are zero (thereby
achieving equality).
So, assume that A > 0 (B is positive by assumption). We define the two
PMFs
ai
(1.27)
Qa (i) , , i = 1, . . . , n,
A
bi
(1.28)
Qb (i) , , i = 1, . . . , n.
B
By the nonnegativity of relative entropy (Proposition 1.5) it now follows that
0 D(Qa kQb )
n
X
ai /A
ai
=
log
A
bi /B
i=1
X
n
n
1
B
1 X
ai
= log
ai +
ai log
A
A
A
bi
= log
B
1
+
A A
i=1
n
X
ai log
i=1
(1.29)
(1.30)
(1.31)
i=1
ai
.
bi
(1.32)
Hence,
n
X
i=1
ai log
A
ai
A log .
bi
B
(1.33)
In this case of positive A, equality can only be reached if D(Qa kQb ) = 0, i.e.,
if ai = bi for all i = 1, . . . , n.
The first part of the following proposition has again been taken over from
[Mos14].
Proposition 1.12 (Data Processing Inequalities (DPI)). Assume that
X (
Y (
Z
(1.34)
(1.35)
(1.36)
A similar result can be stated for the case of continuous RVs with densities.
Mathematical Preliminaries
Proof: We only prove the case of discrete RVs. Since I(X; ZY ) = 0, we
have H(XY ) = H(XY, Z). Hence,
I(X; Z) = H(X) H(XZ)
(1.37)
H(X) H(XY, Z)
(1.38)
= I(X; Y ),
(1.40)
= H(X) H(XY )
(1.39)
(1.41)
!
X X
y
XX
(1.42)
(1.43)
(1.44)
(1.45)
(1.46)
U (
V (
U.
(1.47)
(1.48)
Then
)
H(U V ) H(U U
Hb (Pe ) + Pe log(U 1)
log 2 + Pe log U.
(1.49)
(1.50)
such that
PZ (1) = Pe ,
(1.51)
PZ (0) = 1 Pe ,
(1.52)
H(Z) = Hb (Pe ).
(1.53)
(1.54)
=0
and
) = H(ZU
) + H(U U,
Z)
H(U, ZU
Z)
H(Z) + H(U U,
(1.55)
(1.56)
because U =U
log(U 1)
because U 6=U
(1.58)
where the first inequality follows from conditioning that cannot increase entropy. This proves the inner (second) inequality.
The first inequality follows from the data processing inequality (1.35):
) I(U ; V )
I(U ; U
) H(U ) H(U V )
H(U ) H(U U
) H(U V ).
H(U U
(1.59)
(1.60)
(1.61)
(1.62)
(1.63)
Theorem 1.14 (Union Bound on Total Expectation). Let X be a random variable taking value in the alphabet X and let f : X R+
0 be a nonnegative function. Consider some events Ei X that are not necessarily disjoint,
but whose union covers X :
[
Ei = X .
(1.64)
i
Then
E[f (X)]
X
i
(1.65)
Mathematical Preliminaries
Proof: We again only prove for the case of a discrete RV. We split all sets
Ei up into disjoint subsets Bj , Bj Bj 0 = , j 6= j 0 , such that
[
Ei =
Bk
(1.66)
some k
for the right choice of union over k. As an example, see Figure 1.1. There,
E3 = B4 B5 B6 B7 .
(1.67)
E2
E1
B2
B1
B3
B5
B6
B4
B7
E3
Figure 1.1: Example of overlapping sets Ei that are split up into disjoint subsets Bj .
Obviously, we still have
[
j
Bj = X .
(1.68)
Hence,
E[f (X)] =
QX (x) f (x)
(1.69)
xX
XX
j
(1.70)
xBj
XX
j
QX (x) f (x)
X
QX (x) f (x) +
XX
i
QX (x) f (x)
(1.71)
some j 0 xBj 0
xBj

=
QX (x) f (x)
xEi
{z
0 because f () 0
}
(1.72)
Pr(Ei )
X QX (x)
f (x)
Pr(Ei )
(1.73)
xEi
X
i
(1.74)
Here, (1.70) follows because the sets Bj form a partition of X (i.e., the are
disjoint and exactly cover X ); and in (1.71) we add some sets Bj 0 once again
to make sure that we can account for all Ei (note that some Bj are member of
several Ei !).
The following inequality is slightly different in quality from the previous
ones because it explicitly only holds for continuous random variables (with a
PDF!) and their differential entropy.
Theorem 1.15 (Entropy Power Inequality (EPI)). Let X and Y be two
independent random nvectors with PDF. Then
2
(1.75)
where the differential entropies have to be measured in nats. Equality holds if,
and only if, X and Y are Gaussian with proportional covariance matrices.
Proof: This inequality was introduced by Shannon in [Sha48]. The first
rigorous proof was given in [Sta59]. In [CT06] the proof relies on the Renyi
entropy [CK81]. In [VG06] a proof based on a relationship between mutual
information and minimum meansquare error in Gaussian channels is provided.
And a proof based on basic properties of mutual information and on Taylor
expansions is given in [Rio07].
1.3
FourierMotzkin Elimination3
We are all familiar of the Gaussian elimination procedure that is used to eliminate unwanted variables in a linear equation system. In information theory,
however, we more often encounter a linear inequality system. Particularly,
in multiterminal problems we often see some types of rate regions that are described by a set of inequalities. In general these rate regions can be described
as
AR I
(1.76)
where
T
(1.77)
T
(1.78)
R = R(1) , . . . , R(L)
is a vector that contains L different rates,
I = I(1) , . . . , I(N)
3
10
Mathematical Preliminaries
is a vector with N entries that consists of sums of entropies and mutual information terms, and where A is an N L matrix describing the coupling
between the different rates. Hence, (1.76) represents N inequalities involving
L different rates describing a rate region that is called a polytope.
Now consider the situation that we would like to eliminate one rate, say
(L)
R from the vector R, i.e., we are interested in the projection
o
n
= R(1) , . . . , R(L1) T : R(L) s.t. R = R
T , R(L) T satisfies AR I .
R
(1.79)
inequalities.
with N
I
R
A
(1.80)
of inequalities
Note that in contrast to Gaussian elimination, the number N
after an elimination step might be larger than it was before.
Example 1.16. As an example we consider an inequality system that we
will encounter in the case of a broadcast channel. Consider the inequalities (13.246)(13.248) in Section 13.7 (together with the implicitly given constraints that the four rates cannot be zero:
I U (1) ; U (2)
0
0 1 1
1
0
1
0 (1) I U (1) ; Y (1)
R
0
I U (2) ; Y (2)
1
0
1
(2)
(1.81)
0
0
0
.
1 0
(1)
0 1 0
0
0 (2)
0
0 1 0
0
0
0
0
0 1
11
I U (2) ; Y (2) I U (1) ; U (2)
0
1 1
(2) ; Y (2)
0
I
U
1
0
(1)
1
0
1 (2)
I U (1) ; Y (1)
(1.82)
R
.
1 0
0
0
(1)
0 1 0 R
0
0 1
0
Note that the first two inequalities are generated in Step 1 of the Fourier
Motzkin elimination procedure, while the remaining four inequalities are taken
over without change.
(1) . Again we see that we have one positive
We continue and eliminate R
and two negative components, yielding again two pairing possibilities:
1
1
I U (1) ; Y (1) + I U (2) ; Y (2) I U (1) ; U (2)
0 " (1) #
I U (1) ; Y (1)
(2)
(2)
0
. (1.83)
1
I
U
;
Y
R(2)
0
1 0
0 1
0
This yields exactly the region given in Theorem 13.33.
1.4
The core behind almost all proofs and results discussed in these lecture notes
is the law of large numbers. We therefore quickly review the two different
types of laws of large numbers.
We start with the weak law.
Theorem 1.17 (Weak Law of Large Numbers).
Let X1 , X2 , . . . be a sequence of independent and identically distributed
(IID) RVs of mean and variance 2 . Then for any > 0,
#
" n
1 X
(1.84)
lim Pr
Xk < = 1.
n
n
k=1
12
Mathematical Preliminaries
This means that the probability that the sample mean gets very close to the
statistical mean tends to 1. We have here a convergence in probability.
Proof: From independence, it follows
" n
#
X
n
n
X
1X
2
2
2
Xk
Var
=
Xk =
=
n
=
,
Var
n
n
n2
n2
n
k=1
(1.85)
k=1
k=1
and we have
#
n
n
1X
1X
E
Xk =
E[Xk ] = .
n
n
"
k=1
(1.86)
k=1
Now, we use the Chebyshev Inequality [Mos14, Section 19.2], which says that
for any > 0,
Pr[Y E[Y ] ]
Var[Y ]
,
2
(1.87)
and get
" n
#
1 X
2
Pr
Xk 2 ,
n
n
(1.88)
k=1
1X
=
log Q(Xk )
n
(1.90)
k=1
in prob.
(1.91)
This means that the probability that the sample mean is equal to the statistical mean is 1. We have here a convergence with probability 1 or
almostsure convergence.
13
1.5
Additional Tools
The following theorem shows that in the ddimensional Euclidean space any
convex combination can be written using at most d + 1 vectors. This will be
useful for us when we try to limit the necessary size of certain finite alphabets.
Theorem 1.20 (Carath
eodorys Theorem). If v Rd is a convex combination of n points v1 , v2 , . . . , vn Rd , then there exists a subset of d + 1 of
these points {vk1 , vk2 , . . . , vkd+1 } such that
v=
d+1
X
j vk j
(1.93)
j=1
where
d+1
X
j 0,
j = 1.
(1.94)
j=1
k
X
j vk j
(1.95)
j=1
Pk
j=1 j
d + 2 k n.
(1.96)
14
Mathematical Preliminaries
v1
v5
v6
v2
v4
v3
Figure 1.2: The real plane with six points and a shaded point that is a convex combination of the six other. This shaded point can also be
described as a convex combination of only three of the six points.
Two possible choices of such a subset are depicted with the triangles.
j (vj v1 ) = 0.
j=2
(1.97)
We define
1 ,
k
X
(1.98)
j=2
so that
k
X
j=1
j =
k
X
j +
j=2
k
X
j = 0
(1.99)
j=2
and by (1.97)
k
X
j=1
j vj =
k
X
j=2
j v1 +
k
X
j vj =
j=2
k
X
j=2
j (vj v1 ) = 0. (1.100)
15
k
X
j=1
k
X
j=1
k
X
j=1
j vj 0
(1.101)
k
X
j vj
j vj
(1.102)
j=1
(j j )vj
(1.103)
Note that at least one j > 0 because not all j are zero, but their sum is
equal to zero. We choose
, min
j : j >0
j
i
=
j
i
(1.104)
(where the second equality should be read as a definition for i). Note that
> 0 and that
i
j j = j
j 0 j.
(1.105)
i
{z}
j
j
if j >0
In particular, we have
i i = 0.
(1.106)
Hence,
v=
k
X
j=1
(j j )vj =
k
X
j vj
(1.107)
j=1
where
j = j j 0,
k
X
j=1
i = i i = 0,
k
k
X
X
j =
j
j = 1 0 = 1.
j=1
(1.108)
(1.109)
(1.110)
j=1
16
Mathematical Preliminaries
The following theorem is only stated for completeness reasons and we omit
a proof as it is quite far away from the main topics of this course. However,
it explains why we often will be able to avoid the use of infima and suprema,
but can directly resort to the simpler minima and maxima.
Theorem 1.23 (Extreme Value Theorem (Karl Weierstrass)). Let the
function f () be continuous and let X be a compact set (in the Euclidean
space this is equivalent to closed and bounded with respect to the Euclidean
distance). Then
inf f (x) = min f (x)
(1.111)
(1.112)
xX
xX
and
xX
xX
Chapter 2
Method of Types
Types and typical sets are extremely important tools of information theory.
We have seen part of it already in [Mos14, Chapter 19]. However, while the
weak typicality introduced there has the advantage to be easily extendable to
continuousalphabet random variables, it is not really very intuitive.
In this course we are going to talk about types and strongly typical sets.
Note that while Shannon did have a fundamental understanding of the principal concept of types, and while the theoretical foundations of types go back to
Sanov [San57] and Hoeffding [Hoe56], it was the work of Imre Csiszar [Csi98],
[CK81] jointly with J
anos Korner and Katalin Marton that formalized it and
made it to the main tool of information theory.
2.1
Types
(2.1)
(2.2)
N(ax) components
Example 2.2. Let x = (11, 15, 11, 17) and y = (21, 22, 23, 24). Then we have
N(11x) = 2, I(11x) = {1, 3}, and yI(11x) = (y1 , y3 ) = (21, 23).
N(ax)
,
n
17
a X.
(2.3)
18
Method of Types
2
Px (T) = .
5
(2.4)
Definition 2.5. We use P(X ) to denote the set of all probability distributions
on X . Moreover, Pn (X ) denotes the set of all types with denominator n and
with respect to the alphabet X .
Obviously, Pn is a set with each member being a PMF, and
Pn (X ) P(X ).
(2.5)
(2.6)
Next we turn the way of thinking around: So far we had a given sequence
x and described its type Px . Now we would like to fix a type P Pn and ask
the question of how many sequences x have this type.
Definition 2.7. Let P P(X ) be a distribution on the alphabet X (not
necessarily a type!). Then the set of all lengthn sequences x having a type
Px = P is called type class of P and is denoted by T n (P ):
T n (P ) , x X n : Px = P .
(2.7)
1 2
3, 3
. Then
1
,1
(2.8)
. Then
(2.9)
From Example 2.9 we see that we can redefine Pn (X ) as the set of all
probability distributions P on X such that T n (P ) 6= .
19
1
Px (2) = ,
5
1
Px (3) = ;
5
(2.10)
and
T 5 (Px ) = {11123, 11132, 11213, 11231, . . . , 32111}.
(2.11)
How many sequences are member of T 5 (Px )? Well, simply count all permutations without repetition:
5
T (Px ) = 5! = 20.
(2.12)
3!1!1!
2.2
Properties of Types
(2.13)
(2.14)
Proof: The additional reduction of the exponent by 1 follows from the fact
that one component of each probability vector is uniquely determined by all
the other components.
Note that we will rarely use this improved bound. It is almost only used
in a situation when the alphabet is binary. In this case the improvement is
quite significant from (n + 1)2 to n + 1.
20
Method of Types
(2.15)
where we assume that entropy and relative entropy are given in nats.
Hence, the probability that X = x only depends on the type of x and not
on x directly!
Remark 2.14. Note that we can rewrite this expression using a power of 2
instead of e, but then need to specify the entropy and relative entropy in bits.
In the remainder of this class, unless explicitly marked, we will stick to nats
and e. Also, note that log denotes the natural logarithm.
Proof: Recalling our notation N(ax) from Definition 2.1 and using that
the sequences are generated IID, we have:
Qn (x) =
n
Y
Q(xk )
(2.16)
k=1
Q(a)N(ax)
(2.17)
Q(a)nPx (a)
(2.18)
(2.19)
asupp(Px )
Y
asupp(Px )
Y
asupp(Px )
= exp
asupp(Px )
1
= expn
X
asupp(Px )
= expn
X
asupp(Px )
Px (a) log
(2.20)
Px (a)
1
Q(a) Px (a)
Px (a)
n
Px (a) log
Q(a)
X
asupp(Px )
(2.21)
1
Px (a) log
Px (a)
(2.22)
(2.23)
21
Here, for (2.17) and (2.18) recall the Definition 2.3 of a type and recall that
the support supp(Px ) X denotes the set of all symbols for which Px (a) > 0.
Lets again have some examples.
Example 2.15. Let X = {0, 1}, Q(0) = 1 Q(1) = 13 , and n = 4. We want
to compute the probability of the sequence x = 0010 under the IID law Q:
3
2
1
2
4
= .
Q (0010) =
(2.24)
3
3
81
We can see already here that this probability only depends on the count of
zeros and ones in x.
Lets now compute the same result using TT2. First note that
3
Px (0) = ,
4
1
Px (1) = ,
4
(2.25)
such that
1 3
3
3
1
= 2 log2 3,
H(Px ) = log log
4
4 4
4
4
1
1/4 3
3/4
7
9
D(Px k Q) = log
+ log
= log2 3 .
4
2/3 4
1/3
4
4
(2.26)
(2.27)
Hence,
1
H(Px ) + D(Px k Q) = + log2 3
4
(2.28)
2
,
81
(2.29)
and
24( 4 +log2 3) = 214 log2 3 =
1
as expected.
1
(2.31)
1 n
6 .
22
Method of Types
But what if the dice is not fair? Consider
1 1 1 1 1
Q=
, , , , ,0 .
3 3 6 12 12
(2.32)
(2.33)
(2.34)
(2.35)
aX
n!
.
nP (a) !
(2.36)
However, this value is very hard to manipulate because of the many factorials
inside. So, its bounds turn out to be more useful for our purposes.
23
(2.37)
xX n
X
xT
P n (x)
(2.38)
en H(P )
(Corollary 2.17)
(2.39)
n (P )
X
xT n (P )
= T n (P ) en H(P ) ,
(2.40)
i.e.,
T n (P ) en H(P ) .
(2.41)
P Pn (X ).
(2.42)
n
Y
k=1
P (xk ) =
Y
asupp(P )
P (a)N(ax) =
P (a)nP (a) .
(2.43)
asupp(P )
Note that the righthand side (RHS) of (2.43) is independent of x, and holds
for all sequences of type P . Hence,
X
P n (T n (P )) =
P n (x)
(2.44)
xT n (P )
= T n (P )
P (a)nP (a) .
(2.45)
asupp(P )
Since this holds for every P , it must also hold for P , i.e., we also have
Y
P n (T n (P )) = T n (P )
P (a)nP (a) .
(2.46)
asupp(P )
Now if there exists some a X such that P (a) > 0, but P (a) = 0, then it
follows immediately from (2.45) that P n (T n (P )) = 0 and (2.42) is trivially
satisfied. So we assume that supp(P ) supp(P ). Note that for all a
supp(P ) but a
/ supp(P ), we have P (a) > 0 and P (a) = 0, i.e.,
P (a)nP (a) = 1,
(2.47)
1
Note that this proof is almost identical to the proof about the size of a weakly typical
(n)
(n)
set A in [Mos14, Chapter 19]. There we got A  en(H(X)+) , while here we get a
much more precise result without . So we see that this method is stronger!
24
Method of Types
and therefore
Y
P (a)nP (a) =
asupp(P )
P (a)nP (a) ,
asupp(P )
supp(P ) supp(P ).
(2.48)
Q
T n (P ) asupp(P ) P (a)nP (a)
Q
T n (P ) asupp(P ) P (a)nP (a)
Q
T n (P ) asupp(P ) P (a)nP (a)
Q
T n (P ) asupp(P ) P (a)nP (a)
Y
T n (P )
T (P ) asupp(P )
Q
n!
asupp(P ) nP (a) !
Q
n!
asupp(P ) nP (a) !
(2.49)
(2.50)
(2.51)
Y
asupp(P )
(2.52)
Y
nP (a) !
P (a)n(P (a)P (a))
asupp(P ) nP (a) ! asupp(P )
Q
asupp(P ) nP (a) !
=Q
P (a)n(P (a)P (a))
asupp(P ) nP (a) ! asupp(P )
Y
nP (a) !
=Q
asupp(P )
(2.53)
(2.54)
(2.55)
asupp(P )
Here, in (2.50) we have used (2.48); (2.52) follows from the exact formula
(2.36) for the size of the type class; and in (2.54) we enlarge the range of the
first product without changing its value because for every added term a we
have (nP (a))! = 0! = 1.
Next we need a small lemma.
Lemma 2.19. For any m N0 and any n N, we have
m!
nmn .
n!
(2.56)
Proof: If m n, then
m!
= m (m 1) (m 2) (n + 1) n n{z n} = nmn .

{z
}
n!
mn terms, each term n
mn terms
(2.57)
25
If m < n, then
1
1
m!
1
=
= nm = nmn . (2.58)
n!
n (n 1) (n 2) (m + 1)
n nn
n

{z
}  {z }
nm terms
P n (T n (P ))
Y
asupp(P )
nP (a)
P (a)n(P (a)P (a))
(2.60)
asupp(P )
P
n asupp(P ) (P (a)P (a))
=n
=n
n(11)
(2.59)
(2.61)
= n = 1.
(2.62)
Here, we have again made use of our assumption that supp(P ) supp(P ).
This proves (2.42).
Since every sequence x X n has exactly one type, summing over all type
classes is equivalent to summing over all sequences. Hence,
X
1=
P n (T n (P ))
(2.63)
P Pn (X )
P n (T n (P ))
P Pn (X )
= P n (T n (P ))
(by (2.42))
(2.64)
(2.65)
= P n (T n (P )) Pn (X )
(2.66)
P Pn (X )
X 
P (T (P )) (n + 1)
X
= (n + 1)X 
P n (x)
(by TT1)
(2.67)
(2.68)
xT n (P )
= (n + 1)X 
X
xT
= (n + 1)
X 
en H(P )
(2.69)
n (P )
T n (P ) en H(P ) ,
(2.70)
i.e.,
T n (P )
1
en H(P ) ,
(n + 1)X 
(2.71)
1
en H(P ) .
(n + 1)X 1
(2.72)
26
Method of Types
en Hb ( n ) .
(2.74)
e
n+1
k
This gives a very good estimate at the growth rate of the binomial coefficient
in n.
Finally, we arrive at a fourth type theorem.
Theorem 2.21 (Type Theorem 4 (TT4)).
For any Q P(X ) and a type P Pn (X ), we have
1
en D(P k Q) Qn (T n (P )) en D(P k Q) .
(n + 1)X 
Proof: This follows readily from the first three type theorems:
X
Qn (T n (P )) =
Qn (x)
xT
xT
=e
(by TT2)
(2.77)
n (P )
(2.76)
n (P )
(2.75)
n H(P )
n D(P k Q)
(2.78)
(by TT3)
(2.79)
(2.80)
(n + 1)X 
1
=
en D(P k Q) .
(n + 1)X 
(2.81)
(2.82)
(2.83)
(2.84)
27
denotes that
1
f (n)
log
= 0,
n n
g(n)
lim
(2.85)
i.e., f and g have the same exponential growth rate, we can write:
Pn (X ) ' 1;
(2.86)
Q (x) = exp n H(Px ) + D(Px k Q) ;
T n (P ) ' exp n H(P ) ;
Qn (T n (P )) ' exp n D(P k Q) .
n
2.3
(2.87)
(2.88)
(2.89)
Joint Types
1
QZ (9) = ,
2
QZ (15) =
1
,
12
QZ (19) =
1
12
(2.90)
and
(
W
!
!
!
!)
2
3
15
300
,
,
,
6
5
0
1
(2.91)
with
1
1
1
1
QW (2, 6) = , QW (3, 5) = , QW (15, 0) = , QW (300, 1) = ?
3
2
12
12
(2.92)
Well, they do take on different values, but the probabilities are identical, and
therefore the uncertainty of Z and W are also identical!
(2.93)
Example 2.24. We continue with Example 2.2. We have I(11, 23x, y) = {3}
such that N(11, 23x, y) = 1. On the other hand, I(15, 24x, y) = and
N(15, 24x, y) = 0.
28
Method of Types
(2.96)
(x, y) X n Y n .
(2.97)
Note that Qn still denotes the product distribution: The pair of sequences
(X, Y) are assumed to be pairwise IID, i.e., while the components Xk and Yk
depend on each other, they are independent of the past and the future.
TT3 states that for any P Pn (X Y),
1
en H(P ) T n (P ) en H(P ) .
(n + 1)X Y
(2.98)
Finally, TT4 now reads as follows: For any joint distribution Q P(X Y)
and for a joint type P Pn (X Y), we have
1
en D(P k Q) Qn (T n (P )) en D(P k Q) .
(n + 1)X Y
2.4
(2.99)
Conditional Types
=
= Px,y (a, b).
n
N(ax)
n
(2.101)
29
(2.102)
x = (2, 2, 2, 3, 3, 3, 4, 3, 2, 3).
(2.103)
Then we see that Pyx (12) = 34 because there are 4 positions where xk = 2,
three of which have a corresponding yk = 1. On the other hand, Pyx (1) is
not defined, because the symbol 1 does not show up in the given sequence
x.
We also generalize the definitions of set of types and type class to this
conditional situation.
Definition 2.28. The set of conditional probability distributions that can be
conditional types for a lengthn sequence from alphabet Y given a lengthn
sequence from alphabet X is denoted by Pn (YX ). The set that contains all
such conditional distributions (type or not) is denoted by P(YX ).
For a conditional distribution PY X P(YX ) and a given sequence x
n
X , the conditional type class of PY X is defined as
T n (PY X x) , y Y n : Pyx = PY X .
(2.104)
Be aware of our used notation: PY X and Pyx are both conditional distributions, but the latter is defined via the occurrences of symbols in (x, y).
Theorem 2.29 (Conditional Type Theorem 1 (CTT1)).
The number of conditional types is bounded as follows:
Pn (YX ) (n + 1)X Y .
(2.105)
Proof: For a given a X we already know that there are at most (n + 1)Y
ways of choosing Pyx (a). Now we have X  ways of choosing a, i.e., think of
Pyx () as a matrix with at most
(n + 1)Y
X 
(2.106)
30
Method of Types
where
HPx (Pyx ) ,
X
aX
Px (a) H Pyx (a) ,
(2.107)
X
Px (a) D Pyx (a)
QY X (a) .
DPx Pyx
QY X ,
(2.108)
aX
(2.109)
k=1
Y
(a,b)supp(Px,y )
Y
(a,b)supp(Px,y )
QY X (ba)N(a,bx,y)
(2.110)
QY X (ba)nPx,y (a,b)
(2.111)
(2.112)
(a,b)supp(Px,y )
= exp
(a,b)supp(Px,y )
= expn
Pyx (ba)
1
(2.114)
X
(a,b)supp(Px,y )
= expn
(2.113)
Px (a)
asupp(Px )
bsupp(Pyx (a))
Pyx (ba)
QY X (ba)
X
asupp(Px )
Px (a)
X
bsupp(Pyx (a))
1
.
Pyx (ba)
(2.115)
(2.116)
31
where as above
HPx (PY X ) ,
X
aX
Px (a) H PY X (a) .
(2.117)
Proof: The proof relies heavily on TT3. Fix a X with N(ax) > 0, take
a sequence y T n (PY X x), and only consider those components of y that
have a corresponding component a in the sequence x, i.e., consider yI(ax) .
This subsequence has length N(ax) and its type is by definition
PyI(ax) () = Pyx (a) = PY X (a).
(2.118)
Hence, we look at the type class of all lengthN(ax) sequences with type
PY X (a). From TT3 we know that
1
eN(ax) H(PY X (a))
Y
(N(ax) + 1)
T N(ax) PY X (a) eN(ax) H(PY X (a)) .
(2.119)
To get the size of the total type class, we have to run through all possible a X
and generate every possible sequence y by taking all possible combinations of
components of each subtype class, i.e., we have to compute the product of
the sizes of each subtype class.2
Example 2.32. Let X = {0, 1, 2}, Y = {3, 4}, and
y = (3, 4, 4, 4, 3, 4),
(2.120)
x = (0, 1, 1, 0, 1, 2).
(2.121)
Then we have
Pyx
1/2
1/3
1/2
2/3
(2.122)
1 2
3, 3
(compare with
32
Method of Types
For a = 2: yI(2x) = (3), which is of type (1, 0) (compare with third
column in (2.122)).
In total there is only 1 sequence of this type.
So, in total there are 2 3 1 = 6 different choices for y having the same
conditional type.
(2.123)
aX
N(ax)>0
(2.119)
1
eN(ax) H(PY X (a))
Y
(N(ax) +1)
aX
 {z }
N(ax)>0
n
aX
N(ax)>0
Y
aX
nP
1
e aX
(n + 1)Y
1
(n + 1)Y
N(ax)
n
(2.124)
H(PY X (a))
(2.125)
!
en
aX
P
1
n aX Px (a) H(PY X (a))
e
.
(n + 1)X Y
(2.126)
(2.127)
(2.128)
(2.129)
(2.130)
yT n (PY X x)
(2.131)
33
=e
n DPx (PY X k QY X )
(2.132)
(2.133)
2.5
Remarks on Notation
We all know that notation always is messy. And since we are now trying
to bring some kind of logic into the used notation of this script, we merely
increase the chance to mess everything up even more. . . At the end, we wont
get around trying to understand what the statements actually mean!
We try to clearly distinguish between constant and random quantities.
The basic rule here is
capital letter X : random,
small letter
x : deterministic.
x : deterministic vector.
(In handwriting bold is usually replaced by underline: X and x.) There are a
few exceptions to this rule. Certain deterministic quantities are very standard
in capital letters, so, to distinguish them from random variables, we use a
different font. For example, the capacity is denoted by C (in contrast to a
random variable C). Sets are denoted using a calligraphic font: F. So, if X
is a random variable (RV), then the alphabet of X is denoted by X :
X X.
(2.134)
(2.135)
(2.136)
I like this notation very much, however, unfortunately, it clashes with the
notation used for types. So we will try to follow some slightly adapted rules:
P and Q both can denote some specific PMF. This means that here we
think of P and Q to be fixed distributions and then we define random
variables having this distribution, e.g., X, Y Q and Z P . This
is in contrast to the P in PX and PY that is only generic and means
something different depending on the particular subscript!
34
Method of Types
When referring to the PMF of a RV X, we will exclusively use QX (and
not PX !).
Px denotes the type of sequence x. It is particularly important to note
that PX is the type of the random sequence X, i.e., it is a random
empirical distribution and not the PMF of the random vector X (which
would be stated as QX ).
Note that we try to use P for PMFs in the situation where the PMF
actually can be seen as a type for some sequence:
P : a PMF that could also be a type for some sequence,
Q : a general PMF.
So, since P(X ) denotes the set of all possible PMFs on a finite alphabet X and
Pn (X ) denotes the set of all possible types with denominator n, we usually
have
Q P(X ),
(2.137)
P Pn (X ).
(2.138)
(2.139)
Qn (x) =
xF
n
XY
xF k=1
Q(xk ).
(2.140)
Chapter 3
Now, it turns out that this question can be posed in very compact form using
types. Note that
n
X
1X
1X
xk =
aN(ax) =
aPx (a) = EPx [X].
n
n
k=1
aX
(3.3)
aX
(3.4)
k=1
F , Q P(X ) : EQ [X]
.
4
(3.6)
So, the problem under consideration is to find the probability that a random sequence X has a type far away from the expected type, or in other
words, what is the probability that PX F, where F denotes a set of nontypical types.
35
36
3.1
Sanovs Theorem
(3.7)
If in addition the set F is nice in the sense that there exists a sequence
{Pn F Pn (X )} of types in F such that
k Q),
lim D(Pn k Q) = inf D(Q
(3.8)
1
k Q),
log Qn T n (F) inf D(Q
n n
QF
(3.9)
1
k Q).
log Qn T n (F) = inf D(Q
n n
QF
(3.10)
QF
then
lim
i.e., we have
lim
(3.11)
min
P F Pn (X )
D(P kQ).
(3.12)
Now, since the type class of any PMF that is not a type is empty by definition,
we have
Qn T n (F) = Qn T n F Pn (X )
(3.13)
X
n
n
=
Q T (P )
(3.14)
P F Pn (X )
37
P(X )
Figure 3.1: Sanovs Theorem. The triangle depicts the set of all PMFs, the
shaded area is the subset F, and Q is a given PMF. By Q , we
denote the PMF in F (or at least on the boundary of F) that
is closest to Q, where relative entropy is used as a distance
measure.
en D(P k Q)
X
P F Pn (X )
=e
P F Pn (X )
en D(P k Q)
n minP F Pn (X ) D(P k Q)
= en D(P
e
max
P F Pn (X )
(by TT4)
k Q)
n D(P k Q)
n D(P k Q)
(3.15)
(3.16)
(3.17)
P F Pn (X )
F Pn (X )
(3.18)
Pn (X )
(n + 1)
X 
(enlarging set)
(3.19)
(by TT1).
(3.20)
Since
D(P kQ) =
min
P F Pn (X )
QF
(3.21)
38
X
P F Pn (X )
1
en D(P k Q)
X

(n + 1)
(by TT4)
1
en D(Pn k Q) ,
(n + 1)X 
(3.24)
(3.25)
where in the last inequality we have dropped all but one term in the sum.
Hence,
X  log(n + 1)
1
n
n
lim log Q T (F) lim
D(Pn kQ)
(3.26)
n
n n
n
= lim D(Pn k Q)
(3.27)
n
kQ),
= inf D(Q
QF
(3.28)
(3.29)
QF
lim
(3.30)
and therefore
1
log Qn T n (F)
n n
1
lim log Qn T n (F)
n n
Q),
inf D(Qk
Q) lim
inf D(Qk
QF
QF
(3.31)
(3.32)
(3.33)
where we have also made use of the fact that lim lim by definition. Hence,
Q).
the limit exists and is equal to inf QF
D(Qk
Q , argmin D(QkQ).
(3.34)
QF
However, this is strictly speaking not possible since we do not assume that F
is finite and it therefore might be that the argmin does not exist.
Moreover, the second half of the theorem is claimed to be true not for
nice sets F, but rather for sets F that are open subsets of P(X ). However,
it is nowhere mentioned how such an open subset of P(X ) is supposed to be
defined. Note that the term open only makes sense if we can define an
environment around any member of the set. To do so, we need a distance
measure, but unfortunately D( k ) is not a measure! The situation might be
saved if one considers the normal Euclidean distance and every PMF as a
X dimensional vector. However, in any case, the statement and its proof are
not clean.
39
Example 3.4. Suppose we have a fair coin and want to estimate the probability of observing 700 or more heads in a series of 1000 tosses. So we want to
know the probability of the set of all sequences with 700 or more heads, i.e.,
all sequences with a type
k 1000 k
(3.35)
P =
,
1000
1000
with k {700, 701, . . . , 1000}. Let
n
o
P(X ) : Q
= (p, 1 p) with 0.7 p 1 .
F, Q
1 1
2, 2
. Now,
p
1p
inf
p log
+ (1 p) log
0.7p1
1/2
1/2
= inf { Hb (p) + log 2}
kQ) =
inf D(Q
QF
(3.36)
0.7p1
(3.37)
(3.38)
(3.39)
(3.40)
0.7p1
(3.41)
Note that in this example F is very decent and it is easy to find a sequence
{Pn F P(X )} that achieves the infimum, because
Q = (0.7, 0.3) Pn (X ).
(3.42)
Example 3.5. Suppose we toss a fair dice n times. What is the probability
that the average of the tosses is greater
than or equal to 4?
1 Pn
We recall that the average n k=1 xk is the same as the expectation of
the type EPx [X]. For example, if x = (4, 5, 1, 6, 5, 6, 6, 5, 4, 5, 6) (n = 11), then
the average is about 4.82, the type is
1 0 0 2 4 4
Px =
, , , , ,
,
(3.43)
11 11 11 11 11 11
and hence
EPx [X] =
1
2
4
4
1+
4+
5+
6 4.82.
11
11
11
11
(3.44)
So, we define
(
F,
:
Q
6
X
i=1
)
4
iQ(i)
(3.45)
40
Q) over all Q
F for the given distribution
and we need to minimize D(Qk
Q = (1/6, . . . , 1/6).
Since this problem shows up so often, we will now generalize it and solve
it once in general form so that in future we can directly refer to this solution.
We are interested in the event that the sample average of g(X) for some
function g() is greater than some value :
n
1X
g(Xk ) .
n
(3.46)
k=1
From the discussion above, we know that this event is equivalent to the event
{PX F} where
(
)
X
P(X ) :
F, Q
g(a)Q(a)
.
(3.47)
aX
Even more general, we may be interested in the event that J different such
sample averages are larger than some given thresholds:
)
( n
1X
(3.48)
gj (Xk ) j , j = 1, 2, . . . , J ,
n
k=1
F, Q
gj (a)Q(a)
j , j = 1, 2, . . . , J .
(3.49)
aX
J
X
X
Q(a)
L(Q) =
Q(a) log
+
j
Q(a)g
j (a) j
Q(a)
a
a
j=1
!
X
Q(a)
1
+
(3.50)
L Q(a)
Q(a)
Q(a)
1
= log
+ Q(a)
+
j gj (a) +
Q(a)
Q(a)
Q(a)
Q(a)
j=1
= log Q(a)
log Q(a) + 1 + +
J
X
j gj (a) = 0,
(3.51)
(3.52)
j=1
Q(a)
= Q(a) e1
PJ
j=1
j gj (a)
(3.53)
41
i.e.,
Q (a) = P
Q(a) e
a0
PJ
j=1
Q(a0 ) e
j gj (a)
PJ
j=1
j gj (a0 )
(3.55)
(3.56)
aX
for all j = 1, . . . , J.
In our example (3.45) this reads
1
Q (x) = P66
ex
1
a=1 6
ea
ex
= P6
a
a=1 e
(3.57)
xQ (x) = 4.
(3.58)
x=1
(3.59)
(3.60)
and
Hence, if n = 100 000, then the probability that the average is larger or equal
to 4 is about
2624 1.4 10188 .
(3.61)
Example 3.6. Lets reconsider Example 3.5 and only make a very tiny
change: Instead of asking what the probability is of seeing a sample average larger or equal to 4, we now ask what is the probability of seeing a sample
average of exactly 4:
(
)
6
X
:
=4 .
F , Q
iQ(i)
(3.62)
i=1
Note that if this choice is notPpossible, then some of the values j 0 are too loose in
0
0
the sense that if all constraints n1 n
k=1 gj (Xk ) j , j 6= j , are satisfied, then the j th
constraint is automatically satisfied. These superfluous constraints must then be removed
from the problem and the problem solved again without them.
42
Its quite straightforward to see that the answer is completely identical to the
answer given in Example 3.5. How can that be?!?
First of all, lets be clear that the probability of {PX F} given in (3.45)
given in (3.62). It only could
is not the same as the probability of {PX F}
be the same if
"
(
)#
6
X
:
>4
Pr PX Q
iQ(i)
= 0,
(3.63)
i=1
3.2
We know that the relative entropy is not a distance measure. We will see
now that it actually behaves like a squared distance and thereby satisfies the
Pythagorean Theorem for triangles.
Before we state the theorem, lets quickly recall the definition of convexity.
Definition 3.7. A set F P(X ) is convex if from Q1 , Q2 F it follows that
Q , Q1 + (1 )Q2 F,
[0, 1].
(3.65)
43
(3.66)
QF
F.
Q
(3.67)
way from Q via the detour of Q to Q. What we have here is the behavior of
squared distances.
Recall the simple example of vectors in the threedimensional Euclidean
space. We project an arbitrary vector q R3 onto a given plane. Then the
projection point q is that point of the plane that has shortest Euclidean
distance to q:
q = argmin k
q qk.
(3.68)
plane
q
See Figure 3.2 for an illustration. In this case we have a triangle with a
90degree angle and therefore we know that the Pythagorean identity holds:
k
q q k2 + kq qk2 = k
q qk2 ,
plane.
q
(3.69)
(3.70)
Think of open versus closed sets, even though these terms are not well defined
because D( k ) is not a distance measure.
3
Compare this definition to the definition of a plane in R3 :
R3 : h
Fplane , q
q, vi =
for some vector v and some fixed value . Here, h, i denotes the inner product between
two vectors.
44
> 90 degrees
q
Figure 3.3: A point below the projection plane.
45
for some function f () and some fixed value . It can be shown that the
following holds:
Q ) + D(Q kQ) = D(Q
kQ),
D(Qk
Fplane .
Q
(3.72)
D(Q
F.
Q
(3.73)
> 90 degrees
0
define F , F {Q }. Note that F still is convex because F is convex and
Q must be on the boundary of F.
46
(3.74)
(3.75)
inf QF
0 D(QkQ)!). Hence,
D()
0
(3.76)
=0
!
X
Q(a)
+ (1 )Q (a)
(3.77)
=
Q(a) + (1 )Q (a) log
a
Q(a)
=0
(a)
X
Q(a)
+
(1
)Q
Q(a)
Q (a) log
=
Q(a)
a
=0
(a)
X
Q(a)
Q
Q(a)
+
Q(a)
+ (1 )Q (a)
Q(a)
Q(a)
+ (1 )Q (a)
a
=0
(3.78)
Q (a) Q(a)
=
Q(a)
log
Q(a) Q(a)
a
X
X
Q (a)
+
Q(a)
X
Q (a) log
Q (a)
Q(a)
(3.79)
 {z }
=1
{z
=1
Q) D(Qk
Q ) D(Q k Q).
= D(Qk
(3.80)
Also note that onesidedness makes sure that Q is unique! The reason is
that if there exists a Q 0 that also achieves the infimum in (3.66), then Q 0
must be on the wrong side of the tangential plane through Q , i.e., some part
of F will be on the same side of the tangential plane as Q. Hence, this set
cannot be onesided. See Figure 3.6 for an illustration.
While the intuitive meaning of onesidedness is quite clear, it turns out to
be quite difficult to describe this family of sets in a mathematical clean way.
47
> 90 degrees
F
Figure 3.5: A onesided set with respect to Q. We also show its tangential
plane. Note that this set is not onesided with respect to Q0 .
Q
48
+ Q Q2 QQ
.
QQ
(3.81)
QF
(3.82)
F it is true that
We say that F is onesided with respect to Q if for all Q
Q).
D(QkQ
) + D(Q k Q) D(Qk
3.3
(3.83)
Sometimes, one needs a proper distance measure between PMFs and therefore
cannot rely on the relative entropy. One possible such measure is the variational distance. This measure was introduced in [Mos14, Section 1.3] and
some relations to entropy were discussed in [Mos14, Appendix 1.B].
Definition 3.10. The variational distance (sometimes also called L1 distance) between any two probability distributions Q1 and Q2 is defined as
X
Q1 (a) Q2 (a).
(3.84)
V (Q1 , Q2 ) ,
aX
We will now investigate the variational distance further and deepen our
understanding. Let
M , a X : Q1 (a) > Q2 (a)
(3.85)
such that
max Q1 (A) Q2 (A) = Q1 (M) Q2 (M)
AX
(3.86)
49
(because to maximize Q1 (A) Q2 (A) you choose all a X for which Q1 (a) >
Q2 (a)). Now, by definition we have
X
Q1 (a) Q2 (a)
V (Q1 , Q2 ) =
(3.87)
aX
aM
X
Q1 (a) Q2 (a) +
Q2 (a) Q1 (a)
(3.88)
aMc
(3.89)
(3.90)
(3.91)
Hence we have
Q1 (M) Q2 (M) =
1
V (Q1 , Q2 ).
2
(3.92)
Q2
Q1
I
II
III
Mc
Figure 3.7: Representation of two PMFs Q1 and Q2 : the width of the columns
are chosen to be 1 so that area corresponds to probability. The
set M collects all a X for which Q1 (a) > Q2 (a).
To understand this relationship better consider the example shown in Figure 3.7. Here all a M (i.e., those a for which Q1 (a) > Q2 (a)) are on the
left, and it holds that
area I = Q1 (M) Q2 (M).
But since the areas I and II also can be written as
Z
area I =
dQ1 area III
X
= 1 area III,
Z
area II =
dQ2 area III
X
= 1 area III,
(3.93)
(3.94)
(3.95)
(3.96)
(3.97)
50
we see that area I and area II must always be equally large. In other words,
Q1 (M) Q2 (M) = Q2 (Mc ) Q1 (Mc ).
(3.98)
Since this argument does not use any fact particular to this example, (3.98)
actually holds in general.
In literature, the quantity (3.86) is known as total variation distance, even
though on first sight its definition looks quite different:
Definition 3.11. The total variation distance between any two probability
distributions Q1 and Q2 is defined as
(3.99)
Vtot (Q1 , Q2 ) , maxQ1 (A) Q2 (A).
AX
(3.102)
1
V (Q1 , Q2 ).
2
(3.103)
We now come to the main result of this section: a relation between the
variational distance and the relative entropy.
Theorem 3.13 (Pinsker Inequality [Pin60] [Csi84]).
For two PMFs Q1 , Q2 P(X ), we have
D(Q1 k Q2 )
1 2
V (Q1 , Q2 ) log e.
2
(3.104)
P1 (0) = 1 p,
P2 (0) = 1 q,
(3.105)
(3.106)
51
(3.107)
(3.108)
= (p q + p q)
2
= 4(p q) .
(3.109)
(3.110)
We define
1 2
V (P1 , P2 ) log e
2
1p
p
2(p q)2 log e
= p log + (1 p) log
q
1q
(3.111)
(3.112)
(3.113)
and that
0
14 12 12 =0
}
z } { z
{
(q p) 1 4q(1 q)
g(p, q)
=
log e 0,
q
q(1 q)
for q p.
(3.114)
Hence, we have
g(p, q) 0,
q p,
(3.115)
(3.116)
P2 (1) =
X
aM
Q2 (a) = Q2 (M) , q,
P2 (0) = 1 q.
(3.119)
52
= P1 ({1}) P2 ({1})
= Q1 (M) Q2 (M)
= Vtot (Q1 , Q2 ),
(3.120)
(3.121)
(3.122)
(3.123)
3.4
(DPI)
(3.124)
(3.125)
(3.126)
(by (3.123))
(3.127)
(3.128)
a X.
(3.129)
Proof: We know from TT2 that all sequences of a given type have the
same probability. Hence, conditionally on PX = P , all possible sequences are
equally likely. Moreover, since PX = P , we also know that there are nP (a)
positions k where Xk = a. Hence, the probability that the first position is a
is exactly nPn(a) = P (a).
Example 3.15. Consider X = {a, b}, n = 3, and P = 13 , 23 . Note that Q
does not matter. Given that PX = P , it follows that
X =ab b
or X = b a b
(3.130)
or X = b b a
where all three possibilities are equally likely. Hence, we have a 31 chance that
X1 = a and a 32 chance that X1 = b.
53
QF
(3.131)
(3.132)
where
2
(, n) ,
+ (n + 1)2X  en .
log e
(3.133)
(3.134)
A , SD +2 F Pn (X ),
B , F Pn (X ) \ A,
(3.135)
and
(3.136)
i.e., A is the set of types in F that achieve a relative entropy close to D , and
B are all other types in F, see Figure 3.8.
We start by showing that with high probability the type of X is in A. We
have
Qn (T n (B)) =
Qn (T n (P ))
P F Pn (X )
D(P k Q)>D +2
<
P F Pn (X )
D(P k Q)>D +2
en D(P k Q)
+2)
en(D
P F Pn (X )
D(P k Q)>D +2
+2)
en(D
(3.137)
(by TT4)
(3.138)
(3.139)
(3.140)
(by TT1)
(3.141)
P Pn (X )
+2)
(n + 1)X  en(D
54
Q
P(X )
Q
Figure 3.8: Illustration of the set A defined in (3.135). The triangle depicts
the set of all PMFs, the lightly shaded area is the subset F, and
the darkly shaded area is A.
and
Qn (T n (A))
Qn (T n (A SD + ))
n
(reduce set)
= Q (T (SD + F Pn (X )))
X
=
Qn (T n (P ))
P F Pn (X )
D(P k Q)D +
X
P F Pn (X )
D(P k Q)D +
X
P F Pn (X )
D(P k Q)D +
(3.142)
(3.143)
(3.144)
1
en D(P k Q)
(n + 1)X 
(by TT4)
(3.145)
1
n(D +)
e
(n + 1)X 
(D(P k Q) D + )
(3.146)
(3.147)
en(D +)
X

(n + 1)
Hence,
Qn (T n (B F))
Qn (T n (F))
n
Q (T n (B))
= n n
Q (T (F))
Pr[PX B  PX F] =
(3.149)
55
Qn (T n (B))
Qn (T n (A))
(A F)
(3.150)
(3.151)
= (n + 1)2X  en .
(3.152)
Therefore, since A and B are disjoint and their union contains all types in F,
we have
Pr[PX A PX F] = 1 Pr[PX B  PX F]
> 1 (n + 1)
2X  n
(3.153)
(3.154)
Next we show that all types in A are close to Q . For this we need to rely
on the Pythagorean Theorem (Theorem 3.8). For any P A we have
D + 2 D(P k Q)
D(P k Q ) + D(Q k Q)
(by definition of A)
= D(P k Q ) + D ,
(3.155)
(3.156)
(3.157)
i.e.,
D(P k Q ) 2.
(3.158)
(3.159)
2
.
V (P, Q )
log e
(3.160)
2
PX A = V (PX , Q )
,
log e
(3.161)
Hence we have
PX F Pr[PX A PX F]
Pr V (PX , Q )
log e
> 1 (n + 1)2X  en .
Next we note that by definition (3.160) can be written as
X
2
P (a) Q (a)
,
log e
aX
(3.162)
(3.163)
(3.164)
56
2
,
P (a) Q (a)
log e
Hence, if V (P, Q )
2 ,
log e
a X.
(3.165)
then
2
2
Q (a)
P (a) Q (a) +
,
log e
log e
a X.
(3.166)
P F Pn (X )
X
P F Pn (X )
(3.168)
P (a) Pr[PX = P  PX F]
(3.169)
P F Pn (X)
V (P,Q ) 2loge
X
P F Pn (X)
V (P,Q ) 2loge
P (a) Pr[PX = P  PX F]
(3.170)
2
Q (a)
Pr[PX = P  PX F]
log e
(3.171)
2
= Q (a)
log e
X
P F Pn (X)
V (P,Q ) 2loge
Pr[PX = P PX F]
(3.172)
2
2
Pr V (PX , Q )
P
F
(3.173)
X
log e
log e
2
(3.174)
1 (n + 1)2X  en
log e
2
2
Q (a) (n + 1)2X  en +
(n + 1)2X  en
= Q (a)
log e  {z }
log e

{z
}
1
= Q (a)
> Q (a)
2
Q (a)
(n + 1)2X  en .
log e
(3.175)
(3.176)
Here, (3.167) follows from the Total Probability Theorem; in (3.169) we use
Lemma 3.14; the inequality (3.170) follows by constraining the sum; (3.171)
follows from (3.166); and in (3.174) we use (3.163).
57
The derivation of the upper bound is very similar. We start with (3.169):
X
P (a) Pr[PX = P  PX F]
(3.177)
Pr[X1 = a PX F] =
P F Pn (X )
P F Pn (X)
V (P,Q ) 2loge
P F Pn (X)
V (P,Q )> 2loge
2
Q (a) +
log e
X
P F Pn (X)
V (P,Q ) 2loge
2
,
Q (a) +
log e
Pr[PX = P PX F]
{z
(3.179)
}
(3.180)
X
P F Pn (X)
V (P,Q )> 2loge
Pr[PX = P PX F]
2
= Pr V (PX , Q ) >
P
F
X
log e
2
PX F
= 1 Pr V (PX , Q )
log e
< (n + 1)2X  en .
(3.181)
(3.182)
(3.183)
(3.184)
58
F
Figure 3.9: An example of a locally onesided set F.
that not every set is locally onesided, as can be seen from the counterexample
shown in Figure 3.10.
To make sure that we do not get into troubles, we in addition also require
that Q is unique (this might not be the case anymore if F is only locally
onesided and not properly onesided!).
We now reformulate the Conditional Limit Theorem and get the following
more general version.
59
F
Figure 3.10: An example of a set F that is not locally onesided with respect
to Q. The shaded area shows the part of A that is on the wrong
side of the tangential plane.
QF
(3.185)
(3.186)
and for an arbitrary > 0, we define the set A of types in F that achieve
a relative entropy close to D ,
A , SD +2 F Pn (X ),
(3.187)
(3.188)
60
(3.189)
where
2
(, n) ,
+ (n + 1)2X  en .
log e
(3.190)
Proof: The proof is identical to the proof of Theorem 3.16 with the only
difference that instead of relying on the Pythagorean Theorem (Theorem 3.8),
we invoke the assumption (3.188).
Example 3.19. Consider {Xk } IID Q and > EQ X 2 fixed. Then it
follows from Theorem 3.18 that
"
#
n
1X
n
Pr X1 = a
Xk2 Q (a)
(3.191)
n
k=1
kQ) over all Q
that satisfy E X 2 (note that
where Q minimizes D(Q
Q
P(X ) : E X 2 ). From (3.55) we know that
we have F = Q
Q
2
(3.192)
Gaussian!
Pr[X1 = a1 , X2 = a2 , . . . , Xm = am  PX F]
m
Y
Q (ak ),
(3.193)
k=1
where m N is fixed.
We omit the proof, but hope that this result is quite intuitive. Lets just
discuss the case m = 2. Obviously, the Conditional Limit Theorem does not
depend on whether we regard X1 or X2 . So the only new part in Corollary 3.20
is the conditional independence between X1 and X2 when n tends to infinity.
This can be understood by noting that X1 and X2 are dependent due to the
given structure of the type of the sequence. However, the longer we make the
sequence, the weaker this dependence becomes.
Note that Corollary 3.20 strongly relies on the assumption that m is fixed.
In particular, m must not grow with n. Once again, this is obvious because
61
given the type of a sequence, the last component can be determined from the
previous components, i.e., they are not independent and can therefore not
have a product distribution.
Example 3.21. Lets continue with Example 3.5. We know that
Q (a) = Q(a) maxentropy PMF for given mean =
1 x
e .
c
(3.194)
So, given that the average of a series of dice throws is larger than 4, the first
couple of throws look like as if they were IID according to an exponential
distribution.
Chapter 4
Strong Typicality
We have introduced types in Chapter 2 and already seen in Chapter 3 how
the concept can be very useful in proofs. We will further rely on types in
this class, for example in the study of error exponents or universal source
coding, however, most of the time we will use a slightly simpler tool: typical
sets. The idea of typical sets is to merge several nearby types and their type
classes together, so that we do not need to bother about the exact form of
the distributions, i.e., we do not need to worry whether a PMF is now a type
( Pn (X )) or not. Since types are dense in the set of PMFs1 (similar to Q
being dense in R), for any PMF there is always a type closeby.
There are two types of typical sets: weakly typical sets and strongly typical sets. We have defined weakly typical sets in the first course [Mos14,
Chapter 19]. The biggest advantage of weakly typical sets is that they are
easily generalized to continuousalphabet random variables, while this is not
possible for strongly typical sets and types, which require a finite alphabet.
However, weakly typical sets are difficult to teach because they are not really
very intuitive.
In this course we are going to talk about strongly typical sets that turn
out to be much more intuitive. They provide the main tool that we will
need to prove almost all results discussed in this course. Actually, strongly
typical sets are a more powerful tool than weakly typical sets because they
merge fewer types together (therefore the names!). Note, however, that the
strongest results are proven using types and not typical sets.
We would like to point out an unfortunate misunderstanding: Due to their
name it is tempting to think that strongly typical sets rely on the strong law
of large numbers (Theorem 1.18), while weakly typical sets only require the
weak law of large numbers (Theorem 1.17). This is not true: Both concepts,
strong and weak typicality, only require the weak law of large numbers.
As mentioned, strong typicality only works with finite discrete alphabets,
so all our results are restricted to such cases. One exception, however, shall be
1
We put dense in quotation mark here because we have not properly defined what we
mean by it. Since we will never need to rely on this concept in any proof, we will leave it
undefined, but hope that it illuminates the basic idea anyway.
63
64
Strong Typicality
4.1
i 0 if 0,
Note that the order matters. We always let n go to infinity first and only
afterwards make small.
Moreover, one particular and one particular show up so often that we
name them. We define
m (Q) , log Qmin
(4.2)
65
where Qmin denotes the smallest positive value of Q. Here m stands for
minimum and we do not especially point out the implicit dependence of m
on . Note that m (Q) > 0 because log Qmin < 0.
We also define
t (n, , X ) , (n + 1)
X 
2
exp n
log e
2X 2
(4.3)
m (Q))
n (x) < en(H(Q)+D(Q k Q)
,
a) en(H(Q)+D(Q k Q)+m (Q)) < Q
(4.4)
b)
n(H(Q)+m (Q))
n(H(Q)m (Q))
(4.5)
1 , m (Q) + m (Q).
(4.7)
1)
n A(n) (Q) < en(D(Q k Q)
<Q
.
(4.9)
66
Strong Typicality
(n)
(Q),
(4.10)
is bounded as
1 t (n, , X ) Qn A(n) (Q) 1.
b)
(4.11)
Remark 4.4. We would like to point out that those bounds that contain a
factor (1t ) in front of the exponential term can be rewritten with the (1t )factor incorporated into the exponent and combined with the . However, since
the t needs to have n tending to infinity before we can make small, the in
the exponent must be changed to a . For example, in TA2 we have
1
1 (n, , X ) en(H(Q)m (Q)) = en(H(Q)m (Q)+ n log(1t (n,,X )))
(4.12)
t
= en(H(Q))
(4.13)
1
log 1 t (n, , X ) .
n
(4.14)
with
, m (Q)
Note that m (Q) for n .
Proof of Theorem 4.3: We start with TA1a: Let
X 0 , X \ a X : Q(a)
=0
(4.15)
n
Y
.
X 
k)
Q(x
(4.16)
(4.17)
k=1
N(ax)
Q(a)
(4.18)
nPx (a)
Q(a)
(4.19)
aX 0
Y
aX 0
<
n
Q(a)
Q(a) X 
(by (4.16))
(4.20)
aX 0
!
X
= exp
n Q(a)
log Q(a)
X 
0
aX
!
X
X
Q(a)
1
= exp n
Q(a) log
log Q(a)
Q(a)
X

Q(a)
aX 0
aX 0
0
+ H(Q) + X 
min
exp n D(Q k Q)
log Q
X 
+ H(Q) m (Q)
.
exp n D(Qk Q)
(4.21)
(4.22)
(4.23)
(4.24)
67
Here, in the last inequality, we used that X 0  X  and applied (4.2). Note
that we have to use X 0 in (4.18) because otherwise we might get to an expression like 00 , which is not defined.
The lower bound follows similarly.
= Q.
TA1b follows from TA1a by choosing Q
Next we turn to TA3b and define
n
o
F , Px Pn (X ) : x
/ A(n) (Q)
(4.25)
to be the set of all types of all nontypical sequences. Then for any P F
there must exist some a X such that
P (a) Q(a)
X 
(4.26)
or
P (a) > 0
but Q(a) = 0,
(4.27)
because otherwise the conditions in (4.1) are satisfied and the corresponding
sequence x with this type Px = P would be typical. Hence,
1 2
V (P, Q) log e
2
!2
1 X
=
P (a) Q(a) log e
2
D(P kQ)
(4.28)
(4.29)
aX
1
(P (a) Q(a))2 log e
2
2
log e.
2X 2
(drop terms)
(4.30)
(by (4.26))
(4.31)
Here in (4.30) we drop all terms in the sum apart from that particular a that
satisfies (4.26). Note that if (4.27) holds instead of (4.26), then D(P kQ) =
and (4.31) is true trivially.
Now, it follows from Sanovs Theorem (Theorem 3.1) that
n
n
X 
Q (T (F)) (n + 1) exp n min D(P k Q)
(4.32)
P F
2
(n + 1)X  exp n
log e
(by (4.31))
(4.33)
2X 2
= t (n, , X ).
(4.34)
Hence,
Qn A(n)
(Q) = 1 Qn (T n (F)) 1 t (n, , X ).
(4.35)
68
Strong Typicality
The claim TA2 is proven as follows:
X
1=
Qn (x)
(4.36)
xX n
Qn (x) +
Qn (x)
(4.37)
(n)
xA
/ (Q)
(n)
xA (Q)
Qn (x) + Qn (T n (F))
(4.38)
(4.39)
(n)
xA (Q)
<
X
(n)
xA (Q)
(Q) en(H(Q)m (Q)) + t (n, , X ),
= A(n)
(4.40)
where the inequality follows from TA1b and (4.34). This proves the lower
bound. For the upper bound we have
X
1=
Qn (x)
(4.41)
xX n
>
Qn (x)
(4.42)
(n)
xA (Q)
en(H(Q)+m (Q))
(by TA1b)
(4.43)
(n)
xA (Q)
(Q) en(H(Q)+m (Q)) .
= A(n)
(4.44)
(4.45)
(n)
xA (Q)
<
(n)
xA (Q)
= A(n)
(Q) en(H(Q)+D(Q k Q)m (Q))
n(D(Q k Q)
1)
(4.47)
(by TA2) (4.48)
(4.49)
4.2
69
is defined as
A(n)
(QX,Y )
, (x, y) X n Y n :
, (a, b) X Y, and
X  Y
Px,y (a, b) = 0, (a, b) X Y with QX,Y (a, b) = 0 .(4.50)
In contrast to the weakly typical set, where the individual typicality had
to be taken as a condition into the definition of jointly weakly typical sets,
here this implicitly follows from the definitions! We have the following lemma.
Lemma 4.6 (Joint Typicality Implies Individual Typicality). Let QX
and QY be the marginal distribution of QX,Y . Then,
(QX,Y ) = x A(n)
(QX ) and y A(n) (QY ), (4.51)
(x, y) A(n)
i.e., if a pair of sequences is jointly typical, then each sequence is automatically
typical with respect to its marginal distribution.
(n)
Note that the inverse is not true, i.e., from x A (QX ) and y
(n)
(n)
A (QY ) we cannot conclude that (x, y) A (QX,Y ).
Proof: Recall from our discussion in Section 2.4 that types are probability
distributions and behave this way. Hence, if (x, y) have type Px,y , then x has
type Px where
X
Px (a) =
Px,y (a, b).
(4.52)
bY
(4.53)
y = 0 1 0 1 0 0 0 1 0 1.
(4.54)
Then
00
Px,y
01
10
11
(4.55)
and
= 1+2 = 4+3
z}{ z}{
3
7
Px =
,
.
10
10
(4.56)
70
Strong Typicality
Hence, if by definition we have
Px,y (a, b) QX,Y (a, b) <
,
X  Y
(4.57)
then
X
X
(Px,y (a, b) QX,Y (a, b)) <
bY
bY
,
X  Y
(4.58)
i.e.,
Px (a) QX (a) <
X 
(4.59)
and one condition for strong typicality of x is satisfied. All other conditions
can be checked similarly.
(n)
The properties of A (Q) given in TA generalize directly to the jointly
strongly typical set.
Corollary 4.8 (Generalized Theorem A (TA)).
X,Y P(X Y) and let (x, y) A(n)
Let QX,Y , Q
(QX,Y ) be a strongly
typical pair of sequences. Define m () as given in (4.2) and t (n, , X Y)
according to (4.3) as
2
X Y
t (n, , X Y) , (n + 1)
exp n
log e .
(4.60)
2X 2 Y2
1. The joint probability of the jointly typical sequences (x, y) is bounded
as follows:
(4.62)
(4.64)
71
(4.66)
1 t (n, , X Y) QnX,Y A(n) (QX,Y ) 1.
(4.68)
Note that we have used the notation H(X, Y ) instead of the more precise
H(QX,Y ) simply out of habit.
4.3
We can also extend our definition of strongly typical sets to conditional distributions.
Definition 4.9. For some fixed a X with QX (a) > 0 we define the strongly
typical set conditional on the letter a and with respect to QY X as
(n)
A
QY X (a) , y Y n : Py (b) QY X (ba) <
, b Y, and
Y
Py (b) = 0, b Y with QY X (ba) = 0 .
(4.69)
This definition contains nothing new because it simply conditions everything on the event that X = a. A much more interesting generalization is
when we condition on a given sequence!
Definition 4.10. For some joint distribution QX,Y with marginal QX and
(n)
for some fixed strongly typical sequence x A (QX ), we define the conditionally strongly typical set with respect to QX,Y as
n
o
A(n)
(QX,Y x) , y Y n : (x, y) A(n) (QX,Y ) ,
(4.70)
i.e., conditionally on x, y is conditionally strongly typical if the pair (x, y) is
(n)
(n)
jointly typical. Note that for x
/ A (QX ), we have A (QX,Y x) = .
72
Strong Typicality
1. For every y A
as follows:
en(H(Y X)+m (QX,Y )) < QnY X (yx) < en(H(Y X)m (QX,Y )) .
(4.71)
is bounded as
1 t (n, , X Y) QnY X A(n) (QX,Y x) x 1.
(4.74)
.
X  Y
(4.75)
Hence,
QnY X (yx)
n
Y
=
QY X (yk xk )
(4.76)
k=1
Y
(a,b)supp(QX,Y )
Y
(a,b)supp(QX,Y )
QY X (ba)
N(a,bx,y)
(4.77)
QY X (ba)
nPx,y (a,b)
(4.78)
73
= expn
(a,b)supp(QX,Y )
QX,Y (a, b)
< expn
(a,b)supp(QX,Y )
X  Y
(4.79)
log QY X (ba)
(4.80)
!
QX,Y (a, b)
= exp n H(Y X) n
(4.81)
log
X  Y
QX (a)
(a,b)supp(QX,Y )
 {z }
X
QX,Y (a,b)
(QX,Y )min
exp n H(Y X) n log(QX,Y )min ,
(4.82)
where (4.80) follows from (4.75). The lower bound is analogous. This proves
TB1.
Next we use the lower bound in TB1 to prove the upper bound on the
size of the conditionally strongly typical set (TB2):
X
1=
QnY X (yx)
(4.83)
yY n
>
X
(n)
yA (QX,Y
x)
X
(n)
yA (QX,Y
QnY X (yx)
(drop terms)
(4.84)
(by TB1)
(4.85)
x)
(QX,Y x) en(H(Y X)+m (QX,Y )) .
= A(n)
(4.86)
(4.87)
(4.88)
(4.89)
(4.90)
(4.91)
for all (a, b) X Y. Here the first inequality follows because x and y are
jointly typical, and the second inequality follows because x is typical.
Hence, for all a such that Px (a) > 0, we have
1
1
Pyx (ba) QY X (ba) <
1+
.
(4.92)
X 
Y Px (a)
74
Strong Typicality
Hence, any y A
bY
(4.93)
(QX,Y x) satisfies for all a with Px (a) > 0 and for all
1
Pyx (ba) QY X (ba) < 1 + 1
.
X 
Y Px (a)
(4.94)
From (4.70) and the second condition in (4.50) it also follows that any
(n)
y A (QX,Y x) satisfies for all a with Px (a) > 0 and for all b Y with
QXY (ba) = 0
Pyx (ba) = 0.
(4.95)
We now define
n
o
Fx , PY X Pn (YX ) : y
/ A(n) (QX,Y x) with Pyx = PY X
(4.96)
(4.98)
because otherwise (4.94) and (4.95) are satisfied and the corresponding sequence y with this conditional type Pyx = PY X would be typical. Hence,
considering this pair (a, b), we have
DPx PY X
QY X
X
=
Px (
a) D PY X (
a)
QY X (
a)
(4.99)
a
X s.t.
Px (
a)>0
X
a
X s.t.
Px (
a)>0
X
a
X s.t.
Px (
a)>0
Px (
a)
1 2
V PY X (
a), QY X (
a) log e
2
1 X
Px (
a)
PY X (b
a) QY X (b
a)
2
(4.100)
!2
log e
(4.101)
bY
2
1
Px (a) PY X (ba) QY X (ba) log e
2
1 2
1
1 2
1+
2
log e
Px (a)
2
2 X 
Y
Px (a)
(4.102)
(4.103)
1 2
1
2
1
+
log e
2X 2
Y
Px (a)
 {z }
 {z }
75
2
1
Y
(4.104)
log e.
2X 2 Y2
(4.105)
X
yY n
QnY X (yx)
X
(n)
yA (QX,Y
(n)
yA (QX,Y
(n)
yA (QX,Y
x)
x)
<
(n)
yA
x)
(4.106)
QnY X (yx) +
QnY X (yx)
X
(n)
yA
/ (QX,Y
QnY X
QnY X (yx) +
(n)
yA (QX,Y
(4.108)
QnY X T n (PY X x) x
PY X Fx
(4.109)
PY X Fx
e
x)
(4.107)
T (Fx x) x
(QX,Y x)
x)
QnY X (yx)
2
2X 2 Y2
log e
(4.111)
PY X Fx
2
n(H(Y X) (Q ))
n
log e
m
X,Y
2X 2 Y2
e
+
F

e
(Q
x)
= A(n)
(4.112)
x
X,Y
2
n
log e
(QX,Y x) en(H(Y X)m (QX,Y )) + Pn (YX ) e 2X 2 Y2
A(n)
(4.113)
A(n)
(QX,Y x) en(H(Y X)m (QX,Y )) + (n + 1)X Y e
= A(n)
(QX,Y x) en(H(Y X)m (QX,Y )) + t (n, , X Y).
2
n
2X 2 Y2
log e
(4.114)
(4.115)
Here, the first inequality (4.110) follows from TB1 and from CTT4; in the
subsequent inequality (4.111) we make use of (4.105); in (4.113) we upperbound the size of Fx by the number of conditional types; in (4.114) we then
apply CTT1; and the final step (4.115) follows from definition of t given in
(4.3).
76
Strong Typicality
It only remains TB3. The upper bound is trivial. We again use the
definition (4.96) and write:
(4.116)
QnY X A(n)
(QX,Y x) x = 1 QnY X T n (Fx x) x
X
=1
QnY X T n (PY X x) x
(4.117)
PY X Fx
1
1
en DPx (PY X k QY X )
PY X Fx
PY X Fx
= 1 Fx  e
2
2X 2 Y2
2
2X 2 Y2
1 Pn (YX ) e
1 (n + 1)X Y e
log e
(4.119)
log e
2
2X 2 Y2
(4.120)
log e
2
n
2X 2 Y2
= 1 t (n, , X Y).
(4.118)
log e
(4.121)
(4.122)
(4.123)
Here, in (4.118) we have applied CTT4; (4.119) follows from (4.105); and
(4.122) follows from CTT1.
Note that (4.117)(4.123) basically is a proof of a conditional version of
Sanovs Theorem.
4.4
Accidental Typicality
To us, the most important circumstances with regard to typical sequences are
situations when two or more sequences are generated not according to the
joint PMF that is used to define the typical set, but rather independently (or
partially independently) based on marginal distributions of the joint PMF.
We know from our discussion of types and the large deviation theory that
in such a case these sequences are very likely to be typical, but not jointly
typical! Concretely, we will next compute bounds on the probability that an
independent pair of sequences accidentally looks like it had been generated
jointly using the joint PMF.
Theorem 4.12 (Theorem C (TC)).
Let QX,Y P(X Y) be a joint PMF with marginals QX and QY . Let
the pair of sequences (X, Y) be generated IID not according to QX,Y , but
independently according to the marginals:
{(Xk , Yk )}nk=1 IID QX QY .
(4.124)
77
(4.125)
where
2 , m (QX,Y ) + m (QX ) + m (QY ) 3m (QX,Y ).
(4.126)
(n)
2. For any x A (QX ), the probability that Y happens to be conditionally strongly typical given x,
Pr Y A(n)
(QX,Y x) = QnY A(n) (QX,Y x) , (4.127)
is bounded as
1 t (n, , X Y) en(I(X;Y )+3 )
< QnY A(n)
(QX,Y x) < en(I(X;Y )3 ) ,
(4.128)
where
3 , m (QX,Y ) + m (QY ) 2m (QX,Y ).
(4.129)
(x,y)A
>
(QX,Y )
X
(n)
(x,y)A (QX,Y
(4.131)
= A(n)
(QX,Y ) en(H(X)+H(Y )+m (QX )+m (QY ))
> 1 t (n, , X Y) en(H(X,Y )m (QX,Y ))
en(H(X)+H(Y )+m (QX )+m (QY ))
= 1 t (n, , X Y) en(I(X;Y )+2 ) ,
(4.132)
(4.133)
(4.134)
where the first inequality (4.131) follows from TA1b (note that because
(n)
(n)
(n)
(x, y) A (QX,Y ) we also have x A (QX ) and y A (QY )) and
the second inequality (4.133) follows from TA2.
Note that the steps (4.130)(4.134) are exemplary for many of the proofs
that we will encounter later in this course.
Moreover, we see that because
X
QY (b) =
QX,Y (a, b) QX,Y (a, b) (QX,Y )min
(4.135)
aX
78
Strong Typicality
(4.136)
m (QY ) m (QX,Y ).
(4.137)
and therefore
The same is also true for m (QX ). Hence, the bound in (4.134) can be further
bounded by replacing 2 by 3m (QX,Y ).
The upper bound is analogous.
To prove TC2, we only need to slightly adapt the proof of TC1:
QnY A(n)
(QX,Y x)
X
QnY (y)
(4.138)
=
(n)
yA
(QX,Y x)
>
(n)
yA
(4.139)
(QX,Y x)
= A(n)
(QX,Y x) en(H(Y )+m (QY ))
> 1 t (n, , X Y) en(H(Y X)m (QX,Y )) en(H(Y )+m (QY ))
= 1 t (n, , X Y) en(I(X;Y )+3 ) ,
(4.140)
(4.141)
(4.142)
where the first inequality (4.139) follows from TA1b (note that because y
(n)
(n)
A (QX,Y x) we also have y A (QY )) and the second inequality (4.141)
follows from TB2.
The upper bound is analogous.
We see from TC that if the sequence Y is generated completely independently of a sequence X (i.e., their joint distribution is QX QY instead of
QX,Y ), then the probability that they accidentally look jointly typical (i.e.,
the accidentally look like they have been generated jointly according to QX,Y )
is tending to zero exponentially fast in n, with the decay rate being I(X; Y ).
Hence, if X and Y are very dependent and have therefore high mutual information, the chance that independently generated versions of it happen to
look jointly typical is decreasing very fast to zero, while for small I(X; Y ) this
decay is slower. Obviously, if I(X; Y ) = 0, i.e., X
Y in the first place, then
the argument breaks down and TC only states trivial uninteresting relations.
This type of observations will be fundamental for our proofs in the following chapters.
The argumentation shown in TC can be taken a step further to a situation
of three RVs that are in a Markov relation.
Theorem 4.13 (Theorem D (TD)).
Let QU,V,W be a general joint PMF with marginals QU , QV U and QW U .
Let the triple of sequences (U, V, W) be generated IID not according to
QU,V,W , but according to marginal distributions forming a Markov chain
79
V (
U (
W:
{(Uk , Vk , Wk )}nk=1 IID QU QV U QW U .
(4.143)
(4.145)
(4.146)
(n)
(4.148)
3m (QU,V,W ).
(4.149)
Figure 4.1: Markov setup of three RVs that satisfies the assumptions of TD.
Proof: We start with the lower bound of TD1:
Pr (U, V, W) A(n)
(QU,V,W )
X
=
QnU (u) QnV U (vu) QnW U (wu)
(n)
(u,v,w)A
(4.150)
(QU,V,W )
80
Strong Typicality
en(H(U )+m (QU )) en(H(V U )+m (QU,V ))
>
(n)
(u,v,w)A
(QU,V,W )
(4.152)
> 1 t (n, , U V W) en(H(U,V,W )m (QU,V,W ))
en(H(U,V )+H(W U )+m (QU )+m (QU,V )+m (QU,W ))
= 1 t (n, , U V W)
en(H(W U,V )H(W U )m (QU,V,W )m (QU )m (QU,V )m (QU,W ))
= 1 t (n, , U V W) en(I(V ;W U )+4 ) .
(4.153)
(4.154)
(4.155)
Here, in (4.151) we use once TA1b and twice TB1. Note that because
(n)
(n)
(u, v, w) A (QU,V,W ) we know from Lemma 4.6 that u A (QU ),
(n)
(n)
that (u, v) A (QU,V ), and that (u, w) A (QU,W ). In (4.153) we use
TA2. Furthermore, we see that similarly to the derivation shown in (4.135)
(4.137), we can bound
4 4m (QU,V,W ).
(4.156)
>
(n)
(v,w)A
(QU,V,W u)
(4.157)
= A(n)
(QU,V,W u) en(H(V U )+H(W U )+m (QU,V )+m (QU,W ))
> 1 t (n, , U V W) en(H(V,W U )m (QU,V,W ))
en(H(V U )+H(W U )+m (QU,V )+m (QU,W ))
= 1 t (n, , U V W)
en(H(W U,V )H(W U )m (QU,V,W )m (QU,V )m (QU,W ))
= 1 t (n, , U V W) en(I(V ;W U )+5 ) .
Here, in (4.158) we use twice TB1; and in (4.160) we use TB2.
(4.159)
(4.160)
(4.161)
(4.162)
4.A
81
In Section 4.3 we have introduced the conditionally strongly typical sets conditional on a given sequence. The definition given there actually differs from
the style of definitions given before for the typical sets (Definition 4.1), jointly
typical sets (Definition 4.5), and the conditionally typical sets conditioned on
an event (Definition 4.9): Instead of specifying boundaries on the conditional
type it simply refers to the definition of jointly typical sets. The reason for
this is that the derivations turn out to be easier with this definition.
It is, however, possible to define the conditionally strongly typical sets conditional on a given sequence using the normal approach of specifying boundaries on the conditional type.
Definition 4.14. Fix an > 0 and a conditional distribution QY X P(YX ).
(n)
The conditionally strongly typical set A (QY X x) conditional on a fixed
sequence x X n and with respect to the conditional distribution QY X is
defined as
(Q
x)
,
y Y n : Px,y (a, b) QY X (ba)Px (a) <
A(n)
,
Y X
Y
(a, b) X Y, and
Px,y (a, b) = 0, (a, b) X Y
with Px (a) > 0 and QY X (ba) = 0 . (4.163)
Note that we again have the second condition to make sure that for zero
probability in QY X we do not have any occurrences in the sequences. We do
not need to worry about Px (a) = 0 because then Px,y (a, b) = 0 for sure, but
if Px (a) > 0 and QY X (ba) = 0 we do require Px,y (a, b) = 0, too.
We would like to point out that Definition 4.14 actually is more general
than Definition 4.10:
While our original Definition 4.10 only works for a given sequence that
(n)
is typical, x A (QX ), the alternative Definition 4.14 works for any
sequence x X n .
Our original Definition 4.10 requires the specification of a joint distribution QX,Y , while the alternative Definition 4.14 only needs a conditional
distribution QY X .
However, apart from that, it turns out that both versions of conditionally
strongly typical sets are equivalent. This is shown in the following proposition.
82
Strong Typicality
(n)
if y A(n)
(QX,Y x) = y A0 (QY X x),
(4.164)
where
0 ,
Y + 1
;
X 
(4.165)
and
(n)
if y A(n)
(QY X x) = y A00 (QX,Y x),
(4.166)
00 , X  + Y .
(4.167)
where
(n)
X  Y
= QX (a)QY X (ba) +
X  Y
< Px (a) +
QY X (ba) +
X 
X  Y
+
Px (a)QY X (ba) +
X  X  Y
0
= Px (a)QY X (ba) +
Y
(4.168)
(4.169)
(n)
(x A
(QX )) (4.170)
(QY X 1)
(4.171)
(by (4.165)).
(4.172)
0
.
Y
(4.173)
(x A
(QX ))
(4.174)
(4.175)
(QY X 1)
(4.176)
(by (4.167)).
(4.177)
83
Chapter 5
It is a very interesting fact that joint descriptions are more efficient than
individual descriptions. This is even true for independent random variables:
Even if X1
X2 , the description of (X1 , X2 ) is shorter than the description
for X1 and the description for X2 together!
In slightly fancy wording, it is simpler to describe an elephant and a chicken
together with one description rather than to describe each separately.
So, why have independent problems not independent solutions? The answer lies in the geometry: Rectangular grid points resulting from independent
descriptions are not space efficient! If we want to get something more packed,
we need to make the description dependent. See Figure 5.1 for a qualitative
explanation.
Rate distortion theory goes back to Shannons seminal paper [Sha48].
Shannon dealt with it more in detail in [Sha59], and at the same time also
the Russian group around Kolmogorov worked intensively on this problem
[Kol56]. Berger published a comprehensive book about the topic in 1971
[Ber71]. For the characterization of the rate distortion function (Sections 5.5
85
86
independent description
dependent description
49 points
45 points
Figure 5.1: Quantization of a square: Every point in the square will be represented by the closest grid point. In the left version we use a
rectangular grid generated by an independent description of the
two dimensions of the square. In the right version, we use a shifted
grid where the two dimensions are dependent. As an example a
shaded point with its nearest grid point is shown where we have
chosen this example point such that it demonstrates the maximum
distance between any point of the square and its closest grid point.
Note that even though we use more grid points in the left version,
the maximum distance is larger than in the right version.
5.1
87
g 0 (
x) = E[2(X x
)(1)X > 0] = 2 E[X X > 0] + 2
x = 0,
(5.3)
x
= E[X  X > 0].
(5.4)
i.e.,
x2
2 2
e 22 ,
x > 0,
(5.5)
so that
x
= E[X X > 0] =
Z
0
2
2 2
x2
2 2
r
2
2 2
2
x 2
dx =
e 2
=
.
2 2
x=0
(5.6)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
4
q
x
(x) =
2
if x > 0,
if x 0.
(5.7)
88
See Figure 5.2 for these reconstruction points in the situation when 2 = 1.
This description will cause a distortion of
E (X x
(X))2
1
1
= E (X x
(X))2 X > 0 + E (X x
(X))2 X 0
(5.8)
2
2
2
(5.9)
= E (X x
(X)) X > 0
2
= 1
2.
(5.10)
But what shall we do if we have two bits available to represent X? Obviously, we want to divide the real line into 4 regions and use a point within each
region to represent all values of X in this region. However, it is not obvious
how to choose the regions and their representation points.
Luckily, we do have some knowledge: An optimal choice of region and
representation points should have the following two properties:
Given a set of reconstruction points, the regions should be chosen such
that the distortion is minimized. This is the case if every value x is
mapped to its closest representation point (in the sense of the given
distortion measure), i.e., the regions should be the nearest neighbor
regions around the reconstruction points. Such a partition is called
Voronoi or Dirichlet partition.
Given a set of regions, the reconstruction points should be chosen such
that the distortion is minimized. This is the case if for each region the
corresponding reconstruction point minimizes the conditional expected
distortion over this region.
These two properties can now be used as a basis for an iterative algorithm
that should find (if not the optimal, then at least some) good quantization
system:
Lloyds Algorithm for the Design of a Quantization System:
[Llo82]
Step 1: Start with a set of (manually chosen) reconstruction points.
Step 2: Find the Voronoi regions for the given reconstruction points.
Step 3: Find the optimal reconstruction points for the derived regions.
Step 4: Return to Step 2 until the algorithm has reached some local
minimum.
89
distance; and given some region, the optimal reconstruction point in this region is E[X X is in region].
Let us start with the two reconstruction points x
1 = 0 and x
2 = 1.
Given x
1 = 0 and x
2 = 1, the optimal regions are divided by the threshx
1 +
x2
1
old = 2 = 2 .
Given = 21 , the optimal reconstruction points are
Z
2
1
1
x2
=
x
1 = E X X >
x
e
dx
1
2
Q 21 2
2
81
1.14,
Q 12 2
Z 1
2
2
1
1
x2
x
2 = E X X
=
x
e
dx
2
1 Q 21
2
(5.11)
e 8
0.509.
=
2
1 Q 21
(5.12)
Given x
1 = 1.14 and x
2 = 0.509, the optimal regions are divided by
x2
0.316.
the threshold = x1 +
2
Given = 0.316, the optimal reconstruction points are
0.3162
e 2
x
1 = E[X X > 0.316] =
Q(0.316) 2
1.009,
(5.13)
0.3162
e 2
0.608.
x
2 = E[X X 0.316] =
(1 Q(0.316)) 2
(5.14)
x
1 0.930
x
1 0.881
x
1 0.850
x
2 0.675
= = 0.128.
(5.15)
= = 0.081.
(5.16)
= = 0.052.
(5.17)
x
2 0.719
x
2 0.750
90
x
1 0.831
x
2 0.765
= = 0.033.
(5.18)
We see how we slowly approach the optimal solution given in (5.7) and Figure 5.2.
Example 5.2. Let us also consider the case of X N (0, 1) with the squared
error distortion measure and with R = 2 bits. For symmetry reason it is clear
that we must have a threshold at X = 0 with two region and symmetric
reconstruction points on either side of it. So we only concentrate on the
positive side and try to find x
1 and x
2 with a threshold in between. The
other two reconstruction points will then be x
3 =
x1 and x
4 =
x2 with
corresponding threshold .
The recursion formulas for Lloyds algorithm are then as follows:
=
x
1 + x
2
,
2
(5.19)
2
Z
1
1 x2
1 e 2
,
x
1 = 1
x e 2 dx =
2
2 21 Q()
2 Q() 0
2
Z
1
1 x2
e 2
x
2 =
.
x e 2 dx
=
Q()
2
2 Q()
(5.20)
(5.21)
5.2
log(# of indices)
nats/symbol.
n
(5.22)
The rate describes how many nats we need on average to describe one
source letter.
91
Table 5.3: Recursion given by Lloyds algorithm for the case of a Gaussian RV
represented by four values.
x
1
x
2
0.5
0.75
0.3578
1.3288
0.8433
0.3973
1.4011
0.8992
0.4202
1.4450
0.9326
0.4336
1.4714
0.9525
0.4414
1.4872
0.9643
0.4461
1.4966
0.9714
0.4488
1.5022
..
.
0.9755
0.4528
1.5104
0.9816
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
4
Figure 5.4: Reconstruction areas and points of X N (0, 1) according to Example 5.2.
92
Decoder: The decoder is a mapping n : 1, 2, . . . , enR X n , i.e., the decoder receives an index from the encoder and represents the correspond X n .
ing X by an estimate X
Distortion measure: We first define a perletter distortion measure:
d : X X R+
0
(5.23)
(5.24)
) ,
d(x, x
1X
d(xk , x
k ).
n
(5.25)
k=1
Note that this choice is bad for many practical applications like image
quality or sound quality. A good practical sequence distortion measure
is very likely far from being an average perletter distortion! However,
we make this assumption anyway here in order to simplify our system
and to make its analysis tractable.
We also assume that the perletter distortion d(, ) is bounded, i.e.,
dmax ,
max d(x, x
) < .
xX ,
xX
(5.26)
x X.
(5.27)
This means that we require that for every possible value x X there
must exist (at least) one best representation with zero distortion. As we
will see in the exercises, this assumption causes no loss in generality.
Example 5.3. Two of the most important distortion measures are as follows:
The Hamming distortion is defined as
(
0 if x = x
,
d(x, x
) ,
1 if x 6= x
.
(5.28)
E d(X, X)
(5.29)
= Pr[X 6= X].
(5.30)
93
(5.31)
This distortion measure is in particular popular for continuous (especially Gaussian) RVs. Unfortunately, it is a bad choice for a quality
criterion for images or speech.
1X
) =
d(xk , x
k ).
d(x, x
n
(5.32)
k=1
enR
(5.33)
...,X
n (1), . . . , n enR = X(1),
1 nR are the associated assignment
is called the codebook, and 1
n (1), . . . , n e
regions.
often also is called vector quantization, reproducNote that the codeword X
tion, reconstruction, representation, source codeword, source code, or estimate
of X.
As usual for a coding scheme, we define its rate. Moreover, since our
coding scheme is a rate distortion coding scheme, we also define the achieved
distortion.
Definition 5.5. The rate R of a rate distortion coding scheme is defined1
R,
log(# of indices)
log enR
=
.
n
n
(5.34)
Note that unless we choose several identical codewords (which is a rather inefficient
thing to do!), the number of indices is equal to the number of codewords.
94
The rate distortion region for a source Q and a distortion measure d(, ) is
the closure of the set of all achievable rate distortion pairs (R, D).
Note that by definition we specify the rate distortion region to contain
also its boundaries. This is similar to the definition of capacity that is the
supremum of all achievable transmission rates, without worrying whether a
rate R = C is actually achievable or not.2
We next define two functions that describe the boundary of the rate distortion region.
Definition 5.7. The rate distortion function
R(D) is the smallest rate (actually, infimum of rate) such that R(D), D is in the rate distortion region of
the source for a given average distortion D.
The distortion rate function D(R) is the smallest average
distortion (actu
ally, infimum of average distortion) such that R, D(R) is in the rate distortion
region of the source for a given rate R.
Note that the rate distortion function R(D) is comparable to the capacity
cost function C(Es ), which describes the maximum achievable transmission
rate for a certain given cost (like power).
5.3
q(
xx) :
x,
x
inf
Q(x)q(
xx)d(x,
x)D
I(X; X)
(5.37)
inf
q(
xx) : EQq [d(X,X)]D
I(X; X).
(5.38)
In the case of the capacity, it depends on the channel whether the capacity itself is
achievable or not.
95
(5.39)
(5.40)
(5.41)
We only need to analyze the case where p 12 , because for p > 12 we can
simply redefine the source letter 0 as 1 and vice versa.
We assume first that D p ( 12 ) and derive the following lower bound
on I(X; X):
= H(X) H(XX)
I(X; X)
(5.42)
X)
= Hb (p) H(X X
Hb (p) H(X X)
(5.43)
Hb (p) Hb (D).
(5.46)
= Hb (p) Hb Pr[X =
6 X]
(5.44)
(5.45)
Here, in (5.44) we use that conditioning reduces entropy; and (5.46) follows
because of (5.41) and because Hb () is increasing for arguments less than 12 .
We next show that this general lower bound actually can be achieved by
an appropriate choice of q. To do so, we consider the inverse testchannel
and output X. We choose it to be a binary symmetric channel
with input X
(BSC) with error probability D, see Figure 5.5. Now we need to choose the
such that the output X has the correct Bernoulli distribution: Let
input X
= 1], and compute
r , Pr[X
Pr[X = 1] = p = D(1 r) + (1 D)r.
(5.47)
96
1
1D
Hence,
r=
pD
,
1 2D
(5.48)
(5.49)
= H(X) H(XX)
= Hb (p) Hb (D).
I(X; X)
(5.50)
and
X
X). This must be the minimum because the mutual information can
never become negative. This choice causes a distortion
= Pr[X 6= X]
= Pr[X 6= 0] = p < D
E d(X, X)
(5.51)
by assumption.
So we have derived the information rate distortion function:
(
Hb (p) Hb (D) 0 D min{p, 1 p},
RI (D) =
0
D > min{p, 1 p}.
(5.52)
Remark 5.10. Note that from the Extreme Value Theorem (Theorem 1.23)
we realize that the infimum in the definition of the rate distortion function
(Definition 5.7) can actually be replaced by a minimum. The reason is that
the mutual information is continuous, for any finite alphabets X and X the
set P(X X ) is bounded, and as long as the constraints are closed, i.e., we have
D
EQq d(X, X)
(5.53)
97
instead of
< D,
EQq d(X, X)
(5.54)
min
q(
xx) : EQq [d(X,X)]D
I(X; X)
(5.55)
and
q ,
I(X; X).
argmin
(5.56)
q(
xx) : EQq [d(X,X)]D
D 0.
(5.57)
Proof: To see why Part 1 holds, note that by definition RI (D) is a minimum
over a candidate set that is enlarged if D is increased. If the candidate set is
enlarged, the minimum can only remain unchanged or decrease. Hence, RI (D)
is nonincreasing.
To prove Part 2, take two rate distortion pairs (R1 , D1 ) and (R2 , D2 ) that
lie on the curve described by RI (D), and let q1 (
xx) and q2 (
xx) be the PMFs
that achieve these two points, respectively. Now define
q (
xx) , q1 (
xx) + (1 )q2 (
xx).
(5.58)
D , EQq d(X, X)
+ (1 ) EQq d(X, X)
= EQq1 d(X, X)
2
(5.59)
Then
= D1 + (1 )D2 .
(5.60)
(5.61)
98
Now recall that the mutual information over a channel is convex in the channel
law:
= I(Q, q)
I(X; X)
(5.62)
(by (5.61))
(5.64)
I(Q, q )
(drop minimization)
(5.65)
(by (5.63))
(5.66)
(by definition)
(5.67)
I(Q, q1 ) + (1 ) I(Q, q2 )
RI (D1 ) + (1 )RI (D2 )
min
q(
xx) : EQq [d(X,X)]D
(5.68)
H(X) H(XX)
 {z }
EQq [d(X,X)]D
min
q(
xx) :
I(X; X)
min
q(
xx) : EQq [d(X,X)]D
H(X)
= H(X) = H(Q).
5.4
(5.69)
(5.70)
(5.71)
min
q(
xx) : EQq [d(X,X)]D
I(X; X)
(5.72)
(5.73)
is achievable, and any achievable rate distortion coding scheme with rate
99
5.4.1
(5.74)
Converse
(5.76)
nR H(X)
H(XX)
= H(X)
(5.77)
is a function of X)
(X
(5.78)
= I(X; X)
(5.79)
= H(X) H(XX)
n
X
(5.80)
k=1
n
X
k=1
n
X
k=1
n
X
k )
H(Xk ) H(Xk X
k )
I(Xk ; X
k )
RI E d(Xk ; X
(5.81)
(5.83)
(Def. 5.8)
(5.84)
k=1
n
1 X I
k )
R E d(Xk ; X
n
k=1
!
n
X
I 1
k )
E d(Xk ; X
nR
n
" k=1n
#!
X
1
k )
= nRI E
d(Xk ; X
n
k=1
I
1n
= nR E d X1n ; X
=n
nRI (D)
(5.85)
(Jensen Inequality)
(5.86)
(5.87)
(by (5.25))
(5.88)
(by (5.75)).
(5.89)
100
Here, (5.84) follows because the information rate distortion function by definition is a minimization of mutual information for a given maximum expected distortion. We hence pick this maximum expected distortion to be
k )], i.e., the value that is implicitly specified by q(
E[d(Xk ; X
xx) of our coding
scheme. In (5.86) we rely on the convexity of RI () as shown in Lemma 5.112;
and in the last inequality (5.89) we use that, according to Lemma 5.111, RI ()
is nonincreasing. This proves the converse.
One immediate consequence of the converse is a corollary that states that
no DMS can be compressed losslessly below its entropy. We have proven this
already in [Mos14, Theorem 4.14] under the assumption that we use a proper
message set. Here, we now generalize the statement for any coding scheme.
Corollary 5.13. There exists no coding scheme that can compress a DMS
losslessly below its entropy.
Proof: Choose X = X and consider the Hamming distortion, such that
= Pr[X
6= X].
E d(X, X)
(5.90)
6= X] = 0, we have
If we require D = 0, i.e., Pr[X
R RI (0)
(5.91)
min
I(X; X)
= H(X)
=X]=0
q : Pr[X6
=X]=0
q : Pr[X6
max
{z
=0
(5.92)
H(XX)
(5.93)
= H(X) = H(Q).
(5.94)
Note the beauty of the converse (5.89): There are no epsilons involved nor
any limits! So, what we actually have proven is that for any coding scheme
that satisfies (5.75) and for any finite n, it must hold that
R RI (D).
(5.95)
(5.97)
(5.98)
101
but with a rate smaller than RI (D)? The answer to this question is No. The
reason is as follows. Since Dn is decreasing, for every > 0 there exists a n0
such that for all n n0 we have
n D + .
E d X1n , X
1
(5.99)
Hence, these coding schemes cannot have a rate smaller than RI (D + ) and,
since is arbitrary, we can approach RI (D) arbitrarily closely, as long as RI ()
is continuous. So the only question that remains is whether RI () is continuous.
This is really the case and will be proven later in Section 5.6. For a graphical
explanation of this discussion, see Figure 5.6.
rate
information rate distortion
function (with discontinuity)
RI (D)
RI (D + )
D D+
distortion
Figure 5.6: Our proof of the converse would fail if the information rate distortion function were not continuous.
5.4.2
Achievability
(5.100)
102
x, X(w)
A(n) QX,X .
(5.101)
If it finds several possible choices of w, it picks one. If it finds none, it
chooses w = 1.
The encoder then puts out w.
4: Decoder Design: For a given index w, the decoder puts out X(w).
5: Performance Analysis: We partition the sample space into three disjoint cases:
1. The source sequence is not typical:
X
/ A(n) (Q)
(5.102)
(in which case we for sure cannot find a w such that (5.101) is
satisfied!).
(Q), @ w : X, X(w)
A(n)
QX,X . (5.103)
X A(n)
X A(n)
(Q), w : X, X(w)
A(n)
QX,X . (5.104)
We now apply the Total Expectation Theorem to compute the expected
achieved distortion of our random system:
= E d X n, X
n Case 1 Pr(Case 1)
E d(X, X)
1
1

{z
}
dmax
n Case 2 Pr(Case 2)
+ E d X1n , X
1

{z
}
+E d
dmax
n n
X1 , X1 Case
3 Pr(Case 3)
 {z }
(5.105)
(5.106)
(5.107)
103
To bound the probability of Case 2, note that since each codeword X(w)
has been generated IID without considering any other codeword, the
probability that there exists no codeword that is jointly typical with the
source sequence is simply the product (over all w) of the probabilities that
the wth codeword is not jointly typical with the source sequence. Moreover, each of the probabilities in this product is the same, independent
of w. Hence, we get
Pr(Case 2)
Pr @ w : X, X(w)
A(n) QX,X X A(n) (Q)
(5.109)
= Pr X A(n)
(Q)

{z
}
1
enR
Pr X, X(w)
Y
w=1
(5.110)
nR
e
Y
w=1
Pr X, X(w)
(5.111)
h
i
(n)
Pr X(w)
/ A
QX,X X X A(n) (Q)
(5.112)
nR
e
Y
w=1
nR
e
Y
w=1
i
h
1 Pr X(w)
A(n) QX,X X X A(n) (Q)
h
ienR
(n)
= 1 Pr X A
QX,X X X A(n) (Q)
h
i
A(n) Q X X A(n) (Q)
exp enR Pr X
X,X
= exp en(RI(X;X))
(5.113)
(5.114)
(5.115)
(5.116)
(5.117)
where
1
= m QX,X + m QX log 1 t (n, , X ) .
n
(5.118)
Here, in (5.112) we use the definition of conditionally typical sets (Definition 4.10); (5.114) follows from the independence of the probability
expression on w; the inequality (5.115) is due to the Exponentiated IT
Inequality (Corollary 1.10); and (5.116) follows from TC.
So we see that as long as
+
R > I(X; X)
(5.119)
104
1X
) =
d(x, x
d(xk , x
k )
n
k=1
1X
) d(a, b)
=
N(a, bx, x
n
(5.120)
(5.121)
aX
bX
(5.122)
aX
bX
QX,X
(a, b) +
aX
bX
X  X 
QX,X
(a, b) d(a, b) +
aX
bX
+ dmax .
= E d(X, X)
X
aX
bX
d(a, b)
X  X 
(5.123)
dmax
(5.124)
(5.125)
Here in (5.122) we used the definition of joint types, and in the subsequent
(n)
) A
inequality (5.123) we relied on the assumption that (x, x
QX,X .
Hence,
)
< dmax t (n, , X ) + dmax exp en(RI(X;X)
E d(X, X)
+ dmax
+ E d(X, X)
(5.126)
!
+ 0 D.
(5.127)
= E d(X, X)
So we see that as long as
D
E d(X, X)
(5.128)
R > I(X; X)
(5.129)
and
the random coding scheme works. We can now optimize our choice of
q(
xx) in Step 1 such that it minimizes the mutual information under the
constraint (5.128). Then, for a given D 0 our random coding scheme
works as long as
R>
min
q(
xx) : E[d(X,X)]D
I(X; X).
(5.130)
5.4.3
105
Discussion
We would like to discuss this result and highlight a few interesting points.
Firstly, note that we have not made any explicit choice for the distortion
measure d(, ). We only assumed that it is an average perletter distortion
and that it is bounded (where the latter assumption could be relaxed if one is
careful with limits: Note that the probability of all cases where the distortion
becomes large tends to zero exponentially fast!). The explicit choice for d only
is needed once we want to evaluate the minimization in (5.130).
Secondly, we would like to point out that Theorem 5.12 does not specify
what happens on the border when
R=
min
q(
xx) : E[d(X,X)]D
I(X; X).
(5.131)
The converse does not exclude this case, but the achievability part does not
include it. This is identical to the situation of channel capacity, where one
cannot in general state whether a rate equal to capacity can be achieved or
not. By definition, these boundary cases are included in the rate distortion
region (see Definition 5.6).
Finally, note that we have actually proven a quite strong statement: We
have shown that the probability that our randomly designed system will not
work is very small and tends to zero exponentially fast! Had we relied in our
proof on weak typicality (as defined in [Mos14, Chapter 19]), then our proof
would have become much less direct and less strong: With weak typicality
one can only show that for any > 0 one can find an n and 1 such that the
expected distortion averaged over all codes of length n is less than D + 1 , and
that therefore there must exist at least one code with an average distortion
less than D + 1 .
As a matter of fact, if we go back to our achievability proof and think
about it, then we realize that we have shown that for our coding scheme the
probability of all sequences that are not well represented (i.e., yield a distortion
larger than D) is tending to zero exponentially fast in n! Concretely, we can
compute
Pr {X : there is no good X}
(5.132)
t (n, , X ) + exp en(RI(X;X))
=e
2
X 
n 2X
n log(n+1)

+ e e
n(RI(X;X)
)
(5.133)
(Here the first exponent with factor 2X
 is dominating.) Note that in the
proof that relies on weak typicality nothing is said about the probability of
bad representation. For example, it could be that 10% of all source sequences
result in a very bad representation with a distortion of 2D, however, the
remaining 90% of sequences
are so well represented that, on average, the
requirement E d(X, X)
D is satisfied. We have proven that with our
scheme such a situation cannot occur! We will come back to this exponential
growth in Chapter 6.
106
D(R) =
min
E d(X, X)
(5.134)
q(
xx) : I(X;X)R
is the minimum achievable distortion at a given rate R, i.e., any rate distortion
pair (R, D) with
D > D(R)
(5.135)
is achievable, and any achievable rate distortion coding scheme with rate R
and distortion D must satisfy
D D(R).
5.5
(5.136)
Characterization of R(D)
X
x
q(
xx) = 1,
q(
xx) 0,
x,
(5.138)
x, x
.
(5.139)
X
x,
x
q(
xx)
0 )q(
Q(x
xx0 )
x0
{z
}
Q(x)q(
xx) log P
= I(X;X)
107
!
X
X
X
+
Q(x)q(
xx)d(x, x
) D +
(x)
q(
xx) 1
x
x,
x
(5.140)
=
X
x,
x
(x)
q(
xx)
+ d(x, x
) +
0
0
xx )
Q(x)
x0 Q(x )q(
Q(x)q(
xx) log P
D

(x) .
(5.141)
{z
constant
Here the last two terms are constant and not really interesting. So we drop
them and additionally replace () by () in such a way that
(x)
(x)
= log
.
Q(x)
Q(x)
(5.142)
We get
L(q, , ) ,
X
x,
x
q(
xx)
(x)
+ d(x, x
) log
0
0
xx )
Q(x)
x0 Q(x )q(
Q(x)q(
xx) log P
(5.143)
or, renaming the dummy summation variables,
X
(a)
q(ba)
.
+ d(a, b) log
L(q, , ) ,
Q(a)q(ba) log P
0
0
Q(a)
a0 Q(a )q(ba )
a,b
(5.144)
Now fix an x and an x
and take the derivative with respect to q(
xx). Be
very careful because q(
xx) shows up in three places! Note that if the main
sum is at b = x
, but a 6= x, then q(
xx) still shows up in the sum inside the
logarithm!
L(q, , )
q(
xx)
= Q(x) log P
q(
xx)
(x)
+ d(x, x
) log
0
Q(x)
xa0 )
a0 Q(a )q(
!
P
P
0
0
0
xa )
xa0 ) q(
xx)Q(x)
a0 Q(a )q(
a0 Q(a )q(
+ Q(x)q(
xx)
P
q(
xx)
( a0 Q(a0 )q(
xa0 ))2
!
P
0
X
xa0 )
q(
xa)
a0 Q(a )q(
+
Q(a)q(
xa)
P
Q(x) (5.145)
0
0 ))2
q(
xa)
(
x
a
0 Q(a )q(
a
a6=x
q(
xx)
(x)
= Q(x) log P
+ d(x, x
) log
0
xa0 )
Q(x)
0 Q(a )q(
P a
0 )q(
0 ) q(
X
Q(a
x
a
x
x)Q(x)
q(
xa)Q(x)
0
P
+ Q(x) a
Q(a) P
0 )q(
0)
0 )q(
Q(a
x
a
Q(a
xa0 )
0
0
a
a
a6=x
(5.146)
108
q(
xx)
(x)
= Q(x) log
+ d(x, x
) log
p(
x)
Q(x)
X
q(
xa)Q(x)
+ Q(x)
Q(a)
p(
x)
a
{z
}

=
Q(x)
p(
x)
(5.147)
Q(a)q(
xa) = Q(x)
(x)
q(
xx)
,
+ d(x, x
) log
= Q(x) log
p(
x)
Q(x)
(5.148)
X
x
Q(x)q(
xx).
Since Q(x) > 0 we hence now get from the KKT conditions that
!
xx) > 0,
q(
xx)
(x) = 0 if q(
log
+ d(x, x
) log
!
p(
x)
Q(x) 0 if q(
xx) = 0.
(5.149)
(5.150)
(x)
q(
xx)
= log
d(x, x
)
p(
x)
Q(x)
(5.151)
or
q(
xx) =
p(
x)(x) d(x,x)
e
.
Q(x)
(5.152)
X
x
Q(x)q(
xx) =
p(
x)(x) d(x,x)
e
Q(x)
(5.153)
(x) ed(x,x)
(5.154)
Q(x)
= p(
x)
X
x
(5.155)
X
x
X p(
x)(x)
ed(x,x)
Q(x)
x
(x) X
=
p(
x) ed(x,x) ,
Q(x)
q(
xx) =
(5.156)
(5.157)
i.e.,
Q(x)
.
x) ed(x,x)
x
p(
(x) = P
(5.158)
109
(5.159)
q(
xx)
q(
xx)
= lim P
= lim P
= 1.
0 )q(
0)
0 )
0
0
p(
x)
Q(x
x
x
Q(x
0
0
x
x
(5.160)
q(
xx)
= log 1 = 0.
p(
x)
(5.161)
This is only handwaving, however, the claim can be shown properly using
some sophisticated variation argument: Fiddle around a little with one component of q and check its impact on L.
Then we get from (5.150)
d(x, x
) log
(x)
0
Q(x)
(5.162)
or
(x) ed(x,x) Q(x).
From this now follows that for all x
with p(
x) = 0 we have
X
X
(x) ed(x,x)
Q(x) = 1,
x
(5.163)
(5.164)
(5.165)
We combine (5.165) and (5.159) to yield the KKT conditions for the rate
distortion function.
Theorem 5.15 (KarushKuhnTucker Conditions for the Rate
Distortion Function).
A PMF p(
x) is the solution to the rate distortion minimization if
(
X Q(x) ed(x,x)
= 1 if p(
x) > 0,
P
(5.166)
d(x,b)
1 if p(
x) = 0,
b p(b) e
x
4
110
Q(x)q(
xx)d(x, x
) = D
(5.167)
p(
x) ed(x,x)
q(
xx) = P
.
d(x,b)
b p(b) e
(5.168)
x,
x
with
is
Here for (5.168) we have plugged (5.158) into (5.152). Note that I(X; X)
given by
!
d(x,
x)
d(x,
x)
X
p(
x
)
e
e
=
I(X; X)
Q(x) P
log P
.
(5.169)
d(x,b)
d(x,b)
p(b)
e
p(b)
e
b
b
x,
x
5.6
It actually turns out that the really interesting part of the Lagrangian defined
in (5.143) is the expression without the technical constraint of
X
q(
xx) = 1.
(5.170)
x
We define
+ E d(X, X)
R0 (q, ) , I(X; X)
X
q(
xx)
=
Q(x)q(
xx) log P
+ d(x, x
)
0
xx0 )
x0 Q(x )q(
(5.171)
(5.172)
x,
x
We will next show that by varying 0 we can find all values of R(D) for
every D 0.
Actually, we claim that has more meaning than simply being the Lagrangian multiplier to find the solution of the minimization in R(D). To see
this, fix > 0 and some q(). The latter defines a certain mutual informa and a certain expected distortion E[d(X, X)].
tion I(X; X)
We draw this rate
distortion pair as a point in the distortionrate plane and then add a line of
slope through this point. See Figure 5.7.
Recalling our definition of R0 in (5.171), we realize that the line of slope
crosses the rateaxis at R0 (q, ).
We now repeat the same game, but this time we choose q() to be
q , argmin R0 (q, )
q
(5.173)
111
rate
, I(X; X)
point E d(X, X)
E d(X, X)
line of slope
I(X; X)
distortion
E d(X, X)
Figure 5.7: Distortionrate plane with a certain rate distortion pair and a line
of slope .
rate
R0 (q , )
R()
= D
Eq d(X, X)
= R(D)
Iq (X; X)
distortion
Figure 5.8: For every > 0, the line with slope through R0 (q , ) is a
tangent to the rate distortion function R().
112
achievable by q
R0 (q , )
distortion
R()
R0 (q , )
R0 (q , )
achievable by q
achievable by q
distortion
Figure 5.10: A contradiction: a value R0 (q0 , ) below the minimum R0 (q , ).
Lemma 5.16. The rate distortion pair induced by q and lies on the rate
distortion curve R(), and the line with slope through this point is a tangent
to R(); see Figure 5.8.
Proof: Assume first that R() does not intersect with the line, i.e., it lies
strictly above the line, see Figure 5.9. Then
we have found a rate distortion
Iq (X; X)
that is achievable (our choice q !),
pair (R, D) = Eq [d(X, X)],
but that lies below the rate distortion function. This is a contradiction to the
definition of R() being the minimum.
So assume that the line cuts R() either in or below our point,5 see Figure 5.10. Then we can find some point on the rate distortion curve that is
induced by some other choice q0 and that is below the line. If we now draw
a second line through this new point with the same slope , then this new
line will intersect the rateaxis in R0 (q0 , ) (this can be argued the same way
as shown in Figure 5.7), which is below R0 (q , ). However, this is a contra5
Note that since R() is convex and nonnegative, any line of negative slope that cuts R()
above our point will cut R() once more below our point!
113
diction to the fact that R0 (q , ) is the minimum among all q for the given
!
Hence, we see that R() must touch the line and that therefore the line
must be tangential.
From this lemma and its corresponding Figure 5.8 now immediately follows
that for any > 0 and any D 0,
R0 (q , ) D R(D),
(5.174)
D 0, > 0.
(5.175)
(5.177)
max
>0 s.t.
(5.177) is satisfied x
(5.178)
Also note that the maximization on the RHS is achieved by some () only if
(5.177) is satisfied with equality for all x
with p(
x) > 0.
Proof: We start with the proof of (5.176). We assume that (5.177) holds
and recall the definition (5.143):
X
(x)
q(
xx)
L(q, , ) =
Q(x)q(
xx) log
+ d(x, x
) log
(5.179)
p(
x)
Q(x)
x,
x
(x) X
q(
xx)
Q(x)
x
x
X
X
= R0 (q, )
Q(x) log (x)
Q(x) log
= R0 (q, )
= R0 (q, )
X
x
(5.180)
Q(x) log
1
Q(x)
(5.181)
(5.182)
114
Q(x)q(
xx)
e
1 log e
q(
xx)Q(x)
x,
x
!
X
X
d(x,
x)
=
p(
x)(x) e
Q(x)q(
xx) log e
x,
x
(5.183)
(5.184)
(5.185)
x,
x
!
X
p(
x)
X
x
1 by (5.177)
(5.186)
p(
x) 1 log e
(5.187)
= (1 1) log e = 0,
(5.188)
as long as the maximization is over those that satisfy (5.177). To show that
this can be achieved with equality, we investigate the inequalities (5.184) and
(5.187): The former is achieved with equality if, and only if,
p(
x)(x) d(x,x)
e
= 1,
q(
xx)Q(x)
q(
xx) > 0.
(5.190)
From this follows (5.152) and therefore, analogously to the derivations shown
in (5.152)(5.158), also (5.158).
The latter inequality (5.187) is achieved with equality if, and only if,
X
(x) ed(x,x) = 1, x
with p(
x) > 0.
(5.191)
x
This condition is identical to (5.155). Hence, the situation is completely analogous to the derivations shown for the KKT conditions.
We know from the KKT conditions that such a choice of q and exists
(details omitted), and therefore equality can be achieved in (5.188), but only
if (5.191) holds.
Now we can combine Corollary 5.17 and Lemma 5.18 to find a lower for
the rate distortion function.
Theorem 5.19 (Lower Bound on R()). For a given DMS Q and perletter
distortion measure d(, ), we have for any D 0
R(D) H(Q) +
X
x
(5.192)
115
(x) ed(x,x) 1.
(5.193)
Note that this lower bound even contains free parameters that we can
choose. From the derivations of Corollary 5.17 and Lemma 5.18 we also see
that for every D there exists a particular choice of and that will make the
lower bound tight.
Another important consequence of these derivations is as follows.
Corollary 5.20 (Continuity of R()). The rate distortion function R(D) is
continuous for D 0.
Proof: The continuity of R() for D > 0 follows directly from its convexity. To see this, assume by contradiction that R() is convex, but contains a
discontinuity inside of the convexity interval. Then, we can find around this
discontinuity two points on R() such that their connecting line lies partially
below R() which contradicts the definition of convexity. See Figure 5.11
for a graphical picture of this situation.
On the boundary, on the other hand, a convex function could be discontinuous; again see Figure 5.11. Since R() is convex for D 0, we need to
check whether R(D) is continuous also for D = 0:
?
(5.194)
(5.195)
116
(5.196)
D0
(5.197)
R(D)
= R(0),
D 0,
(5.198)
(5.199)
min R
X)
q
q
{z
}

=
min
q : E[d(X,X)]=0
= or 0
I(X; X)
= R(0).
(5.200)
(5.201)
max
>0
P
x
x) 1
(x) ed(x,
x
with p(
x)>0 x
max
>0
x : d(x,
x)=0
x
with p(
x)>0
(5.202)
x) 1 x
(x) ed(x,
with p(
x)>0 x
max
P >0
(5.203)
(x)1 x
x : d(x,
x)=0
x
with p(
x)>0
where the inequality follows because on the RHS of (5.202) we only restrict
some values of () and are free to choose the others. On the other hand, also
note that
X
lim
max
Q(x) log (x)
P
x
>0
x) 1 x
(x) ed(x,
with p(
x)>0 x
=
lim
max
>0
x) 1 x
(x) ed(x,
with p(
x)>0 x
max
P >0
(x)1 x
x : d(x,
x)=0
x
with p(
x)>0
(5.204)
(5.205)
117
i.e., for any > 0 we can find a big enough such that
X
max
Q(x) log (x)
P
x
>0
x) 1 x
(x) ed(x,
with p(
x)>0 x
max
P >0
(x)1 x
(5.206)
x : d(x,
x)=0
x
with p(
x)>0
D0
= min R0 (q, )
(5.208)
= H(Q) +
max
>0
P
x
H(Q) +
(5.207)
x) 1
(x) ed(x,
max
P >0
(5.209)
x
with p(
x)>0 x
(x)1 x
(5.210)
x : d(x,
x)=0
x
with p(
x)>0
= H(Q) +
max
>0
P
x
x)
(x) ed(x,
1
x
with p(
x)>0
X
x
0 (q, )
= min R
(5.212)
= R(0) .
(5.213)
Here, the first inequality (5.207) follows from Corollary 5.17 and holds for
any > 0; in (5.209) we apply Lemma 5.18; the subsequent inequality (5.210)
then follows from (5.206); in (5.211) we reformulate the condition in the maxi ) given in (5.197); the subsequent equality
mization using our definition of d(,
);
(5.212) follows again from Corollary 5.17, this time applied to case of d(,
and in the final step we use (5.201).
Since (5.213) holds for an arbitrary , this proves the claim.
We can show even more.
Corollary 5.21. For any finite distortion measure, the slope of R(D) is continuous for 0 < D < dmax and approaches as D 0.
Proof: We have seen that minq R0 (q, ) is the rateaxis crossing point of a
tangent to R() with slope . If now R() had a point with a discontinuous
slope (or if it approaches a finite limit as D 0), then there exist several
different tangents with different slopes to that point; see Figure 5.12. This, on
the other hand means, that any q() achieving this point on the R()curve
must minimize R0 (q, ) for several different values .
Hence, we complete our proof if we can show that any q that minimizes
= 0.
R0 (q, ) for two different , 1 6= 2 , will cause I(X; X)
118
slope discontinuities
R0 (q, 2 )
p(
x)1 (x) 1 d(x,x) ! p(
x)2 (x) 2 d(x,x)
e
=
e
.
Q(x)
Q(x)
(5.214)
Hence, for p(
x) > 0 we have that6
1 (x)
= e(1 2 )d(x,x) .
2 (x)
(5.215)
Note that the lefthand side (LHS) of (5.215) does not depend on x
, which
means that the RHS cannot either. Therefore we must have that d(x, x
) is
independent of x
for all x
with p(
x) > 0. This now means that the joint
distribution is a product distribution:
Q(x, x
) = Q(x)q(
xx)
(5.214)
p(
x) 1 (x) e1 d(x,x) = p(
x) (x),

{z
}
(5.216)
independent of x
(5.217)
119
R(D)
D
Figure 5.13: A typical shape of a rate distortion function.
5.7
Similarly to [Mos14, Chapter 14], we can now combine source and channel
coding. Consider a DMS that shall be transmitted over a DMC with an
expected distortion of at most D 0, see Figure 5.14. As before, we use Ts to
dest.
1 , . . . , U
K
U
decoder
Y1 , . . . , Y n
DMC
X1 , . . . , Xn
encoder
U1 , . . . , UK
DMS
Figure 5.14: Joint source and channel coding system: a DMS is transmitted
over a DMC with distortion at most D.
denote the source clocking and Tc to denote the channel clocking. We assume
that the encoder accepts K source symbols as inputs and then generates a
codeword of length n to be transmitted over the DMC. For synchronization
reasons, we need to have
!
KTs = nTc .
(5.218)
An obvious way to design our system is to split the joint source and channel
coding scheme into a rate distortion coding scheme and a channel coding
120
scheme. The rate distortion coding scheme will work as long as its rate Rrd
(in bits per source symbols) satisfies Rrd > R(D), and the channel coding
scheme will work as long as its rate Rch (in bits per channel use) satisfies
Rch < C. Hence, this approach using a sourcechannel separation will work as
long as the rates of both systems, measured in bits per second, are in accord:
Rrd
Rch
C
R(D)
<
=
< ,
Ts
Ts
Tc
Tc
(5.219)
R(D)
C
< .
Ts
Tc
(5.220)
i.e., as long as
K
X
k=1
K
X
k )
H(Uk ) H(Uk U
k )
I(Uk ; U
(5.223)
(5.224)
k=1
K
1X
k )
R E d(Uk ; U
K
k=1
!
K
1X
k )
KR
E d(Uk ; U
K
(5.225)
(5.226)
k=1
K R(D),
(5.227)
where (5.222) holds because a DMS is memoryless. Moreover, using the fact
that we have DMC used without feedback, we get
K I(X n ; Y n )
I U1K ; U
(5.228)
1
1
1
= H(Y1n ) H(Y1n X1n )
n
X
=
H Yk Y1k1 H Yk Y1k1 , X1n
=
k=1
n
X
k=1
H Yk Y1k1 H(Yk Xk )
(5.229)
(5.230)
(5.231)
n
X
k=1
n
X
121
H(Yk ) H(Yk Xk )
(5.232)
I(Xk ; Yk )
(5.233)
(5.234)
k=1
n
X
k=1
= nC,
(5.235)
where the first inequality (5.228) follows from the data processing inequality
(Proposition 1.12); in (5.231) we use the fact that the channel is a DMC
without feedback; (5.232) follows by conditioning that reduces entropy; and
in (5.234) we apply the definition of channel capacity.
Hence, we see that any working joint source channel coding scheme must
satisfy
KR(D) nC
(5.236)
C
R(D)
.
Ts
Tc
(5.237)
5.8
In [Mos14, Section 14.6] we have introduced a particular information transmission scheme that tries to convey a binary DMS over a discrete memoryless
channel (DMC), where unfortunately the entropy of the source is larger than
the available capacity of the channel so that no lossless transmission is possible. We have shown there that in this situation there exists an ultimate lower
bound on the bit error probability Pb :
Ts
1
Pb > Hb 1 C .
(5.238)
Tc
We have then proposed a system that includes between the DMS and the
channel encoder a lossy compression scheme that will reduce the entropy of
the source sequence H({Uk }) to H({Vk }), which is matched to the available
capacity. See Figure 5.15.
The question that we could not answer in [Mos14, Section 14.6] is whether
there exists a system that actually can achieve the lower bound (5.238) (arbitrarily closely). Using our newly acquired knowledge about rate distortion
systems, this question can now be answered.
Lets quickly repeat the setup. We implement a rate distortion system
as the lossy compressor, see Figure 5.16. As before, we use Ts to denote the
source clocking, and Tc to denote the channel clocking, where (5.218) needs to
122
encoder
destination
V1 , . . . , VK
lossy
U1 , . . . , UK
compressor
Y1 , . . . , Yn
decoder
DMC
binary
DMS
X1 , . . . , Xn
DMC
X1 , . . . , Xn
destination
channel
encoder
V1 , . . . , V K
RD
decoder
RD
encoder
U1 , . . . , UK
channel
decoder
binary
DMS
Y1 , . . . , Yn
Figure 5.16: Rate distortion combined with channel transmission: The rate
distortion system compresses the source sequence to make sure
that the entropy of W is below the channel capacity.
hold. Moreover, we assume that the binary DMS is uniform, i.e., H({Uk }) = 1
bit/symbol. The capacity of the DMC is too small, i.e., we have
1
C
bits/s >
bits/s.
Ts
Tc
(5.239)
The rate distortion system has a rate7 R, i.e., we will have eKR different possible
indices W . Hence,
H(W )
log eKR
R
= .
KTs
KTs
Ts
(5.240)
And in order to make sure that the channel transmission of W will be reliable,
we need that
R ! C
.
Ts
Tc
7
(5.241)
123
K
1X
=
E[d(Uk , Vk )]
K
1
K
k=1
K
X
k=1
Pr[Uk 6= Vk ] = Pb .
(5.243)
(5.244)
(5.246)
(since in our case we have p = 12 ). Plugging (5.245) into (5.246) now yields a
lower bound on the necessary rate for our rate distortion coding scheme:
Ts
Ts
Ts
1
1
R R Hb 1 C
= 1 Hb Hb 1 C
= C.
(5.247)
Tc
Tc
Tc
This is exactly the maximum that is allowed in (5.241), i.e., the general lower
bound that we had derived in [Mos14, Section 14.6] coincides with the lower
bound in the rate distortion coding theorem (Theorem 5.12) and is therefore
achievable!
5.9
We now extend our main result Theorem 5.12 to the important situation of a
Gaussian source.
5.9.1
124
1
2
log
2
D
+
(5.249)
where
()+ , max{, 0}.
(5.250)
Proof: As a matter of fact, we have only proven Theorem 5.12 for finite
alphabets and since our proofs relied on strong typicality, we cannot generalize
it to continuous RVs in a straightforward manner. However, it is not too hard
to show that Theorem 5.12 also holds for Gaussian sources with the squared
error distortion, i.e., it can be shown that
R(D) =
inf
2 ]D
(
xx) : E[(XX)
I(X; X).
(5.251)
Here, x, x
R, and () describes a conditional probability density function
(PDF). It remains
to evaluate the minimization in (5.251).
I(X; X)
1
X)
= log 2e 2 h(X X
2
1
log 2e 2 h(X X)
2
1
2
log 2e 2 h N 0, E (X X)
2
1
1
2
= log 2e 2 log 2e E (X X)
2
2
1
1
2
log 2e log 2eD
2
2
1
2
= log ,
2
D
(5.252)
(5.253)
(5.254)
(5.255)
(5.256)
(5.257)
(5.258)
i.e.,
R(D)
1
2
log .
2
D
(5.259)
(5.260)
125
N 0, 2 D , Z N (0, D), where X
with X
Z. This yields the correct
2
output X N 0, and also satisfies
2 = E Z 2 = D.
E (X X)
(5.261)
The mutual information achieved by this scheme is
2 D
1
2
1
= log ,
I(X; X) = log 1 +
2
D
2
D
(5.262)
5.9.2
In the situation of Gaussian sources and Gaussian channels, the parallels between rate distortion theory and channel coding theory are extremely pronounced.
To see this, first recall the situation of channel coding for a Gaussian
channel (see [Mos14, Section 16.3]). We have
Yk = Xk + Zk
where {Zk } IID N 0,
2
(5.264)
1 X 2
E Xk E.
n
(5.265)
k=1
The received sequence Y lies with very high probability in a sphere of radius
v
v
u n
u n
X
p
u
2 uX
2
t
rtot = E[kYk ] =
E Yk = t
E Xk2 + E Zk2
k=1
k=1
p
= n(E + 2 ),
(5.266)
and for every codeword x, the received vector Y lies with high probability in
a sphere around x with radius
v
u n
p
uX
2
r = E[kZk ] = t
E Zk2 = n 2 .
(5.267)
k=1
n
n
An rtot
E + 2 2
nR
n
# of codewords M = e
=
=
,
An r n
2
A
n 2
n
(5.268)
126
i.e.,
E
1
R log 1 + 2 .
2
(5.269)
1X
k )2 D.
E (Xk X
n
(5.270)
k=1
Here, every source sequence X lies with high probability in a sphere with
radius
v
u n
p
uX
2
(5.271)
rtot = E[kXk ] = t
E Xk2 = n 2 ,
k=1
that with
and for every source sequence x there should exist a codeword X
high probability lies in a sphere around x with radius
v
u n
q
uX
2
k Xk )2 = nD.
r = E kX Xk = t
(5.272)
E (X
k=1
nR
n =
# of codewords M = e
=
,
(5.273)
An r n
D
A
nD
n
i.e.,
R
1
2
log .
2
D
(5.274)
5.9.3
i = 1, . . . , m,
(5.275)
127
Figure 5.17: The large sphere depicts an ndimensional sphere of radius rtot
and the small spheres all have radius r. In the case of channel
coding, the small spheres depict the codewords with some noise
around it. We try to put as many small spheres as possible into
the large sphere, but without having overlap such as to make sure
that the receiver will not confuse two codewords due to the noise.
In the case of rate distortion coding, the small spheres depict the
reconstruction vectors with the range of maximum allowed distortion around it. We try to put as few small spheres as possible
into the large sphere, but making sure that the complete large
sphere is covered so that for any source sequence we find at least
one reconstruction vector within the allowed distance.
where Xi
Xj , i 6= i. Assume we are given a certain amount D of total
allowed distortion (again assuming squared error distortion) and ask what rate
R is required to represent (X1 , . . . , Xm ) within this allowed total distortion.
Again, we actually need to derive a coding theorem. But the proof is very
similar to what we have seen so far and we omit the details. The result is as
follows:
R(D) =
min P
(
x1 ,...,
xm x1 ,...,xm ) : E [
m
2
i=1 (Xi Xi )
]D
1, . . . , X
m ).
I(X1 , . . . , Xm ; X
(5.276)
(5.277)
(5.278)
i=1
128
m
X
i=1
m
X
h(Xi )
m
X
i=1
i)
h(Xi X
i)
I(Xi ; X
(5.279)
(5.280)
i=1
m
X
i )2
R E (Xi X
(5.281)
(5.282)
i=1
m
X
R(Di )
i=1
m
X
i=1
2
1
log i
2
Di
+
.
(5.283)
Here, in (5.278) we the chain rule and the fact that the m sources Xi are independent of each other; (5.279) follows because conditioning cannot increase
entropy; in (5.281) we use the definition of the rate distortion function of a
particular source (it is the minimum for a given average distortion!); (5.282)
should be read as a definition
i )2 ,
Di , E (Xi X
(5.284)
Di D;
(5.285)
m
Y
i=1
(
xi xi )
(5.286)
such that
m
m X
1 =
i );
h X1m X
h(Xi X
(5.287)
i=1
and if
i = 0
i N 0, 2 Di if 2 > Di , and X
we choose (
xi xi ) such that X
i
i
otherwise, which then makes sure that
i) =
I(Xi ; X
2
1
log i
2
Di
+
.
(5.288)
129
Di :
Pmin
m
i=1 Di =D
m
X
1
i=1
m
X
1
min
Di s.t.
Pm
i=1
i=1 Di =D
Di i2
2
log i
2
Di
log
+
i2
.
Di
(5.290)
(5.291)
It only remains to figure out how to choose Di . Note that this minimization problem actually looks very much like the capacity problem of parallel
Gaussian channels described in [Mos14, Section 18.1]. While there we have
a concave function that is maximized with boundary constraints on the left
0 Ej , here we have a convex function that is minimized with boundary constraints on the right, Di i2 . So, we only need to adapt the KKT conditions
accordingly: we define
!
m
m
X
1
i2 X
L(D) ,
log
+
(5.292)
Di D ,
2
Di
i=1
i=1
=
+
Di
2 Di
= 0 if Di i2 ,
0 if Di > i2 .
(5.293)
m
X
1
i=1
log
i2
Di
(5.295)
130
where
(
Di =
i2
if i2 ,
if > i2
(5.296)
Di = D.
(5.297)
i=1
(5.298)
i2
131
42
12
52
72
22
32
D4
D1
D2
D7
D5
D3
62
D6
X1
X2
X3
X4
X5
X6
X7
Chapter 6
Introduction
So far we have seen two versions of the rate distortion coding theorem:
The first version we only mentioned without showing a proper proof. Its
proof is based on weak typicality, and it states that as long as R > R(D),
there exists a sequence of coding schemes such that
D.
lim E d(X, X)
(6.1)
However, it does not say anything about the probability whether a particular source sequences is well represented or not, only the average
distortion is fine.
The version shown in Chapter 5 (see Theorem 5.12 and the discussion
in Section 5.4.3), on the other hand, states that as long as R > R(D),
there exists a sequence of coding schemes such that
> D = 0.
lim Pr d(X, X)
(6.2)
133
134
For this chapter we must slightly adapt our notation from Chapter 5:
Instead of R(D) for the rate distortion function, we will write R(Q, D) to
explicitly show the dependence of the rate distortion function on the PMF of
the DMS.
Definition 6.1. For a given perletter distortion measure d(, ), the rate
distortion function R(Q, D) of a discrete memoryless source Q and for a certain
allowed distortion D 0 is defined as
R(Q, D) ,
min
q(
xx) : EQ [d(X,X)]D
IQ (X; X).
(6.3)
q(
xx) : IQ (X;X)R
xX , x
X
d(x, x
) = dmax < .
(6.5)
x X.
(6.6)
(6.7)
135
6.2
In Chapter 5 we have proven that any rate distortion coding scheme that
satisfies the average distortion
1n D
E d X1n , X
(6.9)
must have a rate
R R(Q, D).
(6.10)
1
log kn k R(Q, D) 0
n
(6.12)
1
log n
(6.13)
136
(
(x))
D
n n
n
n
o
= Pr x X n : x A(n)
(Q),
d
x,
(
(x))
D
n
n
n
n
o
n
+ Pr x X : x
/ A(n)
n (Q), d x, n (n (x)) D
n
o
(n)
Pr Bn ,D + Pr x X n : x
/ A(n)
(Q)
n
(n)
Pr Bn ,D + t (n, n , X ).
(6.14)
(6.15)
(6.16)
(6.17)
(6.18)
(6.19)
Here, (6.15) follows from (6.11); the subsequent equalities (6.16) and (6.17)
follow from total probability and because the two sets are disjoint; in (6.18)
(n)
we use the definition of Bn ,D and we enlarge the second set by dropping one
condition; and the final inequality (6.19) follows from the basic properties of
strongly typical sets (TA3b).
Hence,
(n)
Pr Bn ,D 1 t (n, n , X ).
(6.20)
(n)
(n)
(6.21)
(n)
xBn ,D
(6.22)
(n)
= Bn ,D en(H(Q)+n log Qmin ) .
(6.23)
<
(n)
xBn ,D
(6.24)
(6.25)
(6.26)
(6.27)
(6.28)
137
n log Qmin
(6.29)
X
X,
Noting that
n
1X
d(xk , x
k )
n
k=1
1X
) d(a, b)
=
N(a, bx, x
n
) =
d(x, x
(6.32)
(6.33)
aX
bX
aX
bX
,
= EPx,x d(X, X)
(6.34)
(6.35)
(6.36)
C
x
o
X n
(Q);
E
d(X,
X)
D
x X n : x A(n)
Px,
n
x
C
x
X
=
C
x
n
o
n
x X : Pxx = qXX
[
qXX
s.t. EPx
q
XX
(6.37)
(6.38)
[d(X,X)]D
n
aX
satisfies QX (a)Q(a)< X

X
C
x
n
o
x X n : Pxx = qXX (6.39)
X
qXX
s.t. EPx
q
XX
[d(X,X)]D
n
satisfies QX (a)Q(a)< X
aX

sup
C qXX
x
Pn (X X )
qXX
s.t. EPx
q
XX
n
o
x X n : Pxx = qXX
[d(X,X)]D
n
satisfies QX (a)Q(a)< X
aX

(6.40)
138
x
x
x
, but
Figure 6.1: Every source sequence x is mapped to exactly one codeword x
for a source sequence there might exist more than one codeword
that is close enough to satisfy the distortion constraint (e.g., x0
0 , but could also be mapped to x
). Hence, when
is mapped to x
considering all source sequences around every codeword, we might
count some source sequences several times.
C qXX
x
Pn (X X )
n
T qXX x
sup
qXX
(6.41)
[d(X,X)]D
s.t. EPx
q
XX
n
satisfies QX (a)Q(a)< X
aX

(n + 1)X X 
sup
qXX
C
x
s.t. EPx
q
XX
en HPx (XX)
(6.42)
[d(X,X)]D
n
satisfies QX (a)Q(a)< X
aX

X
C
x
= kn k en H(XX)+ .
(6.43)
(6.44)
(6.45)
The inequality (6.36) follows because in the sum we possibly count some x
several times as it is possible that a certain source sequence x is close to several
, see Figure 6.1. In (6.37) we use (6.35). In (6.38) we rewrite the
codewords x
set as a union of sets of sequences with the same conditional type, where the
type is restricted such that the same two conditions as in (6.37) are satisfied.
Then (6.39) follows by the Union Bound. In (6.40) we upperbound the value
by adding a supremum over all given conditional types and by enlarging the
sum to be over all possible conditional types. In (6.41) we then rename the set
. The size of this type
by its proper name: the conditional type class given x
class is then upperbounded in (6.42) by applying CTT3, and also the number
of conditional types is upperbounded by applying CTT1. In (6.43) we make
use of the fact that the entropy in the exponent by definition is maximized
139
n H(Q) 4
(n)
e
< B,D kn k en H(XX)+
(6.46)
which leads to
kn k > en
0
X)
H(Q)H(X
en
Q(a) log
aX
0
4.
20
X)
H(Q)H(X
4
(6.47)
1
Q(a)
(6.48)
aX
aX
{z
X
1
n
log
>
QX (a)
n
X 
QX (a) + X

aX
X
n
n
=
QX (a)
log QX (a) +
X 
X 
aX
!
n
X
QX (a) + X

=
QX (a) log QX (a)
QX (a)
aX
X n
n
+
log QX (a) +
X 
X 
aX
X
X
QX (a) log 1 +
=
QX (a) log QX (a)

= H(X)
(6.49)
(6.50)
(6.51)
n
X  QX (a)
 {z }
pmin
n X
n
+
log QX (a) +
 {z } X 
X 
aX
0
X
n
n X
n
H(X)
QX (a) log 1 +
+
log
X  pmin
X 
X 
aX
aX
n
n
log 1 +
= H(X)
+ n log
X  pmin
X 
0
, for n large enough.
H(X)
4
(6.52)
(6.53)
(6.54)
(6.55)
(6.57)
140
0
.
4
(6.58)
that
< D ,
EQ qXX
d(X, X)
(6.59)
I(X; X)
< R(Q , D ) + .
(6.60)
Q q
XX
Note that because of (6.59) (with strict inequality!) and because E[d(X, X)]
is continuous, we can choose k large enough such that
Dk ,
EQk qXX
d(X, X)
(6.61)
i.e., qXX
is among the choices q in the minimization of R(Qk , Dk ):
R(Qk , Dk ) ,
min
q: EQk
q [d(X,X)]Dk
I(X; X)
I(X; X)
Q
k qXX
(6.62)
XX
XX
< R(Q , D ) + .
(6.64)
(k)
(k )
lim q j
j XX
k
()
q
XX
and
(6.65)
(6.66)
for some q
EQ
()
XX
141
= lim E
d(X, X)
j
(k )
Qkj q j
XX
= lim Dk = D
d(X, X)
j
j
(6.67)
(k )
XX
(the second equality follows because q j achieves R(Qkj , Dkj )) and therefore
R(Q , D ) I(X; X)
Q
()
q
XX
= lim I(X; X)
(k )
Qkj q j
(R(, ) is a minimization)
(6.68)
(6.69)
XX
(k )
XX
(6.70)
= lim R(Qk , Dk )
(by (6.65)).
(6.71)
(6.72)
(6.73)
i.e.,
k
6.3
Recall that denotes the probability that the source sequence X is not reproduced within the required distortion D (see Definition 6.3). We now state
the main result of this chapter.
Theorem 6.5 (Rate Distortion Error Exponent).
Fix a perletter distortion measure d(, ) with source alphabet X and reproduction alphabet X according to Remark 6.2. Then for every R 0
and every D 0, there exists a sequence of lengthn rate distortion coding
schemes (n , n ) such that
the number of indices tends to at most enR :
lim
1
log kn k R;
n
(6.74)
(n , n , Q, d, D) e
R(Q,D)>R
k Q)n
D(Q
(6.75)
142
1
X  log(n + 1) 0
n
(6.76)
as n .
Furthermore, for every R 0, every D 0, every sequence of coding
schemes satisfying (6.74), and for every source Q P(X ),
1
k Q).
log (n , n , Q, d, D)
inf
D(Q
: R(Q,D)>R
n n
Q
lim
(6.77)
have = 0.
To understand the meaning of Theorem 6.5, first note that if R < R(Q, D),
then the set of distributions
: R(Q,
D) > R
Q
(6.78)
also contains the distribution Q. This means that
Q) D(Qk
Q)
0
inf
D(Qk
= D(Qk Q) = 0,
Q=Q
: R(Q,D)>R
(6.79)
i.e.,
inf
: R(Q,D)>R
Q) = 0,
D(Qk
(6.80)
(6.81)
lim (n , n , Q, d, D) 1.
(6.82)
lim
or
n
inf
: R(Q,D)>R
Q).
D(Qk
(6.83)
143
The really cool thing, however, is that the theorem guarantees the existence
of one coding scheme (n , n ) that works for any Q (as long as R R(Q, D))!
So, actually, what we have here is a universal compression scheme!
The performance of this universal compression scheme is not the same for
that have a
different sources: The further away the PMF Q is from those Q
too big rate distortion function value R(Q, D) > R, the better the system will
perform, i.e., the quicker will tend to zero. See Figure 6.2 for a graphical
explanation.
that do
the sources Q
not work because for the
given R and D:
D) > R
R(Q,
kQ) = D
inf D(Q
P(X )
Q
Figure 6.2: Graphical explanation of Theorem 6.5. The triangle depicts the
set of all sources, and the shaded area is the subset of sources for
which there exists no rate distortion coding scheme that works for
the given parameters R and D. There exists a coding scheme that
can compress all sources in the white area, for example the source
Q is a source that actually works. Its performance depends on
that do not work.
D , the distance to the closest of the sources Q
In short one can say that for every system (and for n ) we have
enD , and for all Q in the white area enD . Here, D depends on
the particular source Q and the parameters R and D.
In the following sections we are going to prove Theorem 6.5. In Section 6.3.1 an important lemma is proven that shows that any source sequence
in the type class of the source will for sure be reconstructable within the given
distortion D and rate R(D). This lemma is then used to prove the achievabil
144
ity in Section 6.3.2. The converse is proven in Section 6.3.3 and is based on
the strong converse of Section 6.2.
6.3.1
) D,
min d(x, x
B
x
(6.84)
and
0
B en(R(P,D)+ ) ,
(6.85)
(6.86)
145
(6.89)
(6.90)
(n)
D
dmax .
) A
Then for all (x, x
QX,X we have
1X
d(xk , x
k )
n
k=1
1 X
)d(a, b)
=
N(a, bx, x
n
) =
d(x, x
aX ,bX
QX,X (a, b) +
aX ,bX
X  X 
(6.91)
(6.92)
!
d(a, b)
(6.93)
+ dmax
E d(X, X)
(6.94)
D.
(6.96)
D + dmax
(6.95)
(6.97)
Now consider the random set U(Zm ), i.e., the set of all those x T n (P ) for
which
d(x, Zi ) > D,
i = 1, . . . , m.
(6.98)
(n)
If we can show that E[U(Zm )] < 1, then there must exist a set B A
with U(B) < 1, i.e., U(B) = 0, i.e., U(B) = , where1
B kZm k = m.
QX
(6.99)
1
Note that in our random choice of Zm we do not prevent cases where the same vector
A(n) QX is picked several times. Hence, when regarding Zm as a set instead of a
x
matrix, the number of different vectors Zi might be less than m.
146
X
= E
xT n (P )
X
xT n (P )
X
xT n (P )
X
xT
X
xT
=
=
n (P )
n (P )
(6.100)
(6.101)
Pr[x U(Zm )]
m
Y
i=1
(6.102)
(6.103)
Pr[d(x, Zi ) > D, i = 1, . . . , m]
xT n (P ) i=1
m
X Y
xT
n (P )
I {x U(Zm )}
(6.104)
Pr[d(x, Zi ) > D]
(6.105)
(1 Pr[d(x, Zi ) D])
(6.106)
m
Y
1
xT n (P ) i=1
h
i h
i
(n)
x
Pr d(x, Zi ) D Zi A(n)
Q
Pr
Z
A
Q
i
X,X
X,X x

{z
}
= 1 by (6.96)
h
i
i h
(n)
(n)
/ A
QX,X x
Pr d(x, Zi ) D Zi
/ A
QX,X x Pr Zi

{z
}
0
m
Y
xT n (P ) i=1
h
i
.
1 Pr Zi A(n)
QX,X x
(6.107)
(6.108)
X,X
(6.109)
Pr Zi A(n)
Q
x
=
(n)
X,X
A (Q )
X
en(H(X)+m (QX ))
= en(H(XX)H(X))
(6.111)
147
= en(I(X;X)+) ,
(6.112)
where the inequality (6.110) follows from TA2 and TB2, and where
1
, m QX,X + m QX log 1 t (n, , X X ) .
n
(6.113)
m
Y
X
xT
n (P )
i=1
1 en(I(X;X)+)
m
= T n (P ) 1 en(I(X;X)+)
(6.114)
(6.115)
(6.116)
log X 
(6.117)
Here, in (6.116) we use TT3 and the Exponentiated IT Inequality (Corollary 1.10).
Now choose m as some integer satisfying
en(I(X;X)+2) m en(I(X;X)+3) .
(6.118)
For n large enough, this is always possible. Then it follows from (6.117)
(6.121)
for n large enough. Hence, there must exist a set B X n with U(B) = .
Moreover, from (6.118) this set satisfies
B m en(I(X;X)+3) .
(6.122)
(6.123)
(6.124)
The lemma now follows once we make n large enough and small enough such
that 4 0 .
148
6.3.2
Achievability
The proof of the achievability of Theorem 6.5 is strongly based on the Type
Covering Lemma (Lemma 6.6).
By Lemma 6.6 we know that for any 0 > 0 and for every type P Pn (X ),
there exists a set BP X n that satisfies for n large enough
0
BP  en(R+ )
(6.125)
and
) D(P, R),
min d(x, x
BP
x
x T n (P ).
(6.126)
Note that we have reformulated the Type Covering Lemma: While in Lemma 6.6 we fixed D and then computed R from D using the rate distortion
function, here we fix R and then compute D from R using the distortion rate
function:
.
(6.127)
D(P, R) =
min
EP d(X, X)
q(
xx) : IP q (X;X)R
Set
B,
[
P Pn (X )
en(R+ )
P Pn (X )
=e
X  n(R+0 )
n(R+ 0 )
(6.128)
(Union Bound)
(6.129)
(by (6.125))
(6.130)
= Pn (X ) en(R+ )
(n + 1)
BP
(6.131)
(by TT1)
(6.132)
(6.133)
Hence, we can choose a code (n , n ) using all codewords from B such that
1
1
log kn k lim log B lim {R + 0 } = R + 0 .
n n
n
n n
lim
(6.134)
Since this holds for an arbitrary 0 > 0, this proves (6.74). So lets check the
distortion caused by this code. Fix some D 0 (R is already fixed!) and
define
P(X ) : R(Q,
D) > R .
F, Q
(6.135)
Note that F basically denotes the set of those sources for which the rate
distortion theorem is not satisfied, i.e., for which we get into troubles. Then
by definition of F, for all x T n (F), Px is such that R(Px , D) > R. Hence,
149
(6.136)
BPx
x
D(Px , R)
D
(weakening minimum)
(6.137)
(by (6.126))
(6.138)
(by (6.136)).
(6.139)
(6.140)
D(Q k Q)
(n + 1)X  en inf QF
,
(6.141)
6.3.3
Converse
Q(a)
>
X 
(6.143)
(n)
> 0,
(6.144)
X 
(n)
where the first inequality follows from the definition of A (Q)
and the
second from our choice of . Hence, N(ax) > 0 and therefore
n
Y
n
Q (x) =
Q(xk )
(6.145)
k=1
N(bx)
Q(b)
(6.146)
bX
>0
z } {
N(ax)
= Q(a)
 {z }
=0
N(bx)
Q(b)
= 0,
(6.147)
bX \{a}
150
Qn A
= 0.
(Q)
(6.148)
(6.149)
(n)
for n large enough. Hence, B , A (Q)
meets the condition in the minimization on the LHS of (6.142) and definitively achieves the minimum
because of (6.148). The LHS will be equal to in this case.
Since Q(a)
> Q(a) = 0, the RHS also will yield
Q) = ,
D(Qk
(6.150)
Q(X)
.
Q(X)
(6.151)
Note that Y is finite with probability 1 because Q() > 0 and because the
Now,
event {Q(X)
= 0} has zero probability because X Q.
X
Q(X)
Q(x)
Q) , D, (6.152)
E[Y ] = E log
=
Q(x)
log
= D(Qk
Q(X)
Q(x)
xX
where the last equality has to be understood as definition of D.
Now define for some 0 > 0,
1
Qn (x)
n
0
0
A0 , x X : D log
D +
n (x)
n
Q
(6.153)
k=1
(6.154)
(6.155)
(6.156)
(6.157)
151
!2
n
X
1
= 1 Pr
Yk E[Y ]
> 02
n
k=1
h P
2 i
E n1 nk=1 Yk E[Y ]
1
02
h P
P
2 i
n
1
E n k=1 Yk E n1 nk=1 Yk
=1
02
1 Pn
Var n k=1 Yk
=1
02
Pn
Var[Yk ]
= 1 k=12 02
n
n Var[Y ]
=1
n2 02
Var[Y ]
1 for n large enough.
=1
n02
(6.158)
(6.159)
(6.160)
(6.161)
(6.162)
(6.163)
(6.164)
Here, the inequality follows from the Markov Inequality that states that for
any nonnegative Z,
E[Z]
.
t
Pr[Z t]
(6.165)
n (B)1
BX n : Q
Qn (B) Qn (A0 )
X
=
Qn (x)
(6.166)
(6.167)
xA0
n (x)
en(D+ ) Q
(6.168)
xA0
0
n (A 0 )
= en(D+ ) Q
 {z }
n(D+0 )
(6.169)
(6.170)
Here, the inequality (6.168) follows because of the definition of A0 that guarantees that
Qn (x)
0
en(D+ ) .
n (x)
Q
(6.171)
Hence,
lim
min
n BX n : Q
n (B)1
1
log Qn (B) D + 0 .
n
(6.172)
152
1
1 + 1 1 = 1 2.
(6.173)
(6.174)
Q (x)
(6.175)
we get
Qn (B) Qn (B A0 )
X
=
Qn (x)
(6.176)
(6.177)
xBA0
0
en(D )
n (x)
Q
n (B A0 )
= en(D ) Q
e
n(D0 )
(6.178)
xBA0
(1 2),
(6.179)
(6.180)
(6.182)
(6.183)
1
log kn k R + 0
n
(6.184)
(note that this is possible because we assume that (6.74) is satisfied). Hence,
1
D) 20 + 0 = R(Q,
D) 0 .
log kn k R + 0 < R(Q,
n
(6.185)
does not exist, then the converse is void and (6.77) trivially
Recall that if such a Q
claims that 0.
2
153
For
Now recall the strong converse (Theorem 6.4) applied to the source Q.
0
any 0 < < 1 and > 0, if
n d X, n (n (X)) D 1 ,
Q
(6.186)
then
1
D) 0 ,
log kn k R(Q,
n
(6.187)
for n large enough. The inverse of this statement is as follows: For any
0 < < 1 and 0 > 0, if
1
D) 0 ,
log kn k < R(Q,
n
then, for n large enough,
n d X, n (n (X)) D < 1 ,
Q
(6.188)
(6.189)
or, equivalently,
n
Q
d X, n (n (X)) > D .
3
4
and define
B , x X n : d x, n (n (x)) > D .
Lets choose =
(6.190)
(6.191)
Now we know from (6.185) and (6.190) that for n large enough,
n (B) = 3 .
Q
4
(6.192)
with R(Q,
D) > R! Therefore, we have, for n
Note that this holds for any Q
large enough and an arbitrary > 0,
1
1
log (n , n , Q, d, D) = log Qn (B)
n
n
(6.193)
log Qn (B)
:
n
B
Q : R(Q,D)>R
n
o
kQ)
sup
D(Q
sup
min
n (B)
3
Q
4
(6.194)
(6.195)
: R(Q,D)>R
inf
: R(Q,D)>R
Q) .
D(Qk
(6.196)
(6.192), while the supremum has no influence since (6.192) holds for any Q
satisfying R(Q, D) > R. The subsequent inequality (6.195) then follows from
Lemma 6.8 for the choice of = 41 and for n large enough such that the value
Q).
is within
of the limiting value D(Qk
Since is arbitrary, this proves the converse.
Chapter 7
Problem Description
So far in rate distortion theory we have thought of one single source description
(the index that is generated from the source encoder) that will then be used
to produce an estimate of the source sequence (the output sequence of the
source decoder). But what happens if there are two or more compressors that
all can provide a description (i.e., an index)?
Of course, such a scheme is basically the same as a normal rate distortion
system because the different compressors all see the same source sequence and
therefore can cooperate with each other to achieve the same compression as if
they all were united into one big compressor. Hence, what happens in such a
setup is that the original single source index w is split up into several indices
w(1) , . . . , w(L) that then are used to create a source description by the decoder.
So far this is not interesting. However, lets now ask what happens if one of
these indices somehow gets lost on its way. A standard rate distortion system
will fail to work: No proper index means no description! Here, however, we
still have L 1 other indices available. So, we still should be able to get some
kind of description! This description, of course, will be slightly less accurate,
i.e., cause a higher distortion, since the total number of possible indices is
reduced.
Such a system could be very useful in practice. Consider, for example, a
network where due to packet loss some part of the message does not arrive
at the receiver. A traditional rate distortion system will fail, while a multiple
description system still can reproduce the source, just in slightly less good
accuracy.
So, let us define our setup more formally. For simplicity we concentrate
here on the case with two indices w(1) and w(2) . As shown in Figure 7.1, we
have a source that generates an IID random sequence X1 , . . . , Xn ,
{Xk } IID Q.
155
(7.1)
Sfrag
156
W (1)
Dest.
1, . . . , X
n
X
Enc. (1)
X 1 , . . . , Xn
Dec. (i)
W (2)
Enc. (2)
This source sequence is then fed into two encoders, which will generate indices
W (1) = (1) (X),
W
(2)
(2)
(7.2)
(X),
(7.3)
i = 1, 2,
(7.4)
for two values R(1) and R(2) . Note that it is irrelevant whether the two encoders are actually physically separate entities or if they are jointly together in a
single source encoder. As mentioned above already, the reason for this is that
both encoders see the exactly same input and therefore perfectly know what
the other encoder does.
The decoder will receive either W (1) , or W (2) , or both (W (1) , W (2) ). Depending on what it receives, it will generate a description
or
= (1) W (1) ,
X
= (2) W (2) ,
X
= (12) W (1) , W (2) ,
X
(7.5)
(7.6)
(7.7)
respectively. The decoding functions are deterministic mappings from the set
of possible indices to a sequence in the corresponding reconstruction alphabet
X (i) :
n
o
n
(i)
(i) : 1, 2, . . . , enR
X (i) , i = 1, 2,
(7.8)
and
(12) :
n
o n
o
n
(1)
(2)
1, 2, . . . , enR
1, 2, . . . , enR
X (12) .
(7.9)
Note that the case when the decoder receives no index is uninteresting and
therefore ignored.
157
Depending on whether the first, the second, or both indices arrive at the
decoder, we ask for a different maximum allowed distortion:
h
i
E d(1) X, (1) (1) (X)
D(1) ,
(7.10)
h
i
E d(2) X, (2) (2) (X)
D(2) ,
(7.11)
h
i
E d(12) X, (12) (1) (X), (2) (X)
D(12) ,
(7.12)
for some given values D(1) , D(2) , D(12) . To keep things as general as possible,
we even allow a different distortion measure and different reproduction alphabets for the three different cases. However, in all cases we will stick to our old
(and poor!) assumption of a distortion measure that is an average perletter
distortion:
n
1 X (i)
(i)
) ,
d (x, x
d (xk , x
k ), i = 1, 2, 12.
(7.13)
n
k=1
So the question now is what parameters R(1) , R(2) , D(1) , D(2) , and D(12) can
be chosen for some given source Q and some perletter distortion functions,
such that we can find a multiple description rate distortion coding scheme
that works.
Definition 7.2. A multiple description rate distortion quintuple
R(1) , R(2) , D(1) , D(2) , D(12)
is said to be achievable for a source Q and forsome distortion measures d(i) (, )
(1)
(2)
if there exists a sequence of enR , enR , n multiple description rate distor(1) (2)
(1)
(2)
(12)
tion coding schemes (n , n , n , n , n ) satisfying the distortion constraints
h
i
(X)
D(1) ,
(7.14)
lim E d(1) X, n(1) (1)
n
n
h
i
lim E d(2) X, n(2) (2)
D(2) ,
(7.15)
n (X)
n
h
i
(2)
lim E d(12) X, n(12) (1)
D(12) .
(7.16)
n (X), n (X)
n
158
The multiple description rate distortion region for a source Q and some distortion measures d(i) (, ) is the closure of the set of all achievable multiple
description rate distortion quintuples.
Note the main problem that we have here. Since a good description of the
source must be similar to the source, two individual good descriptions are in
general quite similar and therefore dependent. This, however, means that the
second description cannot contribute much more new information in addition
to the first.
If, on the other hand, two descriptions are independent of each other such
that they together yield a far better description than they do alone, then they
usually will not be very good individually.
7.2
An Example
1
2
(7.17)
if a = b,
if a =
6 b.
(7.18)
Suppose we require that D(12) = 0, i.e., if both indices arrive we would like to
have perfect reconstruction. Lets consider a channel splitting approach where
the even numbered bits are transmitted over the first channel and the odd bits
over the second channel, i.e., we have rates R(1) = R(2) = 12 bits. Then we get
(1)
D(1) = E d X, X
(7.19)
n
1X
(1)
Pr Xk 6= X
(7.20)
=
k
n
k=1
n
i 1 h
i
1X 1 h
(1)
(1)
=
Pr Xk 6= Xk k is even + Pr Xk 6= Xk k is odd
n
2
2
k=1
(7.21)
n
1X 1
1 1
1
=
0+
= ,
n
2
2 2
4
(7.22)
k=1
and
(2)
D(2) = E d X, X
n
1X
(2)
Pr Xk 6= X
=
k
n
k=1
(7.23)
(7.24)
159
n
i 1 h
i
1X 1 h
(2)
(2)
=
Pr Xk 6= Xk k is even + Pr Xk 6= Xk k is odd
n
2
2
k=1
(7.25)
n
1
1X 1 1 1
+ 0 = .
=
n
2 2 2
4
(7.26)
k=1
(1)
,R
(2)
(1)
,D
,D
(2)
(12)
,D
=
1
1
1 1
bits, bits, , , 0
2
2
4 4
(7.27)
is achievable.
However, as we will see below in Section 7.7, we can do better: For R(1) =
(2)
R = 12 bits and D(12) = 0 it is possible to achieve
21
1
(1)
(2)
D =D =
0.207 < .
(7.28)
2
4
7.3
(7.29)
Then we fix some rates R(1) and R(2) and some blocklength n.
2: Codebook Design: We generate enR
(1) w(1) ,
X
by choosing each of the n enR
random according to QX (1) .
Similarly, we generate enR
(2)
(1)
lengthn codewords
(1)
w(1) = 1, . . . , enR ,
(1)
lengthn codewords
(2) w(2) ,
X
by choosing each of the n enR
random according to QX (2) .
(2)
w(2) = 1, . . . , enR ,
(2)
160
x, X
w
,X
w ,w
A(n) QX,X (1) ,X (2) ,X (12) .
(7.30)
If they find several possible choices, they pick one. If they find none,
they choose w(1) = w(2) = 1.
The first encoder (1) puts out w(1) , and the second encoder (2) puts
out w(2) .
4: Decoder Design: The decoder ( (1) , (2) , (12) ) consists of three different decoding functions, depending on whether w(1) , w(2) , or both
(w(1) , w(2) ) are received. It puts out
(1) w(1)
X
if only w(1) is received,
(7.31)
(2)
(2)
(2)
X w
if only w is received,
(7.32)
(12)
(1)
(2)
(1)
(2)
X
w ,w
if both (w , w ) is received.
(7.33)
5: Performance Analysis: We partition the sample space into three disjoint cases:
1. The source sequence is not typical:
X
/ A(n)
(Q)
(7.34)
(in which case we for sure cannot find a pair (w(1) , w(2) ) such that
(7.30) is satisfied!).
2. The source sequence is typical, but there exists no codeword triple
that is jointly typical with the source sequence:
X A(n)
(Q), @ w(1) , w(2) : (7.30) is satisfied. (7.35)
3. The source sequence is typical and there exists a codeword triple
that is jointly typical with the source sequence:
X A(n)
(Q), w(1) , w(2) : (7.30) is satisfied. (7.36)
Then we compute the achieved expected distortion of our system, averaged both over the source and over the random code generation. If
this expected distortion is within tolerance, then our randomly generated coding scheme works.
The details of this analysis are given in the following Section 7.4.
7.4
161
(i)
(i)
+E d

+E d
dmax
(i)
X, X
{z
Case 2 Pr(Case 2)
}
X, X
Case 3 Pr(Case 3)
 {z }
dmax
(i)
(7.37)
(7.38)
7.4.1
Case 1
7.4.2
(7.39)
Case 3
(1) , X, X
(2) , and X, X
(12) is
is jointly typical, then also each pair X, X
jointly typical. Hence, completely analogously to (5.120)(5.125) we have for
i = 1, 2, 12:
n
X
(i) = 1
(i)
d(i) X, X
d(i) Xk , X
k
n
k=1
1 X
(i) d(i) (a, b)
=
N a, b X, X
n
(7.40)
(7.41)
aX
bX (i)
(i)
PX,X
(i) (a, b) d (a, b)
(7.42)
aX
bX (i)
162
QX,X
(i) (a, b) +
X  X (i)
X
aX
bX (i)
(i)
QX,X
(i) (a, b) d (a, b) +
aX
bX (i)
(i)
=E d
7.4.3
(i)
X, X
+ dmax .
X
aX
bX (i)
d(i) (a, b)
(7.43)
dmax (7.44)
X  X (i)
(7.45)
Case 2
The by far most complicated part of this derivation is to find a bound on the
probability of Case 2. The problem is that even if (w(1) , w(2) ) 6= (v (1) , v (2) ),
and
X
w
,X
w ,w
X
v
,X
v ,v
are not necessarily independent because we might have that w(1) 6= v (1) , but
w(2) = v (2) , or vice versa. So we need trickery.
(n)
Given some x A (Q), let F(w(1) , w(2) ) be the event that w(1) and w(2)
give a good choice of codewords:
n (1) (1) (2) (2) (12) (1) (2)
F w(1) , w(2) ,
X
w
,X
w
,X
w ,w
o
(7.46)
A(n)
QX,X (1) ,X (2) ,X (12) x .
Note that w(1) , w(2) are fixed! The randomness comes from the random generation of the codebook.
Now, noting that Case 2 can only occur if F does not occur for all possible
choices of w(1) , w(2) , we can write
\
Pr(Case 2) = Pr
F c w(1) , w(2)
(7.47)
w(1) ,w(2)
= Pr[K = 0]
(7.48)
with
K,
X
w(1) ,w(2)
n
o
I F w(1) , w(2) occurs
(7.49)
(7.50)
163
Pr(Case 2) = Pr[K = 0]
E[K]
Pr K E[K]
2
2
E K E[K]
2
E[K]
2
4 Var[K]
(E[K])2
4 E K 2 (E[K])2
(E[K])2
(7.52)
(7.53)
(7.54)
(7.55)
(7.56)
where the inequality (7.54) follows from the Chebyshev Inequality (1.87). So
it remains to derive some bounds on E[K] and Var[K].
Firstly, E[K]:
o
n
X
(7.57)
E[K] = E
I F w(1) , w(2) occurs
w(1) ,w(2)
h
i
E IF w(1) , w(2) occurs
(7.58)
w(1) ,w(2)
X
w(1) ,w(2)
X
w(1) ,w(2)
1 Pr F w(1) , w(2) + 0 Pr F c w(1) , w(2)
Pr F w(1) , w(2)
X
(n)
w(1) ,w(2) (
x(1) ,
x(2) ,
x(12) )A (x)
(7.59)
(7.60)
(2)
(1) QnX (2) x
QnX (1) x
(1) (2)
(12) x
,x
(7.61)
(n)
(n)
Here we introduce the shorthand A (x) for A
QX,X (1) ,X (2) ,X (12) x , i.e.,
for simplicity we will drop the exact statement of the joint distribution that
is the basis of the typical set.
Using twice TA1b and once TB1, we bound (7.61) as follows:
X
X
(1) +
E[K]
exp n H X
(n)
w(1) ,w(2) (
x(1) ,
x(2) ,
x(12) )A (x)
(1) (2)
(2) + exp n H X
(12) X
,X
exp n H X
+
(7.62)
164
X
A(n)
QX,X (1) ,X (2) ,X (12) x
w(1) ,w(2)
(1) (2)
,X
(12) X
(2) + H X
(1) + H X
+
(7.63)
exp n H X
X
(1) , X
(2) , X
(12) X
exp n H X
w(1) ,w(2)
(1) (2)
,X
(12) X
(2) + H X
(1) + H X
+
(7.64)
exp n H X
= exp n R(1) + R(2)
(1)
(1) , X
(2) , X
(12) X H X
(1) H X
(2) X
exp n H X
(1)
(1) (2)
(2)
(2) X
,X
(12) X
H X
+H X
H X
(7.65)
(1) , X
(2) , X
(12) I X
(1) ; X
(2) .
= exp n R(1) + R(2) I X; X
(7.66)
Here, the inequality (7.64) follows from TB2, and in the subsequent equality
(i)
(7.65) we used the fact that there are enR choices for w(i) . Note that we
have stopped keeping track of the different s and s, but simply summarize
them all together.
Hence, we get
(1) , X
(2) , X
(12)
(E[K])2 exp n 2R(1) + 2R(2) 2 I X; X
(1) ; X
(2) .
(7.67)
2I X
Secondly, we tackle E K 2 :
E K2
n
o X
n
o
X
= E
I F w(1) , w(2) occurs
I F v (1) , v (2) occurs
w(1) ,w(2)
v (1) ,v (2)
(7.68)
n
o
n
oi
E I F w(1) , w(2) occurs I F v (1) , v (2) occurs
h
(7.69)
=
Pr F w(1) , w(2) F v (1) , v (2)
Pr F w(1) , w(2) F v (1) , v (2) ,
(7.70)
(7.71)
where in the last step we distinguish four cases of whether w(i) = v (i) or not.
These cases are described by the four possible subsets of {1, 2}: Q = {1, 2},
165
X
w(1) ,w(2)
X
w(1) ,w(2)
Pr F w(1) , w(2)
(7.72)
(1) , X
(2) , X
(12) + I X
(1) ; X
(2)
exp n I X; X
(7.73)
(1) , X
(2) , X
(12) I X
(1) ; X
(2) + .
= exp n R(1) + R(2) I X; X
(7.74)
Case Q = {1}: In this case we have w(1) = v (1) , but w(2) 6= v (2) , i.e., we
have a partial overlap that is more difficult to handle properly. We use the
following lemma.
Lemma 7.4 (Chain Rule for Typical Sets). The event
(X, Y) A(n) (QX,Y )
is equivalent to the event
X A(n) (QX ) Y A(n) (QX,Y X) .
(7.75)
(7.76)
Proof: This lemma follows directly from the definitions of the typical and
(n)
the conditionally typical sets: If (x, y) A (QX,Y ), then we know from
(n)
Lemma 4.6 that x A (QX ), and from Definition 4.10 we know that y
(n)
A (QX,Y x). On the other hand, it directly follows from Definition 4.10
(n)
(n)
(n)
that if x A (QX ) and y A (QX,Y x), then (x, y) A (QX,Y ).
(n)
Using the shorthands A (x) for
A(n)
QX,X (1) ,X (2) ,X (12) x
or A(n) QX,X (1) x ,
(n)
respectively,1 and A
(1) for
x, x
(1) ,
A(n)
QX,X (1) ,X (2) ,X (12) x, x
1
It should be clear from the context, which distribution needs to be plugged in. We
will keep using this type of shorthands for remainder of these notes whenever the context is
clear.
166
we get
Pr F w(1) , w(2) F w(1) , v (2)
n
o
(2) (2) (12) (1) (2)
(1) w(1) , X
= Pr
X
w
,X
w ,w
A(n)
(x)
n
o
(1) w(1) , X
(2) v (2) , X
(12) w(1) , v (2) A(n) (x)
X
(7.77)
n
o
(1) w(1) A(n) (x)
= Pr X
n
o
(1) (w(1) )
(2) w(2) , X
(12) w(1) , w(2) A(n) x, X
X
n
o
(1) (w(1) )
(2) v (2) , X
(12) w(1) , v (2) A(n) x, X
(7.78)
X
= Pr(Ea Eb Ec )
(7.79)
(7.80)
where in (7.78) we have used Lemma 7.4; (7.79) must be understood as the
definitions of the events Ea , Eb , and Ec ; and where (7.80) follows from the
chain rule.
Now note that conditionally on Ea , the events Eb and Ec are independent
of each other, i.e., in (7.80) we have the two terms Pr(Eb  Ea ) and Pr(Ec  Ea )
that are basically the same. Lets investigate them more closely. We have
Pr(Eb Ea )
X
=
(n)
(1) A
x
(1) (1)
(1) Ea
Pr X
w
=x
(x)
h
(12) (1) (2)
(2) w(2) , X
(1)
Pr X
A(n)
x, x
w ,w
i
(1) (1)
(1) (7.81)
=x
X w
X
(1) (1)
(1) Ea
Pr X
w
=x
(n)
(1) A
x
(x)
max
(n)
(1) A
x
(x)
max
(n)
(1) A
x
{z
}
h
i
(12) (1) (2)
(1)
(2) w(2) , X
Pr X
A(n)
x,
x
w ,w
=1
(x)
(7.82)
h
i
(2) w(2) , X
(12) w(1) , w(2) A(n) x, x
(1) (7.83)
Pr X
X
max
(n)
(1) A (x)
x
max
(n)
(1) A
x
(n)
A
(x)
(n)
(
x(2) ,
x(12) )A
(2)
QnX
(2) x
x,
x(1)
(1) (2)
(12) x
,x
QnX
(7.84)
(12) X
(1) ,X
(2) x
(1)
QX,X (1) ,X (2) ,X (12) x, x
(1) (2)
(2) exp n H X
(12) X
,X
exp n H X
(7.85)
(2) , X
(12) X, X
(1) +
exp n H X
167
(1) (2)
(2) + H X
(12) X
,X
exp n H X
(7.86)
(2) , X
(12) X, X
(1) H X
(2) H X
(12) X
(1) , X
(2) + .
= exp n H X
(7.87)
+ 2H X
(7.89)
(1) 2 H X
(2) , X
(12) X, X
(1) 2 H X
(1) X
= exp n I X; X
(1) (2)
(1) , X
(2)
,X
(12) X
(2) + 2 H X
+ 2H X
+ 2H X
(1) X 2 H X
(1) , X
(2)
+ 2H X
(7.90)
(1) 2 H X
(1) , X
(2) , X
(12) X
= exp n I X; X
(1) , X
(2) , X
(12)
(2) + 2 H X
+ 2H X
(1) X 2 H X
(1) , X
(2)
+ 2H X
(7.91)
(1) + 2 I X; X
(1) , X
(2) , X
(12) + 2 H X
(1) X
= exp n I X; X
(2)
(1)
(1) 2 H X
(1) X
(7.92)
+ 2H X
2H X
(1) , X
(2) , X
(12) + 2 I X
(1) ; X
(2)
= exp n 2 I X; X
(1) .
I X; X
(7.93)
Hence, we get the following bound:
X
Pr F w(1) , w(2) F w(1) , v (2)
w(1) ,w(2)
v (2) 6=w(2)
(1) , X
(2) , X
(12)
exp nR(1) + nR(2) + n R(2) 1 exp n 2 I X; X
(1) ; X
(2) I X; X
(1)
+ 2I X
(7.94)
168
Case Q = {2}: This is the same as the case Q = {1}, but with exchanged
(1) and X
(2) :
roles of X
X
Pr F w(1) , w(2) F v (1) , w(2)
w(1) ,w(2)
v (1) 6=w(1)
(1) ; X
(2)
(1) , X
(2) , X
(12) 2 I X
exp n 2R(1) + R(2) 2 I X; X
(2) + .
+ I X; X
(7.96)
w(1) ,w(2)
X
w(1) ,w(2)
Pr F w(1) , w(2)
X
v (1) 6=w(1)
Pr F v (1) , v (2)
w(1) ,w(2)
(7.97)
(7.98)
v (2) 6=w(1)
X
Pr F w(1) , w(2)
Pr F v (1) , v (2)
Pr F w(1) , w(2) Pr F v (1) , v (2)
(7.99)
v (1) ,v (2)
2
Pr F w(1) , w(2)
= (E[K])2 .
(7.100)
(7.101)
Here, in (7.99) we increase the number of terms in the second sum; and in
(7.101) we use (7.60).
Hence, plugging these four bounds (7.74), (7.95), (7.96), and (7.101) into
(7.71), we get
E K 2 (E[K])2
(1) , X
(2) , X
(12) I X
(1) ; X
(2) +
exp n R(1) + R(2) I X; X
(1) , X
(2) , X
(12) 2 I X
(1) ; X
(2)
+ exp n R(1) + 2R(2) 2 I X; X
(1) +
+ I X; X
(1) , X
(2) , X
(12) 2 I X
(1) ; X
(2)
+ exp n 2R(1) + R(2) 2 I X; X
(2) + .
+ I X; X
(7.102)
169
(7.103)
(7.104)
(E[K])2
(1) ; X
(2) +
(1) , X
(2) , X
(12) + I X
4 exp n R(1) R(2) + I X; X
(1) +
+ 4 exp n R(1) + I X; X
(2) +
+ 4 exp n R(2) + I X; X
(7.105)
, 2 .
(7.106)
(1) + ,
R(1) > I X; X
(7.107)
(2) + ,
R(2) > I X; X
(7.108)
7.4.4
Putting all three cases from Sections 7.4.1, 7.4.2, and 7.4.3 back into (7.38)
now shows that
(i) dmax t (n, , X ) + dmax 2 + E d(i) X, X
(i) + dmax
E d(i) X, X
(7.110)
(7.111)
and the constraints (7.14)(7.16) are satisfied if QX (1) ,X (2) ,X (12) X is such that
(1)
(1) D(1) dmax ,
E
d
X,
X
(7.112)
(2)
(2)
(2)
E d X, X
D dmax ,
(7.113)
(12)
(12)
E d(12) X, X
D
dmax .
(7.114)
We have shown that any multiple description rate distortion quintuple is
achievable for which a distribution QX (1) ,X (2) ,X (12) X can be found such that
(7.107)(7.109) and (7.112)(7.114) are satisfied. Note that since is arbitrary,
we can omit the terms in (7.107)(7.109) and the terms dmax in (7.112)
(7.114).
170
where
R QX,X (1) ,X (2) ,X (12)
, R(1) , R(2) , D(1) , D(2) , D(12) :
(1) ,
R(1) I X; X
(2) ,
R(2) I X; X
(1) ; X
(2) ,
(1) , X
(2) , X
(12) + I X
R(1) + R(2) I X; X
(1) ,
D(1) E d(1) X, X
(2) ,
D(2) E d(2) X, X
(12)
D(12) E d(12) X, X
.
(7.116)
(7.117)
7.5
There is a nice trick how we can actually enlarge the achievable rate distortion
region derived in Sections 7.3 and 7.4. Assume for the moment that there
exists a third encoder whose index always safely arrives at the decoder, i.e.,
the third encoder sees a noisefree channel.2 We assign the rate R(0) to this
third encoder and repeat the derivation of our random coding scheme.
2
Of course, we do not have such a noisefree channel, but we can simulate it, see the
discussion in Section 7.5.3.
7.5.1
171
1: Setup: We choose a PMF QX (0) ,X (1) ,X (2) ,X (12) X and then compute
QX (1) X (0) , QX (2) X (0) , and QX (12) X (0) ,X (1) ,X (2) as marginal distributions
of Q QX (0) ,X (1) ,X (2) ,X (12) X .
Then we fix some rates R(0) , R(1) , and R(2) and some blocklength n.
(0)
X
For every w(0) , we independently generate enR
(1)
(0) (0)
(1) w(0) , w(1) Qn (1) (0) X
(w ) ,
X
X
X
and enR
(2)
lengthn codewords
(1)
w(1) = 1, . . . , enR ,
(7.119)
lengthn codewords
(0) (0)
(2) w(0) , w(2) Qn (2) (0) X
(w ) ,
X
X X
(Note that this means that we have en(R
(0)
(2)
(2) .)
en(R +R ) codewords X
(0)
(2)
w(2) = 1, . . . , enR .
+R(1) )
(7.120)
(1) and
codewords X
Finally, for each triple (w(0) , w(1) , w(2) ), we generate one lengthn codeword
(12) w(0) , w(1) , w(2)
X
(0) (0)
(w ), X
(1) (w(0) , w(1) ),
QnX (12) X (0) ,X (1) ,X (2) X
(2) (w(0) , w(2) ) .
X
(7.121)
3: Encoder Design: For a given source sequence x, the encoders try to
find a triple (w(0) , w(1) , w(2) ) such that
(1) (0) (1) (2) (0) (2) (12) (0) (1) (2)
(0) w(0) , X
x, X
w ,w
,X
w ,w
,X
w ,w ,w
A(n)
QX,X (0) ,X (1) ,X (2) ,X (12) .
(7.122)
If they find several possible choices, they pick one. If they find none,
they choose w(0) = w(1) = w(2) = 1.
The first encoder (1) puts out w(1) , the second encoder (2) puts out
w(2) , and the third encoder (0) puts out w(0) .
4: Decoder Design: The decoder still consists of only three different decoding functions ( (1) , (2) , (12) ), because we know that w(0) does arrive
172
(w(0) , w(2) )
is received,
(7.123)
(7.124)
(7.125)
(7.126)
2. The source sequence is typical, but there exists no codeword quadruple that is jointly typical with the source sequence:
X A(n)
(Q),
@ w(0) , w(1) , w(2) : (7.122) is satisfied. (7.127)
3. The source sequence is typical and there exists a codeword quadruple that is jointly typical with the source sequence:
X A(n)
(Q),
w(0) , w(1) , w(2) : (7.122) is satisfied. (7.128)
The analysis of the first and third case is identical to the analysis shown
in Section 7.4. We only need to have a closer look at Pr(Case 2).
7.5.2
Analysis of Case 2
We define
n (0) (0) (1) (0) (1) (2) (0) (2)
Pr(Case 2)
4 E K 2 (E[K])2
(E[K])2
(7.130)
173
with
X
E[K] =
Pr F w(0) , w(1) , w(2)
A
(7.131)
(0)
(1) x
(0) QnX (1) X (0) x
QnX (0) x
(x)
(2) (0)
(7.132)
(n)
A
QX,X (0) ,X (1) ,X (2) ,X (12) x
(0)
(0)
(2) X
(1) X
(0) + H X
+H X
exp n H X
(0) (1) (2)
(12) X
,X
,X
+H X
+
exp n R(0) + R(1) + R(2)
(0) , X
(1) , X
(2) , X
(12) X
exp n H X
(0)
(0)
(0) + H X
(1) X
(2) X
exp n H X
+H X
(0) (1) (2)
,X
,X
(12) X
+
+H X
(0) , X
(1) , X
(2) , X
(12)
= exp n R(0) + R(1) + R(2) I X; X
(0)
(1) ; X
(2) X
I X
,
and with
E K2 =
(7.133)
(7.134)
(7.135)
Pr F w(0) , w(1) , w(2) F v (0) , v (1) , v (2) .
(7.136)
Before we start with the case distinction according to the different values
of Q, note that if w(0) 6= v (0) then the two events F w(0) , w(1) , w(2) and
F v (0) , v (1) , v (2) are disjoint because w(0) is a counter that was used in the
generation of all codewords simultaneously. Hence, the cases Q = {1, 2},
Q = {1}, Q = {2}, and Q = can be treated jointly.
Cases Q = {1, 2}, Q = {1}, Q = {2}, and Q = :
X
Pr F w(0) , w(1) , w(2) F v (0) , v (1) , v (2)
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with no overlap in w(0)
X
w(0) ,w(1) ,w(2)
Pr F w(0) , w(1) , w(2)
X
v (0) 6=w(0) ,
v (1) ,v (2)
(7.137)
174
(7.138)
= (E[K])2 .
(7.139)
Case Q = {0, 1, 2}: Using a derivation that is (apart from the s and s)
identical to (7.131)(7.135) we get
X
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with overlap {0,1,2}
Pr F w(0) , w(1) , w(2) F v (1) , v (1) , v (2)
Pr F w(0) , w(1) , w(2)
(7.140)
(12)
(2) , X
(1) , X
(0) , X
exp n R(0) + R(1) + R(2) I X; X
(0)
(1) ; X
(2) X
I X
+ .
(7.141)
Case Q = {0, 1}: Again using Lemma 7.4, we get the following (using a
sloppy notation where we omit the arguments w(i) ):
X
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with overlap {0,1}
(0)
(1)
(2)
(1) (1) (2)
Pr F w , w , w
F v ,v ,v
X
w(0) ,w(1) ,w(2) ,
v (2) 6=w(2)
h
i
(0) , X
(1) A(n) (x)
Pr X
h
(0) (1)
,x
(2) , X
(12) A(n) X
,X
Pr X
i2
(0) (1)
(n)
X ,X
A (x)
(7.142)
(0) , X
(1)
exp n R(0) + R(1) + 2R(2) I X; X
(0) (1)
(0)
(2) , X
(12) X
,X
,X 2H X
(2) X
+ 2H X
(0) (1) (2)
(12) X
,X
,X
2H X
+
(0) , X
(1)
= exp n R(0) + R(1) + 2R(2) + I X; X
(0) , X
(1) , X
(2) , X
(12)
2 I X; X
(0)
(1) ; X
(2) X
2I X
+ .
(7.143)
(7.144)
175
Case Q = {0, 2}: This is identical to Case Q = {0, 1} with exchanged roles
(1) and X
(2) :
of X
X
Pr F w(0) , w(1) , w(2) F v (1) , v (1) , v (2)
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with overlap {0,2}
(0) , X
(2)
exp n R(0) + 2R(1) + R(2) + I X; X
(0) , X
(1) , X
(2) , X
(12)
2 I X; X
(0)
(1) ; X
(2) X
2I X
+ .
Case Q = {0}:
X
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with overlap {0}
(7.145)
Finally:
Pr F w(0) , w(1) , w(2) F v (1) , v (1) , v (2)
X
w(0) ,w(1) ,w(2) ,
v (1) 6=w(1) ,v (2) 6=w(2)
h
i
(0) A(n) (x)
Pr X
h
(0)
(1) , X
(2) , X
(12) A(n) X
,x
Pr X
i2
(0)
(n)
X A (x)
(0)
exp n R(0) + 2R(1) + 2R(2) I X; X
(0)
(1) , X
(2) , X
(12) X
,X
+ 2H X
(0)
(0)
(2) X
(1) X
2H X
2H X
(0) (1) (2)
(12) X
,X
,X
2H X
+
(0)
= exp n R(0) + 2R(1) + 2R(2) + I X; X
(0) , X
(1) , X
(2) , X
(12)
2 I X; X
(0)
(2) X
(1) ; X
2I X
+ .
(7.146)
(7.147)
(7.148)
(E[K])2
(0) , X
(1) , X
(2) , X
(12)
4 exp n R(0) R(1) R(2) + I X; X
(0)
(1) ; X
(2) X
+I X
+
176
(7.151)
(0) + ,
R(0) > I X; X
(0) , X
(1) + ,
(0) , X
(2) + ,
R(0) + R(2) > I X; X
(0)
(1) ; X
(2) X
+ .
+I X
7.5.3
(7.150)
(7.152)
(7.153)
(7.154)
(7.155)
Now, according to Figure 7.1 we do not actually have access to such a third
guaranteed channel. However, we can simulate it by adding the nR(0) nats to
both of the two other channels. Then in all interesting three cases (only w(1)
arrives, only w(2) arrives, and (w(1) , w(2) ) arrives) we do have these nats and
they act like they had come over the virtual third channel! This now means
that we adapt our rates R(1) and R(2) :
(1) , R(1) + R(0) ,
R
(2) , R(2) + R(0) .
R
(7.156)
(7.157)
(0) , then
> I X; X
(1) > I X; X
(0) , X
(1) ,
R
(7.158)
(2) > I X; X
(0) , X
(2) ,
R
(7.159)
(1) (2) (0)
(1)
(2)
(0)
(0) (1) (2) (12)
R + R > R + I X; X , X , X , X
+ I X ;X X
(0) + I X; X
(0) , X
(1) , X
(2) , X
(12)
> I X; X
(0)
(1) ; X
(2) X
.
(7.160)
+I X
7.6. Convexity
177
a bit scary, and in particular, it might prevent us from a numerical search for
an optimal choice of the PMF QU,X (1) ,X (2) ,X (12) X . Luckily, we will be able to
prove that without loss of generality the alphabet size U can be restricted to
a finite value (see Lemma 7.7 ahead).
We also point out that we could have guessed the rate region (7.158)
(7.160) directly from (7.107)(7.109). Simply rewrite (7.158)(7.160) as follows:
(0)
(1) > I X; X
(0) + I X; X
(1) X
R
,
(7.161)
(2)
(0)
(2)
(0)
> I X; X
R
+ I X; X
,
(7.162)
(0)
(1)
(2)
(0)
(1)
(2)
(12)
+R
> 2 I X; X
,X
,X
R
+ I X; X
(1) ; X
(2) X
(0) .
+I X
(7.163)
(0) (since w(0) is known in all
We note that all terms are conditioned on X
(0) for each rate to
cases), but we have to add the additional term I X; X
make sure that the description w(0) is accurate enough.
7.6
Convexity
It is not difficult to show that the multiple description rate distortion region is
convex. The argument goes as follows. Assume that two multiple description
rate distortion quintuples
(1) , R
(2) , D
(1) , D
(2) , D
(12)
R(1) , R(2) , D(1) , D(2) , D(12) and R
are achievable. Fix some 0 1 and some blocklength n. Let n1 , bnc
and n2 , n n1 . Now use the first n1 symbols from the source sequence
and encode them with a coding scheme achieving the first quintuple and then
take the remaining n2 symbols and encode them with a coding scheme that
achieves the second quintuple. Note that we can choose n large enough such
that both n1 and n2 become big enough for this to be possible.
On average, we have now created a new coding scheme with rates
(1)
(1) ,
R(1)
+ (1 )R
= R
(2)
R(2) = R(2) + (1 )R
(7.164)
(7.165)
D(12)
(12)
= D
(12) .
+ (1 )D
(7.166)
(7.167)
(7.168)
This proves that any point in the convex hull of an achievable rate distortion region is achievable, too.
178
7.7
Main Result
We are now finally ready to summarize all results that we have derived so far
in this chapter.
Theorem 7.6 (Improved Achievable Multiple Description Rate
Distortion Region [VKG03]).
Consider a DMS Q with finite alphabet X and three average perletter
distortion measures d(i) (, ) with corresponding reconstruction alphabets
X (i) , i = 1, 2, 12. Let U be an auxiliary RV on some finite alphabet
U. Then the following region is an achievable multiple description rate
distortion region:
[
(7.169)
where
R QX,U,X (1) ,X (2) ,X (12)
, R(1) , R(2) , D(1) , D(2) , D(12) :
(1) , U ,
R(1) I X; X
(2) , U ,
R(2) I X; X
D(1)
D(2)
(1) , X
(2) , X
(12) , U
+ I X; X
(1) ; X
(2) U ,
+I X
(1)
(1) ,
E d X, X
(2) ,
E d(2) X, X
(12)
D(12) E d(12) X, X
.
(7.170)
Lemma 7.7. Without loss of optimality we can restrict the size of U in Theorem 7.6 to
U X  + 6.
(7.171)
179
(2) U ,
R(2) I(X; U ) + I X; X
(7.173)
(1)
(2)
(1) (2) (12)
(1) (2)
R + R 2 I(X; U ) + I X; X , X , X
U + I X ;X U .
(7.174)
(7.175)
(7.176)
uU
X
(i) U =
(i) U = u ,
I X; X
QU (u) I X; X
i = 1, 2,
(7.177)
uU
X
(1) , X
(2) , X
(12) U =
(1) , X
(2) , X
(12) U = u , (7.178)
I X; X
QU (u) I X; X
uU
(1)
I X
(2)
;X
X
(1) ; X
(2) U = u ,
U =
QU (u) I X
(7.179)
uU
E d(i) X, X
(i)
X
uU
QU (u)
xX x
(i) X (i)
(i) u
QX,X (i) U x, x
d(i) x, x
(i) ,
i = 1, 2, 12, (7.180)
where QU is the marginal PMF coming from the chosen QU,X (1) ,X (2) ,X (12) X
and from the given QX . Furthermore, note that
X
QX (x) =
QU (u)QXU (xu), x X .
(7.181)
uU
180
xX x
(1) X (1)
xX x
(2) X (2)
(2) ,
QX,X (2) U x, x
(2) u d(2) x, x
xX x
(12) X (12)
(12) ,
QX,X (12) U x, x
(12) u d(12) x, x
X  1u ,
(7.183)
QU (u)vu .
(7.184)
uU
We see that v is a convex combination of U vectors vu . From Caratheodorys Theorem (Theorem 1.20) it now follows that we can reduce the size
of U to at most X  + 6 values (note that v contains X  + 5 components!)
without changing v, i.e., without changing the values of all righthand side
terms in (7.170) and without changing the (given!) value of QX (x) for any
x
P {1, . . . , X  1} (and therefore also for x = X  recall that since
x QX (x) = 1, the value of QX (X ) is determined by the other values). This
proves the claim.
Note that we need to incorporate QX into v because we do not choose QU
directly, but QU X . Hence, when changing the alphabet U and QU , we have
to make sure that QX still remains as given by the source.
Also note that the bound (7.171) can actually be reduced by 1 if we use
Theorem 1.22 instead of Theorem 1.20.
Example 7.8. Lets continue with Example 7.3 and apply Theorem 7.6 to
the situation of the BSS with the Hamming distortion measure. We choose
QU,X (1) ,X (2) ,X (12) X such that
U = constant,
(12)
= X,
(7.185)
(7.186)
(1) and X
(2) have the joint conditional PMF given in Table 7.2.
and such that X
Note that from this table we can see that
(1) X
(2) = X
X
(7.187)
(1)
(2) ,
X
X
(7.188)
(1) , X
(2)
X
1
Bernoulli .
2
(7.189)
181
(1)
X
1
QX (2) X (0)
32 2
21
21
0
2 2
21
QX (1) X (0)
2 2
21
(2)
X
QX (1) ,X (2) X (, 1)
QX (1) X (1)
QX (2) X (1)
(1)
X
Table 7.3: The PMF QX (1) ,X (2) derived from the PMF given in Table 7.2.
(2)
X
QX (1) ,X (2) (, )
0
(1)
X
1
QX (2) ()
0
3
2
2
2 1
2
2
2
1
2
2 1
2
2
1
2
2
2
QX (1) ()
2
1
2
2
2
182
1
2
1
= Hb
Hb 2 1 0
2
2
2
!
1
2
Hb 2 1
= Hb
2
2
1
0.383 bits bits,
2
!
1
2
(2) , U = Hb
Hb 2 1
I X; X
2
2
1
0.383 bits bits,
2
(1) (2) (12)
(1) ; X
(2) U
I(X; U ) + I X; X , X , X
,U + I X
(1) , X
(2) , X
(12) + I X
(1) ; X
(2)
= I X; X
(1) (2) (12)
,X
,X
+0
= H(X) H X X
(7.190)
(7.191)
(7.192)
(7.193)
(7.194)
(7.195)
(7.196)
(7.197)
= H(X)
= 1 bit
(7.198)
1 1
+ bits
2 2
(7.199)
and
2 1
E d
,
2
2
(2)
1
2
(2) = Pr X 6= X
(2) =
,
E d X, X
2
2
(12) = Pr X 6= X
(12) = 0.
E d(12) X, X
(1)
(1)
X, X
(1) =
= Pr X 6= X
(7.200)
(7.201)
(7.202)
1
1
bits, bits,
2
2
21
,
2
21
,0
2
21
2
(7.203)
0.207
183
Chapter 8
Introduction
1, . . . , X
n
X
Encoder
X 1 , . . . , Xn
QX,Y
Decoder
Y1 , . . . , Yn
Figure 8.1: The WynerZiv problem: a rate distortion system where the decoder has access to sideinformation.
At first thought, this might seem a strange problem. Why should the
decoder have sideinformation, but the encoder not? However, there exist
some important practical situations where we have exactly this constellation.
For example:
In a wireless relay channel, the relay passes on his noisy observation Y
about the message X to the destination. The receiver then has access
to both X (directly from the transmitter) and Y (from the relay), while
the encoder has no idea about Y at the time of transmission.
185
186
(8.1)
1X
) =
d(x, x
d(xk , x
k ).
n
(8.2)
k=1
Definition 8.2. A WynerZiv rate distortion pair (R, D) is said to be achievable for a DMS QX,Y and a distortion measure d(, ) if there exists a sequence
of (enR , n) WynerZiv rate distortion coding schemes (n , n ) with
lim E d X, n n (X), Y
D.
(8.3)
The WynerZiv rate distortion region for a DMS QX,Y and a distortion measure d(, ) is the closure of the set of all achievable WynerZiv rate distortion
pairs (R, D).
Definition 8.3. The WynerZiv rate distortion function RWZ (D) is the infimum of rates R such that (R, D) is in the WynerZiv rate distortion region
for a given distortion D.
1
Note that X and Y are dependent, but that (Xk , Yk ) is IID over time k.
8.2
PSfrag
187
codeword enR
bin 1
bin 2
bin 3
bin (enR 1)
bin enR
Figure 8.2: The idea of binning: The codewords are grouped into bins. Instead
of transmitting the codeword index, the decoder is only informed
about the binnumber.
The encoder will not transmit the index of the codeword, but instead only
the binnumber. The decoder should then be able to figure out the correct
codeword within the bin with the help of the sideinformation. This idea is
called binning.
1: Setup: We need an auxiliary random variable U with some alphabet U.
This RV basically represents the codeword at the encoder. So, we choose
U and a PMF QU X , and then compute QU as marginal distribution of
QX QU X .
188
4: Decoder Design: For a given bin number w and a received sideinformation sequence y, the decoder tries to find an index v such that
y, U(w, v) A(n)
(QY,U ).
(8.5)
If there are several choices for v, the decoder simply picks one. If there
is no such v, it sets v , 1. Then the decoder puts out
= f n U(w, v), y
X
(8.6)
is
where we use the notation f n to denote that each component of X
created using the function f (, ), i.e.,
k = f Uk (w, v), yk , k = 1, . . . , n.
X
(8.7)
5: Performance Analysis: For the analysis we distinguish five different
cases that are not necessarily disjoint, but that together cover the entire
sample space:
1. The source and sideinformation sequences are not jointly typical:
(X, Y)
/ A(n) (QX,Y ).
(8.8)
(8.9)
but the encoder cannot find a pair (w, v) such that (8.4) is satisfied.
3. The source and sideinformation sequences are jointly typical and
there exists a good choice (w, v) at the encoder,
(X, Y) A(n) (QX,Y ),
X, U(w, v) A(n) (QX,U ),
(8.10)
(8.11)
(8.12)
(8.13)
(QY,U ).
Y, U(w, v)
/ A(n)
(8.15)
(8.14)
but
189
(8.16)
8.2.1
Case 1
8.2.2
(8.17)
Case 2
This is very similar to the Case 2 of the analysis of the rate distortion theorem,
see (5.108)(5.117). We have
Pr(Case 2)
(n)
= Pr (X, Y) A(n)
(Q
)
@
(w,
v)
:
X,
U(w,
v)
A
(Q
)
X,Y
X,U
(8.18)
= Pr (X, Y) A(n)
(QX,Y )
h
i
Pr @ (w, v) : X, U(w, v) A(n) (QX,U ) (X, Y) A(n)
(Q
)
X,Y
(8.19)
= Pr (X, Y) A(n)
(QX,Y )

{z
}
enR
0
enR
Y Y
w=1 v=1
(QX )
Pr X, U(w, v)
/ A(n)
(QX,U ) X A(n)
(8.20)
nR
enR eY
Y
w=1 v=1
nR
w=1 v=1
(8.21)
Pr U(w, v)
/ A(n) (QX,U X) X A(n) (QX )
(8.22)
nR0
e
eY
Y
nR
Pr X, U(w, v)
/ A(n)
(QX,U ) X A(n)
(QX )
nR0
e
eY
Y
w=1 v=1
nR
<
(8.23)
nR0
e
eY
Y
w=1 v=1
1 Pr U(w, v) A(n)
(QX,U X) X A(n) (QX )
1 en(I(X;U )+)
n(I(X;U )+)
= 1e
en(R+R0 )
(8.24)
(8.25)
190
0
exp en(R+R ) en(I(X;U )+)
0
= exp en(R+R I(X;U )) .
(8.26)
(8.27)
Here, in (8.22) we use the definition of conditionally typical sets (Definition 4.10); (8.24) follows from TC; and the inequality (8.26) is due to the
Exponentiated IT Inequality (Corollary 1.10).
So as long as
R + R0 > I(X; U ) + ,
(8.28)
8.2.3
Case 3
We have
Pr(Case 3) = Pr (X, Y) A(n)
(QX,Y ) X, U(w, v) A(n) (QX,U )
(8.29)
v 6= v : Y, U(w, v) A(n)
(Q
)
Y,U
[
Pr
Y, U(w, v) A(n)
(QY,U )
(8.30)
v,
v 6=v
v,
v 6=v
Pr Y, U(w, v) A(n)
(QY,U )
(8.31)
en(I(Y ;U ))
(8.32)
v,
v 6=v
0
= enR 1 en(I(Y ;U ))
0
(8.33)
(8.34)
Here, in (8.30) we enlarge the set; in (8.31) we apply the Union Bound; and
(8.32) follows from TC.
So as long as
R0 < I(Y ; U )
(8.35)
8.2.4
Case 4
Note that by the definition of jointly typical sets, if (X, Y) is jointly typical
and (X, U) is jointly typical, but (Y, U) is not, then (X, Y, U) cannot be
191
(8.36)
(8.37)
(8.38)
Pr Y, X, U(w, v)
/ A(n) (QY,X,U ) X, U(w, v) A(n) (QX,U )
(8.39)
(n)
(n)
Pr Y, X, U(w, v)
/ A (QY,X,U ) X, U(w, v) A (QX,U ) (8.40)
= 1 Pr Y, X, U(w, v) A(n) (QY,X,U ) X, U(w, v) A(n) (QX,U )
X
=1
(x,u)
(QX,U )
(8.41)
(n)
Pr X = x, U(w, v) = u X, U(w, v) A (QX,U )
(n)
A
=1
Pr (Y, x, u) A(n) (QY,X,U ) X = x, U(w, v) = u
(8.42)
X
(n)
Pr X = x, U(w, v) = u X, U(w, v) A (QX,U )
(x,u)
(QX,U )
(n)
A
Pr Y A(n) (QY,X,U x, u) X = x, U = u ,
(8.43)
where the first inequality (8.37) follows because we enlarge the event (the
(n)
(n)
event (X, Y, U)
/ A
follows from the event (Y, U)
/ A ); and where
in (8.38) we enlarge the event once more by dropping one intersecting event.
So we see that we need a lower bound on
Pr Y A(n) (QY,X,U x, u) X = x, U = u
= Qn
A(n)
(QY,X,U x, u) x
(8.44)
Y X
(n)
by TB3, but that we cannot apply this here because (U, X, Y) is not generated according to QU,X,Y , but U is independent of (X, Y). However, we do
3
However, note that the opposite direction of this argument does not necessarily hold!
192
not need to worry about how U was generated, because in (8.44) (x, u) are
already given as being jointly typical. Hence, we expect that the lower bound
in (8.45) in principle also holds for (8.44).
To prove this, we need to adapt the proof of TB3. Similarly to (4.96) we
define
n
Fx,u , PY X,U Pn (YX U) :
o
y
/ A(n)
(Q
x,
u)
with
P
=
P
(8.46)
U,X,Y
yx,u
Y X,U
to be the set of all conditional types of all conditionally nontypical sequences
and argue identically to (4.96)(4.105) to show4 that for any PY X,U Fx,u ,
DPx,u PY X,U
QY X
X
,
Px,u (
a, u
) D PY X,U (
a, u
)
QY X (
a)
(8.47)
(
a,
u)X U
s.t. Px,u (
a,
u)>0
2
2U2 X 2 Y2
log e.
(8.48)
Using an adapted version of CTT2 where we make use of the fact that
QY X (yk xk ) = QY X,U (yk xk , uk )
(8.49)
(8.50)
1
1
PY X,U Fx,u
2
2U 2 X 2 Y2
log e
(8.51)
(8.52)
(8.53)
(8.54)
(8.55)
PY X,U Fx,u
Basically we use the fact that PY X,U deviates notably from QY X because the sequence
is nontypical, and we then apply the Pinsker Inequality.
= 1 Fx,u  e
2
2U 2 X 2 Y2
193
log e
2
n
2U 2 X 2 Y2
2
n
2U 2 X 2 Y2
(8.56)
log e
log e
= 1 t (n, , U X Y).
(8.57)
(8.58)
(8.59)
(8.60)
Remark 8.4. Note that we have proven here the following statement: Consider a joint distribution QX,Y,Z forming a Markov chain X (
Y (
Z.
(n)
Let two sequences (x, y) A (QX,Y ) be jointly typical and assume that
the sequence Z is generated according to QnZY (y), ignoring QX or x. Then
(n)
8.2.5
Case 5
(8.61)
QX,Y,U (a, b, c) +
d a, f (c, b)
X YU
(a,b,c)X YU
EQX,Y,U d X, f (U, Y ) + dmax .
8.2.6
(8.62)
(8.63)
(8.64)
(8.65)
(8.66)
We are now ready to combine all these cases together. Using the fact that
all five cases combined cover the entire probability space, we use the Union
Bound on total expectation (Theorem 1.14) to get
5
X
Case i Pr(Case i)
E d(X, X)
E d(X, X)
(8.67)
i=1
194
4
X
Case i Pr(Case i)
E d(X, X)

{z
}
i=1
dmax
Case 5 Pr(Case 5)
+ E d(X, X)
(8.68)
 {z }
1
0
dmax t (n, , X Y) + dmax exp en(R+R I(X;U ))
I(X; U ) < R + R ,
0
I(Y ; U ) > R ,
(8.69)
(8.70)
(8.71)
(8.72)
we are able to achieve the rate distortion pair (R, D). Note that we are not
interested in R0 , i.e., we can actually combine (8.71) and (8.72) to the condition
R > I(X; U ) I(Y ; U ).
(8.73)
Since we are trying to make this condition as loose as possible, we will then
decide to choose QU X and f (, ) such that the RHS of (8.73) is minimized.
8.3
Based on (8.70) and (8.73), we define the following rate distortion function.
Definition 8.5. The WynerZiv rate distortion function is defined as
RWZ (D) ,
min
I(X; U ) I(Y ; U ) .
(8.74)
From Section 8.2 we know that any rate distortion pair larger than the
WynerZiv rate distortion function is achievable, i.e., RWZ (D) is an upper
bound on the rate distortion function of the WynerZiv problem. In Section 8.5 we will prove that it actually also is a lower bound, i.e., that RWZ (D)
constitutes the rate distortion function of the rate distortion problem with
sideinformation at the decoder.
The form of RWZ (D) is interesting. First, there is the auxiliary random
variable U that is difficult to understand intuitively. Be aware that U does not
represent what is transmitted (we transmit the bin number W !). It is better
to think of U as the description of the codebook used at the encoder. In the
standard rate distortion problem we only use one codebook that is described
Here we use two: one (described by U) that does not consider the
by X.
that does take Y into account.
sideinformation Y and one (described by X)
195
Depending on the correlation between X and Y, the encoder must use a very
detailed codebook with many bins and only very few codewords per bin, or it
can use a coarse codebook with only a few bins, but many codewords per bin.
i.e., each bin
In the extreme case of X
Y , the encoder must use U = X,
contains exactly one codeword.
Note that the function f (, ) at the decoder is actually a degenerate conditional probability distribution QXU,Y
. We could replace the choice of f by
choice always only contains probability values of 1 and 0, i.e., given a value of
is deterministic.
U and Y , the value of X
Due to the Markov nature of QU,X,Y , i.e., U depends only on X, not on
(X, Y ):
QU,X,Y = QX QY X QU X ,
(8.75)
(8.76)
(8.77)
= I(X; U Y ).
(8.79)
(8.78)
= f (U, Y ))
(because X
(8.80)
(8.81)
The latter mutual information corresponds to the situation when both encoder
and decoder know Y , i.e., it describes the rate distortion region of the rate
distortion problem with global sideinformation:
RXY (D) =
min
QXX,Y
: E[d(X,X)]D
).
I(X; XY
(8.82)
In general, for most sources and distortion measures, RXY (D) is strictly
smaller than RWZ (D), i.e., we usually have
R(D) > RWZ (D) > RXY (D).
(8.83)
Before we show two examples on how one can try to evaluate the Wyner
Ziv rate distortion function (8.74), we would like to point out that similar
such that for every
This is because I(X; U ) I(Y ; U ) does not directly depend on X
in an optimal fashion to minimize E[d(X, X)].
196
to our discussion at the end of Section 7.5.3 also here we have the problem
that in (8.74) we not only optimize over the best choice of QU X , but that
we even have the freedom to select an optimal alphabet U for U . Luckily,
we reduce the dimensionality of the problem by proving that without loss of
generality we can restrict the size of the freely choosable alphabet U.
Lemma 8.6. Without loss of optimality we can restrict the size of U in the
definition of the WynerZiv rate distortion function in (8.74) to
U X  + 2.
(8.84)
Proof: The proof is very similar to the proof of Lemma 7.7. Consider a
given choice of U, QU X , and f (, ), and recall from (8.79) that
I(X; U ) I(Y ; U ) = I(X; U Y )
(8.85)
= H(XY ) H(XY, U )
X
=
QU (u) H(XY ) H(XY, U = u) ,
(8.86)
(8.87)
uU
where QU is the marginal PMF coming from the chosen QU X and from the
given QX . Furthermore, note that
X
XX
E[d(X, f (U, Y ))] =
QU (u)
QX,Y U (x, yu) d(x, f (u, y))
uU
xX yY
(8.88)
and that
QX (x) =
QU (u)QXU (xu),
uU
x X.
(8.89)
(8.90)
QXU (1u), . . . , QXU (X  1u) ,
(8.91)
QU (u)vu .
uU
(8.92)
197
We see that v is a convex combination of U vectors vu . From Caratheodorys Theorem (Theorem 1.20) it now follows that we can reduce the size
of U to at most X  + 2 values (note that v contains X  + 1 components!)
without changing v, i.e., without changing the values of I(X; U ) I(Y ; U ) and
of E[d(X, f (U, Y ))], and without changing the (given!) value of QX (x) for any
x {1, . . . , X  1} (and therefore also for x = X ). This proves the claim.
Note that we need to incorporate QX into v because we do not choose QU
directly, but QU X . Hence, when changing the alphabet U and QU , we have
to make sure that QX still remains as given by the source.
Also note that the bound (8.84) can actually be reduced by 1 if we use
Theorem 1.22 instead of Theorem 1.20.
Example 8.7. Lets consider the example of a binary symmetric source (BSS)
with the Hamming distortion measure. Suppose that the sideinformation Yk
that is available at the decoder is the output of a BSC that has input Xk and
crossover probability p. See Figure 8.3. The task is to compute RWZ .
Dest.
1, . . . , X
n
X
Encoder
X 1 , . . . , Xn
BSS
Decoder
Y1 , . . . , Yn
1p
p
p
1p
(8.93)
and
= E d X, f (U, Y ) = E[d(X, Y )] = Pr[X 6= Y ] = p. (8.94)
E d(X, X)
Hence, if D p, we know that RWZ (D) 0. However, since RWZ (D) 0 by
definition, we see that we have found
RWZ (D) = 0,
for D p.
(8.95)
(8.96)
198
Hence,
I(X; U ) I(Y ; U ) = 1 Hb () 1 Hb (p ? )
(8.97)
= Pr[X 6= U ] = .
E d(X, X)
(8.99)
= Hb (p ? ) Hb ()
(8.98)
and
One can show that the RHS of (8.100) actually achieves the minimum in
(8.74), i.e., we have found the exact value of the WynerZiv rate distortion
function.
Once more, we would like to remind the reader that U is not what is
transmitted over the channel! As an example take D = 0. In this case the
optimal choices for and in (8.100) are = 1 and = 0, i.e., we only use
the second strategy with U = X. To transmit X we needed a rate of 1 bit,
however, (8.100) shows that we can do with 1 Hb (p) Hb (0) = Hb (p) bits.
The rest is provided by Y !
Recall that the standard rate distortion function for the setting of this
example is
+
R(D) = 1 Hb (D) .
(8.101)
One can check that this is always strictly larger than (8.100) unless D 12 or
p = 12 , i.e., unless either R(D) = 0 or X
Y.
If the encoder also has access to Y we get
+
RXY (D) = Hb (p) Hb (D) .
(8.102)
This in turn is always strictly smaller than RWZ (D) unless D = 0, D p, or
p = 21 .
Strictly speaking we cannot do so, as we have only proven (8.74) for the situation of
finite alphabets. But it can be generalized to the Gaussian case, too.
199
We again start by dropping the minimization in (8.74) and pick some PDF
U X and some function f (, ), which leads to an upper bound to RWZ (D).
If the decoder only uses Y to make an estimate of X (i.e., U is a constant
and ignored), then the best estimator is the linear estimator E[X Y ], i.e., we
choose
= f (U, Y ) = E[X Y ]
X
which yields the following average distortion
h
i
= E (X X)
2 = E X E[X Y ] 2 = 2 ,
E d(X, X)
pred
(8.104)
(8.105)
where 2pred denotes the variance of X when knowing Y , i.e., the prediction
error of X when observing Y :
2pred = E (X E[X  Y ])2 = 2 (1 2 ).
(8.106)
(For details on how to compute the prediction error, go back to conditional
Gaussian random variables [Mos14, Appendices A & B].) Hence, if D 2 (1
2 ), we have RWZ (D) 0 and, since the rate distortion function cannot be
negative,
RWZ (D) = 0,
for D 2 (1 2 ).
(8.107)
If D < 2 (12 ), we choose U = X+Z where Z is chosen as Z N 0, Z2 ,
Z
X, and the decoder makes the best estimate of X given both Y and U :
= f (U, Y ) = E[X  Y, U ].
X
(8.108)
(by (8.81))
= h(XY ) h(XY, U )
U)
= h(XY ) h(X XY,
(8.109)
(8.110)
= f (U, Y ))
(because X
= h(XY ) h(X X)
(8.111)
1
1
= log 2e2pred log 2e2MMSE
2
! 2
2pred
1
= log 2
,
2
MMSE
(8.112)
(8.113)
(8.114)
and where 2pred is again the prediction error of X when observing Y , and
where 2MMSE is the variance of the error when optimally estimating X given
Y and U , i.e., the minimum mean squared error (MMSE). Since our distortion
measure is exactly this MMSE and we therefore require that 2MMSE D, we
will choose Z2 such that we have
2MMSE = D.
(8.115)
200
(Note that this is possible because D < 2 (1 2 ).) Hence, we have found
the following bound:
2
1
(1 2 )
RWZ (D) log
, for D < 2 (1 2 ).
(8.116)
2
D
Combined with (8.107) this now gives
+
2
1
(1 2 )
RWZ (D)
log
.
2
D
(8.117)
R(D) =
log
.
(8.118)
2
D
From this then follows that if Y is known both at encoder and decoder, we
have
2
+
1
(1 2 )
RXY (D) =
log
,
(8.119)
2
D
because given Y , the variance of X is 2 (1 2 ). Comparing (8.119) with
(8.117) and remembering that RXY (D) RWZ (D) by definition, we then see
that
2
+
1
(1 2 )
log
= RXY (D)
(8.120)
2
D
RWZ (D)
(8.121)
2
+
1
(1 2 )
log
,
(8.122)
2
D
i.e.,
RWZ (D) =
2
+
1
(1 2 )
log
.
2
D
(8.123)
This shows that the Gaussian source is special: The sideinformation is not
required at the encoder, but it is sufficient to have it available at the decoder
only.
8.4
Properties of RWZ ()
Any rate distortion function R(D) must have the properties that it is nonincreasing and convex in D. The former follows from the fact that if we allow
a higher distortion, we definitely do not need to increase our rate. The latter
can be shown by a timesharing argument. Assume two rate distortion pairs
that are achievable using some given two schemes. If we now use a fraction
of the time the first scheme and the remaining fraction (1 ) the second
201
scheme, we will achieve a rate distortion pair that lies on the straight line
between the two rate distortion pairs. Hence, the rate distortion function can
only lie on this line or below, i.e., it must be convex.
Unfortunately, however, we cannot apply this insight to RWZ (D) because
we have not yet proven that it really is the correct ratedistortion function.
So we will prove directly that RWZ (D) is nonincreasing and convex in D.
Lemma 8.9. The WynerZiv rate distortion function RWZ () as specified in
Definition 8.5 is nonincreasing and convex in D.
Proof: The former is quite obvious from Definition 8.5: If we increase
the value of D, we relax the constraint in the minimization, which can only
decrease the value achieved in the minimization.
We will next prove that RWZ (D) is convex in D. Consider two points
(R0 , D0 ) and (R1 , D1 ) on the (RWZ (D), D)curve and suppose that
f0 : U0 Y X ,
f1 : U1 Y X
QU0 X ,
QU1 X ,
(8.124)
(8.125)
(8.126)
and define the auxiliary random variable (or, rather, random vector, but for a
finite alphabet there is no mathematical difference between a random variable
and a random vector as both take on a finite number of possible values):
U , [Z, UZ ].
Note that for all (z, uz , x) {0, 1} (U0 U1 ) X ,
QUX (ux) = Q[Z,UZ ]X [z, uz ] x = QZ (z) QUz X (uz x).
(8.127)
(8.128)
(8.129)
(8.134)
202
(8.135)
(8.136)
(8.137)
(8.139)
(8.142)
= R0 + (1 )R1
(8.138)
(8.140)
(8.141)
= RWZ (D)
=
(8.143)
min
I(X; U) I(Y ; U)
(8.144)
I(X; U) I(Y ; U)
(8.145)
(8.146)
Here, in (8.143) we use (8.134); (8.144) follows from Definition 8.5; in (8.145)
we drop the minimization and choose U and f as given in (8.127) and (8.129);
and the final equality (8.146) follows from (8.142).
This concludes the proof.
8.5
Converse
We are now ready to show that there exists no rate distortion system with
sideinformation at the decoder side that has a rate distortion pair below the
WynerZiv rate distortion function.
Consider an arbitrary coding scheme where
the encoder
n maps a length
n source sequence x to an index w 1, . . . , enR for some given R, and
where the decoder n maps a received index w and a lengthn sideinformation
. Assume that the scheme
sequence y into a source representation sequence x
works, i.e.,
D
E d(X, X)
(8.147)
for some given D.
8.5. Converse
203
=
=
=
=
=
=
=
1
log enR
n
1
H(W )
n
1
H(W Y)
n
1
1
H(W Y) H(W X, Y)
n
n
1
I(W ; XY)
n
n
1X
I W ; Xk X1k1 , Y1n
n
k=1
n
1 X
H Xk X1k1 , Y1n H Xk W, X1k1 , Y1n
n
k=1
n
1 X
H(Xk Yk ) H Xk W, X1k1 , Y1n
n
k=1
n
X
1
n
H(Xk Yk ) H Xk W, Y1k1 , Yk , Yk+1
n
1
n
1
n
1
n
1
n
1
n
k=1
n
X
k=1
n
X
k=1
n
X
k=1
n
X
k=1
n
X
k=1
(8.148)
(8.149)
(8.150)
(8.151)
(8.152)
(8.153)
(8.154)
(8.155)
(8.156)
H(Xk Yk ) H(Xk Uk , Yk )
(8.157)
I(Xk ; Uk Yk )
(8.158)
(8.159)
H(Uk Yk ) H(Uk Xk )
(8.160)
I(Xk ; Uk ) I(Yk ; Uk ) .
(8.161)
Here, (8.150) and (8.151) follow because conditioning reduces entropy and because entropy is nonnegative; in (8.155) we use that the source is IID over
time; in the subsequent inequality (8.156) we again rely on conditioning reducing entropy (note that we cannot apply this step with equality as in (8.155)
because W depends on the past!); in (8.157) we define a new random variable
(or random vector)
n
Uk , W, Y1k1 , Yk+1
(8.162)
and the final step (8.161) follows by adding and subtracting H(Uk ).
204
1X
I(Xk ; Uk ) I(Yk ; Uk )
n
k=1
n
X
1
= RWZ E d(X, X)
RWZ (D).
(8.164)
(8.165)
(8.166)
(8.167)
(8.168)
(8.169)
8.6
Summary
8.6. Summary
205
(8.171)
Chapter 9
Distributed Lossless
DataCompression:
SlepianWolf Problem
The rate distortion problem and the WynerZiv problem can both be considered to be a special case of the general setup shown in Figure 9.1.
1, . . . , X
n
X
Dest.
Y1 , . . . , Yn
W (1)
Encoder (1)
X 1 , . . . , Xn
QX,Y
Decoder
W (2)
Encoder (2)
Y1 , . . . , Yn
Figure 9.1: A general source compression problem with two joint sources, two
distributed encoders and one joint decoder.
Here, a source jointly generates two IID sequences X and Y that are
then encoded in a distributed fashion from two encoders that cannot directly
cooperate. The decoder then receives both indices W (1) and W (2) (with corresponding rates R(1) and R(2) , respectively) and needs to reconstruct both
source sequences up to some given distortions D(1) and D(2) .
This general problem is not solved, i.e., the optimal fourdimensional rate
distortion region is not known. However, some special cases are known:
If Yk = constant, we are in the case of a standard rate distortion problem
as discussed in Chapter 5.
If R(2) H(Y ), the decoder can recover Y perfectly first and then use
this as sideinformation to gain X back within the required distortion.
This is the situation of WynerZiv as discussed in Chapter 8.
If R(2) H(Y ), if we have D(1) = 0, and if we are only interested in X
at the destination, then we are in the case of lossless source coding with
sideinformation.
207
208
9.1
(9.1)
Definition 9.2. A rate pair (R(1) , R(2) ) is said to be achievable for a dis
(2)
(1)
tributed source if there exists a sequence of enR , enR , n distributed cod(n)
209
R(2)
separate compression
H(X, Y )
and decompression
H(Y )
H(Y X)
joint encoding
H(XY )
H(X)
H(X, Y )
R(1)
Figure 9.2: SlepianWolf rate region for distributed source coding. Note that
we do lose some rate pairs in comparison to the situation of joint
source coding: The corners with R(1) < H(XY ) or R(2) < H(Y X)
are not achievable in distributed fashion, even though they are
achievable for joint source compression.
R(1) H(XY ),
(9.2)
(2)
R H(Y X),
(9.3)
(1)
(2)
R + R H(X, Y ).
(9.4)
This rate region is depicted in Figure 9.2.
This result and [SW73b] was very important not only because it proved
the surprising fact that the joint entropy H(X, Y ) can be achieved, but also
because the technique of binning was introduced, which subsequently was
successfully applied to many other problems.
Example 9.5. Consider the weather in Hsinchu and in Taichung. Obviously,
it is correlated, i.e., if it is rainy in Hsinchu, it probably also rains in Taichung,
210
and vice versa. Lets assume that the weather of every day is independent and
identically distributed following the joint distribution given in Table 9.3.
Table 9.3: A joint weather PMF of Hsinchu and Taichung.
Taichung Y
QX,Y (, )
rain
sun
Hsinchu total
rain
0.445
0.055
0.5
sun
0.055
0.445
0.5
Taichung total
0.5
0.5
Hsinchu X
Suppose two weather stations in Hsinchu and Taichung need to send the
local weather data of 100 days to the Taipei National Weather Service headquarters. They could send all 100 bits data from both places, which would
mean that in total 200 bits of data are transmitted.
If we try to compress the data to reduce the necessary amount of bits,
then an individual data compression both at Hsinchu and Taichung will not
help at all because both X and Y are uniformly binary distributed and can
therefore not be compressed.
However, if we apply a SlepianWolf scheme, then we get
R(1) H(XY ) = Hb (0.89) = 0.5 bits,
(1)
(2)
+R
(2)
(9.5)
(9.6)
(9.7)
9.2
Before we prove Theorem 9.4, we would like to introduce a new lossless data
compression scheme for some DMS QX (). The insights of this new scheme
and its analysis can then directly be applied to the distributed coding problem,
too.
Our coding scheme is based on bins.
211
xA
t (n, , X ) +
X
(n)
xA
= t (n, , X ) +
t (n, , X ) +
(QX )
(n)
(QX )
QnX (x)
(n)
xA (QX )
X
(n)
xA (QX )
QnX (x)
x0 A (QX )
x0 6=x
(n)
x0 A (QX )
x0 6=x
(9.10)
Pr (x0 ) = (x) (9.11)

{z
}
= enR
enR
(9.12)
enR
(9.13)
(n)
x0 A (QX )
This is equivalent to having enR bins and randomly throwing all possible sequences into
one of them.
212
X
(n)
xA (QX )
QnX (x) A(n) (QX ) enR
{z
(9.14)
(QX ) enR
t (n, , X ) + A(n)
(9.15)
(9.17)
(9.16)
if
R > H(QX ) + m
(9.18)
and n is sufficiently large. Here, (9.11) follows from the Union Bound:
We upperbound the event that at least some x0 . . . by a sum over all
x0 . Note that the encoder is random here: Each source sequence is
assigned a random index, i.e., the probability that two get the same
index is # of1bins = 1/ enR . And (9.16) follows from TA2.
Remark 9.6. This shows that there are many ways of constructing coding
schemes with low error probability as long as R > H(X). The advantage of
this scheme is that we do not need the typical set at the encoder, but only at
the decoder, i.e., it will also work for a distributed source!
9.3
Achievability
(2)
(
y) = w
)
(
x, y
(2)
A(n) (QX,Y ).
(9.19)
(9.20)
(9.21)
9.3. Achievability
213
(9.25)
Then,
Pr(error) = Pr(F0 F1 F2 F12 )
(9.26)
(9.27)
(9.28)
QnX,Y (x, y) Pr x0 6= x : (1) (x0 ) = (1) (x),
(n)
(x,y)A (QX,Y )
x0 A(n) (QX,Y y)
(9.29)
X
X
(1) 0
n
(1)
QX,Y (x, y)
Pr (x ) = (x)
{z
}

(n)
(n)
(x,y)A
QnX,Y (x, y)
(n)
(x,y)A (QX,Y
(n)
(x,y)A (QX,Y
en(R
(1)
enR
(1)
(1)
(9.30)
(9.31)
y)
(9.32)
(1)
(9.33)
{z
= enR
(1)
QnX,Y (x, y) A(n)
(QX,Y y) enR
X
(n)
(x,y)A (QX,Y
(QX,Y y)
x0 6=x
(n)
x0 A (QX,Y
x0 6=x
x0 A
(QX,Y )
H(XY )m )
(9.34)
(9.35)
if
R(1) > H(XY ) + m
(9.36)
214
(9.37)
(9.38)
if
(n)
(x,y)A
QnX,Y (x, y) Pr (x0 , y0 ) : x0 6= x, y0 6= y,
(QX,Y )
(n)
(x,y)A
QnX,Y (x, y)
(QX,Y )
X
(n)
(n)
(x,y)A
QnX,Y (x, y)
(QX,Y )
X
(n)
(n)
(x,y)A (QX,Y
en(R
enR
(1)
enR
(9.41)
(2)
(9.42)
(1)
(2)
(QX,Y ) en(R +R )
QnX,Y (x, y) A(n)
(9.43)
(9.44)
(1)
+R(2) )
{z
(1)
(2)
= enR
X
(n)
(x,y)A (QX,Y
(1)
(n)
(x0 ,y0 )A (QX,Y
x0 6=x, y0 6=y
= enR
QnX,Y (x, y)
(n)
(x,y)A (QX,Y
Pr (1) (x0 ) = (1) (x) Pr (2) (y0 ) = (2) (y)

{z
} 
{z
}
Pr (1) (x0 ) = (1) (x), (2) (y0 ) = (2) (y) (9.40)
(9.45)
(9.46)
if
R(1) + R(2) > H(X, Y ) + m
(9.47)
9.4. Converse
215
and n is sufficiently large. Here, (9.41) follows from our assumptions that
the assignments are independent; and (9.44) follows from TA2.
This proves the achievability of the region given in Theorem 9.4.
9.4
Converse
(n)
(9.52)
nn
(9.53)
H YX, W (1) , W (2) nn .
(9.54)
and, analogously,
(1)
Since W (1) takes on enR different values and W (2) takes on enR
values, we now have
(1)
(2)
n R(1) + R(2) = log enR enR
H W (1) , W (2)
= I X, Y; W (1) , W (2) + H W (1) , W (2) X, Y
= I X, Y; W (1) , W (2)
= H(X, Y) H X, YW (1) , W (2)
H(X, Y) nn
= n H(X, Y ) nn ,
(2)
different
(9.55)
(9.56)
(9.57)
(9.58)
(9.59)
(9.60)
(9.61)
i.e.,
R(1) + R(2) H(X, Y ) n .
(9.62)
Here, (9.58) follows because W (1) = (1) (X) and W (2) = (2) (Y); in (9.60) we
have used (9.50); and the last equality (9.61) is because {(Xk , Yk )} are IID
QX,Y .
216
(1)
(1)
(9.63)
(9.64)
(1)
(9.65)
H W
Y
(1)
= I X; W
Y + H W (1) X, Y
= I X; W (1) Y
= I X; W (1) , W (2) Y
= H(XY) H XY, W (1) , W (2)
H(XY) nn
= n H(XY ) nn .
(9.66)
(9.67)
(9.68)
(9.69)
(9.70)
(9.71)
(9.72)
(9.73)
and, analogously,
This proves that no working coding scheme can be outside the region defined
in Theorem 9.4.
9.5
To understand the idea of the coding scheme used in the achievability proof,
consider the corner point
R(1) = H(X),
R
(2)
= H(Y X).
(9.74)
(9.75)
We know that using n H(X) bits we can effectively encode X in a way that
makes sure that the decoder can reconstruct it with arbitrarily small error.
But how do we encode Y using only n H(Y X) bits?
Recall that every sequence X has a small set of Y that is jointly typical
with X. So, if the encoder knows X, it can easily send the index of Y within
this small set. But our encoder does not know X! Hence, instead of finding
(2)
this small typical set, it colors all Y sequences with enR different colors.
If the number of colors is large enough, then the Y sequences in the small
set that is jointly typical with X will all have a different color, i.e., the color
uniquely2 defines the correct Y.
2
We put uniquely in quotation marks here because strictly speaking it is not unique: We
will always have a nonzero error probability that only vanishes once n tends to infinity.
9.6. Generalizations
9.6
217
Generalizations
(9.76)
Then all rate Ltuples R(1) , . . . , R(L) are achievable if, and only if,
c
R[L] H X [L] X [L ] ,
L {1, . . . , L}
(9.77)
R(i)
(9.78)
where
R[L] ,
X
iL
and
X [L] , X (i) : i L .
(9.79)
The theorem has also been extended to stationary and ergodic sources
[Cov75]. In that case the entropies have to be replaced by entropy rates.
9.7
ZeroError Compression
instead of Pe
(9.80)
(2)
H(Y )
(9.81)
(9.82)
(where we have assumed that QX,Y (x, y) > 0 for all (x, y)).
Chapter 10
10.1
Problem Setup
(1) , M
(2)
M
Dec.
Channel
QnY X (1) ,X (2) X(2)
Enc. (1)
Enc. (2)
M (1)
Uniform
Source 1
M (2)
Uniform
Source 2
Figure 10.1: A channel coding problem with two sources that independently
try to transmit a message M (i) , i = 1, 2, to the same destination.
Such a channel model is called multipleaccess channel (MAC).
Here we have a discrete memoryless channel that simultaneously accepts
two inputs from two independent transmitters and that generates an output Y
that is random with a distribution conditional on both inputs. The decoders
task is to simultaneously recover both messages based on the received channel
output sequence Y.
More formally, we have the following definitions.
Definition 10.1. A discrete memoryless multipleaccess channel (DMMAC)
consists of three alphabets X (1) , X (2) , Y and a conditional probability distribution QY X (1) ,X (2) such that
(1) k (2) k
QY Y k1 ,{X (1) }k ,{X (2) }k yk y1k1 , x` `=1 , x` `=1
k 1
`
`=1
`
`=1
(1)
(2)
(10.1)
= QY X (1) ,X (2) yk xk , xk .
If a DMMAC is used without feedback, we have
n
(1) (2) Y
(1) (2)
QYX(1) ,X(2) y x , x
=
QY X (1) ,X (2) yk xk , xk .
(10.2)
k=1
219
220
(1)
(2)
Definition 10.2. An enR , enR , n coding scheme for a DMMAC consists
of two sets of indices
(1)
M(1) = 1, 2, . . . , enR
,
(2)
M(2) = 1, 2, . . . , enR
(10.3)
(10.4)
(10.5)
(2) n
(10.6)
: Y n M(1) M(2) .
(10.7)
(2)
:M
(2)
(1)
(2)
The average error probability of an enR , enR , n coding scheme for a
DMMAC is given as
Pe(n) ,
1
en(R
(1)
+R(2) )
X
(m(1) ,m(2) )
M(1) M(2)
h
Pr (Y1n ) 6= (m(1) , m(2) )
i
M (1) , M (2) = (m(1) , m(2) ) .
(10.8)
Note that if the encoders worked together either by a link or because they knew what
the current input message of the other encoder is, then we would have a multipleinput
singleoutput (MISO) channel, which in the case of discrete alphabets simply leads to a
normal DMC capacity problem.
10.2
221
10.3
222
X (1)
0
1
1
1 1
Y (1)
1
Y
1 2
2
0
Y (2)
0
2
1
X (2)
1 2
1
C(1)
R(1)
Figure 10.3: The capacity region of the MAC consisting of two independent
BSCs.
Note that even if we allowed for cooperation between the two encoders,
we could not get higher rates. Hence, the region of Figure 10.3 is the capacity
223
region.
Example 10.7 (Binary Multiplier MAC). Consider a MAC with binary input
and output alphabets X (1) = X (2) = Y = {0, 1} where
Y = X (1) X (2) .
(10.12)
R(1)
Example 10.8 (Binary Erasure MAC). Consider a MAC with binary input
alphabet X (1) = X (2) = {0, 1} and ternary output alphabet Y = {0, 1, 2}
where
Y = X (1) + X (2)
(10.13)
224
0
1
2
X (2)
1
2
1
2
10.4
225
We will derive two equivalent forms of the MAC capacity region: C1 and C2 .
The proof works as follows: We will first prove that C1 is achievable, then we
will show that C2 C1 and therefore also achievable, and finally we derive a
converse on C2 . Hence, at that stage we will have shown a situation as shown
in Figure 10.7. Here, all rate pairs in C1 are achievable, all rate pairs outside
of C2 are not achievable, and C2 C1 . This is impossible unless C2 = C1 , i.e.,
in Figure 10.7 the darkshaded area disappears and both regions are identical.
10.4.1
Achievability of C1
(2)
(1)
(1)
X
R
<
I
X
;
Y
,
(10.14)
(10.15)
R(2) < I X (2) ; Y X (1) ,
[
C1 = convex closure
R QX (1) , QX (2) .
(10.17)
QX (1) QX (2)
such that
X(1) m
(1) , X(2) m
(2) , Y A(n) QX (1) ,X (2) ,Y .
(10.18)
226
R(2)
1
1
2
1
2
R(1)
R(2)
C1
contradiction
C2
R(1)
Figure 10.7: Two different capacity regions C1 and C2 of a MAC. We have a
contradiction unless the darkshaded area disappears, i.e., actually C2 = C1 .
227
If there is a unique such pair m
(1) , m
(2) , the decoder puts out
m
(1) , m
(2) , m
(1) , m
(2) .
(10.19)
Otherwise the decoder declares an error.
5: Performance Analysis: We define the following events:
n
o
Fm(1) ,m(2) ,
X(1) m(1) , X(2) m(2) , Y A(n) QX (1) ,X (2) ,Y .
(10.20)
nR
eX
(2)
nR
eX
m(1) =1 m(2) =1
1
en(R
(1)
+R
(2)
Pr error M (1) , M (2) = m(1) , m(2)
(10.21)
[
c
(1) (2)
= PrFm
Fm
(1) ,m(2)
(1) ,m
(2) m , m
(m
(1) ,m
(2) )6=(m(1) ,m(2) )
(10.22)
c
(1)
(2)
Pr Fm
m
,
m
(1) ,m(2)
(1)
nR
eX
m
(1) =1
m
(1) 6=m(1)
(1)
(2)
Pr Fm
m
,
m
(1)
(2)
,m
(2)
nR
eX
m
(2) =1
m
(2) 6=m(2)
(1)
(1)
(2)
Pr Fm(1) ,m
(2) m , m
(2)
nR
eX
nR
eX
m
(1) =1
m
(1) 6=m(1)
m
(2) =1
m
(2) 6=m(2)
(1)
(2)
Pr Fm
m
,
m
.
(1)
(2)
,m
(10.23)
228
(10.25)
(n)
en(H(X
(1) )
(n)
(x(1) ,x(2) ,y)A
(10.26)
(1)
(2)
QX (1) ,X (2) ,Y en(H(X )+H(X ,Y ))
= A(n)
(1)
(2)
(1)
(2)
en(H(X ,X ,Y )+) en(H(X )+H(X ,Y ))
(10.27)
))
= en(
(1)
(2)
(1)
(2)
= en(I(X ;X )+I(X ;Y X ))
(10.30)
=e
n( H(X (1) )H(X (2) ,Y X (1) )+H(X (1) )+H(X (2) ,Y ))
I(X (1) ;X (2) ,Y
=e
(10.28)
(10.29)
(10.31)
(10.32)
Here, in (10.26) we use TA1b based on the fact that all sequences in the
sum are typical; in (10.28) we use TA2; and the in the final step (10.32)
we rely on the independence between X (1) and X (2) .
Completely analogously, we derive for m
(2) 6= m(2) :
(2)
(1)
(1)
(2)
en(I(X ;Y X )) ,
Pr Fm(1) ,m
m
,
m
(2)
and similarly, we get for m
(1) 6= m(1) , m
(2) 6= m(2) :
(1)
(2)
Pr Fm
(1) ,m
(2) m , m
X
QnX (1) x(1) QnX (2) x(2) QnY (y)
=
(10.33)
(10.34)
(n)
(x(1) ,x(2) ,y)A
en(H(X
X
(n)
(1) )
(1)
(2)
= A(n)
QX (1) ,X (2) ,Y en(H(X )+H(X )+H(Y ))
(1)
(2)
(1)
(2)
en(H(X ,X ,Y )+) en(H(X )+H(X )+H(Y ))
= en(
(10.36)
(10.37)
))
(10.38)
=e
= en(
))
(10.39)
(10.40)
Plugging these results back into (10.23) and (10.21) now yields
Pr(error)
(1)
(1)
(2)
t n, , X (1) X (2) Y + enR 1 en(I(X ;Y X ))
229
(2)
(2)
(1)
+ enR 1 en(I(X ;Y X ))
(1)
(2)
(1)
(2)
+ enR 1 enR 1 en(I(X ,X ;Y ))
(1)
(1)
(2)
t n, , X (1) X (2) Y + en(R I(X ;Y X )+)
(2)
(1)
(2)
(2)
(1)
(1)
(2)
+ en(R I(X ;Y X )+) + en(R +R I(X ,X ;Y )+) .
(10.41)
(10.42)
Note that this error probability will tend to zero for n as long
as the three conditions (10.14)(10.16) are satisfied. This proves the
achievability for a fixed distribution QX (1) QX (2) . We can now freely
choose QX (1) QX (2) , apply timesharing to get the convex hull, and finally
take the closure because by definition the capacity region includes its
boundaries.
This concludes the achievability proof for C1 .
10.4.2
(10.43)
R(1) I X (1) ; Y X (2) , T ,
(1)
(2)
(2)
(10.44)
R I X ; Y X , T ,
(1)
(2)
(1)
(2)
R + R I X , X ; Y T ,
(10.45)
for some choice of the joint distribution
QT,X (1) ,X (2) ,Y = QT QX (1) T QX (2) T QY X (1) ,X (2) .
(10.46)
(10.47)
(10.48)
(10.49)
(10.50)
where the inequality (10.48) follows from conditioning that reduces entropy,
and where the equality (10.49) holds because given X (1) and X (2) , we have
Y
T.
Similarly we can show
I X (2) ; Y X (1) I X (2) ; Y X (1) , T
(10.51)
230
and
I X (1) , X (2) ; Y = H(Y ) H Y X (1) , X (2)
H(Y T ) H Y X (1) , X (2)
= H(Y T ) H Y X (1) , X (2) , T
= I X (1) , X (2) ; Y T .
(10.52)
(10.53)
(10.54)
(10.55)
Hence, if R(1) , R(2) C2 , then
R(1) I X (1) ; Y X (2) , T I X (1) ; Y X (2) ,
R(2) I X (2) ; Y X (1) , T I X (2) ; Y X (1) ,
(1)
R + R(2) I X (1) , X (2) ; Y T I X (1) , X (2) ; Y
(10.56)
(10.57)
(10.58)
and therefore, comparing with (10.14)(10.16), we see that R(1) , R(2) C1 ,
too. Hence, C2 C1 and all rate pairs in C2 must be achievable.
The only missing point is the bound on the alphabet size of the auxiliary random variable T . A first bound follows from Caratheodorys Theorem
(Theorem 1.20): We write the 3dimensional tuple
I X (1) ; Y X (2) , T , I X (2) ; Y X (1) , T , I X (1) , X (2) ; Y T
as convex combination
X
QT (t) I X (1) ; Y X (2) , T = t , I X (2) ; Y X (1) , T = t ,
tT
I X (1) , X (2) ; Y T = t .
(10.59)
(i)
10.4.3
Converse of C2
(1)
(2)
We will next show that any sequence of enR , enR , n coding schemes with
(n)
Pe 0 must have a rate pair R(1) , R(2) C2 .
231
Recall the
Fano Inequality (Proposition 1.13) with an observation Y about
M (1) , M (2) :
log 2
(1)
(2) n
(n)
(1)
(2)
H M , M Y1 n
+ Pe R + R
(10.60)
n
, nn ,
(10.61)
(n)
where n 0 as n because Pe 0.
Hence, we have
nR(1) = H M (1)
= I M (1) ; Y1n + H M (1) Y1n
I M (1) ; Y1n + H M (1) , M (2) Y1n
I M (1) ; Y1n + nn
I x(1) M (1) ; Y1n + nn
= I X(1) ; Y + nn
I X(1) ; Y, X(2) + nn
= I X(1) ; X(2) + I X(1) ; YX(2) + nn
= I X(1) ; YX(2) + nn
= H YX(2) H YX(1) , X(2) + nn
n
X
H Yk X(2) , Y1k1 H Yk X(1) , X(2) , Y1k1 + nn
=
(10.62)
(10.63)
(10.64)
(10.65)
(10.66)
(10.67)
(10.68)
(10.69)
(10.70)
(10.71)
(10.72)
k=1
n
X
k=1
n
X
k=1
n
X
(1) (2)
+ nn
H Yk X(2) , Y1k1 H Yk Xk , Xk
(10.73)
(2)
(1) (2)
H Yk Xk
H Yk Xk , Xk
+ nn
(10.74)
(2)
(1)
I Xk ; Yk Xk
+ nn .
(10.75)
k=1
Here, (10.62) follows from the assumption that M (1) is uniformly distributed
(1)
over {1, . . . , enR }; (10.65) follows from (10.61); in the next step (10.66) we
apply the Data Processing Inequality (Proposition 1.12) where x(1) M (1)
denotes the codeword that is transmitted if the message is M (1) ; in (10.67)
we write X(1) for x(1) M (1) ; in (10.68) we add a random variable to the
arguments of the mutual information, thereby increasing its value; in the
subsequent (10.69) we use since M (1)
M (2) we also have X(1)
X(2) ; in
(10.73) we use the assumption that our DMMAC is memoryless and used
without feedback; and (10.74) follows from conditioning that reduces entropy.
Hence,
R
(1)
n
1 X (1)
(2)
I Xk ; Yk Xk
+ n .
(10.76)
k=1
232
n
1 X (2)
(1)
I Xk ; Yk Xk
+ n
n
(10.77)
k=1
and
n R(1) + R(2) = H M (1) , M (2)
=I
I
(1)
(10.78)
(2)
M , M ; Y1n
M (1) , M (2) ; Y1n
(2)
(1)
(1)
I x
,M
(2)
n
Y1
+ nn
M (2) ; Y1n + nn
,x
= I X , X ; Y + nn
= H(Y) H YX(1) , X(2) + nn
n
X
=
H Yk Y1k1 H Yk X(1) , X(2) , Y1k1 + nn
+H M
(1)
(1)
k=1
n
X
k=1
n
X
(2)
(1) (2)
H(Yk ) H Yk Xk , Xk
+ nn
(1)
(2)
I Xk , Xk ; Yk + nn ,
(10.79)
(10.80)
(10.81)
(10.82)
(10.83)
(10.84)
(10.85)
(10.86)
k=1
i.e.,
R(1) + R(2)
n
1 X (1) (2)
I Xk , Xk ; Yk + n .
n
(10.87)
k=1
X (1) , XT .
(10.88)
(2)
Similarly, define X (2) , XT and Y , YT . Now we can write the first term
on the RHS of (10.76) as
n
n
X
1 X (1)
(2)
(2)
(1)
=
QT (k) I Xk ; Yk Xk , T = k
(10.89)
I Xk ; Yk Xk
n
k=1
k=1
n
X
(2)
(1)
=
QT (k) I XT ; YT XT , T = k
(10.90)
k=1
(2)
=I
XT , T
= I X (1) ; Y X (2) , T .
(1)
XT ; YT
(10.91)
(10.92)
(10.93)
(10.94)
(10.95)
233
for some distribution QT QX (1) T QX (2) T QY X (1) ,X (2) . Note that this distribution is defined by our choice of T being uniform, the given coding scheme
with its set of codewords, the uniformly distributed messages M (1) , M (2) and
the given MAC.
Using the arguments shown in the discussion after (10.59), we know that
we can reduce the alphabet of T to a size T  = 2.
10.5
10.5.1
Lets consider C1 . For every fixed choice of QX (1) QX (2) , we have given three
fixed numbers:
(10.96)
I1 , I X (1) ; Y X (2) ,
(1)
(2)
,
(10.97)
I2 , I X ; Y X
I3 , I X (1) , X (2) ; Y .
(10.98)
These three numbers together with the constraints R(1) 0 and R(2) 0
specify a pentagon of achievable rate pairs:
R(1) 0
R(2) 0
(1)
(2)
(10.99)
R I1
R I2
R(1) + R(2) I3
as shown in Figure 10.8, where we have named the corner points A to E.
The coordinates of point A are obviously R(1) , R(2) = (I1 , 0). To find the
coordinates of B, note that in B simultaneously we have R(1) = I1 and R(1) +
R(2) = I3 , i.e.,
R(2) = R(1) + I3
= I1 + I3
=I X
(1)
=I X
(2)
(10.100)
(10.101)
(2)
,X ;Y I X
;Y .
(1)
; Y X (2)
(10.102)
(10.103)
(10.104)
So this means that the pentagon of Figure 10.8 more precisely looks as shown
in Figure 10.9.
Let us discuss Figure 10.9 more in detail. First of all, recall that at the
moment we keep QX (1) QX (2) fixed. So, in order to walk along the borders
of this achievable region, we need to play around with our rates R(1) , R(2) .
For example, if we choose R(2) = 0, i.e., user 2 only has one codeword, then
234
I2
I3
R(2)
C
I1
0
B
A
R(1)
0
Figure 10.8: Pentagon of achievable rate pairs.
R(2)
I X (2) ; Y X (1)
I X (2) ; Y
E
I X (1) ; Y
B
R(1)
A
I X (1) ; Y X (2)
235
the decoder knows X(2) in advance and can use this knowledge for the decoding of X(1) . Hence, we understand (using our knowledge of singleuser data
transmission) that the decoder will be able to decode reliably as long as
(10.105)
R(1) < I X (1) ; Y X (2) .
Remark 10.11. Actually, for X (2) = we could do even better:
R(1) < max I X (1) ; Y X (2) = .
(10.106)
But the codeword X(2) is generated QnX (2) , i.e., all different values of will
show up with probability QX (2) (), which then gives
(10.107)
R(1) < EQ (2) I X (1) ; Y X (2) = .
X
The maximum choice (10.106) only occurs if we choose QX (2) such that X (2)
is constant equal to . This we do not include for the moment, as we keep
the distributions fixed.
So we see that in point A the decoder knows X(2) when he decodes X(1) .
However, it is not necessary that R(2) = 0 in order to make sure that the
decoder knows X(2) ! As long as we decode X(2) first and are sure we can do
this reliably, then the system still works. So how large can we choose R(2) ?
Well, we know from standard singleuser transmission that we are OK as long
as
R(2) < I X (2) ; Y .
(10.108)
This explains point B! Note that this is again the idea called successive cancellation as already introduced in Example 10.8. The principle is easy: The
decoder decodes the message of one user (with usually smaller rate) first, ignoring the other user completely, i.e., treating the other user like it were noise.
Then, using the knowledge of this first message, it cancels the influence of
this user from the received sequence and decodes the (usually highrate) message of the second user.
10.5.2
The capacity region C1 is defined as the convex hull of all different pentagons
given by some QX (1) QX (2) . Lets investigate such a convex hull using the
example of two different choices of QX (1) QX (2) : QaX (1) QaX (2) with corresponding pentagon C a , and QbX (1) QbX (2) with corresponding pentagon C b . These
two pentagons are depicted in Figure 10.10 together with the convex hull of
C a C b . The reader will note that this convex hull is not anymore a pentagon,
but rather a heptagon.
On the other hand, one could define a sequence of pentagons C defined
by a convex combination of the five bordering lines of C a and C b , see Figure 10.11. The idea of C is that the five boundaries are convex combinations
236
R(2)
convex hull of C a C b
Cb
Ca
R(1)
Figure 10.10: Two pentagons C a and C b and the convex hull of their union.
R(2)
R(1)
Figure 10.11: Definition of a convex combination of C a and C b .
237
(1)
(1)
Ia1
+ (1
)Ib1
(2)
(2)
Ia2 + (1 )Ib2
(10.109)
(10.110)
[0,1]
Unfortunately, this is not true in general as can be seen from the following
example.
Example 10.12. Consider the following two pentagons:
Ca ,
R(1) , R(2) : R(1) 0, R(2) 0, R(1) 10, R(2) 10,
o
R(1) + R(2) 100 ,
n
C b , R(1) , R(2) : R(1) 0, R(2) 0, R(1) 20, R(2) 20,
o
R(1) + R(2) 20 .
PSfrag
(10.111)
(10.112)
These two pentagons and their boundaries are depicted in Figure 10.12.
inactive R(2)
constraint
20
R(2)
20
10
10
Cb
Ca
10
20
R(1)
10
20
R(1)
(10.113)
238
R(2)
20
10
1
C2
10
R(1)
20
Figure 10.13: New pentagon derived as a convex combination of the two pentagons of Figure 10.12.
1
2
We now realize
a that
there are points in C that are not1 element of the
b
convex hull of C C ! For example, note that (15, 15) C 2 , but (15, 15)
/
Cb.
So, Example 10.12 shows that (10.110) is not true in general. However, we
can rescue the situation: The reason why (10.110) does not hold in the above
example is because some of the constraints are not active! Luckily, it is easy
to see that in our case this cannot happen. Because
(10.115)
I X (2) ; Y X (1) = I X (2) ; Y, X (1) I X (2) ; Y ,
(where we have used that X (1)
X (2) ), we have
I1 + I2 = I X (1) ; Y X (2) + I X (2) ; Y X (1)
I X (1) ; Y X (2) + I X (2) ; Y
= I X (1) , X (2) ; Y
= I3 ,
i.e., I1 + I2 I3 . Hence, the third constraint I3 is always active!
Lets make this more formal.
(10.116)
(10.117)
(10.118)
(10.119)
239
(10.120)
(10.122)
for 0 1.
Then, the rate region defined by I is given by
CI = CIa + (1 )CIb .
(10.123)
(10.124)
CIa + (1 )CIb CI .
(10.125)
and therefore
To prove the reverse, we consider the five extreme points of the pentagonal
region CI , see Figure 10.14.
By definition, any of these extremal points can be written as a convex combination of the corresponding extremal points of CIa and CIb , respectively.
But since this holds true for these extremal points, it must also be true for
any point in CI , and hence
CI CIa + (1 )CIb .
(10.126)
Note that here we rely fundamentally on the fact that I1 + I2 I3 and that
therefore the pentagon in Figure 10.14 really is a pentagon and not a square
such as shown in Figure 10.15. If we could not rely on this fact, our argument
would break down.
The following corollary is a direct consequence of Proposition 10.13.
240
I3 I2 , I2
I1 , I3 I1
B
R(1)
A
I1 , 0
E
(0, 0)
B
D
R(1)
Figure 10.15: The five extremal points do not actually define the corner points
of a pentagon. This situation cannot happen because I1 + I2
I3 .
Corollary 10.14. The convex hull of the union of all rate regions defined by
some I is equal to the rate region defined by the convex combination of all I
vectors.
In particular this shows once again that C1 = C2 .
Note that in (10.110) the convex hull of C a C b can be achieved by timesharing: For a certain percentage [0, 1] of the time, we use a coding scheme
achieving C a while for the rest of time we use another coding scheme achieving
Cb.
S
The RHS of (10.110), [0,1] C , corresponds to a scheme that usually is
called coded timesharing: there we choose the input distribution as a random
mixture with a probability [0, 1] of picking QaX (i) and probability 1 of
picking QbX (i) .
241
10.5.3
As we have seen from Theorem 10.9, the MAC capacity region is a convex
hull of the union of many different pentagons. In general this region will look
as shown in Figure 10.16.
R(2)
R(1)
Figure 10.16: General shape of the MAC capacity region.
However, there are some cases where the MAC region is described by a
single pentagon. As an example, we continue with Example 10.8.
Example 10.15 (Continuation of Example 10.8). Recall the binary erasure
MAC from Example 10.8 with binary inputs and a ternary output given by
Y = X (1) + X (2)
(10.127)
(normal addition!). We have already argued that the pentagon given in Figure 10.6 is achievable. We will now show that it actually is the capacity region.
To do so, we will prove the following three statements:
1. If R(1) , R(2) is achievable, then R(1) 1 bit.
Proof: From Theorem 10.10 we know that for some choice of QT QX (1) T
QX (2) T we have
R(1) I X (1) ; Y X (2) , T
= H Y X (2) , T H Y X (1) , X (2) , T
{z
}

(10.128)
(10.129)
=0
242
(10.130)
log 2 = 1 bit,
(10.132)
(10.131)
3
4
bits.
Proof: From Theorem 10.10 we know that for some choice of QT QX (1) T
QX (2) T we have
R(1) + R(2) = 2R I X (1) , X (2) ; Y T
= H(Y T ) H Y X (1) , X (2) , T

{z
}
(10.133)
(10.134)
=0
= H(Y T )
=H X
(1)
(10.135)
+X
(2)
T .
(10.136)
Now note that from symmetry we can assume that QX (1) T = QX (2) T .
(If QX (1) T and QX (2) T were not the same, we could use timesharing
between this asymmetric choice and its flipped version and thereby making the distribution symmetric. Note that the value of the entropy in
(10.136) for any choice of QT QX (1) T QX (2) T is identical to the entropy
of the flipped version and therefore also the timesharing between these
two versions will result in the same entropy.)
Then
H(Y T ) H(Y )
2
(10.137)
2
(10.138)
2
2
1
1
1 1
1 1
log
2 log 2
2
2
2 2
2 2
2
2
1
1
log 1
1
2
2
3
= bits,
2
(10.139)
(10.140)
243
outer bounds
R(2)
1
1
2
1
2
R(1)
Figure 10.17: An achievable rate region of the binary erasure MAC with (partial) outer bounds. Note that we have not yet proven that the
lightshaded area is not achievable.
Note that all these bounds are actually boundary points of the achievable
region given in Figure 10.6. Hence, we now have the situation shown in Figure 10.17. We have drawn arbitrarily shaped lightshaded areas which we have
not yet proven to be outside of the capacity region. However, it is straightforward to argue why these lightshaded areas cannot be achievable: Suppose
for the moment, they were achievable. Then, using the timesharing convexity
argument, we could also achieve a rate pair
3 3
(1)
(2)
R ,R
= R, R >
,
,
(10.141)
4 4
which is a contradiction to the proven outer bound. Hence, we conclude that
the rate region given in Figure 10.6 must be the capacity region.
We will see below that also the Gaussian MAC has a capacity region in
the shape of a pentagon.
10.6
MultipleUser MAC
R(i)
(10.143)
iL
and
X [L] , X (i) : i L .
(10.144)
244
10.7
Gaussian MAC
Even though strictly speaking our proofs are not extendable to the Gaussian
case, we simply believe that the corresponding results hold anyway.
So assume two users independently transmitting codewords X(1) and X(2) ,
respectively, and a receiver that gets a sequence Y where
(1)
(2)
Yk = Xk + Xk + Zk
(10.145)
with {Zk } being IID N 0, 2 . We assume an averagepower constraint for
each user i:
n
1 X (i) (i) 2
xk (m ) E(i) ,
n
k=1
10.7.1
(i)
m(i) 1, 2, . . . , enR .
(10.146)
Capacity Region
(1)
(2)
(2)
R < I X ; Y X
,
(10.148)
(10.150)
(10.151)
The Gaussian MAC would not be Gaussian if we could not actually derive
this capacity region explicitly. . . ! So note that
I X (1) ; Y X (2)
(10.152)
= h Y X (2) h Y X (1) , X (2)
(1) (2)
(2)
(1)
(2)
(1)
(2)
(10.153)
= h X + X + Z X
h X + X + Z X , X
(1)
= h X + Z h(Z)
(10.154)
1
1
log 2e E(1) + 2 log 2e 2
(10.155)
2
2
1
E(1)
= log 1 + 2 ,
(10.156)
2
and, similarly,
(1) 1
E(2)
I X ;Y X
log 1 + 2 ,
2
1
E(1) + E(2)
(1)
(2)
I X , X ; Y log 1 +
.
2
2
(2)
(10.157)
(10.158)
245
(1)
E
(1)
,
R C
(2)
E
(2)
,
R C
(1)
E + E(2)
(1)
(2)
R +R C
,
(10.161)
(10.162)
(10.163)
where
C(t) ,
1
log(1 + t).
2
(10.164)
Note that we have here the excellent situation that the maximum possible
(1)
(2)
rate if full cooperation between the users were allowed, C E +E
, is also
2
achievable without cooperation! (However, without cooperation the maximum
rate is not available at the corners, where one of the two users get the large
majority of the available sum rate.)
10.7.2
Discussion
E + 2
(1)
(2)
E
E
= C (1)
+C
.
2
2
E +
(10.165)
(10.166)
(10.167)
(10.168)
246
E(2)
E + 2
(1)
(1) (2)
R(1) + R(2) = C E +E
2
B
C
E(1)
E(2) + 2
R(1)
(1)
C E2
E(2)
.
(10.169)
E(1) + 2
Then, it cancels the codeword from user 2 from the received sequence and
decodes user 1, which works fine as long as
(1)
E
(1)
R C
.
(10.170)
2
Note that we have assumed so far the E(1) and E(2) are fixed! (To walk
around in the capacity region of Figure 10.18 we need to play with the rates!)
If we also allow changing the power constraints with a given overall total power
E:
E(1) + E(2) E,
(10.171)
then the capacity region becomes a triangle identical to the capacity of cooperative communication, see Figure 10.19.
10.7.3
247
R(2)
E
R(1)
E
Figure 10.19: Capacity region of the Gaussian MAC when we allow changing
the power of the users subject to a total power constraint E.
n(2) 1
E(2)
1
E(2)
(2)
log 1 + 2 = (1 ) log 1 + 2 ,
(10.175)
R =
n 2
(1)
(1)
1
E(1)
= log 1 +
,
2
2
(10.176)
248
R(2) = (1 )
1
E(2)
log 1 +
.
2
(1 ) 2
(10.177)
(1)
+R
(2)
1
1
E(1)
E(2)
+ (1 ) log 1 +
(10.178)
= log 1 +
2
2
2
(1 ) 2
E(1) + E(2)
! 1
= log 1 +
,
(10.179)
2
2
i.e.,
(1 ) E(1) + E(2) + 2
E(2) + (1 ) 2
(1 ) 2 + (1 )E(1)
!
.
(1 ) 2 + E(2)
(10.180)
This is hard to solve. But considering that it is very unlikely to find solutions
unless both sides are 1, we guess that 1 = 1 . We check and see that this is
actually is possible:
(1 ) E(1) + E(2) + 2
E(2) + (1 ) 2
(1 ) 2 + (1 )E(1)
(1
) 2
(2)
+ E
E(1)
E(1) + E(2)
E(1)
= 1 = =
= 1 = =
E(1) + E(2)
(10.181)
(10.182)
It turns out that this solution really is the only solution to (10.180) apart from
the trivial solutions = 0 and = 1, which both are sumrate suboptimal.
So we see that TDMA indeed can achieve the same maximum sum rate as
CDMA, however, only in one particular point where the timesharing ratio is
fixed with the given power distribution. See Figure 10.20.
R(2)
=0
E(1)
E +E(2)
(1)
TDMA for
from 0 to 1
=1
R(1)
Figure 10.20: TDMA can achieve the maximum sumrate capacity only in one
particular point that is specified by the power distribution.
249
For FDMA a similar investigation can be made. There we need to use the
continuoustime Gaussian capacity formula:
P(1)
(1)
(1)
R = B log 1 +
,
(10.183)
N0 B(1)
P(2)
(2)
(2)
R = B log 1 +
,
(10.184)
N0 B(2)
where P(i) is the ith users power and where B(i) is the ith users available
bandwidth. Assuming that we have a total bandwidth
B(1) + B(2) = B
(10.185)
P(1) + P(2)
,
= B log 1 +
N0 B
(10.186)
and solving
R
(1)
+R
(1) !
we find that
B(1) =
B(2) =
P(1)
P(1) + P(2)
P(2)
P(1) + P(2)
B,
(10.187)
(10.188)
10.8
Historical Remarks
Chapter 11
Transmission of Correlated
Sources over a MAC
In Chapter 9 we considered distributed source compression where the two
encoders were working independently from each other, and in Chapter 10
we considered two independent users transmitting information to the same
receiver. It is therefore quite natural to ask the question of what happens if
we combine these two setups.
11.1
Problem Setup
Dest.
n , V n
U
1
1
Dec.
Y1n
(1) n
1
Xk
Xk
MAC
QY X (1) ,X (2)
(2) n
1
Enc.
(1)
Enc. (2)
U1n
V1n
QU,V
Figure 11.1: A general system for transmitting a correlated source QU,V over
multipleaccess channel QY X (1) ,X (2) .
Note that we have simplified the system by assuming that both the length
of the source sequences U, V and the length of the transmitted codewords
X(1) , X(2) are equal to n. So throughout the whole system there is only one
clock.
Combining our knowledge about lossless compression from [Mos14] (i.e.,
lossless compression is possible up to the entropy of the IID source sequence)
and about the MAC channel (see Theorem 10.10), we immediately see that
251
252
H(V ) < I X (2) ; Y X (1) , T ,
(11.1)
(11.2)
(11.3)
for some choice of QT QX (1) T QX (2) T . In this case we simply use a lossless
compressor to compress the source sequences individually to their most efficient representation and then apply a standard MAC coding scheme according
to Chapter 10.
However, considering our discussion from Chapter 9, we can do better: If
(2)
(1)
X , T ,
(11.4)
H(U
V
)
<
I
X
;
Y
(11.5)
H(V U ) < I X (2) ; Y X (1) , T ,
W (1)
MAC
U
SWEnc. 1
MACEnc. 1
QU,V
QY X (1) ,X (2)
X(2)
W (2)
MACEnc. 2
(1)
W
Y
MACDec.
SWDec.
(2)
W
V
SWEnc. 2
Dest.
Decoder
Figure 11.2: The information transmission system with source channel separation: The joint DMS is first compressed in a distributed manner
according to SlepianWolf, then a MAC coding scheme is applied
for the transmission of the data over the channel.
Note that in both (11.1)(11.3) and (11.4)(11.6) we apply a source channel separation. Is this optimal? Does a source channel separation theorem
exist in this context? Unfortunately, it does not. We can prove this by the
following counterexample.
253
Example 11.1. Let U, V be two binary random variables with the following
joint distribution:
1
QU,V (0, 0) = QU,V (1, 0) = QU,V (1, 1) = ,
3
QU,V (0, 1) = 0.
(11.7)
(11.8)
(with normal addition). We already know that the capacity region of this
MAC is a pentagon with a maximum sum rate of 1.5 bits.
Hence,
H(U, V ) = 1.58 bits > I X (1) , X (2) ; Y T = 1.5 bits
(11.9)
and according to (11.4)(11.6) this source cannot be transmitted reliably over
the given channel.
However, consider the following coding scheme: Choose n = 1, two encoders
(1)
Xk = Uk
(11.10)
(2)
Xk
(11.11)
= Vk ,
and a decoder
Yk =
(1)
Xk
(2)
Xk
0
= Uk + Vk = 1
=
=
=
k = 0, Vk = 0,
U
k = 1, Vk = 0,
U
k = 1, Vk = 1.
U
(11.12)
This coding scheme (apart from being very simple) works perfectly, i.e., the
probability of error is equal to zero!
The reason why a combination of SlepianWolf and MAC coding schemes
is not optimal lies in the basic assumptions of the setup: For MAC coding we
have always assumed that the two users are transmitting completely independent messages, while in the SlepianWolf situation the clue is that the source
has strong correlation. In other words, our MAC coding scheme is no good
at dealing with information that is contained in both messages and therefore
does not need to be transmitted by both users.
This explains why the third constraint in (11.4)(11.6) is too restrictive.
Note that the first two constraints are satisfied:
2
H(U V ) = bits < I X (1) ; Y X (2) , T = 1 bit,
(11.13)
3
2
H(V U ) = bits < I X (2) ; Y X (1) , T = 1 bit.
(11.14)
3
This must be the case because otherwise we would violate the limitations of the
wellunderstood singleuser information transmission: Even if we inform the
first encoder and the decoder about the values of V and X(2) , we cannot beat
constraint (11.4) (and similarly for U and X(1) and the second encoder).
254
11.2
(2)
(1)
(11.15)
H(U V ) < I X ; Y X , V, T ,
(2)
(1)
H(V U ) < I X ; Y X , U, T ,
(11.16)
(1)
(2)
H(U, V ) < I X , X ; Y T
(11.17)
for some
QT,U,V,X (1) ,X (2) ,Y = QT QU,V QX (1) U,T QX (2) V,T QY X (1) ,X (2) .
{z}  {z }  {z }  {z } 
{z
}
timesharing
source
encoder 1
encoder 2
channel
(11.18)
Be aware that in (11.18) the source and the channel are given, while we
can try to find some optimal choice of the encoders and the timesharing
distribution.
Note that this theorem includes both Theorem 9.4 (SlepianWolf) and
Theorem 10.10 (MAC) as special cases. Too see this, first note that if we
choose a dummy channel
Y = X (1) , X (2) ,
(11.19)
set T = 0, and choose X (1) and X (2) independent of (U, V ) so that
QU,V,X (1) ,X (2) = QU,V QX (1) QX (2) ,
(11.20)
where QX (i) must be such that H X (i) = R(i) , then we get from the first
condition in (11.15)(11.17):
H(U V ) < I X (1) ; Y X (2) , V, T
(11.21)
(1) (2)
(1)
(2)
= H X X , V, T H X Y, X , V, T
(11.22)

{z
}
= 0 (because of (11.19))
(1)
= H X X (2) , V, T
= H X (1)
(because of (11.20))
=R
(1)
H(V U ) < I X
(2)
(11.23)
(11.24)
(11.25)
; Y X (1) , U, T
(11.26)
(2)
(11.27)
;
(1)
(11.28)
(2)
H(U, V ) < I X , X ; Y T
= H X (1) , X (2) T H X (1) , X (2) Y, T
{z
}

=0
= H X (1) , X (2)
= H X (1) + H X (2)
=R
(1)
+R
255
(2)
(11.29)
(11.30)
(11.31)
(11.32)
(11.33)
(11.34)
Then
H(U V ) = H(U )
=R
(11.35)
(1)
(11.36)
(1)
< I X ; Y X (2) , V, T
= I X (1) ; Y X (2) , T ;
H(V U ) = H(V )
=R
(11.40)
(2)
< I X ; Y X (1) , U, T
= I X (2) ; Y X (1) , T ;
H(U, V ) = H(U ) + H(V )
=R
<I X
+R
(1)
(11.38)
(11.39)
(2)
(1)
(11.37)
(2)
,X
(2)
(11.41)
(11.42)
(11.43)
(11.44)
; Y T .
(11.45)
3: Encoder Design: Upon observing u, encoder 1 transmits X(1) (u). Similarly, upon observing v, encoder 2 transmits X(2) (v).
256
(11.46)
) , (
),
If there is a unique such pair, then the decoder decides (
u, v
u, v
otherwise it declares an error.
5: Performance Analysis: We have
Pr(error)
= Pr error (U, V) A(n)
(QU,V ) Pr (U, V) A(n) (QU,V )

{z
}
1
+ Pr error (U, V)
/ A(n)
(QU,V ) Pr (U, V)
/ A(n) (QU,V )

{z
} 
{z
}
=1
t (n,,U V)
(11.47)
(QU,V ) + t (n, , U V).
Pr error (U, V) A(n)
(n)
For (u, v) A
Fu,v ,
(11.48)
o
(11.49)
X(1) (u), X(2) (v), Y A(n) QU,V,X (1) ,X (2) ,Y u, v
and write
(QU,V )
Pr error (U, V) A(n)
X
=
Pr U = u, V = v (U, V) A(n) (QU,V )
(n)
(u,v)A
(QU,V )
Pr(errorU = u, V = v)
(11.50)
(n)
Pr U = u, V = v (U, V) A (QU,V )
(n)
(u,v)A
(QU,V )
c
PrFu,v
[
(n)
(
u,
v)A (QU,V
Fu ,v
(11.51)
(
u,
v)6=(u,v)
(n)
(u,v)A
Pr U = u, V = v (U, V) A(n)
(QU,V )
(QU,V )
{z
=1
c
Pr Fu,v
+
X
(n)
A
u
(QU,V v)
6=u
u
Pr(Fu ,v )
257
(n)
(n)
A
v
Pr(Fu,v ) +
(QU,V u)
6=v
v
(
u,
v)A (QU,V )
(
u,
v)6=(u,v)
Pr(Fu ,v )
(11.52)
where we have used the Union Bound. We investigate each term on the
RHS of (11.52) separately:
c
Pr Fu,v
t n, , X (1) X (2) Y .
(11.53)
Then,
X
Pr(Fu ,v )
(n)
A (QU,V
u
6=u
u
v)
(n)
A
u
(1) (2)
(QU,V v) (x ,x ,y)
(n)
6=u
u
A (Q
u,v)
<
<
(11.54)
(n)
(1)
(2)
A (Q
u, v) en(H(X U )+H(X ,Y V ))
(n)
A (QU,V
u
6=u
u
QnX (2) ,Y V x(2) , yv
QnX (1) U x(1) u
v)
en(H(X
(1) ,X (2) ,Y
U,V )+)
(n)
A
u
(11.55)
(QU,V v)
6=u
u
en(H(X
V ))
(11.56)
(1)
(2)
(1)
(2)
< A(n)
(QU,V v) en(H(X U )+H(X V )+H(Y X ,X )+)
(1)
(2)
(2)
en( H(X U )H(X V )H(Y X ,V )+)
)+)
< en(H(U V )+) en(
(1)
(2)
= en(H(U V )I(X ;Y X ,V )+) 0
I(X (1) ;Y
X (2) ,V
(11.57)
(11.58)
(11.59)
if
H(U V ) < I X (1) ; Y X (2) , V .
(11.60)
(1)
Here we have made use of our assumptions that
X only depends on U ,
(2)
(1)
(2)
X only on V , and Y only on X , X
. However, note that when
(2)
Y is conditional on X and V , we cannot drop V , because V is via U
related to X (1) .
(11.61)
(n)
A
v
(QU,V u)
6=v
v
258
(11.62)
Pr(Fu ,v )
(
u,
v)6=(u,v)
QnX (2) V x(2) v
QnY (y)
QnX (1) U x(1) u
(1) (2)
(n)
(
u,
v)A (QU,V ) (x ,x ,y)
(n)
(
u,
v)6=(u,v)
u,
v)
A (Q
<
X
(n)
(
u,
v)A (QU,V )
(
u,
v)6=(u,v)
<
(11.63)
(1)
(n)
, v
en(H(X U ))
QU,V,X (1) ,X (2) ,Y u
A
en(H(X
(2) V
en(
))
en(H(Y ))
(11.64)
U,V )+)
(n)
(
u,
v)A (QU,V )
(
u,
v)6=(u,v)
en(H(X
)+H(Y ))
(11.65)
(1)
(2)
(1)
(2)
(QU,V ) en(H(X U )+H(X V )+H(Y X ,X )+)
< A(n)
en( H(X
)H(Y )+)
(11.66)
)+)
< en(H(U,V )+) en(
(1)
(2)
= en(H(U,V )I(X ,X ;Y )+) 0
I(X (1) ,X (2) ;Y
(11.67)
(11.68)
if
H(U, V ) < I X (1) , X (2) ; Y .
(11.69)
(1) ,X (2) ;Y
)+)
+ t (n, , U V),
(11.70)
11.3
The region given in Theorem 11.2 is strictly suboptimal. Consider for example
the case U = V . Theorem 11.2 gives
H(U, V ) = H(U ) <
max
I X (1) , X (2) ; Y .
(11.71)
QU QX (1) U QX (2) U
259
However, since U = V , both encoders know the message of the other and
therefore they can cooperate! So, we definitely can achieve
H(U ) < max I X (1) , X (2) ; Y ,
(11.72)
QX (1) ,X (2)
V
1
5
6
0
7
260
(11.73)
H(U V ) < I X (1) ; Y X (2) , V, S ,
(11.74)
H(V U ) < I X (2) ; Y X (1) , U, S ,
(1)
(2)
H(U, V W ) < I X , X ; Y W, S ,
(11.75)
auxiliary
RV
encoder 1
encoder 2
channel
Be aware that in (11.77) the source with common part and the channel
are given, while we can try to find some optimal choice of the encoders and
the auxiliary distribution.
Proof: We omit the proof. It can be found in [CEGS80].
Note that it can be shown that this region is already convex, i.e., we do
not need a timesharing variable T . Also it can be shown that
S min X (1)  X (2) , Y
(11.78)
is sufficient.
The region given by (11.73)(11.76) is better than (11.15)(11.17) because
here we allow a dependence between X (1) and X (2) via S. (To be able to
properly compare between (11.73)(11.76) and (11.15)(11.17), it is best to
remove the timesharing variable in (11.15)(11.17), as this only gives convexification. Without T (and conditionally on (U, V )) we see that X (1) and X (2)
in (11.15)(11.17) are indeed independent.)
Unfortunately, this region still is strictly too small. This can be seen very
easily when realizing that our counterexample in Example 11.1 still is not
included in this region: For the source given in Example 11.1 we do not have
any common part!
Chapter 12
12.1
Introduction
S
Dest.
Decoder
QY X,S
S
X
Encoder
Uniform
Source
261
262
At first thought and similar to Chapter 8, this might again seem strange.
Why should the encoder have access to sideinformation, but the decoder not?
However, also here exist some important practical situations where we have
exactly this constellation:
In a broadcast channel, two messages are intended for two receivers,
where the message for receiver 1 can be regarded as unwanted interference for user 2 and vice versa. This interference is known noncausally
to the encoder in advance, but is not known to the decoders.
Consider the situation of burning a rewritable CD. If the CD has been
burned before, it already contains data that might not be completely
removable and that will cause distortion later on when the CD is read
again. Before reburning the CD, the encoder can first read the contents
of the CD and then take the existing noise into account for the encoding
of the given data. The reader, on the other hand, will have no way of
knowing what the original, but now overwritten contents of the CD has
been. This problem is usually known as dirty paper coding. We will
discuss this more in detail in Section 12.6.
More formally, we have the following definitions.
Definition 12.1. A discrete memoryless channel (DMC) with interference
consists of an input alphabet X , an output alphabet Y, an interference alphabet S, and a conditional probability distribution QY X,S such that for any
value of interference sk , the channel output Yk depends only on the current
channel input xk via QY X,S (xk , sk ).
Definition 12.2. An enR , n coding scheme for a DMC with interference
consists of a set of indices
M = 1, 2, . . . , enR ,
(12.1)
(12.2)
: Y n M.
(12.3)
The average error probability of an enR , n coding scheme for a DMC
with interference is given as
Pe(n) ,
1 X
Pr[(Y1n ) 6= m M = m].
enR
mM
(12.4)
263
12.2
(12.6)
1
Note that again we include the boundary into the capacity region, or rather, we define
the capacity as supremum rather than maximum without bothering whether the value of
R = C actually is achievable or not.
264
(12.9)
(12.10)
(12.11)
(12.12)
(12.13)
(12.14)
U(m, v), Y
/ A(n) (QU,Y ).
(12.16)
(12.15)
but
Note that here we ignore the possibility that there might exist another v such that
U(m, v), Y A(n) (QU,Y ),
(12.17)
i.e., our bound on the error probability is definitely too big.
The details of this analysis are given in the following Sections 12.2.1
12.2.5.
12.2.1
265
Case 1
12.2.2
(12.18)
Case 2
@
v
:
U(m,
v),
S
A
(Q
)
(12.19)
S
U,S
h
i
(n)
(n)
= Pr S A(n)
(Q
)
Pr
@
v
:
U(m,
v)
A
(Q
S)
S
A
(Q
)
S
U,S
S
(12.20)
0
enR
Y
= Pr S A(n)
(QS )
Pr U(m, v)
/ A(n) (QU,S S) S A(n) (QS )

{z
} v=1
1
(12.21)
0
enR
Y
v=1
(QS )
Pr U(m, v)
/ A(n) (QU,S S) S A(n)
(12.22)
nR
eY
v=1
1 Pr U(m, v) A(n) (QU,S S) S A(n) (QS )
(12.23)
<
nR
eY
v=1
1 en(I(U ;S)+)
n(I(U ;S)+)
(12.24)
enR0
= 1e
0
exp enR en(I(U ;S)+)
0
= exp en(R I(S;U )) .
(12.25)
(12.26)
(12.27)
Here, (12.24) follows from TC2, and the inequality (12.26) is due to the
Exponentiated IT Inequality (Corollary 1.10).
So, as long as
R0 > I(U ; S) +
(12.28)
12.2.3
Case 3
We have
(QU,S )
Pr(Case 3) = Pr S A(n) (QS ) U(m, v), S A(n)
266
[
U(m,
v), Y A(n)
(QU,Y )
Pr
(12.29)
(12.30)
m,
v
m6
=m
X
m,
v
m6
=m
Pr U(m,
v), Y A(n)
(QU,Y )
(12.31)
en(I(U ;Y ))
(12.32)
m,
v
m6
=m
0
= enR enR 1 en(I(U ;Y ))
(12.33)
(12.34)
Here, in (12.30) we enlarge the set; in (12.31) we apply the Union Bound; and
(12.32) follows from TC.
So, as long as
R + R0 < I(U ; Y )
(12.35)
12.2.4
Case 4
Note that by the definition of jointly typical sets, if (U, Y) is not jointly
typical, then (U, S, X, Y) cannot be jointly typical either. Hence, always
taking into account that X = f n (U, S),
Pr(Case 4)
= Pr
U(m, v), S A(n)
(QU,S ) U(w, v), Y
/ A(n)
(QU,Y )
(12.36)
Pr (U(m, v), S) A(n)
(QU,S )
U(m, v), S, X, Y
/ A(n)
(Q
)
U,S,X,Y
(n)
= Pr U(m, v), S A (QU,S )

{z
}
(12.37)
Pr U(m, v), S, X, Y
/ A(n)
(QU,S,X,Y ) U(m, v), S A(n) (QU,S )
(12.38)
(n)
(n)
Pr U(m, v), S, X, Y
/ A (QU,S,X,Y ) U(m, v), S A (QU,S )
(12.39)
h
= 1 Pr U(m, v), S, X, Y A(n)
(Q
)
U,S,X,Y
i
U(m, v), S A(n) (QU,S ) (12.40)
=1
(u,s)
(QU,S )
267
Pr U(m, v) = u, S = s U(m, v), S A(n) (QU,S )
(n)
A
Pr (u, s, x, Y) A(n) (QU,S,X,Y ) U = u, S = s, x = f n (u, s)
X
=1
(u,s)
(QU,S )
(12.41)
(n)
Pr U(m, v) = u, S = s U(m, v), S A (QU,S )
(n)
A
Pr Y A(n) (QU,S,X,Y u, s, x) U = u, S = s, x = f n (u, s) .
(12.42)
Here the first inequality (12.37) follows because we enlarge the event (the
(n)
(n)
event (U, S, X, Y)
/ A
follows from the event (U, Y)
/ A ).
So we see that we need a lower bound on
Pr Y A(n)
(QU,S,X,Y u, s, x) U = u, S = s, x = f n (u, s)
(12.43)
= Qn
A(n)
(QU,S,X,Y u, s, x) x
Y X
(n)
where we know that (u, s) A (QU,S ) and x = f n (u, s). Note that, as in
Section 8.2.4, we cannot apply TB3 here because (U, S, X, Y) is not generated
according to QU,S,X,Y , but U is independent of the rest.
Basically, the situation here is identical to Section 8.2.4 and the Markov
Lemma that is proven there (see Remark 8.4). The only difference is that
X is not generated according to a distribution, but rather as a deterministic
function X = f n (U, S). However, since a deterministic function can be viewed
as a special distribution function (that only contains probability values 1 or
0), we can adapt the proof in a straightforward manner.
(n)
We start by noting that since (u, s) A (QU,S ) and since x = f n (u, s),
(n)
we have for any y A (QU,S,X,Y u, s, x) the following:
U S X  Y
> Pu,s,x,y (a, b, c, d) QU,S,X,Y (a, b, c, d)
(12.44)
(12.45)
Q
(da, b, c) I {c = f (a, b)}
(12.47)
U S  Y U,S,X
{z
}
1
Pu,s (a, b) I {c = f (a, b)} Pyu,s,x (da, b, c) QY U,S,X (da, b, c)
(12.48)
U S
268
for all (a, b, c, d) U S X Y. Here, I {} denotes the indicator function defined in (7.50), the first inequality follows because (u, s, x, y) is jointly
typical, and the second inequality follows because (u, s) is jointly typical.
The other direction can be shown accordingly, i.e., we have that any y
(n)
A (QU,S,X,Y u, s, x) satisfies for all (a, b, c) with Pu,s (a, b)I {c = f (a, b)} >
0 and for all d Y
Pyu,s,x (da, b, c) QY U,S,X (da, b, c)
1
1
1+
<
.
U S
X  Y Pu,s (a, b) I {c = f (a, b)}
(12.49)
2
log e.
2U2 S2 X 2 Y2
(12.51)
This then corresponds to (8.48) in the proof of WynerZiv. Since also here
we have a Markov structure (U, S) (
X (
Y , the remainder of the proof
follows then exactly along the lines of (8.50)(8.59) (which is an adapted
version of (4.116)(4.123) of the derivation for TB3). We hence are able to
show that
Pr(Case 4) t (n, , U S X Y).
12.2.5
(12.52)
We are now ready to combine all these results together. Using the fact that
all four cases combined cover the entire probability space, we use the Union
Bound to get
Pr(error) Pr(Case 1) + Pr(Case 2) + Pr(Case 3) + Pr(Case 4) (12.53)
0
n(R0 I(S;U ))
t (n, , S) + exp e
+ en(R+R I(U ;Y )+)
+ t (n, , U S X Y)
(12.54)
(12.55)
(12.56)
(12.57)
269
Note that since we are not interested in R0 , we can actually combine (12.56)
and (12.57) to the condition
R < I(U ; Y ) R0 < I(U ; Y ) I(U ; S).
(12.58)
(Note that we also omitted the s and s here, as they can be chosen arbitrarily
small anyway.) Since we are trying to make this condition as loose as possible,
we will then decide to choose QU X and f (, ) such that the RHS of (12.58)
is maximized.
12.3
(12.59)
(12.60)
Note that in the factoring (12.59) QS and QY X,S are given, while we
can choose QU S and QXU,S . Also note the usual problem that we also need
to choose the alphabet of the auxiliary RV U . So, we start with standard
argument based on Caratheodorys Theorem (Theorem 1.20) that limits the
size of U.
Lemma 12.5. Without loss of optimality we can restrict the size of U in the
definition of the GelfandPinsker rate in (12.60) to
U S X  + 1.
(12.61)
Proof: The proof is again very similar to the proof of Lemma 7.7. Consider
a given choice of U, QU S , and QXU,S , and note that
I(U ; Y ) I(U ; S) = H(Y ) H(Y U ) H(S) + H(SU )
(12.62)
X
=
QU (u) H(Y ) H(Y U = u) H(S) + H(SU = u) ,
uU
(12.63)
QS,X (s, x) =
X
uU
s S, x X .
(12.64)
270
(12.66)
QU (u)vu .
(12.67)
uU
s,x
s,x
X
s,x
X
s,x
(12.68)
(12.69)
(12.70)
(12.71)
where in (12.70) we use the factoring (12.59), and in (12.71) we define the
u,s,x,y , QSU (su)QY X,S (yx, s). Hence, QY U is a linear function of QXU,S ,
which means that RGP () is a convex function of a linear function of QXU,S .
But this means that RGP () is convex in QXU,S , as can be seen as follows.
271
(12.75)
where (12.73) follows from the linearity of g2 () and (12.74) from the convexity
of g1 (). So we see that a convex function of a linear function is convex.
Recall that we have shown in Section 12.2 that the GelfandPinsker rate
RGP is achievable. Since we are allowed to choose QU S and QXU,S it is clear
that we would like to maximize RGP with an appropriate choice of QU S and
QXU,S . Due to the convexity of RGP in QXU,S (Lemma 12.6), however, such a
maximization will lead to a conditional distribution with all probability values
being either 1 or 0, i.e., QXU,S will become a deterministic relation that maps
(U, S) to X.
Remark 12.7. To explain why the maximization over a convex function always will result in a boundary point of the function, we consider the example
of a function f (t) that is convex in t [t0 , t1 ]. From the definition of convexity
we have for any 0 1,
f t0 + (1 )t1 f (t0 ) + (1 )f (t1 ) max f (t0 ), f (t1 ) .
(12.76)
Since any point t [t0 , t1 ] can be expressed as t = t0 + (1 )t1 , we hence
have
max f (t) max f (t0 ), f (t1 ) ,
(12.77)
t[t0 ,t1 ]
where the inequality actually is equality because the upper bound can be
achieved.
This motivates the following definition.
Definition 12.8. The GelfandPinsker capacity is defined as
CGP ,
max
QU S
f : U SX
I(U ; Y ) I(U ; S) .
(12.78)
From Section 12.2 we know that any rate below the GelfandPinsker capacity is achievable. In Section 12.4 below we will prove the corresponding
converse, i.e., no rate larger than the GelfandPinsker capacity is achievable.
We remark that from Lemma 12.5 we know that it is sufficient to choose
an alphabet U of the random variable U having size of at most
U S X  + 1.
(12.79)
272
f : U SX
max
QU S
f : U SX
=0
I(U ; Y )
(12.81)
= max I(U ; Y )
(12.82)
= max I(X; Y ) = C,
(12.83)
QU
f : U X
QX
(12.84)
QXS
By conditioning that does not increase entropy and by the Data Processing Inequality (Proposition 1.12) we know that
I(U ; Y ) I(U ; S) = H(U S) H(U Y )
(12.85)
(12.86)
I(X; Y S).
(12.88)
= I(U ; Y S)
(12.87)
Hence, we see that CGP is between the capacity without sideinformation and
the capacity with sideinformation both at transmitter and receiver.
Before we prove that CGP indeed is the maximum achievable rate, we give
an example of how CGP can look like.
Example 12.9. We consider a binary channel with a ternary state: X = Y =
{0, 1} and S = {0, 1, 2}. The conditional channel law is given for some given
0 p, q 1 as follows.
For S = 0, we have
QY X,S (1x, 0) = 1 QY X,S (0x, 0) = q,
x = 0, 1,
(12.89)
x = 0, 1,
(12.90)
273
1q
0
1q
X
1
q
q
0
q
X
1q
1
1q
1p
0
p
X
1
1p
274
if x 6= y,
if x = y,
(12.91)
(12.92)
(12.93)
= q + (1 q) + (1 2)(1 p)
= + (1 2)(1 p).
(12.94)
(12.95)
(12.96)
QU S (01) = 1 QU S (11) = 1 ,
1
QU S (02) = QU S (12) = ,
2
(12.97)
(12.98)
(12.99)
(12.100)
Given these choices, lets compute the corresponding RGP . We start with
I(U ; Y ): From the symmetry of the channel and our choice of distributions, we
must have that the channel from U to Y is a BSC with a crossover probability
= Pr[Y 6= U ]:
1 = Pr[Y = U ]
(12.101)
X
=
QS (s) QU S (us) QXU,S (xu, s) QY X,S (yx, s) I {y = u}

{z
}
u,s,x,y
= I {x=u}
(12.102)
275
1
2
1
2
1
S
1 2
2
1
1
2
1
2
X
u,s
(12.103)
= (1 q) + (1 )q + (1 )q + (1 q)

{z
} 
{z
}
for s=0
for s=1
1
1
+ (1 2) (1 p) + (1 2) (1 p)
2
2

{z
}
(12.104)
for s=2
(12.105)
Hence, using that the capacity of a BSC with crossover probability is log 2
Hb (), we get
I(U ; Y ) = log 2 Hb ()
(12.106)
= log 2 Hb (1 )
(12.107)
= log 2 Hb 2(1 q) + 2(1 )q + (1 2)(1 p) . (12.108)
= log 2 Hb () (1 2) log 2 Hb ()
= 2 log 2 2 Hb (),
(12.109)
(12.110)
(12.111)
(12.112)
and therefore
n
log 2 Hb 2(1 q) + 2(1 )q + (1 2)(1 p)
01
o
2 log 2 + 2 Hb () .
(12.113)
CGP sup
Note that with respect to the first Hb term we should choose to be small,
but with respect to the second Hb term should be close to 12 . Also note
276
12.4
Converse
(12.114)
where n 0 as n .
Since we assume that M is uniform we have H(M ) = log enR , i.e.,
nR = H(M )
=
I(M ; Y1n )
I(M ; Y1n )
n
X
k=1
(12.115)
+
H(M Y1n )
(12.116)
+ nn
k
(12.117)
k1
n
; Y0 I M, Skn ; Y0
I M, Sk+1
+ nn .
(12.118)
n
+ I M, Sn ; Y0n1 I M, Sn1
; Y0n2
+ I(M ; Y0n )
I M, Sn ; Y0n1
= I(M ; Y0n ) = I(M ; Y1n ).
(12.119)
Applying the chain rule twice, once with respect to Yk and once with respect
to Sk , we continue with (12.118) as follows.
n
1 X
n
I M, Sk+1
; Y0k I M, Skn ; Y0k1 + n
n
k=1
n
1 X
n
n
=
I M, Sk+1
; Y0k1 + I M, Sk+1
; Yk Y0k1
n
k=1
n
n
I M, Sk+1
; Y0k1 I Sk ; Y0k1 M, Sk+1
+ n
(12.120)
(12.121)
12.5. Summary
277
n
k1
1 X
k1
n
n
=
I M, Sk+1 ; Yk Y0
I Sk ; Y0
M, Sk+1 + n
n
k=1
n
1 X
n
H Yk Y0k1 H Yk Y0k1 , M, Sk+1
=
n

{z
}
{z
}

k=1
H(Yk )
n
H Sk M, Sk+1

{z
}
= H(Sk ) because
M
{Sk } and {Sk } IID
(12.122)
, Uk
n
+ n (12.123)
+ H Sk Y0k1 , M, Sk+1
{z
}

, Uk
n
1X
H(Yk ) H(Yk Uk ) H(Sk ) + H(Sk Uk ) + n
n
(12.124)
1
n
I(Uk ; Yk ) I(Uk ; Sk ) + n
(12.125)
max I(U ; Y ) I(U ; S) + n
(12.126)
1
n
k=1
n
X
k=1
n
X
k=1
QU,XS
= max I(U ; Y ) I(U ; S) + n
QU,XS
=
max
I(U ; Y ) I(U ; S) + n
QU S ,QXU,S
=
max
I(U ; Y ) I(U ; S) + n
QU S ,f : U SX
= CGP + n .
(12.127)
(12.128)
(12.129)
(12.130)
12.5
Summary
Note that we use here a vector notation for Uk , even though there is no fundamental
difference between a random variable and a random vector for finite alphabets.
278
(12.131)
(12.132)
12.6
The possibly most famous application of GelfandPinskers result is its application to a Gaussian setup.
Consider a sequence {Sk } of general (not necessarily Gaussian) IID random variables of finite variance,4 and a memoryless channel where for a given
channel input xk R, the channel output Yk R at time k is given as
Yk = xk + Sk + Zk ,
(12.133)
with
{Zk } IID N 0, 2 ,
{Zk }
{Sk }.
(12.134)
The transmitter has no knowledge of the realization of {Zk } (i.e., input and
noise are independent {Xk }
{Zk }), but the realization of the interference {Sk } is known to the transmitter noncausally before transmission starts.
Moreover, the transmitter is subject to an averagepower constraint, i.e., a
codeword of length n must satisfy
n
1X 2
xk E.
n
(12.135)
k=1
There exist two different spellings: Gelfand or Gelfand. We use the spelling that seems
to be more common in information theory.
4
Note that Sk does not even need to be continuous, i.e., discrete, continuous or a mixture
of both is all fine, as long as it is decent enough that the mutual information terms below
make sense. One has to be careful with the differential entropy though! So, for simplicity,
we assume here that Sk is continuous with a proper PDF.
279
Max Costa who has introduced this system model [Cos83] called it writing
on dirty paper. The idea is that the transmitter writes its message on a piece
of paper that is pretty dirty so that the written message will be difficult to
read. However, to help transmission, the transmitter can first scan the paper
to learn about the noise on the paper and then adapt the writing to it. The
receiver, on the other hand, has no knowledge about the original dirt on the
paper before the message was written onto it. Additionally, the receiver will
introduce noise when reading the paper.
As a first thought one might think that an easy way of getting rid of the
interference Sk is to simply subtract the known Sk at the transmitter:
k Sk ,
Xk = X
(12.136)
k + Zk ,
k Sk + Sk + Zk = X
Yk = X
(12.137)
i.e., we have reduced the channel model to a Gaussian channel. However, this
approach does not work because of the power constraint (12.135). Note that
we have made no assumption about the interference apart from being IID and
having finite variance. This variance, however, might be far bigger than E
such that (12.136) violates (12.135).
So, we go back to Figure 12.1 and try to adapt the derivation of the
previous sections that were based on our finite alphabet assumption to this
Gaussian setup. This can be done and it is not that difficult to show that the
maximum achievable rate is given as
CGP (E) =
sup
I(U ; Y ) I(U ; S)
(12.138)
U,XS : E[X 2 ]E
for some given random variable S, and for Y = X + S + Z with Z N 0, 2 .
In the remainder of this section we will now derive the explicit value of CGP (E).
We start with a lower bound. We choose U,XS = U S XU,S as follows:
N (0, E) and for some R we set
For some independent U
+ S,
U ,U
(12.139)
+ S S = U.
X , U S = U
(12.140)
X N (0, E),
(12.141)
X
S,
U = X + S.
(12.142)
(12.143)
280
(12.144)
(12.146)
(12.148)
= h(X) h S + X (X + S + Z)X + S + Z
= h(X) h (1 )X Z X + S + Z .
(12.145)
(12.147)
(12.149)
(12.150)
= (1 )E = 0,
(12.151)
(12.152)
i.e.,
=
E
.
E + 2
(12.153)
(12.154)
= I(X; X + Z)
1
E
= log 1 + 2 ,
2
(12.155)
(12.156)
(12.157)
(12.158)
(12.159)
(12.160)
(12.161)
where in the last step we made use of our knowledge of the optimal Gaussian
input of a Gaussian channel.
On the other hand, note that CGP is trivially upperbounded by a situation
where the receiver also knows the realization of the sideinformation. In this
scenario, the receiver simply subtracts the value of {Sk } (since the receiver
281
is not restricted by any type of power constraint, this is always possible) and
thereby reduces the problem to the standard Gaussian channel. Hence,
1
E
CGP (E) log 1 + 2 .
(12.162)
2
This gives the astonishing result that for the dirty paper channel (12.133)
the interference can be eliminated without loss of rate independently of the
type of interference and the value of its variance!
Theorem 12.11 (Dirty Paper Coding Theorem [Cos83]).
The dirty paper channel capacity (12.133) is given by
1
E
(12.163)
CGP (E) = CGaussian (E) = log 1 + 2
2
irrespectively of S.
12.7
We finish this chapter by quickly summarizing different types of sideinformation without proofs.
Consider a DMC with interference, and distinguish where the interference
is known.
No sideinformation: If neither transmitter nor receiver have knowledge of the interference, they simply experience a DMC with averaged
channel law:
C = max I(X; Y )
(12.164)
QX
where
QY X (yx) =
X
s
x, y.
(12.165)
Noncausal sideinformation:
Only at encoder: This is the case discussed in this chapter:
I(U ; Y ) I(U ; S) .
(12.166)
C=
max
QU S ,f : U SX
(12.167)
282
(12.168)
(12.169)
Causal sideinformation:
S1k1 known at encoder: Since {Sk } is IID and the DMC is
memoryless, knowledge of past realizations of {Sk } is useless, i.e.,
C = max I(X; Y ).
QX
(12.170)
max
QU ,f : U SX
I(U ; Y ).
(12.171)
max
QU ,f : U SX
I(U ; Y S)
(12.172)
(12.173)
12.A
We have seen in Lemma 12.6 that (for fixed QS , QU S , and QY X,S ) the
GelfandPinsker rate RGP (QU,S,X,Y ) is convex in QXU,S . It is tempting to
283
claim that (for fixed QS , QXU,S , and QY X,S ) RGP (QU,S,X,Y ) is concave in
QU S because we know that I(U ; S) is convex in the channel law QU S , that
I(U ; Y ) is concave in the channel input distribution QU , and that QU is linear in QU S . Unfortunately, this argument is wrong because for it to hold we
need the channel QY U to be fixed, which is not the case as it also depends
on QU S . It turns out that in general RGP (QU,S,X,Y ) is not concave in QU S !5
However, under the additional assumption of a cost constraint, we can
prove that the GelfandPinsker rate is concave in the cost. This result is
crucial for the derivation of a converse for the GelfandPinsker capacity with
a cost constraint, see (12.138). In the following we will quickly show a prove
of this latter claim.
For simplicity of notation, we will only consider the case of a DMC. So,
given some QS and QY X,S and some E > 0, we define
CGP (E) ,
max
I(U ; Y ) I(U ; S) .
(12.174)
QU,XS : E[X 2 ]E
(1)
(2)
For some two values E(1) and E(2) , let QU,XS and QU,XS be the PMFs that
achieve CGP E(1) and CGP E(2) , respectively, and let U (i) , X (i) be the
corresponding RVs, i.e.,
(i)
S, U (i) , X (i) QS QU,XS , i = 1, 2.
(12.175)
Now let Z be a binary RV that is independent of all other random variables
and that takes the value 1 with probability and the value 2 with probability
1 , and define a new pair of RVs (U , X ) as
U , Z, U (Z) ,
(12.176)
X , X (Z) .
(12.177)
Note that
E
X2
h h
ii
(Z) 2
=E E X
Z
h
i
h
2 i
2
= E X (1)
+ (1 ) E X (2)
= E(1) + (1 )E(2) , E .
(12.178)
(12.179)
(12.180)
Hence, we have
CGP E(1) + (1 )E(2)
= CGP (E )
max
QU,XS : E[X 2 ]E
(12.181)
I(U ; Y ) I(U ; S)
I(U ; Y ) I(U ; S)
= I Z, U (Z) ; Y I Z, U (Z) ; S
(12.182)
(12.183)
(12.184)
Interestingly, Gelfand and Pinsker actually wrongly claim concavity in their original
paper [GP80]!
284
(12.185)
(12.186)
(12.187)
(12.188)
(12.189)
Here, (12.183) follows by dropping the maximization and choosing one particular input of corresponding cost E , i.e., by choosing (U , X ); in (12.186) we
add conditioning on Z to H(Y ) (which reduces entropy) and to H(S) (which
remains unchanged because S
Z); and the last equality (12.189) holds
because U (i) , X (i) achieves CGP E(i) , i = 1, 2.
This proves that CGP (E) indeed is concave in E.
Chapter 13
Problem Setup
Dest. 1
(0) , M
(1)
M
Dec.
(1)
M (1) Uniform
Source 1
Y(1)
Broadcast
Channel
Dest. 2
(0) , M
(2)
M
Dec. (2)
Enc.
M (0)
Uniform
Source 0
M (2)
Uniform
Source 2
Figure 13.1: A channel coding problem with three independent sources and
two independent destinations: The common message M (0) is intended for both destinations, while the messages M (i) are only
for the corresponding destination i, i = 1, 2. The three sources
are encoded by one common encoder, while each destination has
its own independent decoder. Such a channel model is called
broadcast channel (BC).
A single encoder needs to transmit three messages to two independent
destinations. The common message M (0) must arrive at both receivers, while
the private1 message M (1) is only for destination 1, and the private message
M (2) only for destination 2. The transmission takes place via a socalled
broadcast channel that produces for a single input x two outputs Y (1) and
Y (2) . Such a communication setup has been described first in [Cov72].
More formally, we have the following definitions.
1
Note that we do not consider privacy in a cryptographic context here: We do not care
if a private message can be (or even actually is) decoded by a wrong receiver as long as it
does arrive at its intended receiver!
285
286
(1)
,y
n
Y
(1) (2)
x =
QY (1) ,Y (2) X yk , yk xk .
(2)
(13.1)
k=1
(0)
(1)
(2)
Definition 13.2. An enR , enR , enR , n coding scheme for a DMBC
consists of three sets of indices
(0)
M(0) = 1, 2, . . . , enR
,
(13.2)
(1)
M(1) = 1, 2, . . . , enR
,
(13.3)
(2)
M(2) = 1, 2, . . . , enR
(13.4)
called message sets, an encoding function
: M(0) M(1) M(2) X n ,
(13.5)
n
(2) n
M(0) M(1) ,
M(0) M(2) .
(13.6)
(13.7)
(0)
(1)
(2)
The error probability of an enR , enR , enR , n coding scheme for a
DMBC is given as
Pe(n) , Pr (1) Y(1) 6= M (0) , M (1) or (2) Y(2) 6= M (0) , M (2) . (13.8)
Definition 13.3. A rate triple R(0) , R(1) , R(2) is said to be achievable for the
(0)
(1)
(2)
BC if there exists a sequence of enR , enR , enR , n coding schemes with
(n)
0 as n .
The capacity region of the BC is defined to be the closure of the set of all
achievable rate triples.
Pe
Note that this usually is less than the capacity of the worse of the two channels:
Since we need to convey on both channels at the same time with only one input
distribution, a distribution QX that is good for the worse channel might not
be good for the good channel.
287
Example 13.5. A lecturer in a classroom: Not every student gets all information. If the lecturer is successful, then good students get more information
than less good students, but the poorer students still should receive enough
to be able to follow. Only a bad lecturer will teach at a pace that corresponds
to the worst student!
Example 13.6. If X is a vectoralphabet with the first component only connected with Y (1) and the second component only connected with Y (2) , then we
have an orthogonal BC with two independent channels. The capacity region
obviously is
(
(13.10)
(13.11)
where C(i) are the corresponding singleuser capacities of the two independent
channels.
(13.12)
(13.13)
(Another way to see this is that for every word we always have a choice between
two languages, i.e., we get an additional 1 bit.) This then yields a total of 13
bits/channel use. This is more than single timesharing!
Note that if we do not apply a 50%50% timesharing,
but for example
3
a 25%75%, then we only get an additional Hb 4 0.81 bits per channel
use.
288
13.2
The first important observation concerns shifting of bits between common message and private messages. Since all R(0) bits are available at both receivers,
we can easily define some portion of them to become a private message of any
of the two receivers. These bits will still be decodable at the other receiver,
but once they are not common message anymore, they are simply discarded
at the wrong receiver.
We have the following theorem.
Theorem 13.8. If R(0) , R(1) , R(2) is achievable, then
0 R(0) , R(1) + 1 R(0) , R(2) + 2 R(0)
(13.14)
for 0 , 1 , 2 0 and 0 + 1 + 2 = 1 is also achievable.
The second observation is even more important. It is the foundation of
the degradedness property that will be introduced in Section 13.3.1.
Theorem 13.9. The capacity region of a BC depends only on the conditional
marginal distributions QY (1) X and QY (2) X and not on the joint conditional
channel law QY (1) ,Y (2) X .
Proof: Define
Pe(n) , Pr (1) Y(1) 6= M (0) , M (1) (2) Y(2) 6= M (0) , M (2)
,
(13.15)
Pe(n),(1)
Pe(n),(2)
h
i
, Pr (1) Y(1) 6= M (0) , M (1) ,
h
i
, Pr (2) Y(2) 6= M (0) , M (2) .
(13.16)
(13.17)
Y
6= M (0) , M (1) (2) Y(2) 6= M (0) , M (2) ,
we have
Pe(n) max Pe(n),(1) , Pe(n),(2) .
(13.19)
max Pe(n),(1) , Pe(n),(2) Pe(n) Pe(n)(1) + Pe(n)(2)
(13.20)
Hence,
(n)
(n)(1)
(n)(2)
(n)
289
Remark 13.10. Be aware that the error probability does depend on the
(n)
channel law QY (1) ,Y (2) X , but whether Pe can be made arbitrarily small or
not does not depend on QY (1) ,Y (2) X except through QY (1) X and QY (2) X .
The main consequence of Theorem 13.9 is as follows.
(1) (2) have the
Corollary 13.11. If two different BCs QY (1) ,Y (2) X and Q
Y ,Y X
same conditional marginal distributions QY (1) X and QY (2) X , then these two
BCs have the same capacity region.
Also note that by the chain rule
QY (1) ,Y (2) X = QY (1) X QY (2) Y (1) ,X ,
(13.21)
i.e.,
X
QY (2) X y (2) x =
QY (1) X y (1) x QY (2) Y (1) ,X y (2) y (1) , x .
(13.22)
y (1)
X
y (1)
QY (1) X y (1) x QY (2) Y (1) ,X y (2) y (1) , x .
(13.23)
Finally, we would like to point out that by the usual timesharing argument, i.e., the transmitter talks for a certain percentage of the time only to one
receiver and the rest of the time only to the second, it is clear again that the
capacity region must be convex. However, as we have seen in Example 13.7,
timesharing usually is not efficient and normally the capacity region cannot
be achieved by timesharing.
13.3
13.3.1
(13.24)
290
QY (2) Y (1)
Y (1)
QY (1) X
Y (1)
Y (2)
Y (1)
Y (2)
Y (1)
Figure 13.3: A physically degraded Gaussian BC.
(2)
Q
Y X
X
(1) y (1) x Q
(2) (1) y (2) y (1)
y (2) x =
Q
Y X
Y Y
y (1)
(13.29)
291
X
y (1)
(2) (1) y (2) y (1) .
QY (1) X y (1) x Q
Y Y
(13.30)
Y
X
(2) (1) y (2) y (1) .
QY (1) X y (1) x Q
QY (2) X y (2) x =
Y Y
(13.31)
y (1)
(13.35)
(13.36)
(1)
+ V,
(13.37)
we see that conditionally on X = x, Y (2) has the same distribution as Y (2)
2 ), but Y
(2) depends
(both are conditional meanx Gaussian with variance (2)
only on Y (1) and not on X, i.e., we have a Markov structure! This shows
that the Gaussian BC is stochastically degraded and has therefore the same
capacity region as the physically degraded Gaussian BC
(
Y (1) = x + Z (1) ,
(13.38)
(2)
(1)
Y
= x + Z + V.
(13.39)
Note that the two Gaussian BCs are not the same! In the original BC (13.32)
(13.33), we have Z (1)
Z (2) , but in (13.38)(13.39), Z (1)
6 Z (2) because
(2)
(1)
Z = Z + V , i.e.,
Cov Z (1) , Z (2) = E Z (1) Z (1) + V
(13.40)
h
i
2
= E Z (1)
+ E Z (1) E[V ]
(13.41)
2
= (1)
6= 0.
(13.42)
292
13.3.2
1
2
2
2
p
1
2 + 2 + p
p
2 + 2 + p +
1
2
p
1
2 2 p
p
2 2 p
!
(13.46)
Note that this BC is not degraded because the matrix in (13.45) is not a
stochastic matrix (the rows do sum to 1, but there is a negative entry!).
However, it can be shown that (13.43) is satisfied for this choice.
Exercise 13.20. Finish the details in the derivation of the counterexample in
the proof of Lemma 13.19. Hint: This is not straightforward!
13.3.3
293
1 0
1 0
QY (1) X = 0 1 ,
QY (2) X = 12 21 .
(13.48)
1
2
1
2
1
2
1
2
(13.49)
QY (2) = (1 p, p),
(13.50)
and
I X; Y (1) = Hb (p + ) 2p + 2,
(13.51)
(2)
I X; Y
= Hb (p) 2p.
(13.52)
A quick calculation now shows that I X; Y (2) I X; Y (1) is convex in , i.e.,
the maximum is achieved for = 0 or = p. In the former the difference is 0,
in the latter the difference is Hb (p) Hb (2p) 2p, which in turn is convex in
p and always nonpositive. Hence, we see that
I X; Y (2) I X; Y (1) 0
(13.53)
proving that this BC has a more capable output.
On the other hand, if we choose U to be uniform over {0, 1} and
1
!
2 if x = 0, u = 0 or x = 1, u = 0,
1
1
0
2
2
QXU (xu) = 1 if x = 2, u = 1,
=
, (13.54)
0 0 1
0 otherwise
then
QY (1) U =
1
2
1
2
1
2
1
2
!
,
3
4
1
2
QY (2) U =
1
4
1
2
!
,
(13.55)
and
QY (1) =
1 1
,
,
2 2
QY (2) =
5 3
,
.
8 8
(13.56)
Hence,
I U ; Y (1) = 0 < I X; Y (2) ,
(13.57)
294
13.4
Superposition Coding
Next, we will present an achievable coding scheme that is based on a new idea:
superposition coding. In superposition coding, the codewords are arranged in
several separate clouds, see Figure 13.4. The decoder with a bad channel will
cloud center U
codeword X
(0)
(2)
R
<
I
U
;
Y
,
(13.58)
R(1) < I X; Y (1) U ,
(13.59)
(0)
(1)
(1)
R + R < I X; Y
(13.60)
for some joint distribution QU,X such that U (
X (
(Y (1) , Y (2) ).
Note that as already mentioned in Section 10.2 we assume perfect time
synchronization and therefore can apply timesharing between two different
coding schemes. Hence, the capacity region must be convex. It can be shown
that the region defined by (13.58)(13.60) is already convex, i.e., here no
additional timesharing is necessary.
Proof: We prove Theorem 13.23 by creating a random coding scheme.
1: Setup: Fix R(0) , R(1) , QU , QXU , and some blocklength n.
2: Codebook Design: We generate enR
U(m(0) ) QnU ,
(0)
m(0) = 1, . . . , enR .
(13.61)
295
(1)
For each U m(0) , we generate enR independent lengthn codewords
(1)
(13.62)
X m(0) , m(1) QnXU U m(0) , m(1) = 1, . . . , enR .
We reveal both codebooks to encoder and decoders.
Note that U m(0) represents the cloud center of the m(0) th cloud, and
X m(0) , m(1) is the m(1) th codeword of the m(0) th cloud.
3: Encoder Design: To send message m(0) , m(1) , the encoder trans
mits the codeword X m(0) , m(1) .
4: Decoder Design: Upon receiving Y(2) , decoder (2) looks for an m
(0)
such that
U m
(0) , Y(2) A(n) QU,Y (2) .
(13.63)
If there is exactly one such m
(0) , the decoder (2) puts out m
(0) , m
(0) .
Otherwise it declares an error.
Upon receiving Y(1) , decoder (1) looks for a pair m
(0) , m
(1) such that
(13.64)
U m
(0) , X m
(0) , m
(1) , Y(1) A(n) QU,X,Y (1) .
If there is exactly one such pair m
(0) , m
(1) , the decoder (1) puts out
m
(0) , m
(1) , m
(0) , m
(1) . Otherwise it declares an error.
5: Performance Analysis: We start with decoder (2) :
Pe(n),(2)
(0)
nR
eX
(1)
nR
eX
m(0) =1 m(1) =1
1
en(R
(0)
+R(1) )
Pr error(2) M (0) , M (1) = m(0) , m(1) .
(13.65)
(13.66)
and, using the Union Bound, TB3, and TC1, we bound as follows:
Pr error(2) M (0) , M (1) = m(0) , m(1)
(0)
enR
[
(2) c
(2) (0)
(1)
Fm
m
,
m
(13.67)
= Pr Fm(0)
(0)
m
(0) =1
m
(0) 6=m(0)
(0)
c
(2)
Pr Fm(0) m(0) , m(1) +
nR
eX
m
(0) =1
m
(0) 6=m(0)
(2) (0)
(1)
Pr Fm
m
,
m
(13.68)
(0)
296
t n, , U Y
(2)
nR
eX
en(I(U ;Y
(2) )
(13.69)
m
(0) =1
m
(0) 6=m(0)
(0)
(2)
t n, , U Y (2) + enR en(I(U ;Y ))
(13.70)
(13.71)
and thus
Pe(n),(2)
(13.72)
(13.73)
nR
eX
(1)
nR
eX
m(0) =1 m(1) =1
en(R
(0)
+R(1) )
Pr error(1) M (0) , M (1) = m(0) , m(1)
(13.74)
and we define for each m(0) , m(1)
n
o
(1)
Fm(0) ,m(1) ,
U m(0) , X m(0) , m(1) , Y(1) A(n) QU,X,Y (1) .
(13.75)
c
[
(1)
(0) (1)
(1)
m , m
= Pr
Fm
Fm(0) ,m(1)
(0) ,m
(1)
(0)
(1)
,m
)
(m
(0)
(1)
(0)
(1)
,m
)6=(m ,m )
(m
(13.76)
(1)
c
(1)
Pr Fm(0) ,m(1) m(0) , m(1) +
(0)
nR
eX
(1)
nR
X
m
(0) =1
m
(0) 6=m(0)
m
(1) =1
nR
eX
m
(1) =1
m
(1) 6=m(1)
(0) (1)
(1)
Pr Fm(0) ,m
m
,
m
(1)
(0) (1)
(1)
Pr Fm
m
,
m
.
(0) ,m
(1)
(13.77)
The first term corresponds to the case of the jointly generated codewords
and received sequence are not jointly typical. This can be bounded by t
297
as usual. In the second term, a wrong codeword inside the correct cloud
is decoded: for m
(1) 6= m(1) ,
(0) (1)
(1)
Pr Fm(0) ,m
m
,
m
(1)
X
QnU (u) QnXU (xu) QnY (1) U y(1) u
=
(13.78)

{z
}
(u,x,y(1) )
(n)
A
wrong codeword,
correct cloud!
(QU,X,Y (1) )
(u,x,y(1) )
(QU,X,Y (1) )
) (13.79)
(1) U )
(n)
A
(1)
= A(n)
QU,X,Y (1) en(H(U )+H(XU )+H(Y U ))
(1)
(1)
en(H(U,X,Y )+) en(H(U,X)+H(Y U ))
= en(H(Y
(1)
= en(I(X;Y U )) .
(13.80)
(13.81)
(13.82)
(13.83)
Here the most important step is (13.78), where we need to realize that
Y(1) is generated based on the transmitted X, but not on the wrong
codeword considered here. However, since we do consider the correct
cloud, the cloud center U is related to the received Y(1) .
In the third term, some codeword inside the wrong cloud is decoded: for
m
(0) 6= m(0) and any m
(1) ,
(0) (1)
(1)
Pr Fm
m
,
m
(0) ,m
(1)
X
=
QnU (u) QnXU (xu) QnY (1) y(1)
(13.84)
(u,x,y(1) )
(QU,X,Y (1) )
(n)
A
(1)
A(n)
QU,X,Y (1) en(H(U )) en(H(XU )) en(H(Y ))
(13.85)
= en(H(Y
(1)
= en(I(U,X;Y ))
= en(I(X;Y
(13.86)
(13.87)
(13.88)
),
(1) )
(13.89)
(13.91)
298
(13.92)
(13.93)
nR(0)
e[
[
(0)
(1)
(1)
M
,
M
Fm(0) ,m
(1)
(1)
m
=1
(1)
(1)
m
6=m
(1)
enR
(1)
Fm
(0)
m
(0) =1
m
(0) 6=m(0)
c
i
(1)
E Pr Fm(0) ,m(1) M (0) , M (1)
(13.102)
+ E
(0)
enR
m
(0) =1
m
(0) 6=m(0)
(1)
(0)
(1)
Pr Fm
M
,
M
(0)
299
nR(1)
+ E
eX
m
(1) =1
m
(1) 6=m(1)
(0)
(1)
(1)
M
,
M
Pr Fm(0) ,m
(1)
(13.103)
(0)
(1)
(1)
(1)
t n, , U X Y (1) + enR en(I(U ;Y )) + enR en(I(X;Y U ))
(13.104)
(13.105)
(13.106)
(13.107)
Here in (13.102) the first event corresponds to the case that the correct codeword is not recognized, the first union of events corresponds to the case where
any codeword from a wrong cloud is (wrongly) recognized, and the second
union of events corresponds to the case where a wrong codeword from the correct cloud is recognized. Note that we have an inequality in front of (13.102)
because we only check whether the cloud center of a wrong cloud happens to
be typical with the received sequence, and do not bother to check whether or
not there actually exists a codeword in that wrong cloud that is jointly typical
with the cloud center and the received sequence (which is the reason why this
analysis leads to a weaker result).
Hence, we have shown that a rate triple R(0) , R(1) , 0 is achievable if
(
(13.108)
R(0) < min I U ; Y (1) , I U ; Y (2) ,
(1)
(1)
U
(13.109)
R < I X; Y
for some joint distribution QU,X such that U (
X (
(Y (1) , Y (2) ).
For the case of less noisy BCs, this achievable region is identical to the
region given in Corollary 13.24. However, in general, (13.108)(13.109) is
smaller than (13.58)(13.60).
So far, all these results apply to the case when R(2) = 0. However, recalling the observation given in Theorem 13.8, we can generalize them: We can
convert some of the R(0) bits to R(2) bits, i.e., the original R(0) will become
R(0) + R(2) . This then gives the following first main result.
Theorem 13.26 (Achievability based on Superposition).
For a general BC, a rate triple R(0) , R(1) , R(2) is achievable if
(1)
(1)
R < I X; Y
U ,
(13.111)
300
(13.113)
(13.116)
(13.117)
QU (u)vu .
(13.118)
uU
From Caratheodorys Theorem (Theorem 1.20) it now follows that we can reduce the size of U to at most X  + 2 values (note that v contains X  + 1 com
ponents!) without changing v, i.e., without changing the values of I U ; Y (2)
and I X; Y (1) U , and without changing the value of QX (). Note that if QX
is fixed, also I X; Y (1) remains fixed. This proves the claim.
13.5
In [EG79], El Gamal derived the capacity region of BCs with a more capable
output. The main contribution was a new outer bound, as the inner bound
came from the already known superposition coding. In the following we will
now present a generalization of the main idea in [EG79] to general BCs.
301
(0)
(2)
(2)
R + R I V, W ; Y
,
(13.121)
(0)
(1)
(2)
(1)
(2)
R + R + R I U, W ; Y
+ I V ; Y U, W ,
(13.122)
(2)
(13.125)
(13.126)
(i)
(13.129)
k=1
n
X
(1) (1)
(1)
(1)
(1)
(1)
=
H Yk Y1 , . . . , Yk1 H Yk M (0) , Y1 , . . . , Yk1
k=1
+ nn(1)
(13.130)
n
X
(1)
(1)
(1)
(1)
(2)
(0)
(2)
H Yk
H Yk M , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn
k=1
+ nn(1)
n
X
(1)
(1)
=
H Yk
H Yk Wk + nn(1)
(13.131)
(13.132)
k=1
n
X
(1)
=
I Wk ; Yk
+ nn(1)
k=1
n
X
(13.133)
1
(1)
I WZ ; YZ Z = k + nn(1)
n
k=1
(1)
= n I WZ ; YZ Z + nn(1)
=n
(13.134)
(13.135)
302
(13.136)
Here, (13.129) follows from the Fano Inequality (Proposition 1.13); (13.131)
from conditioning that reduces entropy; and in (13.136) we move Z from the
conditioning into the main argument of the mutual information functional.
In a similar fashion, we bound
nR(0) = H M (0)
(13.137)
(2)
= I M (0) ; Y
+ H M (0) Y(2)
n
X
(2) (2)
k=1
n
X
k=1
(13.138)
(13.139)
(2) (2)
(2)
(2)
H Yk Yk+1 , . . . , Yn(2) H Yk M (0) , Yk+1 , . . . , Yn(2)
+ nn(2)
(13.140)
n
X
(2)
(2)
(1)
(1)
(2)
H Yk M (0) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2)
H Yk
k=1
+ nn(2)
n
X
(2)
(2)
H Yk
H Yk Wk + nn(2)
=
=
k=1
n
X
(2)
I W k ; Yk
+ nn(2)
(13.141)
(13.142)
(13.143)
k=1
(2)
= n I WZ ; YZ Z + nn(2)
(2)
n I Z, WZ ; YZ
+ nn(2) ,
(13.144)
(13.145)
= H M (0) , M (1)
(13.146)
= I M (0) , M (1) ; Y
+ H M (0) , M (1) Y(1)
(13.147)
n
X
(1) (1)
(1)
(1)
(1)
(1)
k=1
+ nn(3)
(13.148)
n
X
(1)
(1)
(1)
(1)
(2)
H Yk
H Yk M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2)
k=1
+ nn(3)
n
X
(1)
(1)
=
H Yk
H Yk Wk , Uk + nn(3)
k=1
(13.149)
(13.150)
303
(13.151)
k=1
(1)
= n I WZ , UZ ; YZ Z + nn(3)
(1)
n I Z, WZ , UZ ; YZ
+ nn(3)
(13.152)
(13.153)
and similarly,
(2)
n R(0) + R(2) n I Z, WZ , VZ ; YZ
+ nn(4) .
(13.154)
Finally, we obtain
n R(0) + R(1) + R(2)
= H M (0) , M (1) , M (2)
= H M (0) , M (1) + H M (2) M (0) , M (1)
I M (0) , M (1) ; Y(1) + I M (2) ; Y(2) M (0) , M (1) + nn(5)
n
X
(1) (1)
(1)
=
I M (0) , M (1) ; Yk Y1 , . . . , Yk1
(13.155)
(13.156)
(13.157)
k=1
(2)
(2)
+ I M (2) ; Yk M (0) , M (1) , Yk+1 , . . . , Yn(2) + nn(5)
n
X
(1)
(1)
(1)
I M (0) , M (1) , Y1 , . . . , Yk1 ; Yk
(13.158)
k=1
(1)
(1)
(2)
(2)
+ I M (2) , Y1 , . . . , Yk1 ; Yk M (0) , M (1) , Yk+1 , . . . , Yn(2)
+ nn(5)
n
X
(1)
(1)
(2)
(1)
I M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2) ; Yk
=
(13.159)
k=1
(0)
(1)
(1)
I
M , M (1) , Y1 , . . . , Yk1
(1)
(1)
(2)
(2)
+ I Y1 , . . . , Yk1 ; Yk M (0) , M (1) , Yk+1 , . . . , Yn(2)
(2)
(1)
(1)
(2)
+ I M (2) ; Yk M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2)
(2)
(1)
Yk+1 , . . . , Yn(2) ; Yk
+ nn(5)
n
X
(1)
(1)
(2)
(1)
=
I M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2) ; Yk
(13.160)
k=1
(1)
(1)
(2)
(2)
+ I M (2) ; Yk M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2)
+ nn(5)
n
X
(1)
(2)
=
I Wk , Uk ; Yk
+ I Vk ; Yk Uk , Wk + nn(5)
(13.161)
(13.162)
k=1
(1)
(2)
= n I WZ , UZ ; YZ Z + n I VZ ; YZ UZ , WZ , Z + nn(5)
(1)
(2)
n I Z, WZ , UZ ; YZ
+ n I VZ ; YZ UZ , Z, WZ + nn(5) .
(13.163)
(13.164)
304
n
n
X
n
I S1k1 ; Tk Tk+1
,R =
I Tk+1
; Sk S1k1 , R .
k=1
(13.165)
k=1
(13.166)
j=1
n
X
k1
n
n
I Tk+1 ; Sk S1 , R =
I Tj ; Sk S1k1 , Tj+1
,R .
(13.167)
j=k+1
Hence,
n
X
n X
k1
n
X
n
,R =
I S1k1 ; Tk Tk+1
I Sj ; Tk S1j1 , Tk+1
,R
k=1
k=1 j=1
(13.168)
and
n
X
n
n
X
X
n
n
,R
; Sk S1k1 , R =
I Sk ; Tj S1k1 , Tj+1
I Tk+1
k=1
k=1 j=k+1
j1
n X
X
(13.169)
n
,R
I Sk ; Tj S1k1 , Tj+1
(13.170)
n
I Sj ; Tk S1j1 , Tk+1
,R .
(13.171)
j=1 k=1
n X
k1
X
k=1 j=1
(13.172)
Finally, by defining
W , (Z, WZ ),
U , UZ ,
V , VZ
(1)
Y (1) , YZ ,
(13.173)
(2)
YZ ,
(13.174)
(2)
(13.175)
305
Remark 13.29. If we specialize Theorem 13.27 for the case R(0) = 0, it looks
as follows:
R(1) I U, W ; Y (1) ,
R(2) I V, W ; Y (2) ,
(13.176)
(13.177)
(13.178)
(13.179)
It has been shown in [NW08] that this outer bound actually is identical to
R(1) I U ; Y (1) ,
R(2) I V ; Y (2) ,
(13.180)
(13.181)
(13.182)
(13.183)
without the need of the auxiliary RV W . This shows how difficult it actually
is to really understand these bounds or even to evaluate them! Note that the
latter outer bound was implicitly given in [EG79] already, although it was not
explicitly stated because the paper directly specialized it for the case of BCs
with a more capable output.
13.6
(1)
(1)
R I X; Y
U ,
(13.185)
Proof: The achievability follows directly from superposition coding, Theorem 13.26. For the converse, consider Theorem 13.27 and choose V = 0 and
306
X = U:
R(0) min I W ; Y (1) , I W ; Y (2) ,
R(0) + R(1) I X, W ; Y (1) ,
R(0) + R(2) I W ; Y (2) ,
(13.187)
(13.188)
(13.189)
(13.190)
(13.191)
(13.192)
R(0) I W ; Y (2)
(13.193)
(13.194)
(0)
(1)
(2)
(2)
(1)
U ,
(13.196)
R + R + R I U; Y
+ I X; Y
(13.198)
(13.199)
307
R(1) = R(0) + I U ; Y (2) + I X; Y (1) U
I X; Y (1) U
R(0)
I U ; Y (2)
(13.200)
(13.201)
(13.202)
308
(0)
(2)
I
U
;
Y
,
(13.205)
(13.206)
R(1) I X; Y (1) U ,
(13.208)
R(0) I W ; Y (1) ,
(0)
(2)
,
(13.209)
R I W;Y
(0)
(1)
(1)
R + R I X; Y
,
(13.210)
R(0) I U ; Y (2) ,
R(0) + R(1) I U ; Y (2) + I X; Y (1) U
(13.214)
(13.215)
R(0) I U ; Y (1) ,
R(0) I U ; Y (2) ,
R(1) I X; Y (1) U ,
(13.216)
I X; Y (1) = I U, X; Y (1)
= I U ; Y (1) + I X; Y (1) U ,
(13.220)
(13.217)
(13.218)
(13.219)
(13.221)
309
(13.222)
(13.223)
and
(
R(1) I X; Y (1) U ,
R(0) + R(1) I U ; Y (1) + I X; Y (1) U .
(13.224)
(13.225)
13.7
Note that by its implicit construction, superposition coding works well if one
user has a much better channel than the other and we have only a common
message for the worse user (or the BC is degraded). However, if both users
have similar channels and/or we have no common message, but only private
messages, binning turns out to be better!
The following scheme is closely related to GelfandPinsker, where we have
noncausal sideinformation at the transmitter. We will assume that there is
no common message.
1: Setup: We need two auxiliary random variables U (1) and U (2) with
some alphabets U (1) and U (2) . Then we choose a PMF QU (1) ,U (2) and
compute its marginals QU (1) and QU (2) . We further choose a function
f : U (1) U (2) X that will be used in the encoder to create the channel
input sequence.
(1) and R
(2) , and some blocklength n.
Then we fix some rates R(1) , R(2) , R
(i)
(i)
2: Codebook Design: We generate enR enR codewords U(i) m(i) , v (i)
(i)
(i)
of length n, m(i) = 1, . . . , enR and v (i) = 1, . . . , enR , by choosing
(i) (i)
(i)
all n en(R +R ) components Uk m(i) , v (i) independently at random
according to QU (i) , for both i = 1, 2. Here m(i) describes the bin and v (i)
describes the index of the codeword in this bin.
3: Encoder Design: For
a message pair m(1) , m(2) , the encoder tries to
find a pair v (1) , v (2) such that
U(1) m(1) , v (1) , U(2) m(2) , v (2) A(n) QU (1) ,U (2) .
(13.226)
If it finds several possible choices, it picks one. If it finds none, it chooses
v (1) , v (2) = (1, 1). Note that these choices can be decided in advance,
i.e., v (1) , v (2) becomes a function of m(1) , m(2) . However, also note
that the choice which codeword is picked in bin m(1) also depends on
m(2) and vice versa.
310
(13.227)
R
enY
(2)
R
enY
h
(2) (2) (2)
(1)
(1) (1)
Pr U m , v
,U m ,v
v (1) =1 v (2) =1
R
enY
i
(2)
R
enY
v (1) =1 v (2) =1
h
1 Pr U(1) m(1) , v (1) , U(2) m(2) , v (2)
A(n) QU (1) ,U (2)
(1)
<
R
enY
i
(13.231)
(2)
R
enY
v (1) =1 v (2) =1
(13.230)
1 en(I(U
(1) ;U (2) )+
= 1e
(1)
(2)
(1) (2)
exp en(R +R ) en(I(U ;U )+)
(1)
(2)
(1) (2)
= exp en(R +R I(U ;U )) .
(13.232)
(13.233)
(13.234)
(13.235)
311
So, as long as
(1) + R
(2) > I U (1) ; U (2) +
R
(13.236)
Pr
m
(i) ,
v (i)
m
(i) 6=m(i)
X
m
(i) ,
v (i)
m
(i) 6=m(i)
n
o
U(i) m
(i) , v(i) , Y(i) A(n)
Q
(i)
(i)
U ,Y
(13.238)
h
i
(i)
(n)
(i)
(i) (i)
A
QU (i) ,Y (i)
Pr U m
, v , Y
(13.239)
n(I(U (i) ;Y (i) ))
(13.240)
m
(i) ,
v (i)
m
(i) 6=m(i)
= enR
(i)
en(R
(i)
(i)
(i)
(i)
(13.241)
(13.242)
So, as long as
(i) < I U (i) ; Y (i)
R(i) + R
(13.243)
(13.244)
Note that here we ignore the possibility that there might exist another v(i) such that
QU (i) ,Y (i) .
(13.245)
U(i) m(i) , v(i) , Y(i) A(n)
i.e., our bound on the error probability is too big. This derivation
is identical to the Case 4 in Section 12.2.4. There it is shown that
this probability is upperbounded by t n, , U (1) U (2) X Y (i) .
312
(13.246)
(13.247)
(13.248)
(13.250)
(13.251)
(13.252)
(13.253)
R(1) < I U (1) ; Y (1) ,
(2)
(2)
(2)
R < I U ;Y
,
(13.254)
R(1) + R(2) < I U (1) ; Y (1) + I U (2) ; Y (2) I U (1) ; U (2) (13.255)
for some QU (1) ,U (2) QXU (1) ,U (2) QY (1) ,Y (2) X .
Note that an optimal choice for QXU (1) ,U (2) actually degenerates into
a function f : U (1) U (2) X .
If we fix a certain QU (1) ,U (2) QXU (1) ,U (2) and a given BC QY (1) ,Y (2) X , then
the achievable rate region of Theorem 13.33 is given by a pentagon shown in
Figure 13.6. In this pentagon, points A and B are of particular interest: For
example in point A we have
R(1) = I U (1) ; Y (1) I U (1) ; U (2)
(13.256)
which is identical to the GelfandPinsker rate RGP if we consider U(2) as
interference that is noncausally known to the transmitter (compare with Definition 12.4)! Hence, we can actually use GelfandPinsker coding here. A
corresponding encoder is shown in Figure 13.7. Note that in general the decoders will only be able to decode their own message.
313
R(2) = R(1) + I U (1) ; Y (1) + I U (2) ; Y (2) I U (1) ; U (2)
I U (2) ; Y
R(2)
(2)
A
R(2) = I U (2) ; Y (2)
I U (2) ; Y (2) I U (1) ; U (2)
R(1)
I U (1) ; Y (1)
I U (1) ; Y (1) I U (1) ; U (2)
U(1)
fGP
U(2)
M (1)
GP Enc.
U(2)
U(2)
U(2)
Enc. 2
M (2)
f : U (1) U (2) X
Figure 13.7: BC encoder based on a GelfandPinsker encoder.
13.8
function f : T U (1) U (2) X . Then we fix some rates R(0) , R(1) , R(2) ,
(1) and R
(2) , and some blocklength n.
R
(0)
2: Codebook Design: We generate enR codewords T m(0) QnT ,
(0)
m(0) = 1, . . . , enR (the cloud centers). For each T m(0) , we use the
(i)
code construction of Section 13.7 with binning, i.e., we generate enR
(i)
enR lengthn codewords U(i) m(0) , m(i) , v (i) QnU (i) T T m(0) ,
314
(i)
(the codewords
(0) , m(1) , m(2) , the encoder
3: Encoder Design: For a message
triple
m
tries to find a pair v (1) , v (2) such that
T m(0) , U(1) m(0) , m(1) , v (1) , U(2) m(0) , m(2) , v (2)
A(n)
QT,U (1) ,U (2) .
(13.257)
QT,U (i) ,Y (i) .
T m
(0) , U(i) m
(0) , m
(i) , v(i) , Y(i) A(n)
(13.259)
If there is a unique pair m
(0) , m
(i) , then the decoder (i) puts out
m
(0) , m
(i) , m
(0) , m
(i) . If there are several choices for m
(0) , m
(i)
or none, the decoder declares an error. Note that the decoder does
not
care if there are several possible v(i) for a unique pair m
(0) , m
(i) .
5: Performance Analysis: Using our standard analysis technique, we find
the following conditions:
1. The encoder cannot find appropriate codewords (corresponds to
Case 1 in binning, Section 13.7):
(1) + R
(2) > I U (1) ; U (2) T .
R
(13.260)
(13.261)
(13.262)
315
Together with the nonnegativity constraints, this yields the following ten
conditions:
0
0 1 0 1
I U (1) ; U (2) T
I U (1) ; Y (1) T
1
1
0
0
0
0
0
1
1
1
I T, U (1) ; Y (1)
1
1
0
0
(1)
R
1
(2) ; Y (2)
0
0
1
1
I
T,
U
(1)
(13.263)
R
1 0
0
0
0 (2)
0
R
0 1 0
0
0
0
(2)
0 1 0
0
0
0
0
1
0
0
0
0
0
0 1
0
We now apply FourierMotzkin elimination (see Section 1.3) to eliminate
(1) and R
(2) . We start with R
(1) :
R
0
1
0 1
I U (1) ; Y (1) T I U (1) ; U (2) T
1
I T, U (1) ; Y (1) I U (1) ; U (2) T
1
0 1
0
(1) ; Y (1) T
1
0
0
I
U
(0)
(1)
(1)
1
1
0
0
I T, U ; Y
0
(1)
(2)
(2)
0
1
1
T
I
U
;
Y
R
(2)
(2)
(2)
1
0
1
1
I T, U ; Y
1 0
(2)
0
0
0
R
0 1 0
0
0
0 1 0
0
0
0
0 1
0
(13.264)
(2) :
Next we remove R
0
1
1
1
1
1
1
1
1
2
1
1
0
R(0)
0
1
0
1 R(1)
1
(2)
0
1
0
R
1
0
1
1 0
0
0 1 0
0
0 1
316
U (1) ; Y (1) T I
U (1) ; Y (1) T I
T, U (1) ; Y (1) I
T, U (1) ; Y (1) I
I
I
I
I
U (1) ; U (2) T + I
U (1) ; U (2) T + I
U (1) ; U (2) T + I
U (1) ; U (2) T + I
U (2) ; Y (2) T
T, U (2) ; Y (2)
U (1) ; Y (1) T
T, U (1) ; Y (1)
0
0
0
U (2) ; Y (2) T
T, U (2) ; Y (2)
U (2) ; Y (2) T
T, U (2) ; Y (2)
(13.265)
following
(13.266)
(13.267)
(13.268)
(13.269)
(13.270)
(13.271)
(13.272)
(13.273)
Note that
(13.268) + (13.273) = (13.271) + (13.272),
so one of these four constraints is redundant. We choose to ignore
(13.273). The remaining constraints can be simplified further if we take
Theorem 13.8 into account: We can replace R(0) by R(0) (1) (2) ,
R(1) by R(1) + (1) , and R(2) by R(2) + (2) , where we need to add the
constraints
(1) + (2) R(0) ,
(1)
(2)
0,
0.
(13.274)
(13.275)
(13.276)
Before we write down the new inequality system and again apply the
FourierMotzkin elimination to eliminate (1) and (2) , we introduce
317
(13.277)
(13.278)
(13.279)
(13.280)
(13.281)
0
1
0
1
0
IU (1)
0
1
0
1
IU (2)
1
1
1
1
IU (1) + IU (2) I
1
0
0 1
IY (1) + IU (1)
(0)
1
0
1
1
0
R
I
+
I
(2)
(2)
Y
U
(1)
1
1
1
0
0 R IY (1) + IU (1) + IU (2) I
(2)
1
1
0
0
0
1
1 (1)
0
1 0
(2)
1 0
0
0
0
0
0
0
0
0 1 0
0 1 0
0
0
0
0
0
1
0
0
0
0
0
0
0 1
In a first step, we
1
1
1
0
1
0
1
1
2
0
1
1
0
0
1
1 0
0
0
0
1
1
1
0
1
1
1
1
1
1
1 0
0
0 1 0
0
0 1
0
0
0
eliminate (1) :
IU (1)
0
I
+
I
+
2I
I
1
Y (2)
U (1)
U (2)
I
+
I
I
1
(1)
(2)
U
U
I
1
(2) + IU (2)
(0)
0
1 R
(1)
I
1
R
(2)
,
(2)
IY (1) + IU (1)
1 R
I
(2)
0
Y (1) + IU (1) + IU (2) I
0
0
0
0
0
0
0
1
(13.283)
318
2
2
2
IY (1) + IY (2) + 2IU (1) + 2IU (2) I
1
2
IY (2) + IU (1) + 2IU (2) I
2
1
IY (1) + 2IU (1) + IU (2) I
1
1
IU (1) + IU (2) I
1
I (1) + I (2) + I (1) + I (2)
1
1
Y
Y
U
U
0
1
IY (2) + IU (2)
1
0
IY (1) + IU (1)
(0)
1 0
0 R
0
(1)
. (13.284)
1
1
R
I
+
I
+
I
(1)
(1)
(2)
Y
U
U
(2)
0
1 R
IU (2)
0
1
1
IY (2) + IU (1) + IU (2)
1
0
IU (1)
0
1
1
IY (1) + IU (1) + IU (2) I
1
1
IY (2) + IU (1) + IU (2) I
1
1 0
0
0
0
1
0
0
0
0 1
0
Luckily, we can reduce these 17 inequalities further using the following
observations:
8 equals 15 = drop 8
6 is implied by 10 = drop 6
7 is implied by 12 = drop 7
5 is implied by 11 = drop 5
9 is implied by 13 = drop 9
11 is implied by 14 = drop 11
10 + 14 equals 2 = drop 2
12 + 13 equals 3 = drop 3
1 + 4 equals 10 + 12 + 13 + 14 = drop 1
This leaves us with five inequalities plus the obvious three nonnegativity
constraints. Writing 13 and 14 in a combined fashion yields the final
result given in Theorem 13.34.
319
fies
R(1) < I U (1) ; Y (1) T ,
(13.285)
(2)
(2)
(2)
R < I U ;Y
T ,
(13.286)
(1)
(2)
(1)
(1)
(2)
(2)
R + R < I U ;Y
T + I U ;Y
T
(13.287)
I U (1) ; U (2) T ,
n
o
(0)
(1)
(2)
(1)
(2)
(1)
(1)
T
R
+
R
+
R
<
min
I
T
;
Y
,
I
T
;
Y
+
I
U
;
Y
13.9
There are many known outer bounds, but none has been proven to be tight
for all BCs. For all cases where the capacity region is known2 the outer bound
is taken from Theorem 13.27. In the following we quickly review a few more
outer bounds.
Note that the derivation of all outer bounds are based on the Fano Inequality, the Data Processing Inequality, additionally given information (socalled
genieaided bounds), additionally allowed cooperations, or similar.
(0)
(1)
The
simplest outer bound is by Cover [Cov72]: Any rate triple R , R ,
(2)
R
not satisfying
(0)
(2)
(2)
R + R I X; Y
,
(13.290)
(0)
(1)
(2)
(1)
(2)
R + R + R I X; Y , Y
(13.291)
cannot be achievable. The proof is quite straightforward: Every user by itself
cannot transmit more than its capacity, and the sumrate bound assumes that
the receivers cooperate. Another proof is based on the CutSet Bound, see
Chapter 15.
Sato improved on the sumrate bound (13.291):
R(0) + R(1) + R(2) min max I X; Y (1) , Y (2) ,
(13.292)
QX
where the min is over all QY (1) ,Y (2) X having the same conditional marginals
QY (1) X and QY (2) X as the BC.
2
Apart from the special cases discussed in Section 13.6, there is, e.g., also the case of
the deterministic BC where y (1) and y (2) are deterministic functions of x.
320
Another famous
outer bound is
pair R(1) , R(2) not satisfying
R(1) I
R(2) I
R(1) + R(2) I
(13.293)
(13.294)
(13.295)
(1)
(1)
(1)
I
U
;
Y
,
(13.296)
R(2) I U (2) ; Y (2) ,
(13.297)
n
o
(0)
(1)
(2)
,
(13.299)
R
min
I
T
;
Y
,
I
T
;
Y
o
(0)
(1)
(1)
(1)
(1)
(2)
R
+
R
I
U
;
Y
T
+
min
I
T
;
Y
,
I
T
;
Y
,
(13.300)
o
n
(2)
(1)
(0)
(2)
(2)
(2)
,
,
I
T
;
Y
T
+
min
I
T
;
Y
R
+
R
I
U
;
Y
(13.301)
o
(1)
(2)
+
min
I
T
;
Y
,
I
T
;
Y
,
(13.302)
o
(13.303)
+ min I T ; Y (1) , I T ; Y (2)
for some QU (1) ,U (2) QXU (1) ,U (2) QY (1) ,Y (2) X cannot be achievable. Note that
this bound is contained in the outer bound of Theorem 13.27, but it is not
clear whether it is strictly smaller or not.
13.10
Gaussian BC
We have already seen the definition of the most typical Gaussian BC in Example 13.17. We now generalize this definition to the general Gaussian BC.
13.10. Gaussian BC
321
Y (1) = x + Z (1) ,
Y (2) = x + Z (2) ,
(13.304)
(13.305)
where
Z (1) , Z (2)
T
N (0, KZZ )
(13.306)
2
(1)
(12)
(12)
2
(2)
!
(13.307)
(13.308)
(13.309)
We can easily repeat the argument of Example 13.17 to show that also
this channel is stochastically degraded. Actually, the derivation is identical
because we only need to worry about the conditional distribution of Y (i) given
X, i.e., the correlation between Z (1) and Z (2) is completely irrelevant.
We now state the capacity region of this Gaussian BC.
Theorem 13.35 (Capacity Region of Gaussian BC).
The Gaussian BC capacity region is given by
(1 )E
1
(0)
(2)
R + R log 1 +
,
2
E + (2)
1
E
(1)
R log 1 + 2
2
(1)
(13.310)
(13.311)
322
1
2
log 1 +
E
2
(2)
=0
timesharing
=1
E
1
log
1
+
2
2
R(1)
(1)
X N (0, E),
0
X =U +X .
(13.316)
0
X
U,
(13.317)
(13.318)
Note that by this choices we have X N (0, E) (U is the cloud center and X 0
the codeword in the cloud). Now we evaluate:
0
(2)
I U ; Y (2) = I U ; U + X
+
Z
(13.319)
 {z }
noise
(1 )E
1
= log 1 +
,
2
2
E + (2)
I X; Y (1) U = I U + X 0 ; U + X 0 + Z (1) U
= I X 0 ; X 0 + Z (1) U
= I X 0 ; X 0 + Z (1)
1
E
= log 1 + 2 .
2
(1)
(13.320)
(13.321)
(13.322)
(13.323)
(13.324)
13.10. Gaussian BC
323
To prove the converse, we do not simply try to show that the above choice
is optimal, but we go some steps further back. We start as follows:
1
R(1) = H M (1)
(13.325)
n
1
1
= I M (1) ; Y(1) + H M (1) Y(1)
(13.326)
n
n
1
I M (1) ; Y(1) + n(1)
(13.327)
n
1
(13.328)
I M (1) ; Y(1) , M (0) , M (2) + n(1)
n
1
1
= I M (1) ; M (0) , M (2) + I M (1) ; Y(1) M (0) , M (2) + n(1) (13.329)
n
{z
} n
=0
1
= I M (1) ; Y(1) M (0) , M (2) + n(1) ,
n
(13.330)
and
1
H M (0) , M (2)
n
1
1
= I M (0) , M (2) ; Y(2) + H M (1) Y(1)
n
n
1
(0)
(2)
(2)
+ n(2) ,
I M ,M ;Y
n
R(0) + R(2) =
(1)
(13.331)
(13.332)
(13.333)
(2)
where n and n (by the Fano Inequality) tend to zero as n tends to infinity.
We now continue our bounding as follows:
1
I M (0) , M (2) ; Y(2)
n
1
1
(13.334)
= h Y(2) h Y(2) M (0) , M (2)
n
n
n
1
1 X (2) (2)
(2)
=
h Yk Y1 , . . . , Yk1 h Y(2) M (0) , M (2)
(13.335)
n
n
k=1
n
1
X
1
(2)
h Yk
h Y(2) M (0) , M (2)
(13.336)
n
n
k=1
n
1
1X1
2
log 2e Ek + (2)
h Y(2) M (0) , M (2)
n
2
n
k=1
!!
n
1
1X
1
2
log 2e
Ek + (2)
h Y(2) M (0) , M (2)
2
n
n
k=1
1
1
2
log 2e E + (2)
h Y(2) M (0) , M (2) .
2
n
(13.337)
(13.338)
(13.339)
Here, the first inequality (13.336) follows conditioning that cannot reduce
entropy; in the subsequent inequality (13.337) we define Ek , E[Xk ] and
upperbound the entropy by the Gaussian entropy of given second moment;
then in (13.338) we use the concavity of the logarithm; and the final step
(13.339) follows from the averagepower constraint (13.309).
324
Now note that by conditioning that reduces entropy and by the Markovity
of encoder(
channel(
decoder we have
n
2
log 2e(2)
= h Z(2)
(13.340)
2
= h Y(2) X
(13.341)
= h Y(2) X, M (0) , M (2)
(13.342)
(2) (0)
(2)
h Y M ,M
(13.343)
(2)
h Y
(13.344)
n
2
log 2e E + (2)
,
(13.345)
2
where the last step follows in the same way as in (13.334)(13.339). Hence,
n
n
2
2
log 2e(2)
h Y(2) M (0) , M (2) log 2e E + (2)
,
(13.346)
2
2
and therefore there must exist some , 0 1, such that
n
2
.
h Y(2) M (0) , M (2) = log 2e E + (2)
2
(13.347)
R(0) + R(2)
(13.348)
(13.349)
(13.350)
(13.351)
e n h(Y
e n h(Y
+ e n h(VM
(0) ,M (2) )
(13.352)
and bound
2
h Y(1) M (0) , M (2)
n
2
2
(1)
(0)
(2)
(0)
(2)
log e n h(Y +VM ,M ) e n h(VM ,M )
2
2
(2)
(0)
(2)
(0)
(2)
= log e n h(Y M ,M ) e n h(VM ,M )
2 n
2 n
2 )
2 2 )
log 2e(E+(2)
log 2e((2)
(1)
n 2
n 2
= log e
e
2
2
2
= log 2e E + (2)
2e (2)
(1)
2
= log 2e E + (1)
,
(13.353)
(13.354)
(13.355)
(13.356)
(13.357)
13.10. Gaussian BC
325
where in (13.355) we have made use of (13.347). Hence, using (13.357) and
the fact that X is a function of M (0) , M (1) , M (2) , we get from (13.330)
R(1)
=
=
=
=
1
(13.358)
I M (1) ; Y(1) M (0) , M (2) + n(1)
n
1
1
h Y(1) M (0) , M (2) h Y(1) M (0) , M (1) , M (2) + n(1) (13.359)
n
n
1 2
1
h Y(1) M (0) , M (2) h Y(1) M (0) , M (1) , M (2) , X + n(1)
2 n
n
(13.360)
1
1
2
(1)
(1)
log 2e E + (1) h Z
+ n
(13.361)
2
n
1
1
2
2
log 2e E + (1)
log 2e(1)
+ n(1)
(13.362)
2
2
E
1
+ n(1) .
(13.363)
log 1 + 2
2
(1)
Chapter 14
Problem Setup
(0) , M
(1) ,
M
(2)
M
Dec.
Channel
QnY X (1) ,X (2)
Enc. (1)
M (0)
X(2)
Uniform
Source 1
Uniform
Source 0
Enc. (2)
M
(2)
Uniform
Source 2
Figure 14.1: The generalized multipleaccess channel with two private message
sources M (i) , i = 1, 2, and a common message source M (0) .
Note that the common message might represent a common time reference
that lets the transmitters synchronize their transmissions. However, in this
case we have R(0) = 0 and we are actually back in the situation of Chapter 10.
More generally, the common message has a strictly positive rate. For example,
it could represent some information that two mobile stations are relaying from
one base station to the next.
The various definitions of Section 10.1 very easily generalize to the new
situation here. In particular note that the capacity region
now has become
three dimensional containing rate triples R(0) , R(1) , R(2) .
327
328
14.2
such that
U m
(0) , X(1) m
(0) , m
(1) , X(2) m
(0) , m
(2) , Y
A(n)
QU,X (1) ,X (2) ,Y .
(14.1)
If there is exactly one such triple m
(0) , m
(1) , m
(2) , the decoder puts
out m
(0) , m
(1) , m
(2) , m
(0) , m
(1) , m
(2) . Otherwise it declares an error.
5: Performance Analysis: By the symmetry of the random codebook
construction, the conditional error probability does not depend on which
triple of indices is sent. Without
loss of generality, we can therefore
assume that M (0) , M (1) , M (2) = (1, 1, 1).
We define the following events: for each m(0) m(1) , m(2) ,
n
o
Fm(0) ,
U m(0) , Y A(n)
(Q
)
,
(14.2)
U,Y
n
Fm(0) ,m(1) ,m(2) ,
U m(0) , X(1) m(0) , m(1) , X(2) m(0) , m(2) , Y
o
A(n)
QU,X (1) ,X (2) ,Y .
(14.3)
Then, using the Union Bound, we can bound as follows:
(1)
(2)
nR(0)
enR
enR
[
[
c e [
(n)
Pe Pr F1,1,1
Fm(0)
F1,m(1) ,1
F1,1,m(2)
m(0) =2
m(1) =2
m(2) =2
329
enR
[
(2)
enR
[
(14.4)
c
Pr F1,1,1 M (0) = 1, M (1) = 1, M (2) = 1
(0)
nR
eX
m(0) =2
Pr Fm(0) M (0) = 1, M (1) = 1, M (2) = 1
(1)
nR
eX
m(1) =2
Pr F1,m(1) ,1 M (0) = 1, M (1) = 1, M (2) = 1
(2)
nR
eX
m(2) =2
(1)
nR
eX
Pr F1,1,m(2) M (0) = 1, M (1) = 1, M (2) = 1
(2)
nR
eX
m(1) =2 m(2) =2
Pr F1,m(1) ,m(2) M (0) = 1, M (1) = 1, M (2) = 1
(14.5)
where in (14.4) the first event corresponds to the case that the correct
codewords are not recognized, the first union of events corresponds to the
case where some codeword from a wrong cloud is (wrongly) recognized,
and the remaining unions of events correspond to the cases where some
wrong codeword from the correct cloud is recognized. Note that we have
an inequality in front of (14.4) because we only check whether the cloud
center of a wrong cloud happens to be typical with the received sequence,
and do not bother to check whether or not there actually exist codewords
in that wrong cloud that are jointly typical with the cloud center and the
received sequence.
Of the five main terms in (14.5), the first two are standard:
c
Pr F1,1,1 (1, 1, 1) t n, , U X (1) X (2) Y ,
(0)
nR
eX
m(0) =2
(14.6)
(0)
nR
eX
en(I(U ;Y ))
(14.7)
en(I(U ;Y )) .
(14.8)
m(0) =2
enR
(0)
(n)
A (QU,X (1) ,X (2) ,Y
330
A
(14.9)
QU,X (1) ,X ,Y
= A(n)
(1)
(2)
(2)
n(H(U,X (1) ,X (2) ,Y )+)
e
en(H(U,X )+H(X U )+H(Y X ,U ))
(14.12)
n(H(X (2) X (1) ,U )+H(Y X (1) ,X (2) ,U )H(X (2) U )H(Y X (2) ,U )+)
(14.13)
(14.14)
(14.15)
=e
=e
=e
Here the most important step is (14.9) where we need to realize that Y is
generated based on the transmitted X(2) , but not on the wrong codeword
X(1) considered here. However, since we do consider the correct cloud,
the cloud center U is related to the received Y. Moreover, in (14.15) we
make use of the Markov chain X (1) (
U (
X (2) , i.e., we use that
conditionally on the cloud center U, the codewords X(1) and X(2) are
generated independently.
Hence, we get
(1)
nR
eX
m(1) =2
(1)
(1)
(2)
Pr F1,m(1) ,1 (1, 1, 1) enR en(I(X ;Y X ,U )) ,
(14.16)
and, by symmetry,
(2)
nR
eX
m(2) =2
(2)
(2)
(1)
Pr F1,1,m(2) (1, 1, 1) enR en(I(X ;Y X ,U )) .
(14.17)
(n)
A
QnY U yx(2) , u
{z
}

X
wrong codewords,
correct cloud!
n(H(X (1) U ))
n(H(U ))
A
(14.18)
331
(2)
en(H(X U )) en(H(Y U ))
(1)
(2)
(1)
(2)
en(H(U,X ,X ,Y )+) en(H(U,X )+H(X U )+H(Y U ))
= en(
(1)
(2)
= en(I(X ,X ;Y U )) ,
U ))
(14.19)
(14.20)
(14.21)
(14.22)
i.e.,
(1)
(2)
nR
eX
nR
eX
Pr F1,m(1) ,m(2) (1, 1, 1)
m(1) =2 m(2) =2
n(R(1) +R(2) )
en(I(X
(1) ,X (2) ;Y
U ))
(14.23)
(1)
(14.24)
(14.25)
R(2)
R(1) + R(2)
(1)
< I X ; Y X (2) , U ,
< I X (2) ; Y X (1) , U ,
< I X (1) , X (2) ; Y U .
(14.26)
(14.27)
(14.28)
(14.29)
Note that (14.26) and (14.29) combined with the Markov chain
U (
X (1) , X (2) (
Y
(14.30)
results in
R(0) + R(1) + R(2) < I(U ; Y ) + I X (1) , X (2) ; Y U
= I U, X (1) , X (2) ; Y
= I X (1) , X (2) ; Y + I U ; Y X (1) , X (2)
= I X (1) , X (2) ; Y .
Hence, instead of (14.26)(14.29) we can equivalently write
(14.31)
(14.32)
(14.33)
(14.34)
(14.35)
(14.36)
(14.37)
(14.38)
332
R(2)
R(1)
Figure 14.2: The shape of the achievable region (14.35)(14.38) for the MAC
with common message for a fixed choice of QU QX (1) U QX (2) U .
Since we can freely choose QU , QX (1) U and QX (2) U , we can now take
the convex hull of the union of (14.35)(14.38) for such choices. Note,
however, that it can be shown (and we will prove it in the following
section) that the union of (14.35)(14.38) already is convex, i.e., we do
not need the convexhull operation.
Also note that the shape of the region defined in (14.26)(14.29) is not
as shown in Figure 14.2. This is not a contradiction because the true
shape of the capacity region is given by the union of these regions and
this union is the same irrespectively whether we use (14.26)(14.29) or
(14.35)(14.38).
14.3. Converse
14.3
333
Converse
The converse again relies on the Fano Inequality (Proposition 1.13) with an
observation Y about M (0) , M (1) , M (2) :
log 2
(0)
(1)
(2) n
(n)
(0)
(1)
(2)
H M , M , M Y1 n
(14.39)
+ Pe R + R + R
n
, nn ,
(14.40)
(n)
where n 0 as n if Pe 0.
(n)
So assume a given system with Pe 0. For such a system we have
nR(1) = H M (1)
(14.41)
(1)
n
(1) n
= I M ; Y1 + H M Y1
(14.42)
(1)
n
(0)
(1)
(2) n
I M ; Y1 + H M , M , M Y1
(14.43)
(1)
n
I M ; Y1 + nn
(14.44)
(1)
n
(0)
(2)
I M ; Y1 , M , M
+ nn
(14.45)
(2)
(1)
n (0)
+ nn
(14.46)
= I M ; Y1 M , M
n
X
H Yk Y1k1 , M (0) , M (2) H Yk Y1k1 , M (0) , M (1) , M (2)
=
k=1
+ nn
n
X
=
H Yk Y1k1 , M (0) , M (2) , x(2) (M (0) , M (2) )
(14.47)
k=1
H Yk Y1k1 , M (0) , M (1) , M (2) , x(1) (M (0) , M (1) ),
x(2) (M (0) , M (2) ) + nn
n
X
H Yk Y1k1 , M (0) , M (2) , X(2)
=
(14.48)
k=1
H Yk Y1k1 , M (0) , M (1) , M (2) , X(1) , X(2) + nn
(14.49)
n
X
k1
(1) (2)
(0)
(2)
(2)
(0)
=
H Yk Y1 , M , M , X
H Yk Xk , Xk , M
k=1
+ nn
(14.50)
n
X
(2)
(1) (2)
n
X
(2)
(1)
=
I Xk ; Yk Xk , M (0) + nn .
(14.52)
k=1
Here, (14.41) follows from the assumption that M (1) is uniformly distributed
(1)
over {1, . . . , enR }; (14.44) follows from (14.40); in (14.45) we make use of the
independence of M (1) and (M (0) , M (2) ); (14.48) follows because the codewords
are deterministic functions of the corresponding messages; in the next step
334
(14.49) we simplify our notation and write X(i) for x(i) M (0) , M (i) ; in (14.50)
we use the assumption that our DMMAC is memoryless and used without
feedback; and (14.51) follows from conditioning that reduces entropy.
We next introduce a random variable T , which is independent of (M (0) ,
(1)
M , M (2) ) and uniformly distributed over {1, 2, . . . , n}, and a random vector
(1)
U , (M (0) , T ). Furthermore, we define the random variables X (1) , XT ,
(2)
X (2) , XT , and Y , YT , so that QU,X (1) ,X (2) ,Y factors as
QU (u) QX (1) U x(1) u QX (2) U x(2) u QY X (1) ,X (2) y x(1) , x(2)
(14.53)
for all u, x(1) , x(2) , y. Hence,
n
1 X (1)
(2)
I Xk ; Yk Xk , M (0) + n
R(1)
n
k=1
n
X
1 (1)
(2)
=
I XT ; YT XT , M (0) , T = k + n
n
k=1
(2)
(1)
= I XT ; YT XT , M (0) , T + n
= I X (1) ; Y X (2) , U + n .
(14.54)
(14.55)
(14.56)
(14.57)
(14.58)
Similarly, by the fact that M (1) and M (2) are independent and uniformly
distributed over their respective index set,
nR(1) + nR(2)
= H M (1) , M (2)
=
=
=
=
(1)
(2)
(14.59)
(14.60)
(14.61)
(14.62)
(14.63)
+ nn (14.64)
H Yk Y1k1 , M (0) , M (1) , M (2) , X(1) , X(2) + nn
n
X
(1) (2)
=
H Yk Y1k1 , M (0) H Yk Xk , Xk , M (0) + nn
(14.65)
(1) (2)
H Yk M (0) H Yk Xk , Xk , M (0) + nn
(14.67)
k=1
n
X
k=1
n
X
(1)
(2)
I Xk , Xk ; Yk M (0) + nn ,
k=1
(14.66)
(14.68)
335
and hence
R(1) + R(2) I X (1) , X (2) ; Y U + n .
(14.69)
Finally,
nR(0) + nR(1) + nR(2)
= H M (0) , M (1) , M (2)
=
=
=
(0)
(1)
(2)
(14.70)
(14.71)
(14.72)
(14.73)
+ nn
n
X
(1) (2)
=
+ nn
H Yk Y1k1 H Yk Xk , Xk
(14.74)
(1) (2)
H(Yk ) H Yk Xk , Xk
+ nn
(14.76)
k=1
n
X
k=1
n
X
(1)
(2)
I Xk , Xk ; Yk + nn ,
(14.75)
(14.77)
k=1
and hence
R(0) + R(1) + R(2) I X (1) , X (2) ; Y + n .
(14.78)
So we see that the achievable region derived in Section 14.2 actually is the
best possible region. Note that for discrete alphabets there is no real difference
between a random vector and a random variable, i.e., we could also write U
instead of U.
Also note that since we have not excluded the possibility of timesharing
in the proof of the converse, we see that the converse indirectly is proof that
the region derived in Section 14.2 is convex! One could of course also directly
check this, but after the proof of the converse this is not anymore necessary.
14.4
Capacity Region
336
is given by all rate triples R(0) , R(1) , R(2) satisfying
(14.79)
(14.80)
(14.81)
(14.82)
for some QU QX (1) U QX (2) U QY X (1) ,X (2) . The alphabet size of the
auxiliary RV U can be limited to
o
n
(14.83)
U min Y + 3, X (1) X (2) + 2 .
Proof: The achievability and converse have already been proven. It only
remains to show the bound on the cardinality of U.
Consider a given choice of U and
QU,X (1) ,X (2) ,Y = QU QX (1) U QX (2) U QY X (1) ,X (2)
(14.84)
(14.85)
uU
I X
(2)
X
QU (u) I X (2) ; Y X (1) , U = u ,
; Y X (1) , U =
(14.86)
uU
X
I X (1) , X (2) ; Y U =
QU (u) I X (1) , X (2) ; Y U = u ,
(14.87)
uU
(14.88)
uU
X
uU
QU (u)QX (1) U x(1) u QX (2) U x(2) u .
(14.89)
(i)
For
of
simplicity
notation and without loss of generality, assume that X =
(i)
1, 2, . . . , X  . Now we define the vector v:
v , I X (1) ; Y X (2) , U , I X (2) ; Y X (1) , U , I X (1) , X (2) ; Y U ,
(1)
(2)
QX (1) ,X (2) (1, 1), . . . , QX (1) ,X (2) X , X  1 ,
(14.90)
(14.91)
337
QU (u) vu .
(14.92)
uU
where we have made use of the Markov chain U (
X (1) , X (2) (
Y.
Moreover, we also have for all y Y
X
QY (y) =
QU (u)QY U (yu).
(14.96)
uU
(14.98)
Chapter 15
Discrete Memoryless
Networks and the CutSet
Bound
15.1
So far we have only seen multipleuser problems where we had either only
one transmitter, but several receivers, or we had several transmitters, but
only one receiver. In a much more general setting, however, we can think
of a situation with many different terminals, where each terminal potentially
can be transmitter, receiver, or even both. Moreover, a terminal might have
several messages intended for different receivers, some of them for only one
particular terminal (a private message) and some for several receivers at the
same time (a common or at least partially common message).
Such a general network can be described by a discrete memoryless network
(DMN), which is a very broad generalization of a discrete memoryless channel
(DMC).
Definition 15.1 ([CT06]). A discrete memoryless network (DMN) is a discretetime, synchronously clocked network consisting of T different terminals
t T = {1, . . . , T}. These terminals are all connected via a channel and
potentially all act simultaneously as transmitter and receiver.
In the DMN there exist M statistically independent messages M (m) , m =
(m)
1, . . . , M, each of which is uniformly distributed over 1, . . . , enR
, i.e., ev(m)
(m)
ery message M
has a rate R . Each message originates at exactly one
terminal and is intended for one or more other terminals. We denote by
M(t) {1, . . . , M} the set of (indices of the) messages originating at terminal t, and by D(m) the set of terminals that are intended receivers of the mth
message M (m) .
(t)
At every timestep k, every terminal t T emits a channel input Xk that
is based on its messages M (m) , m M(t), and on the previously observed
(t)
(t)
channel outputs Y1 , . . . , Yk1 . Afterwards it observes a new channel output
339
340
(t)
(15.1)
(t)
for some function f (t) () and for some noise RV Nk that is independent of all
other random variables and IID over time. Hence, we can describe the channel
again by a conditional probability distribution PY (1) ,...,Y (T) X (1) ,...,X (T) () that
does not change over time. Note that the setup of the model is such that we
(t)
have a causal operation of the network: The channel inputs Xk are applied
after clock tick k 1, but before clock tick k, so that they serve as current
(t)
inputs for the channel outputs Yk .
In Figure 15.1 we have depicted an example of a DMN with five terminals.
Note that it is possible that some terminal acts as receiver only, in which
case we will omit the corresponding arrow of the nonexistent channel input.
Similarly, it is possible that the conditional channel probability distribution
PY (1) ,...,Y (T) X (1) ,...,X (T) () is such that the channel output of some terminal t
is completely independent of any messages and only depends on noise and is
therefore completely useless for that terminal. In this case we will omit this
particular channel output. The decisions of the terminals about their intended
(m) (t), t D(m) .
received messages are denoted by M
We remark that while this definition of a DMN covers all networks considered so far, it does not accommodate the concept of a common message
between several transmitters because in Definition 15.1 we ask for all messages
to be independent and to originate at exactly one terminal only.
Definition 15.2. The capacity
region C of a DMN is the closure of the set
(1)
(M)
of rate tuples R , . . . , R
for which, for sufficiently large n, there are
encoders and decoders so that the error probability
M
[
[
(m) (t) 6= M (m)
Pe(n) = Pr
M
(15.2)
m=1 tD(m)
15.2
CutSet Bound
Obviously, it is near impossible to find a closedform expression for the capacity region of a DMN, seeing that we could not even solve some simple
examples of a DMN like the general BC. However, one can still say something
about the network. A particularly interesting and actually quite simple idea
is the CutSet Bound [CT06], [EG81]. This is an attempt to generalize the
typical proof of a converse based on the Fano Inequality (Proposition 1.13)
to the setup of a DMN. Hence, it will provide outer bounds on the capacity
region.
341
M (4)
Terminal 5
(2) (1)
M
M (1)
(1) (3)
M
X(5)
Y(3)
X(1)
Terminal 1
Y(1)
DMN
(4) (3)
M
Terminal 3
(2) (3)
M
(3) (3)
M
channel
X(2)
Terminal 2
M (2)
Y(2)
M (3)
X(4)
Y(4)
Terminal 4
(1) (4)
M
Figure 15.1: A DMN with five terminals and four messages. Note that
(m) (t) denotes the decision about the mth message at terminal
M
t. In this example, Terminal 5 does not get any useful information back from the channel and we have therefore omitted Y(5) .
Also note that Terminal 3 acts as pure receiver without giving
any feedback into the network.
Definition 15.3. If the set of terminals T = {1, . . . , T} is partitioned into
two sets S and S, then the pair (S, S) is called a cut.
We say that the cut (S, S) separates a message M (m) and its decision
(m)
342
(15.5)
(S)
(15.7)
(S)
o
[
[ n
X
(m) t 6= M (m)
log 2 + n Pr
M
R(m)
mM(S) tD(m) S
log 2 + n Pr
M
[
(15.8)
[
m=1 tD(m)
log 2
=n
+ Pe(n)
n
mM(S)
M
X
(m) (t) 6= M (m)
M
M
X
!
R(m)
(15.9)
m=1
!!
R
(m)
(15.10)
m=1
, nn .
(15.11)
Here, the first inequality (15.8) follows from Proposition 1.13; in the second
inequality (15.9) we enlarge both the set in the probability expression as well
as the sum; and (15.11) should be read as definition of n . Note that n 0
(n)
as n because we assume that Pe 0.
P
(m)
So, using that M (S) is uniformly distributed over en mM(S) R
different
values, we have
X
n
R(m)
mM(S)
= H M (S)
(15.12)
343
= I M (S) ; Y(S) + H M (S) Y(S)
I M (S) ; Y(S) + nn
I M (S) ; Y(S) , M (S) + nn
= I M (S) ; Y(S) M (S) + nn
= H Y(S) M (S) H Y(S) M (S) , M (S) + nn
n
X
(S) (S)
(S)
=
H Yk Y1 , . . . , Yk1 , M (S)
(15.13)
(15.14)
(15.15)
(15.16)
(15.17)
k=1
(S)
Yk
(S)
(S)
(S)
(S)
+ nn
Y1 , . . . , Yk1 , M , M
(15.18)
(S)
Yk
(S)
(S)
(S)
(S)
+ nn
Y1 , . . . , Yk1 , M , M
(15.19)
H
n
X
(S)
(S) (S)
(S)
=
H Yk Y1 , . . . , Yk1 , M (S) , Xk
k=1
H
n
X
(S) (S)
H Yk Xk
k=1
(S)
(S) (S)
(S)
(S)
+ nn (15.20)
H Yk Y1 , . . . , Yk1 , M (S) , M (S) , Xk , Xk
n
X
(S) (S)
(S) (S)
(S)
=
H Yk Xk
H Yk Xk , Xk
+ nn
(15.21)
k=1
n
X
(S)
(S) (S)
=
I Xk ; Yk Xk
+ nn
k=1
n
X
1 (S) (S) (S)
I XZ ; YZ XZ , Z = k + nn
n
k=1
(S)
(S) (S)
= n I XZ ; YZ XZ , Z + nn
(S) (S)
(S) (S)
(S)
= n H YZ XZ , Z n H YZ XZ , XZ , Z + nn
(S) (S)
(S) (S)
(S)
= n H YZ XZ , Z n H YZ XZ , XZ
+ nn
(S) (S)
(S) (S)
(S)
n H YZ XZ
n H YZ XZ , XZ
+ nn
(S)
(S) (S)
= n I XZ ; YZ XZ
+ nn .
=n
(15.22)
(15.23)
(15.24)
(15.25)
(15.26)
(15.27)
(15.28)
Here, the inequality (15.14) follows from (15.11); in the following inequality
(15.15) we add some argument to the mutual information; in (15.16) we make
use of the basic assumption of a DMN that all messages are independent of
each other; (15.18) follows from the chain rule.
Then in (15.19) we first note that by definition M (S) contains all messages that originate from terminals in S (it also contains all those messages
that originate in S and whose destinations are all in S). Further we note
that all terminals in S generate their channel inputs at time k from their
344
observations of the past channel outputs and their messages. Hence, from
(S)
(S)
(S)
Y1 , . . . , Yk1 , M (S) , we can generate directly Xk . The subsequent inequality (15.20) is based on conditioning that reduces entropy; in (15.21) we apply
the basic assumption about the DMN that the current channel outputs only
depend on the current channel inputs.
In (15.23) we introduce the RV Z that is independent of any other RV
and uniformly distributed over {1, . . . , n}; and (15.26) follows because of the
Markov chain
(1)
(T)
(1)
(T)
Z (
XZ , . . . , XZ (
YZ , . . . , YZ .
(15.29)
i.e., the joint distribution is as follows:
QZ,X (1) ,...,X (T) ,Y (1) ,...,Y (T) = QZ QX (1) ,...,X (T) Z QY (1) ,...,Y (T) X (1) ,...,X (T) ,
(15.30)
X
,
R(1) , . . . , R(M) :
R(m) I X (S) ; Y (S) X (S) .
(15.31)
mM(S)
Now note the important fact that the used distribution (15.30) is the same
for all cuts S. Hence, for a given QX (1) ,...,X (T) , the achievable rate tuples must
lie in the set
\
(15.32)
R QX (1) ,...,X (T) , S
R QX (1) ,...,X (T) =
ST
and therefore the capacity region C must lie within the union of all these
regions for all possible choices of the distribution QX (1) ,...,X (T) . We have shown
the following.
Theorem 15.4 (CutSet Bound [CT06], [EG81]).
Consider a DMN QY (1) ,...,Y (T) X (1) ,...,X (T) with M independent messages
M (m) of rate R(m) , m = 1, . . . , M. The capacity region C must satisfy
[
\
C
R QX (1) ,...,X (T) , S ,
(15.33)
QX (1) ,...,X (T) ST
where R(, ) is defined (15.31) and where we rely on the notational con
15.3. Examples
345
15.3
Examples
15.3.1
Broadcast Channel
346
S1
(0) (1)
M
(1)
M
M (1)
S3
Terminal 1
Y(1)
Broadcast
Channel
M (0)
nal 3
(0) (2)
M
(2)
M
Termi
X(3)
M (2)
S2
If we rename X (3) to its more usual X and maximize over QX , we see that the
CutSet Bound corresponds to the simplest outer bound (13.289)(13.291).
15.3.2
MultipleAccess Channel
S1
X(1)
(1)
MAC
(2) Terminal 3
M
M (1)
Terminal 1
(3)
X(2)
M (2)
Terminal 2
S2
Figure 15.3: Application of the CutSet Bound on the multipleaccess channel.
Again there are only three interesting cuts: S1 = {1}, S2 = {2}, and
(1) , the second separates
S3 = {1, 2}. The first cut separates M (1) from M
(2) , and the third both messages from both decisions. Hence, we
M (2) from M
get
R(1) I X (1) ; Y (3) X (2) ,
R(2) I X (2) ; Y (3) X (1) ,
R(1) + R(2) I X (1) , X (2) ; Y (3) ,
(15.37)
(15.38)
(15.39)
15.3. Examples
347
(15.40)
(15.41)
(15.42)
This we now have to maximize over all joint distributions QX (1) ,X (2) . Hence,
we see that the CutSet Bound gives the right mutual information terms (compare with Theorem 10.9!), but it is too large because we maximize over the
joint distribution instead of the product distribution.
15.3.3
SingleRelay Channel
Consider the channel model shown in Figure 15.4. This channel is called
Terminal 2
S2
M
Terminal 3
Y(3)
Y(2)
X(2)
S1
X(1)
M
Terminal 1
(15.43)
(15.44)
Since both must be satisfied, we see that the capacity is bounded as follows:
C
max
QX (1) ,X (2)
n
o
min I X (1) ; Y (2) , Y (3) X (2) , I X (1) X (2) ; Y (3) .
(15.45)
348
15.3.4
DoubleRelay Channel
S2
X
Terminal 4
(2)
S1
X(1)
Y(4)
Channel
S3
S4
X(3)
Terminal 1
Y(3)
Terminal 3
(15.50)
Chapter 16
Problem Setup
We now turn to the simplest (and therefore very important) example of a communication setup involving at the same time several transmitters and several
receivers: the interference channel (IC). In this model, we do not have any
common messages, but only transmitterreceiver pairs with a corresponding
private message each. The channel mixes the transmitted signals such that
the transmitters unintentionally interfere with each other. We will restrict our
discussion to the case of two transmitters with their corresponding receivers
as shown in Figure 16.1.
(1)
M
Dest. 1
(2)
M
Dest. 2
Dec. (1)
Dec. (2)
Y(1)
X(1)
Channel
Y(2)
X(2)
Enc. (1)
Enc. (2)
M (1)
Uniform
Source 1
M (2)
Uniform
Source 2
Figure 16.1: A channel coding problem with two sources and two destinations:
The sources independently try to transmit their message M (i) ,
i = 1, 2, to their corresponding destination and by doing so interfere with each other. This channel model is called interference
channel (IC).
Encoder 1 needs to transmit the message M (1) to destination 1, and encoder 2 needs to transmit the message M (2) to destination 2. The interference
channel will produce two outputs Y (1) and Y (2) for the inputs X (1) and X (2) ,
where both Y (1) and Y (2) depend on both X (1) and X (2) .
More formally, we have the following definitions.
Definition 16.1. A discrete memoryless interference channel (DMIC) consists of four alphabets X (1) , X (2) , Y (1) , Y (2) and a conditional probability
distribution QY (1) ,Y (2) X (1) ,X (2) such that when it is used without feedback, we
349
350
have
QY(1) ,Y(2) X(1) ,X(2) y(1) , y(2) x(1) , x(2)
n
Y
(1) (2) (1) (2)
=
QY (1) ,Y (2) X (1) ,X (2) yk , yk xk , xk .
(16.1)
k=1
(1)
(2)
Definition 16.2. An enR , enR , n coding scheme for a DMIC consists of
two sets of indices
n
o
(1)
M(1) = 1, 2, . . . , enR
,
(16.2)
n
o
(2)
M(2) = 1, 2, . . . , enR
(16.3)
called message sets, two encoding functions
n
(16.4)
(2) n
(16.5)
M(1) ,
(16.6)
(2)
:M
(2)
n
(2) n
M(2) .
(16.7)
(1)
(2)
The error probability of an enR , enR , n coding scheme for a DMIC
is given as
(16.8)
Pe(n) , Pr (1) Y(1) 6= M (1) or (2) Y(2) 6= M (2) .
Definition 16.3. A rate pair R(1) , R(2) is said to be achievable for the IC if
(2)
(1)
(n)
there exists a sequence of enR , enR , n coding schemes with Pe 0 as
n .
The capacity region of the IC is defined to be the closure of the set of all
achievable rate pairs.
Example 16.4 (Independent BSCs). Assume we have two independent BSCs
as shown in Figure 16.2. We know that X (1) can transmit at most at a rate
of C1 = 1 Hb (1 ) and X (2) at a rate of C2 = 1 Hb (2 ) bits. There is
no interference. Hence, the capacity region is the rectangular region shown in
Figure 16.3.
(2)
(1)
(1)
(1)
(1)
(2)
R C , max (2) I X ; Y X = x ,
QX (1) , x
(2)
(2)
(16.9)
(16.10)
351
1 1
1
Y (1)
X (1)
1
1
1 1
1 2
2
X (2)
Y (2)
2
1
1 2
R(2)
C2
C1 R(1)
Figure 16.3: The capacity region of the IC consisting of two independent
BSCs.
R(2)
C(2)
C(1) R(1)
Figure 16.4: An achievable rate region for a DMIC.
352
(16.11)
The capacity region of this IC is a triangle with corner points (0, 0), (0, 1 bit),
and (1 bit, 0). The achievability follows directly from the inner bound of
Figure 16.4, the converse from the CutSet Bound (see Theorem 16.7 below).
Very similarly to the BC, the capacity region of the IC does only depend
on the marginal distributions.
Theorem 16.6. The capacity region of an IC depends only on the conditional
marginal distributions QY (1) X (1) ,X (2) and QY (2) X (1) ,X (2) and not on the joint
conditional channel law QY (1) ,Y (2) X (1) ,X (2) .
Proof: Define
Pe(n) , Pr (1) Y(1) 6= M (1) (2) Y(2) 6= M (2) ,
Pe(n),(1) , Pr (1) Y(1) 6= M (1) ,
Pe(n),(2) , Pr (2) Y(2) 6= M (2) .
(16.12)
(16.13)
(16.14)
Y
6= M (1) (2) Y(2) 6= M (2) ,
(16.15)
we have
Pe(n) max Pe(n),(1) , Pe(n),(2) .
(n)
(n)(1)
(16.16)
(n)(2)
(16.17)
(16.18)
(1)
M
Dest. 1
(2)
M
Dest. 2
Dec. (1)
Dec. (2)
Y(1)
X(1)
Channel
Y(2)
X(2)
Enc. (1)
Enc. (2)
353
M (1)
Uniform
Source 1
M (2)
Uniform
Source 2
16.2
16.2.1
CutSet Bound
(16.19)
(16.20)
(16.21)
(16.22)
(16.23)
(16.24)
(16.25)
(16.26)
(16.27)
(16.28)
(16.29)
(16.30)
From the expansions (16.21), (16.23), (16.28), and (16.30), we see that the
first two and the last two inequalities are implied by (16.25) and (16.26). So
they are redundant and we remain with the three inequalities (16.24)(16.26).
354
Theorem 16.7
(CutSet Bound). On a general IC, any achievable rate
pair R(1) , R(2) must satisfy
(1)
(1)
(1) (2)
I
X
;
Y
X
,
(16.31)
(2)
(2)
(2) (1)
R I X ;Y
X
,
(16.32)
16.2.2
(16.34)
R(1) I X (1) ; Y (1) X (2) , T ,
(2)
(2)
(2) (1)
X ,T ,
(16.35)
R I X ;Y
(2)
16.3
355
A very simple inner bound can be found by requiring that both receivers
decode both messages. This basically changes the IC into a doubleMAC and
therefore will result in the following (MAClike) achievable rate region:
(1)
(2) (2)
(1)
(1)
(1) (2)
X , T , (16.40)
X
,
T
,
I
X
;
Y
R
min
I
X
;
Y
R(2) min I X (2) ; Y (1) X (1) , T , I X (2) ; Y (2) X (1) , T , (16.41)
R(1) + R(2) minI X (1) , X (2) ; Y (1) T , I X (1) , X (2) ; Y (2) T (16.42)
for some QT QX (1) T QX (2) T QY (1) X (1) ,X (2) QY (2) X (1) ,X (2) .
This bound can be improved if we drop the restriction that the receivers
decode the message that is not intended for them.
Theorem 16.9. On a general IC, any rate pair R(1) , R(2) is achievable that
satisfies
(1)
(1)
(1) (2)
I
X
;
Y
X ,T ,
(16.43)
R
(2)
(2)
(2) (1)
R I X ;Y
X ,T ,
(16.44)
(1)
(2)
(1)
(2)
(1)
(1)
(2)
(2)
R + R min I X , X ; Y T , I X , X ; Y T
(16.45)
for some QT QX (1) T QX (2) T . The auxiliary timesharing random variable
T can be restricted to take value in an alphabet T with T  = 4.
Proof: This random coding proof follows very closely the achievability
proof we have seen for the MAC capacity region C1 .
1: Setup: Fix R(1) , R(2) , QT , QX (1) T , QX (2) T , and some blocklength n.
2: Codebook Design: Generate one lengthn sequence T IID QT . Then
(1)
generate enR lengthn codewords X(1) (m(1) ) QnX (1) T (T), m(1) =
(1)
(2)
(16.47)
356
(16.48)
(1)
for some m
(1) 1, . . . , enR
. If there is a unique such m
(2) , the decoder
(2)
puts out
m
(2) , m
(2) .
(16.49)
nR
eX
(2)
nR
eX
m(1) =1 m(2) =1
en(R
(1)
+R(2) )
Pr error(1) M (1) , M (2) = m(1) , m(2)
(16.51)
[
c
= PrFm
Fm
(1) ,m(2)
(1) ,m(2)
m
(1) 6=m(1)
(1) (2)
Fm
(1) ,m
(2) m , m
(m
(1) ,m
(2) )6=(m(1) ,m(2) )
(1)
c
Pr Fm
, m(2)
(1) ,m(2) m
[
(16.52)
(1)
nR
eX
m
(1) =1
m
(1) 6=m(1)
(1)
(1)
(2)
Pr Fm
(1) ,m(2) m , m
(2)
nR
eX
nR
eX
m
(1) =1
m
(1) 6=m(1)
m
(2) =1
m
(2) 6=m(2)
(1)
(2)
Pr Fm
m
,
m
.
(1)
(2)
,m
(16.53)
357
X
(t,x(1) ,x(2) ,
(n)
y(1) )A
(1) T )
(1)
(2)
(1)
= A(n)
QT,X (1) ,X (2) ,Y (1) en(H(T,X )+H(X ,Y T ))
(1)
(2)
(1)
(1)
(2)
(1)
en(H(T,X ,X ,Y )+) en(H(T,X )+H(X ,Y T ))
(16.57)
(16.58)
=e
n( H(T,X (1) )H(X (2) ,Y (1) T,X (1) )+H(T,X (1) )+H(X (2) ,Y (1) T ))
(16.59)
=e
(16.60)
=e
(16.61)
=e
(16.62)
Here, in (16.56) we use TA1b based on the fact that all sequences in the
sum are typical; in (16.58) we use TA2; and the in the final step (16.62)
we rely on the conditional independence between X (1) and X (2) given T .
Similarly, we get for m
(1) 6= m(1) , m
(2) 6= m(2) :
(1)
(2)
Pr Fm
(1) ,m
(2) m , m
X
=
QnT (t) QnX (1) T x(1) t QnX (2) T x(2) t QnY (1) T y(1) t
(t,x(1) ,x(2) ,
(n)
y(1) )A
(16.63)
n(H(T ))
(t,x(1) ,x(2) ,
(n)
y(1) )A
en(H(Y
(1) T )
(16.64)
(n)
n(H(T,X (1) )+H(X (2) T )+H(Y (1) T ))
= A
QT,X (1) ,X (2) ,Y (1) e
(16.65)
(1)
(2)
(1)
(1)
(2)
(1)
en(H(T,X ,X ,Y )+) en(H(T,X )+H(X T )+H(Y T )) (16.66)
=e
(1)
(1)
(1)
(2)
(16.67)
(16.68)
(16.69)
358
(1)
(16.71)
(1)
(16.72)
Note that these error probabilities will tend to zero for n as long
as the three conditions (16.43)(16.45) are satisfied.
The bound on the alphabet size of T follows from the FenchelEggleston
strengthening of Caratheodorys Theorem (Theorem 1.22).
Example 16.10 (Symmetric IC). Consider an IC that is symmetric in the
sense that QY (1) X (1) ,X (2) = QY (2) X (1) ,X (2) . The capacity region for this IC is
(2)
(2)
(2) (1)
R I X ;Y
X ,T ,
(16.74)
16.4
359
(16.76)
(16.77)
(16.78)
I X (1) ; Y (2) I X (1) ; Y (2) , X (2)
= I X (1) ; X (2) + I X (1) ; Y (2) X (2)

{z
}
=0
= I X (1) ; Y (2) X (2) ,
(16.80)
(16.79)
(16.81)
(16.82)
we see that any IC with very strong interference, implicitly also has strong
interference (therefore the naming!). However, the converse is not necessarily
true.
Example 16.13. Consider the IC with X (1) , X (2) {0, 1} and Y (1) , Y (2)
{0, 1, 2} where
Y (1) = Y (2) = X (1) + X (2)
(16.83)
I X (1) ; Y (1) X (2) = I X (1) ; Y (2) X (2) = H X (1) ,
I X (2) ; Y (2) X (1) = I X (2) ; Y (1) X (1) = H X (1) ,
(16.84)
(16.85)
360
i.e., the channel has strong interference. However, for any nontrivial input
distributions,
I X (1) ; Y (2) = H X (1) H X (1) Y (2)
(16.86)
(1)
<H X
(16.87)
(16.88)
= I X (1) ; Y (1) X (2)
and
I X (2) ; Y (1) = H X (2) H X (2) Y (1)
< H X (2)
= I X (2) ; Y (2) X (1) ,
(16.89)
(16.90)
(16.91)
R(1) I X (1) ; Y (1) X (2) , T ,
(16.92)
(1)
(2)
(2)
(2)
R I X ; Y X , T
(16.93)
for some QT QX (1) T QX (2) T .
Proof: Concerning achievability, note that the two terms in the third condition of Theorem 16.9 can be bounded as follows:
(16.94)
I X (1) , X (2) ; Y (1) T = I X (2) ; Y (1) T + I X (1) ; Y (1) X (2) , T
(1)
(1) (2)
(2)
(2) (1)
X , T , (16.95)
I X ;Y
X ,T + I X ;Y
where the inequality follows from the assumption of very strong interference
(16.79); and analogously
I X (1) , X (2) ; Y (2) T I X (1) ; Y (1) X (2) , T + I X (2) ; Y (2) X (1) , T . (16.96)
Therefore, we see that the first two conditions of Theorem 16.9 imply the
third.
Similarly, we see that the third condition of Satos bound (Theorem 16.8)
is
I X (1) , X (2) ; Y (1) , Y (2) T
I X (1) , X (2) ; Y (1) T
(16.97)
(2)
(2) (1)
(1)
(1) (2)
I X ;Y
X ,T + I X ;Y
X ,T ,
(16.98)
361
where the first inequality follows from dropping Y (2) and the second inequality
from (16.95). Hence, also here the first two conditions imply the third.
Note that the capacity region of an IC with very strong interference can
be achieved by successive cancellation: each decoder decodes the unwanted
message first (which it can do since the cross link is so much better than the
direct link) and then uses the knowledge of the unwanted message to decode
the wanted message.
Theorem 16.15 (IC with Strong Interference [CEG87]).
The capacity region
of an IC with strong interference is given by all rate
(1)
(2)
pairs R , R
satisfying
(2)
(2)
(2) (1)
R I X ;Y
X ,T ,
(16.100)
(16.101)
for some QT QX (1) T QX (2) T . The auxiliary timesharing random variable T can be restricted to take value in an alphabet T with T  = 4.
In order to prove this, we need the following lemma.
Lemma 16.16. For an IC with strong interference, we have for an arbitrary
(X(1) , X(2) ) QX(1) QX(2) :
I X(1) ; Y(1) X(2) I X(1) ; Y(2) X(2) ,
(16.102)
(2)
(2) (1)
(2)
(1) (1)
I X ;Y X
I X ;Y X ,
(16.103)
where the vectors have an arbitrary length n 1.
Proof: We only prove (16.102). The other inequality follows accordingly.
The following proof is an adaptation1 from [CEG87].
Recall that by definition, an IC with strong interference satisfies
(16.104)
I X (1) ; Y (1) X (2) , U = u I X (1) ; Y (2) X (2) , U = u ,
for any QX (1) U (u) QX (2) U (u), where we have conditioned on U = u for
an arbitrary auxiliary RV U . Hence, by averaging over U , the IC with strong
interference also satisfies
I X (1) ; Y (1) X (2) , U I X (1) ; Y (2) X (2) , U ,
(16.105)
as long as U (
X (1) , X (2) (
Y (1) , Y (2) and X (1) (
U (
X (2)
form Markov chains.
1
In [CEG87] the authors claim to prove the lemma by induction. This is, however,
strictly speaking not true. The proof is rather a recursive derivation.
362
(16.108)
(2) (1)
=I
X , Yk+1 , . . . , Yn(1)
(1)
(1)
(2)
(2) (2)
(1)
+ I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X , Yk , . . . , Yn(1)
(1)
(2)
(2)
(1)
(1)
(1)
I Yk ; Y1 , . . . , Yk1 X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)
(1)
(2)
(2)
(2)
(1)
+ I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1) ,
(16.109)
(1)
(2)
(2)
Yk ; Y1 , . . . , Yk1
where two terms are zero because the channel is memoryless, i.e., conditionally
(1)
(2)
(2)
(1)
on Xi and Xi (i = 1, . . . , k 1), Yi is independent of Xk .
Very similarly (but not exactly in the same way!), we also get
(1)
(1)
(1)
(1)
(1)
I X1 , . . . , Xk ; Y1 , . . . , Yk X(2) , Yk+1 , . . . , Yn(1)
(1)
(1)
(1)
(1)
= I X1 , . . . , Xk ; Yk X(2) , Yk+1 , . . . , Yn(1)
(1)
(1)
(1)
(1)
(1)
+ I X1 , . . . , Xk ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(16.110)
(1)
(1)
(2)
(2)
(1)
(1)
= I X1 , . . . , Xk , Y1 , . . . , Yk1 ; Yk X(2) , Yk+1 , . . . , Yn(1)
(2)
(2)
(1)
(1)
(1)
(1)
I Y1 , . . . , Yk1 ; Yk X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)
363
(1)
(1)
(1)
(1)
(1)
+ I X1 , . . . , Xk ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(16.111)
(2)
(2)
(1)
(1)
= I Y1 , . . . , Yk1 ; Yk X(2) , Yk+1 , . . . , Yn(1)
(1)
(1)
(2)
(2)
(1)
+ I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)
(1)
(1)
(1)
(1)
(2)
(2)
(1)
+ I X1 , . . . , Xk1 ; Yk X(2) , Xk , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)
{z
}

=0
(2)
(2)
(1)
(1)
(1)
(1)
I Y1 , . . . , Yk1 ; Yk X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)
(1)
(1)
(1)
(1)
(1)
+ I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(1)
(1)
(1)
(1)
(1)
(1)
+ I Xk ; Y1 , . . . , Yk1 X(2) , X1 , . . . , Xk1 , Yk , . . . , Yn(1) (16.112)

{z
}
=0
(2)
(2)
(1)
(1)
= I Y1 , . . . , Yk1 ; Yk X(2) , Yk+1 , . . . , Yn(1)
(1)
(1)
(2)
(2)
(1)
+ I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)
(2)
(2)
(1)
(1)
(1)
(1)
I Y1 , . . . , Yk1 ; Yk X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)
(1)
(1)
(1)
(1)
(1)
+ I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1) .
(16.113)
Hence, subtracting (16.113) from (16.109) and noting that the first and third
terms cancel, we get
(1)
(1)
(2)
(2)
(1)
I X1 , . . . , Xk ; Y1 , . . . , Yk X(2) , Yk+1 , . . . , Yn(1)
(1)
(1)
(1)
(1) (2)
(1)
I X1 , . . . , Xk ; Y1 , . . . , Yk X , Yk+1 , . . . , Yn(1)
(1)
(1)
(2)
(2)
(1)
= I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(1)
(1)
(1)
(1)
(1)
I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(1)
(2) (2)
(2)
(2)
(1)
(1)
+ I Xk ; Yk X , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn
(1)
(1)
(2)
(2)
(1)
I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)
(16.114)
(1)
(1)
(2)
(2)
(1)
= I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(1)
(1)
(1)
(1)
(1)
I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(1)
(2) (2)
(1)
(1) (2)
+ I Xk ; Yk Xk , Uk I Xk ; Yk Xk , Uk ,
(16.115)
where we have defined
(2)
(2)
(2)
(2)
(2)
(1)
Uk , X1 , . . . , Xk1 , Xk+1 , . . . , Xn(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1) .
(16.116)
By recursively using (16.115) starting with k = n and going backwards until
364
k = 1, we finally obtain
I X(1) ; Y(2) X(2) I X(1) ; Y(1) X(2)
n
X
(1)
(2) (2)
(1)
(1) (2)
=
I Xk ; Yk Xk , Uk I Xk ; Yk Xk , Uk .
(16.117)
k=1
(16.118)
and
(1)
(2)
Xk (
Uk (
Xk
(16.119)
form Markov chains such that (16.105) can be applied to each summand of
the sum in (16.117). This proves the lemma.
Proof of Theorem 16.15: The achievability follows directly from Theorem 16.9. For the converse, we take over the derivation of the two first inequalities in Satos outer bound (i.e., in the converse of the MAC in Section 10.4.3).
For the third inequality we adapt Satos proof as follows:
n R(1) + R(2)
= H M (1) + H M (2)
(16.120)
(2)
(2)
(2) (2)
(1)
(1)
(1) (1)
(16.121)
+ I M ;Y
+H M Y
= I M ;Y
+H M Y
(1)
(1)
(2)
(2)
(1)
(2)
I M ;Y
+ I M ;Y
+ nn + nn
(16.122)
(1)
(1)
(2)
(2)
(1)
(2)
I X ;Y
+ I X ;Y
+ nn + nn
(16.123)
(1)
(1)
(2)
(2)
(2)
(1)
(2)
I X ;Y ,X
+ I X ;Y
+ nn + nn
(16.124)
(2)
(2)
(1)
(2)
(1)
(1) (2)
+ I X ;Y
+ nn + nn
(16.125)
= I X ;Y X
(2)
(2)
(1)
(2)
(1)
(2) (2)
+ I X ;Y
+ nn + nn
(16.126)
I X ;Y X
(1)
(2)
(2)
(1)
(2)
= I X ,X ;Y
+ nn + nn ,
(16.127)
where the last inequality (16.126) follows from Lemma 16.16. The reminder
of the proof is identical to (10.82)(10.92).
This proves R(1) + R(2) I X (1) , X (2) ; Y (2) T . The bound R(1) + R(2)
I X (1) , X (2) ; Y (1) T follows accordingly.
The bound on the alphabet size of T follows from the FenchelEggleston
strengthening of Caratheodorys Theorem (Theorem 1.22).
16.5
HanKobayashi Region
So far we have seen the coding strategy where both receivers decode both
messages. Quite intuitively, this turns out to be optimal in the situation of
strong interference. However, in general this is not optimal and there are
quite a few other natural coding strategies like, e.g., treating the interference
as noise or using some kind of orthogonal transmission like TDMA or FDMA.
365
The random coding strategy that includes all mentioned strategies as special cases and that yields the best known achievable region to date is called
HanKobayashi coding scheme. It contains a fundamentally new aspect of
random coding that we have not seen so far: rate splitting. The basic idea
here is to be more flexible with respect to what part of a message is private
(i.e., it will only be decoded by the intended receiver) and what is public
in the sense that also the unintended receiver will decode it in order to help
with the decoding of the wanted message.
Related to rate splitting is the concept of nonunique decoding: so far
the receiver always tried to correctly decode all those messages that it was
interested in. However, in an IC, we might also try to decode the unwanted
message as it might help with the decoding of the wanted message. The
receiver, however, does not care whether the decoding of the unwanted message
turns out to be successful or not as long as it helps with the wanted message!
Hence, the unwanted message must not be uniquely decoded in the end.
16.5.1
1: Setup: For both i {1, 2}, we split M (i) up into two independent parts
M 0(i) (the public part) and M 00(i) (the private part) where M 0(i) and
M 00(i) have the rates R0(i) and R00(i) , respectively, and where R0(i) + R00(i) =
R(i) . The idea is that decoder 1 decodes M (1) = M 0(1) , M 00(1) and M 0(2) ,
and that decoder 2 decodes M (2) = M 0(2) , M 00(2) and M 0(1) .
Moreover, we choose some QU (1) , QU (2) , QX (1) U (1) , QX (2) U (2) , and some
blocklength n.
0(i)
,m
. Otherwise
it declares an error.
366
6= M 0(2)
(16.129)
= M 00(1) M
M
= M 0(1) M
is not considered an error since we do not care if the first decoder cannot
decode the unwanted message. This leaves us with 6 possible error events.
As we have done such error bounding so many times before, we take the
liberty of using a slightly more sloppy notation. We first mention that
the probability that the correct codeword is not jointly typical with the
cloud centers and the received sequence is very small:
h
i
t .
Pr U(1) (right), X(1) (right, right), U(2) (right), Y(1) A(n)
(16.130)
Hence, we only need to check whether there exists a combination of partially wrong codewords that accidentally look jointly typical:
Pr M 0(1) in error
h
X
=
Pr U(1) (wrong), X(1) (wrong, right),
wrong m0(1)
i
(16.131)
U(2) (right), Y(1) A(n)
X
QU (1) ,X (1) u(1) , x(1) QU (2) ,Y (1) u(2) , y(1)
y(1) )A
(16.132)
nR0(1)
n(H(U (1) ,X (1) ,U (2) ,Y (1) )H(U (1) ,X (1) )H(U (2) ,Y (1) )+)
(16.133)
(16.134)
n(R0(1) I(U (1) ,X (1) ;U (2) )I(U (1) ,X (1) ;Y (1) U (2) )+)
(16.135)
(16.136)
=e
=e
=e
0(1)
= en(R
0(1)
(1)
(1)
(2)
= en(R I(X ;Y U )+) .
I(X (1) ;Y (1) U (2) )I(U (1) ;Y (1) U (2) ,X (1) )+
(16.137)
(16.138)
Here, in (16.136) we have used that U (1) , X (1)
U (2) , X (2) , and in
(16.138) we use the Markov structure of U (1) (
X (1) (
Y (1) (which
(2)
holds irrespective of whether we condition on U or not).
367
i
U(2) (right), Y(1) A(n)
X
QU (1) ,U (2) ,Y (1) u(1) , u(2) , y(1)
(16.139)
y(1) )A
QX (1) U (1) x(1) u(1)
(16.140)
enR
en(
00(1)
(1)
(2)
(1)
(1)
= en(R I(X ;U ,Y U )+)
00(1)
00(1)
) (16.141)
H(U (1) ,X (1) ,U (2) ,Y (1) )H(U (1) ,U (2) ,Y (1) )H(X (1) U (1) )+
(16.142)
= en(R
00(1)
(1)
(1)
(1)
(2)
= en(R I(X ;Y U ,U )+) .
I(X (1) ;U (2) U (1) )I(X (1) ;Y (1) U (1) ,U (2) )+
(16.143)
(16.144)
i
U(2) (right), Y(1) A(n)
X
QU (1) ,X (1) u(1) , x(1)
(16.145)
y(1) )A
en(R
0(1)
= en(R
0(1)
+R
00(1)
) en(
+R
00(1)
(16.146)
)
H(U (1) ,X (1) ,U (2) ,Y (1) )H(U (1) ,X (1) )H(U (2) ,Y (1) )+
(16.147)
),
(16.148)
i
U(2) (wrong), Y(1) A(n)
X
(1)
QU (1) ,X (1) u , x(1)
(16.149)
y(1) )A
QU (2) u(2) QY (1) y(1)
(16.150)
368
0(2)
en(R +R )
(1)
(1)
(2)
(1)
(1)
(1)
(2)
(1)
en(H(U ,X ,U ,Y )H(U ,X )H(U )H(Y )+)
(16.151)
n(R0(1) +R0(2) +H(U (2) ,Y (1) U (1) ,X (1) )H(U (2) U (1) ,X (1) )H(Y (1) )+)
(16.152)
=e
(16.153)
=e
=e
(16.154)
where in (16.152) we have used the independence of U (2) and U (1) , X (1) ,
and in (16.154) the Markovity of U (1) (
X (1) (
Y (1) .
The fifth event is as follows:
Pr M 00(1) and M 0(2) in error
h
X
=
Pr U(1) (right), X(1) (right, wrong),
wrong m00(1) ,m0(2)
i
U(2) (wrong), Y(1) A(n)
(16.155)
X
QU (1) ,Y (1) u(1) , y(1) QU (2) u(2)
y(1) )A
00(1)
QX (1) U (1) x(1) u(1)
0(2)
(16.156)
en(R +R )
(1)
(1)
(2)
(1)
(1)
(1)
(2)
(1)
(1)
en(H(U ,X ,U ,Y )H(U ,Y )H(U )H(X U )+) (16.157)
00(1)
0(2)
(1)
(2)
(1)
(1)
(2)
(1)
(1)
= en(R +R +H(X ,U U ,Y )H(U )H(X U )+)
(16.158)
= en(R
00(1)
+R0(2) +H(X (1) ,U (2) U (1) ,Y (1) )H(U (2) U (1) ,X (1) )H(X (1) U (1) )+)
(16.159)
=e
n(R00(1) +R0(2) +H(X (1) ,U (2) U (1) ,Y (1) )H(X (1) ,U (2) U (1) )+)
(16.160)
=e
(16.161)
i
U(2) (wrong), Y(1) A(n)
X
QU (1) ,X (1) u(1) , x(1)
(16.162)
wrong
(u(1) ,x(1) ,u(2) ,
m0(1) ,m00(1) ,m0(2)
(n)
y(1) )A
QU (2) u(2) QY (1) y(1)
(1)
(2)
(1)
(1)
(1)
(16.163)
(16.164)
(16.165)
369
0(1)
(1)
(1) (2)
U
,
(16.166)
R
<
I
X
;
Y
00(1)
(1)
(1) (1)
(2)
R
< I X ;Y
U ,U
,
(16.167)
0(1)
00(1)
(1)
(1) (2)
R +R
< I X ;Y
U
,
(16.168)
0(1)
0(2)
(1)
(2)
(1)
R +R
< I X ,U ;Y
,
(16.169)
(1)
00(1)
0(2)
(1)
(2)
(1)
,
(16.170)
R
+R
< I X , U ; Y U
0(1)
00(1)
0(2)
(1)
(2)
(1)
R +R
+R
< I X ,U ;Y
.
(16.171)
Note that (16.166) and (16.169) are redundant, which leaves us with
four bounds. We combine them with the corresponding four bounds of
decoder 2 and replace R00(i) by R(i) R0(i) :
(16.172)
R(1) R0(1) < I X (1) ; Y (1) U (1) , U (2) ,
(1) (2)
(2)
0(2)
(2)
(2)
,
(16.173)
R R
< I X ; Y U , U
(2)
(2)
(2) (1)
U
,
(16.175)
R < I X ;Y
(1)
(1)
0(1)
0(2)
(1)
(2)
(1)
U
,
(16.176)
R R +R
< I X ,U ;Y
(2)
(2)
0(2)
0(1)
(2)
(1)
(2)
,
(16.177)
< I X , U ; Y U
R R + R
(2)
0(1)
(2)
(1)
(2)
R +R
< I X ,U ;Y
.
(16.179)
This finishes the first stage of the proof.
16.5.2
FourierMotzkin Elimination
We next apply FourierMotzkin elimination (see Section 1.3) to get rid of the
unwanted rates R0(i) . We rewrite (16.172)(16.179) as follows:
I1
I X (1) ; Y (1) U (1) , U (2)
1
0 1 0
0
1
0
1
I X (1) ; Y (1) U (2) I
1
0
0
0
(1) 3
(2)
(2)
0
I4
U
1
0
0
I X ;Y
1
(1)
(2)
(1)
(1)
0 1 1 R(1)
I5
I X , U ; Y U
(2)
(2)
(1)
(2) U (2)
0
1
1 1
R I X , U ; Y
, I6 (16.180)
I
0(1)
1
(1)
(2)
(1)
0
0
1
R I X , U ; Y
7
0(2)
(2)
(1)
(2)
I X ,U ;Y
0
I8
1
1
0
1 0
0
0
0
0
0 1 0
0
0
0
0
1
0
0
0
0
0 1
0
0
370
1
0 1
I1
1
I2 + I5
1 1
I2 + I7
1
0
1
I3
0
0
0
I4
1
0
1
I5 + I6
1
0 R(1)
(2)
1
0 1
R I5 .
1
1 R0(1)
1
I6 + I7
1
I7
0
0
1
1
0
I8
1 0
0
0
0 1 0
0
0
0
0 1
(16.181)
Note that I1 I5 , i.e., the first bound implies the 7th, and similarly that
I3 I7 , i.e., the 4th bound implies the 9th. Hence, we remove the 7th and
9th bound and continue to eliminate R0(1) :
2
1
I1 + I6 + I7
1
I1 + I8
2
I + I + I + I
2
5
6
7
1
I2 + I5 + I8
2
1
I
+
I
"
2
7
(1) #
0
I3
(2)
1
R
I
4
1
I
+
I
5
6
1
I6 + I7
1
I
8
1 0
0 1
0
(16.182)
Note that the third bound is the sum of the 5th and 9th; that I4 I8 and therefore the 7th bound implies the 10th; and that the 5th implies the 9th bound
because I2 I6 . So we remove the third, the 9th, and the 10th bound. Moreover, we also omit the obvious nonnegativity constraints (last two bounds)
371
0
1
16.5.3
1
I1 + I6 + I7
I1 + I8
1
# I2 + I5 + I8
"
2
R(1)
1 (2) I2 + I7 .
I3
0
I4
1
I5 + I6
1
(16.183)
This almost proves the HanKobayashi region. The only remaining part is
timesharing. As mentioned before at the end of Section 10.5.2, the convex hull of the region defined by (16.183) might be smaller than if we perform an additional coded timesharing operation. Hence, in the code generation we actually should first create
a random sequence T QnT () and
then create the sequences U(i) m0(i) QU (i) T (T) and X(i) m0(i) , m00(i)
QX (i) U (i) ,T U(i) (m0(i) ), T . All expressions involving a typical set must be
adapted to include the T. Then we would have to go through the whole derivation again and would realize that we get the same expressions as in (16.183)
apart from the fact that all are conditioned on T .
Theorem 16.17 (HanKobayashi Achievable Rate Region [HK81],
[CMGEG08]).
For a general DMIC,
an achievable rate region is given by all nonnegative
(1)
(2)
rate pairs R , R
satisfying
(16.184)
R(1) I X (1) ; Y (1) U (2) , T ,
(1)
(2)
(2)
(2)
R I X ; Y U , T ,
(16.185)
(16.186)
(1) (2)
(1)
(2)
(2)
(1)
(2)
(1)
(1)
T + I X ;Y
U ,U ,T ,
R + R I X ,U ;Y
(16.187)
(1)
(1)
(2)
(1)
(2)
(1)
R + R I X , U ; Y U , T
(2)
(1)
(2) (2)
+
I
X
,
U
;
Y
U
,
T
,
(16.188)
(1)
(2)
(1)
(2)
(1)
(1)
(1) (1)
(2)
2R + R I X , U ; Y
T + I X ;Y
U ,U ,T
(1) (2)
(1)
(2)
(2)
(1)
(2)
(2)
(2)
R + 2R I X , U ; Y T + I X ; Y U , U , T
+ I X (1) , U (2) ; Y (1) U (1) , T
(16.190)
372
for some
(1) (2) (1) (2)
QT QU (1) T QX (1) U (1) ,T QU (2) T QX (2) U (2) ,T Q
Y ,Y X ,X
(16.191)
(1) (2) (1) (2) must have the same marginals (16.17), (16.18)
where Q
Y ,Y X ,X
as the given IC. The auxiliary random variables
to take
(i) can be(i)restricted
X
value in alphabets of size T  7 and U
+ 4, i = 1, 2,
respectively.
Proof: The only part that remains to be proven are the bounds on the
alphabet sizes of the auxiliary random variables. First, note that the bound on
T  is a straightforward consequence of the FenchelEggleston strengthening
of Caratheodorys Theorem (Theorem 1.22): we have seven bounds with all
terms in the bounds being conditional on T . Hence, the rate region can be
described by a linear combination of vectors with seven components.
So, we restrict T to size 7, fix some distribution QT and condition everything on T = t for a fixed t. We turn to U (1) : For given distributions QX (1) U (1) ,T (, t), QU (2) T (t), QX (2) U (2) ,T (, t), QY (1) X (1) ,X (2) , and
QY (2) X (1) ,X (2) , we define a vector vu(1) with the following X (1)  + 4 components:
(1)
vu(1) , I X (2) ; Y (2) U (1) = u(1) , T = t ,
(16.192)
(2)
vu(1) , I X (2) ; Y (2) U (1) = u(1) , U (2) , T = t ,
(16.193)
(3)
vu(1) , H Y (2) U (1) = u(1) , T = t
(16.194)
+ I X (1) ; Y (1) U (1) = u(1) , U (2) , T = t ,
(4)
vu(1) , H Y (2) U (1) = u(1) , U (2) , T = t
(16.195)
+ I X (1) ; Y (1) U (1) = u(1) , U (2) , T = t ,
(5)
(2)
(1) (1)
(1)
vu(1) , I U ; Y
U = u ,T = t
+ I X (2) ; Y (2) U (1) = u(1) , U (2) , T = t ,
(16.196)
(1)
(6)
vu(1) , QX (1) U (1) ,T 1u , t ,
(16.197)
..
.
(X (1) +4)
vu(1)
, QX (1) U (1) ,T X (1) 1 u(1) , t .
(16.198)
It can now be checked that, for any choice of QU (1) T , all terms on the RHS
of (16.184)(16.190) when conditioned on T = t are given by the components
of a v that is defined as a linear combination of vu(1) :
X
v,
QU (1) T u(1) t vu(1) .
(16.199)
u(1) U (1)
16.6. Gaussian IC
373
16.6
Gaussian IC
Once again we will also discuss the Gaussian case even though strictly speaking our proofs do not directly generalize to continuous channel models. The
Gaussian IC is particularly illustrative because one can very nicely demonstrate the different coding strategies.
16.6.1
Channel Model
The most general model for a Gaussian interference channel looks as follows:
(
0
0
0
0
Y (1) = c11 x(1) + c12 x(2) + Z (1) ,
(16.200)
(2)0
(1)0
(2)0
(2)0
Y
= c21 x
+ c22 x
+Z ,
(16.201)
where c11 , c12 , c21 , c22 are fixed constants, where the noise is jointly Gaussian
0
Z (1) , Z (2)
T
N (0, KZ0Z0 )
(16.202)
2
(1)
(12)
(12)
2
(2)
!
,
(16.203)
(16.204)
1 (1)0
Y
,
(1)
2 (2)0
,
Y
,
(2)
1 (1)0
Z ,
(1)
2 (2)0
,
Z .
(2)
x(1) ,
Y (1) ,
Z (1) ,
(16.205)
x(2)
Y (2)
Z (2)
(16.206)
Then we get
= x(1) +
x + Z (1) ,
Y
(1) c22
(2) c11
(16.207)
(16.208)
Hence, we have found the socalled standard form of the Gaussian IC.
374
(16.209)
Y (1) = x(1) + a12 x(2) + Z (1) ,
(2)
(1)
(2)
(2)
Y
= a21 x + x + Z ,
(16.210)
where a12 and a21 are fixed nonnegative constants, where the noise is jointly
Gaussian
!!
1
T
Z (1) , Z (2) N 0,
, [1, 1],
(16.211)
1
and where both inputs are subject to an averagepower constraint
h
2 i
E X (i)
E(i) , i = 1, 2.
(16.212)
Note that due to Theorem 16.6 the capacity region of the Gaussian IC
does not depend on .
16.6.2
Outer Bound
We start by adapting Satos outer bound (Theorem 16.8) to the Gaussian IC.
To that goal note that
(16.213)
I X (1) ; Y (1) X (2) , T = h Y (1) X (2) , T h Y (1) X (1) , X (2) , T
(1)
(1)
(1)
(16.214)
=h X +Z T h Z
1
1
log 2e E(1) + 1 log 2e
(16.215)
2
2
1
(16.216)
= log 1 + E(1) ,
2
where the upper bound can be achieved if the input is chosen to be zeromean
Gaussian with variance E(1) . The second bound is accordingly
1
I X (2) ; Y (2) X (1) , T log 1 + E(2) .
2
(16.217)
(16.218)
(16.219)
where again the inequality can be achieved with equality if the input
is chosen
to be zeromean Gaussian with covariance matrix diag E(1) , E(2) . Note that
!
1
det KZZ = det
= 1 2
(16.220)
1
16.6. Gaussian IC
375
and
() , det KYY
(1)
(2)
(1)
(16.221)
(2)
+ a12 E + 1
a21 E + a12 E
(2)
a21 E + a12 E +
a21 E(1) + E(2) + 1
= E(1) 1 + a21 2 a21 + E(2) 1 + a12 2 a12
2
+ E(1) E(2) 1 a12 a21 + 1 2 ,
= det
(1)
!
+
(16.222)
(16.223)
i.e.,
1
()
.
I X (1) , X (2) ; Y (1) , Y (2) T log
2
1 2
(16.224)
Note that since the capacity region does not depend on , we can minimize
(16.224) over .
Theorem 16.19 (Satos Outer
Bound [Sat77]). For a Gaussian IC, any
(1)
(2)
achievable rate pair R , R
must satisfy
1
1
(16.226)
R(2) log 1 + E(2) ,
16.6.3
As we have already discussed in Section 16.5, there are three basic communication strategies. All three strategies are special cases of the HanKobayashi
achievable region (Theorem 16.17).
Treating Interference as Noise
If the interference is only very weak (i.e., a12 , a21 1), then it is quite natural
to ignore the structure of the interfering signal and treat it as equivalent to
noise. Basically, the Gaussian IC is then transformed into two parallel additive
noise channels with an achievable rate region
(
)
[
0 R(1) I X (1) ; Y (1)
(1)
(2)
.
R1 =
R ,R
:
(16.228)
0 R(2) I X (2) ; Y (2)
(1) (2)
X
376
will be a very bad bound anyway. Note that this bound can be proven using
the HanKobayashi achievable region (Theorem 16.17) with T , U (1) , and U (2)
chosen to be deterministic.
Unfortunately, it is not clear what the optimal input distribution is. Note
that choosing a Gaussian input is good for the direct link, but is hurting the
other receiver most via the interference (Gaussian noise is the worst noise!).
However, if we do choose Gaussian inputs, the region looks as follows:
(1) E(1)
(1)
0 R log 1 +
(2)
[
(2)
2
a12 E + 1
(1)
(2)
R1,G =
R ,R
:
.
(2) E(2)
(1)
0 1
0 R(2) log 1 +
(1)
(2)
(1)
2
0 1
a21 E + 1
(16.229)
Note that (i) allows us to adapt the power of each user, thereby allowing to
reduce the interference at the cost of the direct link.
Orthogonal Coding
In the situation of a moderate interference, it might be a good strategy to try to
avoid interference by means of an orthogonal coding scheme. For example, one
can use TDMA to make sure that only one user is accessing the channel at one
time or FDMA to separate the users by means of using different frequencies.
Obviously, in such a scheme, we have two independent Gaussian channels
with the optimal input being Gaussian. We use 0 1 as the timesharing
parameter and get the following achievable region:
E(1)
(1)
0 R log 1 +
[
2
(1)
(2)
R2 =
.
(16.230)
R ,R
:
1
E(2)
(2)
0 1
0R
log 1 +
2
1
We remind the reader that by diving the available power E(1) by , we make
sure that we meet the power constraint exactly (the first user only transmits
a fraction , hence it can use more power during this transmission and still
achieve the averagepower constraint). The same comment applies to E(2) that
is divided by 1 .
Decoding and Canceling Interference
Finally, if the interference is much stronger than the direct link, then it is
quite obvious that one should decode the unwanted message first and use this
knowledge to eliminate the interference from the received signal before one
decodes the wanted message. As a matter of fact, as we have seen before, for
strong and very strong interference, this strategy is optimal.
Once the interference has successfully been removed from the channel, we
end up with two Gaussian MAC channels yielding an achievable rate region
16.6. Gaussian IC
377
1
(1)
(1)
0 R log 1 + E ,
(2)
(2)
0
log
1
+
E
,
(1)
(2)
2
R3 =
R ,R
:
. (16.231)
(1)
(2)
a E +E
21
Comparison
In Figure 16.6, all three communication strategies and Satos outer bound are
depicted for a Gaussian IC with E(1) = E(2) = 7 and different values of a12 and
a21 . It can be clearly seen that, depending on the strength of the interference,
different schemes are best.
16.6.4
The concepts of strong and very strong interference also works in the case of
the Gaussian IC. Indeed, they are very intuitive in this context: For example,
consider the case where a12 1 and assume we have a reliable coding scheme
for some given R(1) , R(2) . So, receiver 1 can reconstruct X(1) reliably and
can therefore compute
0
Y(1) X(1)
a12
Z(1)
= a21 X(1) + X(2) + .
a12
Y(2) ,
a21 X(1) +
(16.232)
(16.233)
This is very similar to the received word of receiver 2: Only the noise is
different and actually has a smaller variance. Hence, since receiver is able to
reliably reconstruct X(2) from Y(2) , then this means that receiver 1 is also
0
able to reconstruct X(2) from Y(2) .
So, we understand that if both a12 1 and a21 1, in a reliable system
both receiver can always reconstruct both messages.
After this short discussion and this insight, we now quickly summarize the
results for strong and very strong interference of the Gaussian IC.
Definition 16.20 ([Car75], [Sat81]). A Gaussian IC is said to have strong
interference if
a12 1,
a21 1.
(16.234)
378
1.6
1.4
1.4
a)
b)
c)
0.8
a)
b)
c)
1.2
R(2)
1.2
R(2)
1.6
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
0.2
0.4
0.6
R(1)
a12 = 0.55, a21 = 0.45
1.4
1.4
1.2
1.2
b)
c)
0.8
1.2
1.4
1.6
a)
1.6
R(2)
R(2)
1.6
0.8
R(1)
b) c)
0.8
0.6
0.6
0.4
0.4
0.2
0.2
a)
0
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
0.2
0.4
0.6
R(1)
a12 = 1.15, a21 = 1.15
1.2
1.4
1.6
1.4
1.4
1.2
1.2
c)
b)
0.8
1.6
R(2)
R(2)
1.6
0.8
R(1)
0.6
c)
b)
0.8
0.6
a)
0.4
0.4
0.2
a)
0.2
0
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
R(1)
0.2
0.4
0.6
0.8
1.2
1.4
1.6
R(1)
Figure 16.6: Achievable regions and outer bound for the Gaussian IC with
E(1) = E(2) = 7 for different values of the cross talk coefficients. The black dotted curve denotes Satos outer bound
(Theorem 16.19); in the red curve a) interference is treated as
noise (where timesharing is neglected on purpose, and we assume Gaussian inputs), see (16.229); blue curve b) is orthogonal
coding, see (16.230); and for the black curve c) the interference
is canceled, see (16.231). The units are bits.
16.6. Gaussian IC
379
(16.235)
(16.236)
(16.237)
(16.238)
(16.239)
(16.240)
(16.241)
(16.242)
i.e.,
1
1
1
log 1 + E(1) + log 1 + E(2) log 1 + a21 E(1) + E(2) .
2
2
2
(16.243)
a21 E(2) + 1.
(16.244)
380
(16.246)
(16.247)
(16.248)
The capacity regions for the Gaussian IC with strong or very strong interference, respectively, then read as follows.
Theorem 16.22 ([Ahl74], [Sat81]).
with strong interference is
0 R(1)
0 R(2)
(1)
(2)
CIC, strong =
R ,R
:
R(1) + R(2)
1
(1)
log 1 + E ,
1
(2)
log 1 + E ,
2
.
(1)
1
(2)
(1)
(2)
a21 E + E
(16.249)
1
(1)
(1)
0 R 2 log 1 + E ,
(1)
(2)
(16.250)
CIC, very strong =
R ,R
:
.
1
16.6.5
We can also apply the HanKobayashi region of Theorem 16.17 to the Gaussian IC. The timesharing random variable T can be used as representing
different transmission modes: For example, t = 1 might represent the mode
when interference will be treated as noise, t = 2 the mode when only user 1
transmits, t = 3 is the mode when only user 2 transmits, t = 4 represents the
mode when both users decode both messages, etc.
In general, T T with T being a finite alphabet. Depending on the realization of T = t, we then choose the other random variables U (i) (t) and
X (i) (t). Actually, to allow the recovery of each of the three basic communication strategies of Section 16.6.3, we recall from Section 13.10 about the
Gaussian BC that in order to implement superposition coding, we may choose
0
0
X (i) = U (i) + X (i) where U (i) and X (i) are independent
(i)zeromean Gaussian
(i)
(i)
(i)
random variables with variance E and 1 E , respectively. This
choice then only requires to choose (i) (t) and E(i) (t) as a function of the
16.6. Gaussian IC
381
mode T = t. It also ensures that all seven conditions of Theorem 16.17 are
restricted to Gaussian distributions:
1
E(1) (t)
(1)
(1) (2)
U , T = t = log 1 +
I X ;Y
, (16.251)
2
a12 1 (2) (t) E(2) (t) + 1
1
E(2) (t)
, (16.252)
I X (2) ; Y (2) U (1) , T = t = log 1 +
(1)
2
a21 1 (1) (t) E (t) + 1
1
E(1) (t) + a12 (2) (t)E(2) (t)
(1)
(2)
(1)
I X ,U ;Y
T = t = log 1 +
, (16.253)
2
a12 1 (2) (t) E(2) (t) + 1
I X (2) ; Y (2) U (1) , U (2) , T = t
1 (2) (t) E(2) (t)
1
= log 1 +
, (16.254)
2
a21 1 (1) (t) E(1) (t) + 1
1
E(2) (t) + a21 (1) (t)E(1) (t)
(2)
(1)
(2)
I X ,U ;Y
T = t = log 1 +
, (16.255)
2
a21 1 (1) (t) E(1) (t) + 1
I X (1) ; Y (1) U (1) , U (2) , T = t
1 (1) (t) E(1) (t)
1
= log 1 +
, (16.256)
2
a12 1 (2) (t) E(2) (t) + 1
I X (1) , U (2) ; Y (1) U (1) , T = t
1 (1) (t) E(1) + a12 (2) (t)E(2) (t)
1
= log 1 +
,
2
a12 1 (2) (t) E(2) (t) + 1
(16.257)
(2)
(1)
(2) (2)
U ,T = t
I X ,U ;Y
1 (2) (t) E(2) + a21 (1) (t)E(1) (t)
1
= log 1 +
.
2
a21 1 (1) (t) E(1) (t) + 1
(16.258)
Now we can recover the region R1,G in (16.229) with (1) = (2) = 1 by
choosing T to be a constant and by setting (1) = (2) = 0 (thereby making
U (1) = U (2) = 0 with probability 1). For R2 in (16.230), we set T to be binary
with T = {1, 2} and with probability PT (1) = 1 PT (2) = , and we choose
(1) (1) = 1, (2) (1) = 0, E(1) (1) = E(1) / , E(2) (1) = 0 for mode T = 1, and
(1) (2) = 0, (2) (2) = 1, E(1) (2) = 0, E(2) (2) = E(2) / for mode T = 2.
Finally, the reader can check that R3 in (16.231) can be recovered by T
being constant, and (1) = (2) = 1.
In general, the HanKobayashi region (for fixed T ) is a heptagon, see the
example in Figure 16.7.
16.6.6
382
3.5
R(2) 3.262.5bits
R(2)
1.5
0.5
1.5
2.5
3.5
R(1)
Figure 16.7: This depicts the heptagon describing the HanKobayashi region
for the Gaussian IC with the choice (1) = (2) = 0.9. The Gaussian IC has parameter a12 = a21 = 0.1 and the power constraints
are E(1) = E(2) = 1000. The units are bits.
power E(1) = E(2) , E and then to compare the maximum sum rate at high
SNR with the situation of no interference. If there was no interference (a = 0),
then each transmitter could transmit at a rate of
1
1
log(1 + E) log E
(16.259)
2
2
(for E large), i.e., in total one can transmit at a sum rate of
1 1
+
log E = 1 log E.
2 2
(16.260)
Now we are interested in the behavior of the factor in front of the logarithm
when a is increased. Concretely, we investigate the symmetric degrees of
freedom:
dsym (a) , lim
max
E (R(1) ,R(2) )C
(16.261)
Even though the capacity region of the Gaussian IC is not know for all values
of a, the symmetric degrees of freedom has been derived exactly and is shown
in Figure 16.8.
Note how the Gaussian IC can be split into four different regions:
For a 0, 12 we have weak interference and it is optimal with respect
to the degrees of freedom to treat the interference as noise.
16.6. Gaussian IC
dsym
383
weak
medium
strong
very strong
1
2
3
1
2
1
2
2
3
3
2
Bibliography
[Abb08]
Emmanuel A. Abbe, Local to global geometric methods in information theory, Ph.D. dissertation, Massachusetts Institute
of Technology (MIT), June 2008.
[AGA12]
[Ahl71]
Rudolf Ahlswede, Multiway communication channels, in Proceedings 2nd IEEE International Symposium on Information
Theory (ISIT).
Tsahkadsor, Armenia, USSR: Publishing
House of the Hungarian Academy of Sciences (published 1973),
September 28, 1971, pp. 2351.
[Ahl74]
[Ber71]
[Ber73]
[BZ83]
Toby Berger and Zhen Zhang, Minimum breakdown degradation in binary source encoding, IEEE Transactions on Information Theory, vol. 29, no. 6, pp. 807814, November 1983.
[Car75]
[CEG87]
385
386
Bibliography
[CEGS80]
[CK78]
Imre Csisz
ar and J
anos Korner, Broadcast channels with confidential messages, IEEE Transactions on Information Theory,
vol. 24, no. 3, pp. 339348, May 1978.
[CK81]
Imre Csisz
ar and J
anos Korner, Information Theory: Coding
Theorems for Discrete Memoryless Systems. Budapest, Hungary: Academic Press, 1981.
[CK11]
Imre Csisz
ar and J
anos Korner, Information Theory: Coding
Theorems for Discrete Memoryless Systems, 2nd ed.
Cambridge, UK: Cambridge University Press, 2011.
[CMGEG08] HonFah Chong, Mehul Motani, Hari Krishna Garg, and Hesham El Gamal, On the HanKobayashi region for the interference channel, IEEE Transactions on Information Theory,
vol. 54, no. 7, pp. 31883195, July 2008.
[Cos83]
[Cov72]
[Cov75]
[CS04]
Imre Csisz
ar and Paul C. Shields, Information theory and
statistics: A tutorial, Foundations and Trends in Communications and Information Theory, vol. 1, no. 4, pp. 417528, 2004.
[Csi84]
Imre Csisz
ar, Sanov property, generalized Iprojection and a
conditional limit theorem, The Annals of Probability, vol. 12,
no. 3, pp. 768793, August 1984.
[Csi98]
Imre Csisz
ar, The method of types, IEEE Transactions on
Information Theory, vol. 44, no. 6, pp. 25052523, October 1998.
[CT06]
[EG79]
Abbas A. El Gamal, The capacity of a class of broadcast channels, IEEE Transactions on Information Theory, vol. 25, no. 2,
pp. 166169, March 1979.
387
Bibliography
[EG81]
[EGC82]
[Egg58]
H. G. Eggleston, Convexity.
versity Press, 1958.
[EGK10]
[EGK11]
[ETW08]
[Fan61]
[Gal68]
Robert G. Gallager, Information Theory and Reliable Communication. New York, NY, USA: John Wiley & Sons, 1968.
[GP80]
[HK81]
[Hoe56]
Wassily Hoeffding, Asymptotically optimal tests for multinominal distributions, The Annals of Mathematical Statistics,
vol. 36, pp. 19161921, 1956.
[KM77a]
J
anos K
orner and Katalin Marton, Comparison of two noisy
channels, in Topics in Information Theory (1975), Imre Csiszar
and Peter Elias, Eds. NorthHolland, 1977, pp. 411423, colloquia Math. Soc. Janos Bolyai.
388
Bibliography
[KM77b]
J
anos K
orner and Katalin Marton, General broadcast channels
with degraded message sets, IEEE Transactions on Information Theory, vol. 23, no. 1, pp. 6064, January 1977.
[Kol56]
[Kra07]
[Lia72]
Henry HerngJiunn Liao, Multiple access channels, Ph.D. dissertation, University of Hawaii, Honolulu, USA, September 1972.
[Llo82]
[LPW08]
[Mar79]
[Mos14]
[Nai10]
Chandra Nair, A note on outer bounds for broadcast channel, in Proceedings International Zurich Seminar on Broadband
Communications (IZS), Zurich, Switzerland, March 35, 2010.
[NEG07]
[NW08]
Bibliography
389
[Oza80]
[Pin60]
Mark S. Pinsker, Information and Information Stability of Random Variables and Processes, ser. Problemy Peredaci Informacii.
Moscow: Akademii Nauk SSSR, 1960, vol. 7, English translation:
HoldenDay, San Francisco, 1964.
[Rio07]
Olivier Rioul, A simple proof of the entropypower inequality via properties of mutual information, in Proceedings IEEE
International Symposium on Information Theory (ISIT), Nice,
France, June 2430, 2007, pp. 4650.
[San57]
[Sat77]
Hiroshi Sato, Twouser communication channels, IEEE Transactions on Information Theory, vol. 23, no. 3, pp. 295304, May
1977.
[Sat78]
Hiroshi Sato, An outer bound to the capacity region of broadcast channels, IEEE Transactions on Information Theory,
vol. 24, no. 3, pp. 374377, May 1978.
[Sat81]
[Sha48]
Claude E. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379423 and
623656, July and October 1948.
[Sha59]
[Sha61]
[Sta59]
A. J. Stam, Some inequalities satisfied by the quantities of information of Fisher and Shannon, Information and Control,
vol. 2, pp. 101112, June 1959.
390
Bibliography
[SW73a]
David S. Slepian and Jack K. Wolf, A coding theorem for multiple access channels with correlated sources, Bell System Technical Journal, vol. 52, no. 7, pp. 10371076, September 1973.
[SW73b]
David S. Slepian and Jack K. Wolf, Noiseless coding of correlated information sources, IEEE Transactions on Information
Theory, vol. 19, no. 4, pp. 471480, July 1973.
[VAR11]
[VG06]
Sergio Verd
u and Dongning Guo, A simple proof of the entropypower inequality, IEEE Transactions on Information Theory,
vol. 52, no. 5, pp. 21652166, May 2006.
[VKG03]
[Wit80]
[WW81]
[WWZ80]
Jack K. Wolf, Aaron D. Wyner, and Jacob Ziv, Source coding for multiple descriptions, Bell System Technical Journal,
vol. 59, no. 8, pp. 14171426, October 1980.
[WZ76]
[Yeu08]
[ZB87]
Zhen Zhang and Toby Berger, New results in binary multiple descriptions, IEEE Transactions on Information Theory,
vol. 33, no. 4, pp. 502521, July 1987.
List of Figures
1.1
1.2
8
14
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
Sanovs Theorem . . . . . . . . . . . . . . . . . . . . . . . .
Projection of a point q onto a plane . . . . . . . . . . .
A point below the projection plane . . . . . . . . . . .
A convex set with a tangential plane . . . . . . . . . .
A onesided set with respect to Q . . . . . . . . . . . .
Uniqueness of Q . . . . . . . . . . . . . . . . . . . . . . . .
Representation of two PMFs Q1 and Q2 . . . . . . . .
Illustration of the set A defined in (3.135) . . . . . .
An example of a locally onesided set F . . . . . . . .
An example of a set F that is not locally onesided
.
.
.
.
.
.
.
.
.
.
37
44
44
45
47
47
49
54
58
59
4.1
79
5.1
5.2
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18
Quantization of a square . . . . . . . . . . . . . . . . . . . . . . .
Reconstruction areas and points of X N (0, 1) . . . . . .
Reconstruction areas and points of X N (0, 1) . . . . . .
Test channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof failure for discontinuous rate distortion function . .
Distortionrate plane . . . . . . . . . . . . . . . . . . . . . . . . .
Tangent through R0 (q , ) . . . . . . . . . . . . . . . . . . . . . .
A contradiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A second contradiction . . . . . . . . . . . . . . . . . . . . . . . .
A convex function is continuous . . . . . . . . . . . . . . . . . .
A convex function with a slope discontinuity . . . . . . . .
A typical rate distortion function . . . . . . . . . . . . . . . . .
Joint source channel coding . . . . . . . . . . . . . . . . . . . . .
Lossy compression added to joint source channel coding
Rate distortion combined with channel transmission . . .
The ndimensional space of all possible received vectors .
Reverse waterfilling solution . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
87
91
96
101
111
111
112
112
115
118
119
119
122
122
127
131
6.1
6.2
..........
Mapping of source sequences x to codewords x
Graphical explanation of Theorem 6.5 . . . . . . . . . . . . . . . .
138
143
391
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
392
List of Figures
7.1
156
8.1
8.2
8.3
185
187
197
9.1
9.2
207
209
10.1
10.2
10.3
....
....
219
222
..
..
..
..
..
..
..
..
..
..
..
..
a
..
..
..
..
10.4
10.5
10.6
10.7
10.8
10.9
10.10
10.11
10.12
10.13
10.14
10.15
10.16
10.17
10.18
10.19
10.20
11.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
222
223
224
226
226
234
234
236
236
237
238
240
.
.
.
.
.
.
.
.
240
241
243
246
....
....
247
248
11.3
252
259
12.1
12.2
12.3
12.4
12.5
.
.
.
.
.
261
273
273
273
275
13.1
13.2
285
290
11.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
251
393
List of Figures
13.3
13.4
13.5
13.6
13.7
13.8
.
.
.
.
.
.
290
294
307
313
313
322
14.1
14.2
327
332
15.1
15.2
15.3
15.4
15.5
.
.
.
.
.
341
346
346
347
348
16.1
16.2
16.3
349
351
16.4
16.5
16.6
16.7
16.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
351
351
353
378
382
383
List of Tables
5.3
91
7.2
7.3
181
181
9.3
210
395
Index
Italic entries are to names.
Symbols
X , 2
D, 2
i , 64
t (), 65, 70
i , 64
m (), 64
H, 1
h, 2
I(; ), 3
I(x), 17, 27
I {}, 162
inf, 16, 36
L1 distance, see variational
distance
log, 20
N(x), 17, 27
P, 18
Pn , 18, 29
sup, 16
supp, 1
A
Abbe, Emmanuel A., 42, 48
AEP, 12
Ahlswede, Rudolf, 249, 360, 380
Akyol, Emrah, 183
Aminzadeh Gohari, Amin, 319
Anantharam, Venkat, 319
asymptotic equipartition property,
12
auxiliary random variable, 176
bound on alphabet size, 177,
397
398
discrete memoryless, 286
Gaussian, 291, 320
GelfandPinsker coding, 312
physically degraded, 289
rate triple, 286
stochastically degraded, 290,
291
timesharing, 287
with less noisy output, 292
with more capable output, 292
BSC, see binary symmetric
channel
BSS, see binary symmetric source
C
capacity cost function, 94
capacity region, 220, 221
BC, 286, 288
degraded, 307
degraded message set, 308
Gaussian, 321
less noisy, 307
more capable, 305
CutSet Bound, 344
dirty paper, 281
DMN, 340
Gaussian MAC, 245
GelfandPinsker, 263, 277
IC, 350, 352
MAC, 225, 229, 241
common message, 335
Caratheodorys Theorem, 13
FenchelEggleston
strengthening, 15
Caratheodory, Constantin, 13
Carleial, Aydano B., 359, 377
causality, 340
CDMA, 246
chain rule
for entropy, 2
for mutual information, 3
for type, 28
for typical set, 165
Chebyshev Inequality, 12
Chong, HonFah, 371
Index
codedivision multipleaccess, 246
coding scheme
broadcast channel, 286
DMC with interference, 262
interference channel, 350
multiple description, 157
multipleaccess channel, 220
rate distortion, 93
SlepianWolf, 208
WynerZiv, 186
coding theorem
auxiliary random variable, 176
broadcast channel, 312, 318
dirty paper, 281
for broadcast channel, 294,
299, 301
degraded, 307
degraded message set, 308
less noisy, 307
more capable, 305
for correlated sources over
MAC, 254, 260
for multiple description
problem, 170, 178
for multipleaccess channel,
225, 229
with common message, 335
for rate distortion problem, 98
with sideinformation, 204
for SlepianWolf problem, 209
for WynerZiv problem, 204
GelfandPinsker, 277
common message, 327
common part, 259
Conditional Limit Theorem, 53, 59
Conditional Type Theorem, see
Type Theorem
convergence, 13
almostsure, 12
in probability, 12
with probability 1, 12
convex combination, 13
of pentagons, 235
convexity, 42
maximization, 271
onesided, 46
399
Index
Costa, Max, 279, 281, 361
Costa, Max H. M., 361
Cover, Thomas M., x, 9, 38, 57,
170, 182, 217, 260, 285,
319, 339, 340, 344
Csisz
ar, Imre, x, xi, 9, 17, 50, 53,
133
Csisz
ar,Imre, 304
Csisz
arK
orner identity, 304
CTT, see Type Theorem
cut, 341
CutSet Bound, 319, 340, 344
BC, 345
MAC, 347
relay channel, 347, 348
D
data compression
distributed, 207
distributed lossless, see
SlepianWolf problem
lossless, 100, 210
lossy, see rate distortion
problem
universal, 143
zeroerror, 217
Data Processing Inequality, 5
data transmission
with sideinformation, see
GelfandPinsker problem
differential entropy, 2
Dirichlet partition, 88
dirty paper coding, 262
capacity, 281
discrete memoryless broadcast
channel, see broadcast
channel
discrete memoryless channel, 64,
339
with interference, 262
discrete memoryless interference
channel, see interference
channel
discrete memoryless network, 339
cut, 341
400
FDMA, 247
Fenchel, Werner, 15
FenchelEggleston strengthening
of Caratheodorys
Theorem, 15
FourierMotzkin elimination, 9,
10, 315
frequencydivision multipleaccess,
247
G
Gallager, Robert G., 86
Garg, Hari Krishna, 371
Gelfand, Isral M., 278
Gelfand, Sergei I., 277, 278, 283
GelfandPinsker
capacity, 263
GelfandPinsker problem, 261
application to broadcast
channel, 312
capacity, 271
coding scheme, 262
coding theorem, 277
achievability, 263
converse, 276
Gaussian, 281
convexity, 270
rate, 263, 269
Goyal, Vivek K., xi, 178, 183
Guo, Dongning, 9
H
Han, Te Sun, 371
HanKobayashi region, 365, 371,
375, 380
Hoeffding, Wassily, 17
I
IC, see interference channel
IID, see independent and
identically distributed
under random variable
indicator function, 162
indicator random variable, 51
Index
Information Theory Inequality, see
IT Inequality
information transmission system,
121, 251
coding theorem, 254, 260
achievability, 255
interference channel, 349
capacity region, 350
dependence on marginals,
352
coding scheme, 350
discrete memoryless, 349
Gaussian, 374
degrees of freedom, 381
HanKobayashi region, 375,
380
inner bound, 375
outer bound, 375
strong interference, 377, 380
very strong interference,
377, 380
HanKobayashi region, 365,
371, 375, 380
inner bound, 355
inner region, 365
outer bound, 353, 354
rate pair, 350
Satos outer bound, 354
strong interference, 359
Gaussian, 377, 380
symmetric, 358
very strong interference, 359
Gaussian, 377, 380
IT Inequality, 3
Exponentiated, 4
J
joint source channel coding
scheme, 119
K
K
orner, J
anos, x, 9, 17, 133, 292,
304, 308, 312, 320
KarushKuhnTucker conditions,
106
401
Index
Kim, YoungHan, xi
KKT conditions, 106
Kobayashi, Kingo, 371
Kolmogorov, Andrei N., 85, 278
Kramer, Gerhard, x, xi, 178, 183
L
Lagrangian, 40, 106
law of large numbers
strong, 12
weak, 11
Levin, David A., 50
LHS, 118
Liao, Henry HerngJiunn, 249
Lloyd, Stuart P., 88
LogSum Inequality, 4
lossless data compression
coding theorem
achievability, 210
M
MAC, see multipleaccess channel
Markov chain, 5, 79
Markov Inequality, 151
Markov Lemma, 193, 267
Marton, Katalin, 17, 292, 308,
312, 318, 320
Massey, James L., 3
minimum mean squared error, 199
MMSE, see minimum mean
squared error
Moser, Stefan M., x, 1, 3, 5, 6, 12,
17, 23, 48, 63, 100, 105,
106, 119, 121123, 125,
129, 130, 199, 251
Motani, Mehul, 371
multiple description problem, 155
coding scheme, 157
coding theorem, 170, 178
achievability, 159, 170
convexity, 177
multiple description rate
distortion quintuple, 157
multiple description rate
distortion region, 158
402
Pinsker, Mark S., 50, 277, 278, 283
PMF, see probability mass
function
polytope, 10
probability density function, 2
probability distribution
empirical, 18
linear family, 43
PDF, 2
PMF, 1
probability mass function, 1
Pythagorean Theorem, 42
Q
quantization, 86
Lloyds algorithm, 88
R
Renyi entropy, 9
random coding, 101, 145, 159, 170,
187, 210, 212, 225, 255,
263, 294, 309, 313, 328,
365
binning, 187, 210, 212, 263,
309
binning & superposition
coding, 313
coloring, 216
rate splitting, 365
superposition coding, 294,
328, 365
random variable
independent and identically
distributed, 11
indicator, 7
rate distortion function, 94, 134
continuity, 115
convexity, 97
KKT conditions, 109
lower bound, 114
properties, 97, 106, 110
WynerZiv, see WynerZiv
problem
rate distortion problem, 85
coding scheme, 93
Index
coding theorem, 98, 105, 133
achievability, 101
converse, 99
strong converse, 135
coding theorem for Gaussian,
123
distortion rate function, 94
error exponent, 141
error probability, 135
Gaussian source, 123
information rate distortion
function, 94
multiple description problem,
see multiple description
problem
rate, 93
rate distortion function, 94
rate distortion pair, 94
rate distortion region, 94
test channel, 96, 124
with sideinformation, see
WynerZiv problem
rate distortion region, 94, 98
multiple description, 158, 170,
178
WynerZiv, 186, 204
rate region, 9, 220
MAC, 225, 229
SlepianWolf, 209
rate splitting, 365
relative entropy, 2
Data Processing Inequality, 5
Pythagorean Theorem, 42
relay channel, 347, 348
CutSet Bound, 347, 348
reverse waterfilling, 130
RHS, 23
Rioul, Olivier, 9
Rose, Kenneth, 183
RV, see random variable
S
Salehi, Masoud, 260
Sanovs Theorem, 36
Sanov, I. N., 17
403
Index
Satos outer bound, 354
Sato, Hiroshi, 320, 354, 359, 360,
375, 377, 380
Shannon, Claude E., 85, 249
Shields, Paul C., xi
sideinformation
data compression, 185
data transmission, 261, 281
Slepian, David S., 209, 249, 335
SlepianWolf problem, 208
achievable rate region, 208
coding theorem, 209
achievability, 212
converse, 215
distributed coding scheme, 208
over MAC, 251
rate pair, 208
sphere covering, 126
sphere packing, 126
Stam, A. J., 9
successive cancellation, 224, 235,
245, 361
successive refinement, 183
superposition coding, 294, 313, 365
T
TA, see Theorem A under typical
set
TB, see Theorem B under typical
set
TC, see Theorem C under typical
set
TDMA, 247
Thomas, Joy A., x, 9, 38, 57, 339,
340, 344
timedivision multipleaccess, 247
timesharing, 198, 200, 208, 221,
240, 247, 255, 287, 289
coded, 240
total expectation, 7
total variation distance, 50
Tse, David N. C., 383
TT, see Type Theorem
type, 17
chain rule, 28
conditional, 28
joint, 28
type class, 18
conditional, 29
type covering lemma, 144
Type Theorem
Conditional, 29
CTT1, 29
CTT2, 29
CTT3, 30
CTT4, 32
TT1, 19, 28
TT2, 20, 28
TT3, 22, 28
TT4, 26, 28
typical set
chain rule, 165
conditionally strongly, 71
conditionally strongly,
alternative definition, 81
joint implies individual, 69
jointly strongly, 68
Markov Lemma, 193
strongly, 64
strongly conditional on letter,
71
Theorem A, 65, 70
Theorem B, 72
Theorem C, 76
weakly, 63
U
Union Bound, 7
on total expectation, 7
V
variational distance, 48
Venkataramani, Raman, xi, 178,
183
Verd
u, Sergio, 9
Viswanatha, Kumar, 183
Voronoi partition, 88
W
Wang, Hua, 383
404
Wang, Zizhou Vincent, 305
waterfilling, 130
Weierstrass, Karl, 16
Wilmer, Elizabeth L., 50
Witsenhausen, Hans S., 182
Wolf, Jack K., 182, 209, 249, 335
Wyner, Aaron D., 182, 204
WynerZiv problem, 185
coding scheme, 186
coding theorem, 204
achievability, 187
converse, 202
Gaussian source, 198
rate distortion function with
global sideinformation,
Index
195
WynerZiv rate distortion
function, 186, 194
properties, 201
WynerZiv rate distortion
pair, 186
WynerZiv rate distortion
region, 186
Y
Yeung, Raymond W., xi
Z
Zhang, Zhen, 182
Ziv, Jacob, 182, 204