Information Theory

Information Theory
Lecture Notes
Stefan M. Moser
Version 3 2013
c Copyright Stefan M. Moser

Signal and Information Processing Lab
ETH Zrich
u
Zurich, Switzerland
Department of Electrical and Computer Engineering
National Chiao Tung University (NCTU)
Hsinchu, Taiwan
You are welcome to use these lecture notes for yourself, for teaching, or for any
other noncommercial purpose. If you use extracts from these lecture notes,
please make sure that their origin is shown. The author assumes no liability
or responsibility for any errors or omissions.
Version 3.2.
Compiled on 6 December 2013.
Contents
Preface
xi
1 Shannons Measure of Information

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Uncertainty or Entropy . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Binary Entropy Function . . . . . . . . . . . . . . . . .
1.2.3 The Information Theory Inequality . . . . . . . . . . .
1.2.4 Bounds on H(U ) . . . . . . . . . . . . . . . . . . . . . .
1.2.5 Conditional Entropy . . . . . . . . . . . . . . . . . . . .
1.2.6 More Than Two RVs . . . . . . . . . . . . . . . . . . .
1.2.7 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Relative Entropy and Variational Distance . . . . . . . . . . .
1.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3 Conditional Mutual Information . . . . . . . . . . . . .
1.4.4 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Jensens Inequality . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Comments on our Notation . . . . . . . . . . . . . . . . . . . .
1.6.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.2 Entropy and Mutual Information . . . . . . . . . . . .
1.A Appendix: Uniqueness of the Denition of Entropy . . . . . .
1.B Appendix: Entropy and Variational Distance . . . . . . . . . .
1.B.1 Estimating PMFs . . . . . . . . . . . . . . . . . . . . .
1.B.2 Extremal Entropy for given Variational Distance . . . .
1.B.3 Lower Bound on Entropy in Terms of Variational Distance
1
1
6
6
8
9
10
12
15
16
17
19
19
21
22
22
22
23
23
24
25
27
27
28
32
2 Review of Probability Theory

35
2.1 Discrete Probability Theory . . . . . . . . . . . . . . . . . . . 35
2.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . 36
2.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . 40
3 Data Compression: Ecient Coding of a Single Random
Message
43
c Stefan M. Moser, v. 3.2
vi
Contents
3.1
3.2
3.3
3.4
3.5
3.6
3.7
A Motivating Example . . . . . . . . . . . . . . . . . . . . . .
A Coding Scheme . . . . . . . . . . . . . . . . . . . . . . . . .
Prex-Free or Instantaneous Codes . . . . . . . . . . . . . . .
Trees and Codes . . . . . . . . . . . . . . . . . . . . . . . . . .
The Kraft Inequality . . . . . . . . . . . . . . . . . . . . . . .
Trees with Probabilities . . . . . . . . . . . . . . . . . . . . . .
What We Cannot Do: Fundamental Limitations of Source
Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 What We Can Do: Analysis of Some Good Codes . . . . . . .
3.8.1 Shannon-Type Codes . . . . . . . . . . . . . . . . . . .
3.8.2 Shannon Code . . . . . . . . . . . . . . . . . . . . . . .
3.8.3 Fano Code . . . . . . . . . . . . . . . . . . . . . . . . .
3.8.4 Coding Theorem for a Single Random Message . . . . .
3.9 Optimal Codes: Human Code . . . . . . . . . . . . . . . . . .
3.10 Types of Codes . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.A Appendix: Alternative Proof for the Converse Part of the Coding Theorem for a Single Random Message . . . . . . . . . . .
43
45
46
47
51
53
59
60
61
63
67
74
76
87
92
4 Data Compression: Ecient Coding of an Information Source 95

4.1 A Discrete Memoryless Source . . . . . . . . . . . . . . . . . . 95
4.2 BlocktoVariable-Length Coding of a DMS . . . . . . . . . . 96
4.3 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.2 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 Variable-LengthtoBlock Coding of a DMS . . . . . . . . . . 107
4.5 General Converse . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6 Optimal Message Sets: Tunstall Message Sets . . . . . . . . . 113
4.7 Optimal Variable-LengthtoBlock Codes: Tunstall Codes . . 115
4.8 The Eciency of a Source Coding Scheme . . . . . . . . . . . 120
5 Stochastic Processes and Entropy
5.1 Discrete Stationary Sources . . .
5.2 Markov Processes . . . . . . . .
5.3 Entropy Rate . . . . . . . . . . .
Rate
121
. . . . . . . . . . . . . . . . . 121
. . . . . . . . . . . . . . . . . 122
. . . . . . . . . . . . . . . . . 128
6 Data Compression: Ecient Coding of Sources with Memory135

6.1 BlocktoVariable-Length Coding of a DSS . . . . . . . . . . . 135
6.2 EliasWillems Universal BlockToVariable-Length Coding . . 138
6.2.1 The Recency Rank Calculator . . . . . . . . . . . . . . 139
6.2.2 Codes for Positive Integers . . . . . . . . . . . . . . . . 141
6.2.3 EliasWillems BlocktoVariable-Length Coding for a
DSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3 LempelZiv Universal Coding Schemes . . . . . . . . . . . . . 146
6.3.1 LZ-77: Sliding Window LempelZiv . . . . . . . . . . . 146
6.3.2 LZ-78: Tree-Structured LempelZiv . . . . . . . . . . . 149
c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013
vii
Contents
6.4
Analysis of LZ-78 . . . . . . . . . . . . . . . . . .
6.4.1 Distinct Parsing . . . . . . . . . . . . . . .
6.4.2 Number of Phrases . . . . . . . . . . . . .
6.4.3 Maximum Entropy Distribution . . . . . .
6.4.4 Denition of a Stationary Markov Process
n
6.4.5 Distinct Parsing of U1 . . . . . . . . . . .
6.4.6 Asymptotic Behavior of LZ-78 . . . . . . .
7 Optimizing Probability Vectors over Concave

KarushKuhnTucker Conditions
7.1 Introduction . . . . . . . . . . . . . . . . . . . .
7.2 Convex Regions and Concave Functions . . . . .
7.3 Maximizing Concave Functions . . . . . . . . . .
7.A Appendix: The Slope Paradox . . . . . . . . . .
8 Gambling and Horse Betting
8.1 Problem Setup . . . . . . . . . . . .
8.2 Optimal Gambling Strategy . . . . .
8.3 The Bookies Perspective . . . . . .
8.4 Uniform Fair Odds . . . . . . . . . .
8.5 What About Not Gambling? . . . .
8.6 Optimal Gambling for Subfair Odds
8.7 Gambling with Side-Information . .
8.8 Dependent Horse Races . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
150
150
150
153
155
158
160
.
.
.
.
165
165
166
168
174
.
.
.
.
.
.
.
.
177
177
178
182
184
185
186
191
193
.
.
.
.
.
.
.
.
197
197
202
204
208
211
213
215
217
.
.
.
.
.
229
229
229
233
237
240
Functions:
.
.
.
.
.
.
.
.
.
.
.
.
9 Data Transmission over a Noisy Digital Channel

9.1 Problem Setup . . . . . . . . . . . . . . . . . . . .
9.2 Discrete Memoryless Channels . . . . . . . . . . .
9.3 Coding for a DMC . . . . . . . . . . . . . . . . . .
9.4 The Bhattacharyya Bound . . . . . . . . . . . . .
9.5 Operational Capacity . . . . . . . . . . . . . . . .
9.6 Two Important Lemmas . . . . . . . . . . . . . .
9.7 Converse to the Channel Coding Theorem . . . .
9.8 The Channel Coding Theorem . . . . . . . . . . .
10 Computing Capacity
10.1 Introduction . . . . . . . . . . . . .
10.2 Strongly Symmetric DMCs . . . . .
10.3 Weakly Symmetric DMCs . . . . . .
10.4 Mutual Information and Convexity
10.5 KarushKuhnTucker Conditions .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11 Convolutional Codes
245
11.1 Convolutional Encoder of a Trellis Code . . . . . . . . . . . . . 245
11.2 Decoder of a Trellis Code . . . . . . . . . . . . . . . . . . . . . 248
11.3 Quality of a Trellis Code . . . . . . . . . . . . . . . . . . . . . 252
viii
Contents
11.3.1 Detours in a Trellis . . . . . . . . . . . . . . . . . . . . 253
11.3.2 Counting Detours: Signalowgraphs . . . . . . . . . . . 256
11.3.3 Upper Bound on the Bit Error Probability of a Trellis
Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
12 Error Exponent and Channel Reliability Function

12.1 The Union Bhattacharyya Bound . . . . . . . . . .
12.2 The Gallager Bound . . . . . . . . . . . . . . . . . .
12.3 The Bhattacharyya Exponent and the Cut-O Rate
12.4 The Gallager Exponent . . . . . . . . . . . . . . . .
12.5 Channel Reliability Function . . . . . . . . . . . . .
.
.
.
.
.
267
268
269
271
276
280
.
.
.
.
.
.
.
.
285
285
286
287
287
288
290
292
293
.
.
.
.
301
301
305
306
310
.
.
.
.
.
.
.
313
313
315
317
318
319
324
326
16 Bandlimited Channels
16.1 Additive White Gaussian Noise Channel . . . . . . . . . . . .
16.2 The Sampling Theorem . . . . . . . . . . . . . . . . . . . . . .
16.3 From Continuous To Discrete Time . . . . . . . . . . . . . . .
331
331
334
338
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13 Joint Source and Channel Coding

13.1 Information Transmission System . . . . . . . . . . . . . . .
13.2 Converse to the Information Transmission Theorem . . . . .
13.3 Achievability of the Information Transmission Theorem . . .
13.3.1 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . .
13.3.2 An Achievable Joint Source Channel Coding Scheme
13.4 Joint Source and Channel Coding . . . . . . . . . . . . . . .
13.5 The Rate of a Joint Source Channel Coding Scheme . . . . .
13.6 Transmission above Capacity and Minimum Bit Error Rate .
14 Continuous Random Variables and Dierential Entropy
14.1 Entropy of Continuous Random Variables . . . . . . . . . .
14.2 Properties of Dierential Entropy . . . . . . . . . . . . . .
14.3 Generalizations and Further Denitions . . . . . . . . . . .
14.4 Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . .
15 The
15.1
15.2
15.3
Gaussian Channel
Introduction . . . . . . .
Information Capacity . .
Channel Coding Theorem
15.3.1 Plausibility . . . .
15.3.2 Achievability . . .
15.3.3 Converse . . . . .
15.4 Joint Source and Channel
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Coding Theorem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17 Parallel Gaussian Channels

343
17.1 Independent Parallel Gaussian Channels: Waterlling . . . . . 343
17.2 Dependent Parallel Gaussian Channels . . . . . . . . . . . . . 348
17.3 Colored Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . 351
ix
Contents
18 Asymptotic Equipartition Property and Weak
18.1 Motivation . . . . . . . . . . . . . . . . . . . .
18.2 AEP . . . . . . . . . . . . . . . . . . . . . . . .
18.3 Typical Set . . . . . . . . . . . . . . . . . . . .
18.4 High-Probability Sets and the Typical Set . . .
18.5 Data Compression Revisited . . . . . . . . . .
18.6 AEP for General Sources with Memory . . . .
18.7 General Source Coding Theorem . . . . . . . .
18.8 Joint AEP . . . . . . . . . . . . . . . . . . . .
18.9 Jointly Typical Sequences . . . . . . . . . . . .
18.10 Data Transmission Revisited . . . . . . . . . .
18.11 Joint Source and Channel Coding Revisited . .
18.12 Continuous AEP and Typical Sets . . . . . . .
18.13 Summary . . . . . . . . . . . . . . . . . . . . .
Typicality
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
353
353
355
357
359
361
363
364
366
367
370
372
374
377
19 Cryptography
19.1 Introduction to Cryptography . . . . . . .
19.2 Cryptographic System Model . . . . . . . .
19.3 Kerckho Hypothesis . . . . . . . . . . . .
19.4 Perfect Secrecy . . . . . . . . . . . . . . . .
19.5 Imperfect Secrecy . . . . . . . . . . . . . .
19.6 Computational vs. Unconditional Security .
19.7 Public-Key Cryptography . . . . . . . . . .
19.7.1 One-Way Function . . . . . . . . . .
19.7.2 Trapdoor One-Way Function . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
379
379
380
381
381
384
387
387
389
390
A Gaussian Random Variables

A.1 Standard Gaussian Random Variables . . .
A.2 Gaussian Random Variables . . . . . . . .
A.3 The Q-Function . . . . . . . . . . . . . . .
A.4 The Characteristic Function of a Gaussian
A.5 A Summary . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
395
395
397
400
406
408
B Gaussian Vectors
B.1 Positive Semi-Denite Matrices . . . . . . . . . . . . . . . . .
B.2 Random Vectors and Covariance Matrices . . . . . . . . . . .
B.3 The Characteristic Function . . . . . . . . . . . . . . . . . .
B.4 A Standard Gaussian Vector . . . . . . . . . . . . . . . . . .
B.5 Gaussian Vectors . . . . . . . . . . . . . . . . . . . . . . . . .
B.6 The Mean and Covariance Determine the Law of a Gaussian
B.7 Canonical Representation of Centered Gaussian Vectors . . .
B.8 The Characteristic Function of a Gaussian Vector . . . . . .
B.9 The Density of a Gaussian Vector . . . . . . . . . . . . . . .
B.10 Linear Functions of Gaussian Vectors . . . . . . . . . . . . .
B.11 A Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
409
409
414
418
419
420
423
427
429
429
432
433
x
C Stochastic Processes
C.1 Stochastic Processes & Stationarity . . . . . . .
C.2 The Autocovariance Function . . . . . . . . . . .
C.3 Gaussian Processes . . . . . . . . . . . . . . . .
C.4 The Power Spectral Density . . . . . . . . . . .
C.5 Linear Functionals of WSS Stochastic Processes
C.6 Filtering Stochastic Processes . . . . . . . . . . .
C.7 White Gaussian Noise . . . . . . . . . . . . . . .
C.8 Orthonormal and KarhunenLoeve Expansions .
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
439
439
441
443
444
445
449
450
453
Bibliography
463
List of Figures
467
List of Tables
471
Index
473
Preface
These lecture notes started out as handwritten guidelines that I used myself
in class for teaching. As I got frequent and persistent requests from students
attending the class to hand out these private notes in spite of their awful
state (I still cannot really believe that any student was actually able to read
them!), my students Lin Gu-Rong and Lin Hsuan-Yin took matters into their
A
own hands and started to typeset my handwritten notes in L TEX. These
versions of notes then grew together with a couple of loose handouts (that
complemented the textbook by Cover and Thomas [CT06] that I had been
using as class textbook for several years) to a large pile of proper handouts and
were used several times in combination with Cover and Thomas. During this
time, the notes were constantly improved and extended. In this context I have
to acknowledge the continued help of my students, in particular of Lin HsuanYin and of Chang Hui-Ting, who typed the chapter about cryptography.
In fall 2008 my colleague Chen Po-Ning approached me and suggested to
write a coding and information theory textbook for students with only little
background in engineering or math. Together with three other colleagues we
worked on this project for over two years until it got completed and published
in 2012 [MCck]. This work had quite some impact on the presentation of
some of the material of this class. In late 2010, I nally decided to compile
all separate notes together, to add some more details in some places and
rearrange the material, and to generate a proper lecture script that could be
used as class textbook in future. Its rst edition was used in class in fall
2011/2012 during which it underwent further revisions and improvements.
For the current second edition, I then also added one more chapter about the
missing topic of error exponents.
In its current form, this script introduces the most important basics in
information theory. Depending on the range and depth of the selected class
material, it can be covered in about 20 to 30 lectures of two hours each.
Roughly, the script can be divided into three main parts: Chapters 36 cover
lossless data compression of discrete sources; Chapters 913 look at data transmission over discrete channels; and Chapters 1417 deal with topics related to
the Gaussian channel. Besides these main themes, the notes also briey cover
a couple of additional topics like convex optimization, gambling and horse
betting, typicality, cryptography, and Gaussian random variables.
More in detail, the script starts with an introduction of the main quantities of information theory like entropy and mutual information in Chapter 1,
xi
xii
Preface
followed by a very brief review of probability in Chapter 2. In Chapters 3

and 4, lossless data compression of discrete memoryless sources is introduced
including Human and Tunstall coding. Chapters 5 and 6 extend these results to sources with memory and to universal data compression. Two dierent universal compression schemes are introduced: EliasWillems coding and
LempelZiv coding.
As a preparation for later topics, in Chapter 7, we then discuss convex
optimization and the KarushKuhnTucker (KKT) conditions. Chapter 8 is
an interlude that introduces a quite dierent aspect of information theory:
gambling and horse betting. It is quite separate from the rest of the script
and only relies on some denitions from Chapter 1 and the KKT conditions
from Chapter 7.
Then, Chapter 9 introduces the fundamental problem of data transmission
and derives the channel coding theorem for discrete memoryless channels. In
Chapter 10, we discuss the problem of computing capacity, and in Chapter 11,
we show one practical algorithm of data transmission: convolutional codes
based on a trellis. Chapter 12 is slightly more advanced and is an introduction
to the concept of error exponents.
In Chapter 13, we combine source compression and channel transmission
and discuss the basic problem of transmitting a given source over a given
channel. In particular, we discuss the consequences of transmitting a source
with an entropy rate above capacity.
Before we extend the discussion of channel capacity to continuous alphabets (i.e., to the example of the Gaussian channel) in Chapter 15, we prepare
the required background in Chapter 14. Chapters 16 and 17 then give short
glimpses at continuous-time channels and at waterlling, respectively.
Up to Chapter 17, I have avoided the usage of weak typicality. Chapter 18
corrects this omission. It introduces the asymptotic equipartition property and
typical sets, and then uses this new tool to re-derive several proofs of previous
chapters (data compression, data transmission and joint source and channel
coding). Note that for the basic understanding of the main concepts in this
script, Chapter 18 is not necessary. Also, strong typicality is not covered here,
but deferred to the second course Advanced Topics in Information Theory
[Mos13] where it is treated in great detail.
Lastly, Chapter 19 presents a very brief overview of some of the most
important basic results in cryptography.
I have made the experience that many students are mortally afraid of Gaussian random variables, vectors, and processes. I believe that this is mainly
because they are never properly explained in undergraduate classes. Therefore
I have decided to include a quite extensive coverage of them in the appendices,
even though I usually do not teach them in class due to lack of time.
For a better understanding of the dependencies and relationships between
the dierent chapters and of the prerequisites, on page xv a dependency chart
can be found.
I cannot and do not claim authorship for much of the material covered in
here. My main contribution is the compilation, arrangement, and some more
Preface
xiii
or less strong adaptation. There are many sources that I used as inspiration
and from where I took the ideas of how to present information theory. Most
important of all are of course my two teachers: Prof. James L. Massey during
my master study and Prof. Amos Lapidoth during the time I was working
on my Ph.D. Jim and Amos taught me most of what I know in information
theory!
More in detail, I have used the following sources:
The basic denitions in information theory (Chapters 1, 14, 18) are
based on the textbook of Cover and Thomas [CT06]. Also from there
come the treatment of horse betting (Chapter 8) and the analysis of the
LempelZiv universal data compression (Chapter 6).
The idea for the proof of the channel coding theorem (Chapters 9 and 15)
using the beautifully simple threshold decoder stems from Tobias Koch,
who got inspired by the work of Polyanskiy, Poor and Verd [PPV10].
u
In principle, the threshold decoder goes back to a paper by Feinstein
[Fei54].
Most of the material about lossless data compression (Chapters 36)
is very strongly inspired by the wonderful lecture notes of Jim Massey
[Mas96]. In particular, I fully use the beautiful approach of trees to
describe and analyze source codes. Also the chapter about computing
capacity (Chapter 10), the introduction to convolutional codes (Chapter 11), the introduction to error exponents (Chapter 12), and the
overview over cryptography (Chapter 19) closely follow Jims teaching
style.
The material about the joint source and channel coding in Chapter 13,
on the other hand, is inspired by the teaching of Chen Po-Ning [CA05a],
[CA05b].
For convex optimization I highly recommend the summary given by Bob
Gallager in his famous textbook from 1968 [Gal68]. From this book also
stem some inspiration on the presentation of the material about error
exponents.
Finally, for a very exact, but still easy-to-understand treatment of stochastic processes (in particular Gaussian processes) I do not know any
better book than A Foundation in Digital Communication by Amos
Lapidoth [Lap09]. From this book stem Appendices AC (with the
exception of Section C.8 which is based on notes from Bob Gallager).
Also the review of probability theory (Chapter 2) is strongly based on
notes by Amos.
All of these sources are very inspiring and highly recommended for anyone
who would like to learn more.
xiv
Preface
There are several important topics that are not covered in this script.
Most notably, rate distortion theory is missing. While extremely beautiful
and fundamental, rate distortion theory still is of less practical importance,
and therefore I decided to defer it to the course Advanced Topics in Information Theory [Mos13]. In this subsequent course many more advanced topics
are covered. In particular, it introduces strong typicality including Sanovs
theorem and the conditional limit theorem, rate distortion theory, distributed
source coding, and various multiple-user transmission schemes.
This script is going to be improved continually. So if you nd typos, errors,
or if you have any comments about these notes, I would be very happy to hear
them! Write to
Thanks!
Finally, I would like to thank Yin-Tai and Matthias who were very patient
with me whenever I sat writing on my computer, with my thoughts far away. . .
Stefan M. Moser
xv
Preface
14
19
10
13
11
12
15
17
16
18
Dependency chart. An arrow has the meaning of is required for.
Chapter 1
Shannons Measure of
Information
1.1
Motivation
We start by asking the question: What is information?

Lets consider some examples of sentences that contain some information:
The weather will be good tomorrow.
The weather was bad last Sunday.
The president of Taiwan will come to you tomorrow and will give you
one million dollars.
The second statement seems not very interesting as you might already know
what the weather has been like last Sunday. The last statement is much
more exciting than the rst two and therefore seems to contain much more
information. But on the other hand do you actually believe it? Do you think
it is likely that you will receive tomorrow one million dollars?
Lets make some easier examples:
You ask: Is the temperature in Taiwan currently above 30 degrees?
This question has only two possible answers: yes or no.
You ask: The president of Taiwan has spoken with a certain person
from Hsinchu today. With whom?
Here, the question has about 400,000 possible answers (since Hsinchu
has about 400,000 inhabitants).
Obviously the second answer provides you with a much bigger amount of
information than the rst one. We learn the following:
Shannons Measure of Information
The number of possible answers r should be linked to information.
Lets have another example.

You observe a gambler throwing a fair die. There are 6 possible outcomes
{1, 2, 3, 4, 5, 6}. You note the outcome and then tell it to a friend. By
doing so you give your friend a certain amount of information.
Next you observe the gambler throwing the die three times. Again, you
note the three outcomes and tell them to your friend. Obviously, the
amount of information that you give to your friend this time is three
times as much as the rst time.
So we learn:
Information should be additive in some sense.
Now we face a new problem: Regarding the example of the gambler above
we see that in the rst case we have r = 6 possible answers, while in the second
case we have r = 63 = 216 possible answers. Hence in the second experiment
there are 36 times more possible outcomes than in the rst experiment. But
we would like to have only a 3 times larger amount of information. So how
do we solve this?
A quite obvious idea is to use a logarithm. If we take the logarithm of
the number of possible answers, then the exponent 3 will become a factor 3,
exactly as we wish: logb 63 = 3 logb 6.
Precisely these observations have been made by the researcher Ralph Hartley in 1928 in Bell Labs [Har28]. He gave the following denition.
Denition 1.1. We dene the following measure of information:
I(U )
logb r,
(1.1)
where r is the number of all possible outcomes of a random message U .

Using this denition we can conrm that it has the wanted property of
additivity:
I(U1 , U2 , . . . , Un ) = logb rn = n logb r = nI(U ).
(1.2)
Hartley also correctly noted that the basis b of the logarithm is not really
important for this measure. It only decides about the unit of information.
So similarly to the fact that 1 km is the same distance as 1000 m, b is only
a change of units without actually changing the amount of information it
describes.
1.1. Motivation
For two important and one unimportant special cases of b it has been
agreed to use the following names for these units:
b = 2 (log2 ):
bit,
b = e (ln):
nat (natural logarithm),
b = 10 (log10 ):
Hartley.
Note that the unit Hartley has been chosen in honor of the rst researcher who
has made a rst (partially correct) attempt of dening information. However,
as nobody in the world ever uses the basis b = 10, this honor is questionable. . .
The measure I(U ) is the right answer to many technical problems.

Example 1.2. A village has 8 telephones. How long must the phone number
be? Or asked dierently: How many bits of information do we need to send
to the central oce so that we are connected to a particular phone?
8 phones = log2 8 = 3 bits.
(1.3)
We choose the following phone numbers:

{000, 001, 010, 011, 100, 101, 110, 111}.
(1.4)
In spite of its usefulness, Hartleys denition had no eect what so ever in

the world. Thats life. . . On the other hand, it must be admitted that Hartleys
denition has a fundamental aw. To realize that something must be wrong,
note that according to (1.1) the smallest nonzero amount of information is
log2 2 = 1 bit. This might sound like only a small amount of information,
but actually 1 bit can be a lot of information! As an example consider the
1-bit (yes or no) answer if a man asks a woman whether she wants to marry
him. . . If you still dont believe that 1 bit is a huge amount of information,
consider the following example.
Example 1.3. Currently there are 6,902,106,897 persons living on our planet
[U.S. Census Bureau, 25 February 2011, 13:43 Taiwan time]. How long must
a binary telephone number U be if we want to be able to connect to every
person?
According to Hartley we need
I(U ) = log2 (6902106897) 32.7 bits.
(1.5)
So with only 33 bits we can address every single person on this planet! Or, in
other words, we only need 33 times 1 bit in order to distinguish every human
being alive.
We see that 1 bit is a lot of information and it cannot be that this is the
smallest amount of (nonzero) information.
To understand more deeply what is wrong, consider the two hats shown in
Figure 1.1. Each hat contains four balls where the balls can be either white

black and white balls
A:
B:
Figure 1.1: Two hats with four balls each.
or black. Lets draw one ball at random and let U be the color of the ball. In
hat A we have r = 2 colors: black and white, i.e., I(UA ) = log2 2 = 1 bit. In
B ) = 1 bit. But obviously,
hat B we also have r = 2 colors and hence also I(U
we get less information if in hat B black shows up, since we somehow expect
black to show up in the rst place. Black is much more likely!
We realize the following:
A proper measure of information needs to take into account the

probabilities of the various possible events.
This has been observed for the rst time by Claude Elwood Shannon
in 1948 in his landmark paper: A Mathematical Theory of Communication
[Sha48]. This paper has been like an explosion in the research community!1
Before 1948, the engineering community was mainly interested in the behavior of a sinusoidal waveform that is passed through a communication system. Shannon, however, asked why we want to transmit a deterministic sinusoidal signal. The receiver already knows in advance that it will be a sinus,
so it is much simpler to generate one at the receiver directly rather than to
transmit it over a channel! In other words, Shannon had the fundamental
insight that we need to consider random messages rather than deterministic
messages whenever we deal with information.
Lets go back to the example of the hats in Figure 1.1 and have a closer
look at hat B:
There is one chance out of four possibilities that we draw a white ball.
Since we would like to use Hartleys measure here, we recall that the
quantity r inside the logarithm in (1.1) is the number of all possible
outcomes of a random message. Hence, from Hartleys point of view,
we will see one realization out of r possible realizations. Translated to
the case of the white ball, we see that we have one realization out of
1
Besides the amazing accomplishment of inventing information theory, at the age of 21

Shannon also invented the computer in his Master thesis! He proved that electrical circuits
can be used to perform logical and mathematical operations, which was the foundation of
digital computer and digital circuit theory. It is probably the most important Master thesis
of the 20th century! Incredible, isnt it?
1.1. Motivation
four possible realizations, i.e.,

log2 4 = 2 bits
(1.6)
of information.
On the other hand there are three chances out of four that we draw a
black ball.
Here we cannot use Hartleys measure directly. But it is possible to
translate the problem into a form that makes it somehow accessible to
Hartley. We need to normalize the statement into a form that gives
us one realization out of r. This can be done if we divide everything
by 3, the number of black balls: We have 1 chance out of 4 possibilities
3
(whatever this means), or, stated dierently, we have one realization out
of 4 possible realizations, i.e.,
3
log2
4
= 0.415 bits
3
(1.7)
of information.
So now we have two dierent values depending on what color we get. How
shall we combine them to one value that represents the information? The
most obvious choice is to average it, i.e., we weigh the dierent information
values according to their probabilities of occurrence:
1
3
2 bits + 0.415 bits = 0.811 bits
4
4
(1.8)
1
3
4
log2 4 + log2 = 0.811 bits.
4
4
3
(1.9)
or
We see the following:

Shannons measure of information is an average Hartley information:
r
pi log2
i=1
1
=
pi
pi log2 pi ,
(1.10)
i=1
where pi denotes the probability of the ith possible outcome.

We end this introductory section by pointing out that the given three
motivating ideas, i.e.,
1. the number of possible answers r should be linked to information;
2. information should be additive in some sense; and

3. a proper measure of information needs to take into account the probabilities of the various possible events,
are not sucient to exclusively specify (1.10). In Appendix 1.A we will give
some more information on why Shannons measure should be dened like
(1.10) and not dierently. However, the true justication is as always in
physics that Shannons denition turns out to be useful.
1.2
1.2.1
Uncertainty or Entropy
Denition
We now formally dene the Shannon measure of self-information of a source.

Due to its relationship with a corresponding concept in dierent areas of
physics, Shannon called his measure entropy. We will stick to this name as it
is standard in the whole literature. However note that uncertainty would be
a far more precise description.
Denition 1.4. The uncertainty or entropy of a discrete random variable
(RV) U that takes value in the set U (also called alphabet U ) is dened as
H(U )
PU (u) logb PU (u),
(1.11)
usupp(PU )
where PU () denotes the probability mass function (PMF) 2 of the RV U , and

where the support of PU is dened as
{u U : PU (u) > 0}.
supp(PU )
(1.12)
Another, more mathematical, but often very convenient form to write the
entropy is by means of expectation:
H(U ) = EU [logb PU (U )].
(1.13)
Be careful about the two capital U : one denotes the name of the PMF, the
other is the RV that is averaged over.
Remark 1.5. We have on purpose excluded the cases for which PU (u) = 0 so
that we do not get into trouble with logb 0 = . On the other hand, we also
note that PU (u) = 0 means that the symbol u never shows up. It therefore
should not contribute to the uncertainty in any case. Luckily this is the case:
lim t logb t = 0,
t0
(1.14)
i.e., we do not need to worry about this case.

So we note the following:
2
Note that sometimes (but only if it is clear from the argument!) we will drop the
subscript of the PMF: P (u) = PU (u).
1.2. Uncertainty or Entropy
We will usually neglect to mention support when we sum over

PU (u) logb PU (u), i.e., we implicitly assume that we exclude all u with
zero probability PU (u) = 0.
As in the case of the Hartley measure of information, b denotes the unit
of uncertainty:
b=2:
bit,
b=e:
nat,
b = 10 :
Hartley.
(1.15)
If the base of the logarithm is not specied, then we can choose it freely
ourselves. However, note that the units are very important! A statement
H(U ) = 0.43 is completely meaningless. Since
ln
,
(1.16)
ln b
(with ln() denoting the natural logarithm) 0.43 could mean anything as, e.g.,
logb =
if b = 2 :
H(U ) = 0.43 bits,
if b = e :
8
if b = 256 = 2 :
(1.17)
H(U ) = 0.43 nats 0.620 bits,
(1.18)
H(U ) = 0.43 bytes = 3.44 bits.
(1.19)
This is the same idea as 100 m not being the same distance as 100 km.
So remember:
If we do not specify the base of the logarithm, then the reader can
choose the unit freely. However, never forget to add the units, once you
write some concrete numbers!
Note that the term bits is used in two ways: Its rst meaning is the unit of
entropy when the base of the logarithm is chosen to be 2. Its second meaning
is binary digits, i.e., in particular the number of digits of a binary codeword.
Remark 1.6. It is worth mentioning that if all r events are equally likely,
Shannons denition of entropy reduces to Hartleys measure:
1
pi = , i :
r
H(U ) =
i=1
1
1
1
logb = logb r
r
r
r
1 = logb r.
(1.20)
i=1
=r
Remark 1.7. Be careful not to confuse uncertainty with information! For

motivation purposes, in Section 1.1 we have talked a lot about information.
However, what we actually meant there is self-information or more nicely
put uncertainty. You will learn soon that information is what you get by
reducing uncertainty.
Another important observation is that the entropy of U does not depend

on the dierent possible values that U can take on, but only on the probabilities
of these values. Hence,
U
1 ,
2 ,
(1.21)
with
with
with
1
1
prob. 2 prob. 1 prob. 6
3
and
V
34 , 512 , 981
(1.22)
with
with
with
1
1
6
have both the same entropy, which is

1 1
1 1
1
1
H(U ) = H(V ) = log2 log2 log2 1.46 bits.
2
2 3
3 6
6
This actually also holds true if we consider a random vector:
W
1
,
1
0
13
,
5
12
(1.23)
(1.24)
with
with
with
1
2
3
i.e., H(W) = H(U ) = H(V ). Hence we can easily extend our denition to
random vectors.
Denition 1.8. The uncertainty or entropy of a discrete random vector W =
(X, Y )T is dened as
H(W) = H(X, Y )
EX,Y [logb PX,Y (X, Y )]

=
(1.25)
PX,Y (x, y) logb PX,Y (x, y).
(1.26)
(x,y)supp(PX,Y )
Here PX,Y (, ) denotes the joint probability mass function of (X, Y ) (see Section 2.2 for a review of discrete RVs).
1.2.2
Binary Entropy Function
One special case of entropy is so important that we introduce a specic name.

Denition 1.9. If U is binary with two possible values u1 and u2 , U =
{u1 , u2 }, such that Pr[U = u1 ] = p and Pr[U = u2 ] = 1 p, then
H(U ) = Hb (p)
(1.27)
where Hb () is called the binary entropy function and is dened as

Hb (p)
p log2 p (1 p) log2 (1 p),
p [0, 1].
(1.28)
The function Hb () is shown in Figure 1.2.

Exercise 1.10. Show that the maximal value of Hb (p) is 1 bit and is taken
on for p = 1 .
2
1
0.9
Hb (p) [bits]
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
p
Figure 1.2: Binary entropy function Hb (p) as a function of the probability p.
1.2.3
The Information Theory Inequality
The following inequality has not really a name, but since it is so important
in information theory, we will follow Prof. James L. Massey, retired professor
at ETH in Zurich, and call it the information theory inequality or the IT
inequality.
Proposition 1.11 (IT Inequality). For any base b > 0 and any > 0,
1
logb e logb ( 1) logb e
(1.29)
with equalities on both sides if, and only if, = 1.

Proof: Actually, Figure 1.3 can be regarded as a proof. For those readers who would like a formal proof, we provide next a mathematical derivation.
We start with the upper bound. First note that
logb
=1
= 0 = ( 1) logb e
=1
(1.30)
Then have a look at the derivatives:

d
( 1) logb e = logb e
d
(1.31)
and
1
d
logb = logb e
d
> logb e
< logb e
if 0 < < 1,
if > 1.
(1.32)
Hence, the two functions coincide at = 1, and the linear function is above
the logarithm for all other values.
10
4
3
2
1
0
1
2
3
(1) logb e
logb
1 1 logb e
4
5
0
0.5
1.5
2.5
3.5
4.5
Figure 1.3: Illustration of the IT inequality.
To prove the lower bound again note that

1
= 0 = logb
logb e
=1
(1.33)
=1
and
1
d
1
d
logb e =
1
logb e
2
>
<
d
d
d
d
logb =
logb =
logb e
if 0 < < 1,
logb e
if > 1,
(1.34)
similarly to above.
1.2.4
Bounds on H(U )
Theorem 1.12. If U has r possible values, then

0 H(U ) log r,
(1.35)
where
H(U ) = 0
if, and only if,
H(U ) = log r
if, and only if,
PU (u) = 1
1
PU (u) =
r
for some u,
(1.36)
u.
(1.37)
Proof: Since 0 PU (u) 1, we have

PU (u) log2 PU (u)
=0
>0
if PU (u) = 1,
if 0 < PU (u) < 1.
(1.38)
11
Hence, H(U ) 0. Equality can only be achieved if PU (u) log2 PU (u) = 0 for
all u supp(PU ), i.e., PU (u) = 1 for all u supp(PU ).
To derive the upper bound we use a trick that is quite common in information theory: We take the dierence and try to show that it must be
nonpositive.
H(U ) log r =
=
=
usupp(PU )
usupp(PU )
usupp(PU )
PU (u) log PU (u) log r

PU (u) log PU (u)
PU (u) log r (1.40)

usupp(PU )
PU (u) log PU (u) r
PU (u) log
=
usupp(PU )
(1.39)
(1.41)
1
r PU (u)
(1.42)
1
1 log e
r PU (u)
PU (u)
usupp(PU )
=
usupp(PU )
PU (u)
usupp(PU )
log e
(1.43)
(1.44)
=1
1
r
usupp(PU )
1 1 log e
(1.45)
1
r
1
r 1 log e
r
(1.47)
= (1 1) log e = 0.
(1.48)
uU
1 1 log e
(1.46)
Here, (1.43) follows from the IT inequality (Proposition 1.11); and in (1.46)
we change the summation from u supp(PU ) to go over the whole alphabet
u U , i.e., we include additional (nonnegative) terms in the sum. Hence,
H(U ) log r.
Equality can only be achieved if
1. in (1.43), in the IT inequality = 1, i.e., if
1
1
= 1 = PU (u) = ,
r PU (u)
r
(1.49)
for all u; and if
2. in (1.46), the support of U contains all elements of the alphabet U , i.e.,3

3
|supp(PU )| = |U | = r.
(1.50)
By |U| we denote the number of elements in the set U.
12
Note that if Condition 1 is satised, Condition 2 is also satised.
1.2.5
Conditional Entropy
Similar to probability of random vectors, there is nothing really new about

conditional probabilities given that a particular event Y = y has occurred.
Denition 1.13. The conditional entropy or conditional uncertainty of the
RV X given the event Y = y is dened as
H(X|Y = y)
PX|Y (x|y) log PX|Y (x|y)
(1.51)
xsupp(PX|Y (|y))
= E log PX|Y (X|Y ) Y = y .
(1.52)
Note that the denition is identical to before apart from that everything
is conditioned on the event Y = y.
From Theorem 1.12 we immediately get the following.
Corollary 1.14. If X has r possible values, then
0 H(X|Y = y) log r;
H(X|Y = y) = 0
if, and only if,
H(X|Y = y) = log r
P (x|y) = 1
1
P (x|y) =
r
if, and only if,
(1.53)
for some x;
(1.54)
x.
(1.55)
Note that the conditional entropy given the event Y = y is a function of

y. Since Y is also a RV, we can now average over all possible events Y = y
according to the probabilities of each event. This will lead to the averaged
conditional entropy.
Denition 1.15. The conditional entropy or conditional uncertainty of the
RV X given the random variable Y is dened as
H(X|Y )
ysupp(PY )
PY (y) H(X|Y = y)
= EY [H(X|Y = y)]
=
(1.56)
(1.57)
PX,Y (x, y) log PX|Y (x|y)
(1.58)
(x,y)supp(PX,Y )
= E log PX|Y (X|Y ) .
(1.59)
The following observations should be straightforward:

PY (y)PX|Y (x|y) = PX,Y (x, y);
0 H(X|Y ) log r, where r is the number of values that the RV X
can take on;
H(X|Y ) = 0 if, and only if, PX|Y (x|y) = 1 for some x and y;
H(X|Y ) = log r if, and only if, PX|Y (x|y) = 1 , x and y;
r
13
H(X|Y ) = H(Y |X);

but, as we will see later, H(X) H(X|Y ) = H(Y ) H(Y |X).
Next we get the following very important theorem.
Theorem 1.16 (Conditioning Reduces Entropy).

For any two discrete RVs X and Y ,
H(X|Y ) H(X)
(1.60)
with equality if, and only if, X and Y are statistically independent, X
Y.
Proof: Again we use the same trick as above and prove the inequality
by showing that the dierence is nonpositive. We start by noting that
H(X) =
=
xsupp(PX )
PX (x) log PX (x) 1
(1.61)
PY |X (y|x)
(1.62)
PX (x)PY |X (y|x) log PX (x)
(1.63)
PX (x) log PX (x)

ysupp(PY |X (|x))
xsupp(PX )
=1
=
=
xsupp(PX ) ysupp(PY |X (|x))
PX,Y (x, y) log PX (x)
(1.64)
(x,y)supp(PX,Y )
such that
H(X|Y ) H(X) =
PX,Y (x, y) log PX|Y (x|y)

(x,y)supp(PX,Y )
PX,Y (x, y) log PX (x)
(1.65)
(x,y)supp(PX,Y )
PX,Y (x, y) log
=
(x,y)supp(PX,Y )
= E log
PX (X)
.
PX|Y (X|Y )
PX (x)
PX|Y (x|y)
(1.66)
(1.67)
Note that it is always possible to enlarge the summation of an expectation to

include more random variables, i.e.,
EX [f (X)] = EX,Y [f (X)] = E[f (X)].
(1.68)
14
Hence, the expression (1.67) could be much easier derived using the expectation notation:
H(X|Y ) H(X) = E log PX|Y (X|Y ) E[log PX (X)]
PX (X)
= E log
.
PX|Y (X|Y )
(1.69)
(1.70)
So, we have the following derivation:

H(X|Y ) H(X)
PX (X)
= E log
PX|Y (X|Y )
PX (X) PY (Y )
= E log
PX|Y (X|Y ) PY (Y )
PX (X)PY (Y )
= E log
PX,Y (X, Y )
(1.71)
(1.72)
(1.73)
PX,Y (x, y) log
=
(x,y)supp(PX,Y )
PX,Y (x, y)
(x,y)supp(PX,Y )
=
(x,y)supp(PX,Y )
(x,y)supp(PX,Y )
xX ,yY
xX
PX (x)PY (y)
1 log e
PX,Y (x, y)
PX (x)PY (y) PX,Y (x, y) log e
PX (x)PY (y)
PX,Y (x, y)
PX (x)PY (y) 1 log e
(1.74)
(1.75)
(1.76)
(1.77)
PX (x)PY (y) 1 log e
(1.78)
PY (y) 1 log e
(1.79)
PX (x)
yY
= (1 1) log e = 0.
(1.80)
Here, (1.75) again follows from the IT inequality (Proposition 1.11) and (1.78)
because we add additional terms to the sum. Hence, H(X|Y ) H(X).
Equality can be achieved if, and only if,
P (x)P (y)
= 1 = P (x)P (y) = P (x, y)
P (x, y)
for all x, y (which means that X Y );
2. in (1.78), P (x) P (y) = 0 for P (x, y) = 0.
(1.81)
15

Remark 1.17. Attention: The conditioning reduces entropy rule only applies to random variables, not to events! In particular,
H(X|Y = y) H(X).
(1.82)
To understand why this is the case, consider the following example.

Example 1.18. Let X be the skin color of a randomly chosen human: yellow,
white, black. Lets assume that there are 50% yellow, 30% white, and 20%
black humans on this earth. Hence
H(X) = 0.5 log 0.5 0.2 log 0.2 0.3 log 0.3 1.49 bits.
(1.83)
Now let Y be the nationality of this randomly chosen human. If we are told
Y , our knowledge/uncertainty about X changes. Examples:
1. Y = Taiwan: In Taiwan about 100% of the population is yellow. Hence
H(X|Y = Taiwan) = 1 log 1 = 0 bits < H(X).
2. Y = Mixtany: Let Mixtany be a country where 1 are white, 1 are black,
3
3
1
and 1 are yellow. Hence H(X|Y = Mixtany) = 3 log 1 3 = log 3
3
3
1.58 bits > H(X).
So we see that depending on the value of Y , the uncertainty about X might be
reduced or increased. However, from Theorem 1.16 we know that on average
the knowledge of Y will reduce our uncertainty about X: H(X|Y ) H(X).
1.2.6
More Than Two RVs
We can now easily extend entropy and conditional entropy to more than two
RVs. We only show some examples involving three RVs. You should have no
troubles to extend it to an arbitrary number of RVs and events.
Denition 1.19. The conditional entropy of a RV X conditional on the RV
Y and the event Z = z is dened as
H(X|Y, Z = z)
E log PX|Y,Z (X|Y, z) Z = z
(x,y)supp(PX,Y |Z (,|z))
(1.84)
PX,Y |Z (x, y|z) log PX|Y,Z (x|y, z)
= EY [H(X|Y = y, Z = z) | Z = z]
=
ysupp(PY |Z (|z))
=
ysupp(PY |Z (|z))
(1.86)
PY |Z (y|z) H(X|Y = y, Z = z)
PY |Z (y|z)
(1.85)
(1.87)
PX|Y,Z (x|y, z)
xsupp(PX|Y,Z (|y,z))
16
log PX|Y,Z (x|y, z)

=
=
(1.88)
PY |Z (y|z)PX|Y,Z (x|y, z) log PX|Y,Z (x|y, z)
ysupp(PY |Z (|z))
xsupp(PX|Y,Z (|y,z))
(1.89)
= PX,Y |Z (x,y|z)
(x,y)supp(PX,Y |Z (,|z))
PX,Y |Z (x, y|z) log PX|Y,Z (x|y, z).
(1.90)
The conditional entropy of X conditional on the RVs Y and Z is dened as

H(X|Y, Z)
EZ [H(X|Y, Z = z)]
(1.91)
= E log PX|Y,Z (X|Y, Z)
(1.92)
PX,Y,Z (x, y, z) log PX|Y,Z (x|y, z).
(1.93)
(x,y,z)supp(PX,Y,Z )
The properties generalize analogously:

H(X|Y, Z) H(X|Z);
H(X|Y, Z = z) H(X|Z = z);
but not (necessarily) H(X|Y, Z = z)
H(X|Y ).
Note that the easiest way of remembering the various variations in the
denition of an entropy is to use the notation with the expected value. For
example, H(X, Y, Z|U, V, W = w) is given as
H(X, Y, Z|U, V, W = w)
E log PX,Y,Z|U,V,W (X, Y, Z|U, V, w) W = w
(1.94)
where the expectation is over the joint PMF of (X, Y, Z, U, V ) conditional on

the event W = w.
1.2.7
Chain Rule
Theorem 1.20 (Chain Rule).

Let X1 , . . . , Xn be n discrete RVs with a joint PMF PX1 ,...,Xn . Then
H(X1 , X2 , . . . , Xn )
= H(X1 ) + H(X2 |X1 ) + + H(Xn |X1 , X2 , . . . , Xn1 )
(1.95)
=
k=1
H(Xk |X1 , X2 , . . . , Xk1 ).
(1.96)
Proof: This follows directly from the chain rule for PMFs:
PX1 ,...,Xn = PX1 PX2 |X1 PX3 |X1 ,X2 PXn |X1 ,...,Xn1 .
(1.97)
1.3. Relative Entropy and Variational Distance
17
Example 1.21. Let (X1 , X2 ) take on the values (0, 0), (1, 1), (1, 0) equally
likely with probability 1 . Then
3
H(X1 , X2 ) = log 3 1.58 bits.
We also immediately see that PX1 (0) =
1
3
(1.98)
2
and PX1 (1) = 3 . Hence,
1
1 2
2
1
H(X1 ) = log log = Hb
3
3 3
3
3
0.91 bits.
(1.99)
Moreover, PX2 |X1 (0|0) = 1, PX2 |X1 (1|0) = 0, such that

H(X2 |X1 = 0) = 0 bits,
and PX2 |X1 (0|1) =
1
2
and PX2 |X1 (1|1) =
1
2
(1.100)
such that
H(X2 |X1 = 1) = log 2 = 1 bit.
(1.101)
Using the denition of conditional entropy we then compute

H(X2 |X1 ) = PX1 (0) H(X2 |X1 = 0) + PX1 (1) H(X2 |X1 = 1) (1.102)
2
1
(1.103)
= 0 + 1 bits
3
3
2
(1.104)
= bits.
3
We nally use the chain rule to conrm the result we have computed above
already directly:
H(X1 , X2 ) = H(X1 ) + H(X2 |X1 )
1
2
= Hb
+ bits
3
3
1 2
2 2
1
= log2 log2 + bits
3
3 3
3 3
1
2
2
2
= log2 3 log2 2 + log2 3 + log2 2
3
3
3
3
= log 3.
(1.105)
(1.106)
(1.107)
(1.108)
(1.109)
1.3
Relative Entropy and Variational Distance
Beside the Shannon entropy as dened in Denition 1.4 that describes the uncertainty of a random experiment with given PMF, there also exist quantities
that compare two random experiments (or rather the PMFs describing these
experiments). In this class, we will touch on such quantities only very briey.
Denition 1.22. Let P () and Q() be two PMFs over the same nite (or
countably innite) alphabet X . The relative entropy or KullbackLeibler divergence between P () and Q() is dened as
D(P Q)
P (x) log
xsupp(P )
P (x)
P (X)
= EP log
.
Q(x)
Q(X)
(1.110)
18
Remark 1.23. Note that D(P Q) = if there exists an x supp(P ) (i.e.,

P (x) > 0) such that Q(x) = 0:
P (x) log
P (x)
= .
0
(1.111)
So, strictly speaking, we should dene relative entropy as follows:

P (x)
xsupp(P ) P (x) log Q(x)
D(P Q)
if supp(P ) supp(Q),
otherwise,
(1.112)
but again, we are lazy in notation and will usually simply write
D(P Q) =
P (x) log
x
P (x)
.
Q(x)
(1.113)
It is quite tempting to think of D(P Q) as a distance between P () and

Q(), in particular because we will show below that D(P Q) is nonnegative
and is equal to zero only if P () is equal to Q(). However, this is not correct
because the relative entropy is not symmetric,
D(P Q) = D(Q P ),
(1.114)
and because it does not satisfy the triangle inequality

D(P1 P3 ) D(P1 P2 ) + D(P2 P3 ).
(1.115)
(This is also the reason why it should not be called KullbackLeibler distance.) Indeed, relative entropy behaves like a squared distance, so one
should think of it as an energy rather than distance. (It actually describes
the ineciency of assuming that the PMF is Q() when the true distribution
is P ().)
For us the most important property of D( ) is its nonnegativity.
Theorem 1.24.
D(P Q) 0
(1.116)
with equality if, and only if, P () = Q().

Proof: In the case when supp(P ) supp(Q), we have D(P Q) = >
0 trivially. So, lets assume that supp(P ) supp(Q). Then,
D(P Q) =
P (x) log
xsupp(P )
Q(x)
P (x)
(1.117)
P (x)
xsupp(P )
=
xsupp(P )
Q(x)
1 log e
P (x)
Q(x) P (x) log e
(1.118)
(1.119)
1.4. Mutual Information
19
=
xsupp(P )
Q(x)
P (x) log e
(1.120)
xsupp(P )
=1
(1 1) log e
(1.121)
= 0.
(1.122)
Here, (1.118) follows from the IT inequality (Proposition 1.11) and (1.121) by
adding additional terms to the sum. Hence, D(P Q) 0.
Equality can be achieved if, and only if,
i.e., if P (x) = Q(x), for all x; and if
Q(x)
P (x)
= 1 for all x supp(P ),
2. in (1.121), supp(P ) = supp(Q).

Another quantity that compares two PMFs is the variational distance. In
contrast to relative entropy, this is a true distance.
Denition 1.25. Let P () and Q() be two PMFs over the same nite (or
countably innite) alphabet X . The variational distance between P () and
Q() is dened as
V (P, Q)
xX
P (x) Q(x) .
(1.123)
Here it is obvious from the denition that V (P, Q) 0 with equality if,
and only if, P () = Q(). It is slightly less obvious that V (P, Q) 2.
Since V (, ) satises all required conditions of a measure, it is correct to
think of the variational distance as a distance measure between P () and Q().
It describes how similar (or dierent) two random experiments are.
More properties of V (, ) and its relation to entropy are discussed in Appendix 1.B.
1.4
1.4.1
Mutual Information
Denition
Finally, we come to the denition of information. The following denition is

very intuitive: Suppose you have a RV X with a certain uncertainty H(X).
The amount that another related RV Y can tell you about X is the information
that Y gives you about X. How to measure it? Well, compare the uncertainty
of X before and after you know Y . The dierence is what you have learned!
Denition 1.26. The mutual information between the discrete RVs X and
Y is given by
I(X; Y )
H(X) H(X|Y ).
(1.124)
20

Note that H(X|Y ) is the uncertainty about X when knowing Y .
Remark 1.27. Note that it is a mutual information, not an information

about X provided by Y ! The reason for this name can be quickly understood
if we consider the following. Using twice the chain rule we have:
=
=
H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )
H(X) H(X|Y ) = H(Y ) H(Y |X)

I(X; Y ) = I(Y ; X)
(1.125)
(1.126)
(1.127)
Hence, X will tell exactly the same about Y as Y tells about X. For example,
assume X being the weather in Hsinchu and Y being the weather in Taichung.
Knowing X will reduce your uncertainty about Y in the same way as knowing
Y will reduce your uncertainty about X.
The mutual information can be expressed in many equivalent forms. A
particularly nice one can be derived as follows:4
I(X; Y ) = H(X) H(X|Y )
= E[log PX (X)] E log PX|Y (X|Y )

PX|Y (X|Y )
= E log
PX (X)
PX|Y (X|Y ) PY (Y )
= E log
PX (X) PY (Y )
PX,Y (X, Y )
= E log
PX (X)PY (Y )
PX,Y (x, y)
PX,Y (x, y) log
=
,
PX (x)PY (y)
(1.128)
(1.129)
(1.130)
(1.131)
(1.132)
(1.133)
(x,y)supp(PX,Y )
i.e.,
I(X; Y ) = D(PX,Y PX PY ).
(1.134)
Hence, I(X; Y ) is the distance between the joint distribution of X and Y (i.e.,
PX,Y ) and the joint distribution of X and Y if X and Y were independent
(i.e., PX PY ).
From this form (1.134) we also immediately learn that
I(X; Y ) 0
(1.135)
with equality if, and only if, PX,Y = PX PY , i.e., if and only if X Y . Note
that this can also be derived from the fact that conditioning reduces entropy
(Theorem 1.16).
One more form:
I(X; Y ) = H(X) + H(Y ) H(X, Y ).
(1.136)
This form can also be used for a Venn-diagram, as shown in Figure 1.4, and
is particularly nice because it shows the mutual informations symmetry.
4
Recall the way we can handle expectations shown in (1.68).
1.4. Mutual Information
21
I(X; Y )
H(X)
H(Y )
H(Y |X)
H(X|Y )
Union = H(X, Y )
Figure 1.4: Diagram depicting mutual information and entropy in a set-theory
way of thinking.
1.4.2
Properties
Since mutual information is closely related to entropy, we can easily derive

some of its properties. The most important property has already been mentioned above: Mutual information cannot be negative:
I(X; Y ) 0.
(1.137)
I(X; Y ) = H(X) H(X|Y ) H(X),
(1.138)
Moreover, we have
I(X; Y ) = H(Y ) H(Y |X) H(Y ),
(1.139)
and hence
0 I(X; Y ) min{H(X), H(Y )}.
(1.140)
Also note that the mutual information of a RV X about itself is simply its
entropy:
I(X; X) = H(X) H(X|X) = H(X).
(1.141)
=0
Therefore H(X) sometimes is also referred to as self-information.
22
1.4.3
Conditional Mutual Information
Similarly to the entropy we can extend the mutual information to conditional

versions. Since these denitions are basically only repetitions of the corresponding denitions of conditional entropies, we only state a few:
I(X; Y |Z = z)
I(X; Y |Z)
H(X|Z = z) H(X|Y, Z = z);
EZ [I(X; Y |Z = z)]
z
PZ (z) H(X|Z = z) H(X|Y, Z = z)
= H(X|Z) H(X|Y, Z).
1.4.4
(1.142)
(1.143)
(1.144)
(1.145)
Chain Rule
Finally, also the chain rule of entropy directly extends to mutual information.
Theorem 1.28 (Chain Rule).
I(X; Y1 , Y2 , . . . , Yn )
= I(X; Y1 ) + I(X; Y2 |Y1 ) + + I(X; Yn |Y1 , Y2 , . . . , Yn1 ) (1.146)
n
=
k=1
I(X; Yk |Y1 , Y2 , . . . , Yk1 ).
(1.147)
Proof: From the chain rule of entropy we have

I(X; Y1 , . . . , Yn )
= H(Y1 , . . . , Yn ) H(Y1 , . . . , Yn |X)
= H(Y1 ) + H(Y2 |Y1 ) + + H(Yn |Yn1 , . . . , Y1 )
(1.148)
H(Y1 |X) + H(Y2 |Y1 , X) + + H(Yn |Yn1 , . . . , Y1 , X)
(1.149)
+ H(Yn |Yn1 , . . . , Y1 ) H(Yn |Yn1 , . . . , Y1 , X)
(1.150)
= H(Y1 ) H(Y1 |X) + H(Y2 |Y1 ) H(Y2 |Y1 , X) +

= I(X; Y1 ) + I(X; Y2 |Y1 ) + + I(X; Yn |Yn1 , . . . , Y1 )
(1.151)
=
k=1
1.5
I(X; Yk |Yk1 , . . . , Y1 ).
(1.152)
Jensens Inequality
Jensens inequality actually is not something particular of information theory,

but is a basic property of probability theory. It is often very useful. We also
refer to Chapter 7, where we will study concave and convex functions and
their properties more in detail.
1.6. Comments on our Notation
23
Theorem 1.29 (Jensens Inequality). If f () is a concave function over

an interval I and X X is a RV, where X I, then
E[f (X)] f (E[X]).
(1.153)
Moreover, if f () is strictly concave (i.e., if f () is double-dierentiable it

2f
satises d dx(x) < 0), then
2
E[f (X)] = f (E[X]) X = constant.
(1.154)
If in the above concave is replaced by convex, all inequalities have to be

swapped, too.
Proof: A graphical explanation of why Jensens inequality holds is
shown in Figure 1.5. For the special case when f () is double-dierentiable,
we can prove it as follows. Since f () is double-dierentiable and concave, we
know that f () 0 for all I. Now, lets use a Taylor expansion with
around a point x0 I with correction term: for any x I and some that
lies between x and x0 , we have
f (x) = f x0 + (x x0 )
(1.155)
= f (x0 ) + f (x0 )(x x0 ) + f () (x x0 )
f (x0 ) + f (x0 )(x x0 ).
Taking the expectation over both sides and choosing x0

obtain
E[f (X)] f (x0 ) + f (x0 )(E[X] x0 )
= f (E[X]) + f (E[X])(E[X] E[X])
= f (E[X]),
(1.156)
(1.157)
E[X], we nally
(1.158)
(1.159)
(1.160)
proving the claim.

Remark 1.30. An easy way to remember the dierence between concave
and convex is to recall that a concave function looks like the entrance to
a cave.
1.6
1.6.1
Comments on our Notation

General
We try to clearly distinguish between constant and random quantities. The

basic rule here is
capital letter X : random variable,
small letter
x : deterministic value.
24
concave
f (x2 )
f (x)
f (x)
f (x1 )
convex
x1
x2
Figure 1.5: Graphical proof of Jensens inequality: Fix some x1 and x2 and
let x
px1 + (1 p)x2 for some p [0, 1]. Then f (x) is always above
f (x) pf (x1 ) + (1 p)f (x2 ).
For vectors or sequences bold face is used:
capital bold letter X : random vector,
small bold letter
x : deterministic vector.
There are a few exceptions to this rule. Certain deterministic quantities are
very standard in capital letters, so, to distinguish them from random variables,
we use a dierent font. For example, the capacity is denoted by C (in contrast
to a random variable C) or the codewords in Chapter 3 are D-ary (and not Dary!). However, as seen in this chapter, for the following well-known function
we stick with the customary shape: I(X; Y ) denotes the mutual information
between X and Y . Mutual information is of course not random, but as it
necessarily needs function arguments, its meaning should always be clear.
Moreover, matrices are also commonly depicted in capitals, but for them
we use yet another font, e.g., C. Then, sets are denoted using a calligraphic
font: C. As seen above, an example of a set is the alphabet X of a random
variable X.
Finally, also the PMF is denoted by a capital letter P : The discrete random
variable X has PMF PX (), where we normally will use the subscript X to
indicate to which random variable the PMF belongs to. Sometimes, however,
we also use P () or Q() without subscript to denote a generic PMF (see, e.g.,
Appendix 1.B or Chapters 10 and 12). To avoid confusion, we shall never use
a RV P or Q.
1.6.2
Entropy and Mutual Information
As introduced in Sections 1.2 and 1.4, respectively, entropy and mutual information are always shown in connection with RVs: H(X) and I(X; Y ). But
strictly speaking, they are functions of PMFs and not RVs: H(X) is a function
of PX () (see, e.g., (1.21)(1.23)) and I(X; Y ) is a function of PX,Y (, ) (see,
1.A. Appendix: Uniqueness of the Denition of Entropy
25
e.g., (1.134)). So in certain situations, it is more convenient to write H and I

as functions of PMFs:
H Q() or H(Q)
and
I(PX , PY |X ).
(1.161)
To emphasize the dierence in notation, in this case we drop the use of the
semicolon for the mutual information. However, for the entropy no such distinction is made. We hope that no confusion will arise as we never dene RVs
P or Q in the rst place.
Finally, sometimes we even write a PMF P () in a vector-like notation
listing all possible probability values:
PX () = (p1 , p2 , . . . , pr )
(1.162)
denotes the PMF of an r-ary RV X with probabilities p1 , . . . , pr . The entropy

of X can then be written in the following three equivalent forms
H(X) = H(PX ) = H(p1 , . . . , pr )
(1.163)
and is, of course, equal to

r
H(p1 , . . . , pr ) =
pi log
i=1
1.A
1
.
pi
(1.164)
Appendix: Uniqueness of the Denition of

Entropy
In Section 1.1 we have tried to motivate the denition of the entropy. Even
though we succeeded partially, we were not able to give full justication of Definition 1.4. While Shannon did provide a mathematical justication [Sha48,
Sec. 6], he did not consider it very important. We omit Shannons argument,
but instead we will now quickly summarize a slightly dierent result that was
presented in 1956 by Aleksandr Khinchin. Khinchin specied four properties
that entropy is supposed to have and then proved that, given these four properties, (1.11) is the only possible denition. We will now quickly summarize
his result.
We dene Hr (p1 , . . . , pr ) to be a function of r probabilities p1 , . . . , pr that
sum up to 1:
r
pi = 1.
(1.165)
i=1
We further ask this function to satisfy the following four properties:

1. For any r, Hr (p1 , . . . , pr ) is continuous (i.e., a slight change to the values
of pi will only cause a slight change to Hr ) and symmetric in p1 , . . . , pr
(i.e., changing the order of the probabilities does not aect the value of
Hr ).
26

2. Any event of probability 0 does not contribute to Hr :
Hr+1 (p1 , . . . , pr , 0) = Hr (p1 , . . . , pr ).
(1.166)
3. Hr is maximized by the uniform distribution:

Hr (p1 , . . . , pr ) Hr
1
1
.
,...,
r
r
(1.167)
4. If we partition the m r possible outcomes of a random experiment

into m groups, each group containing r elements, then we can do the
experiment in two steps:
(a) determine the group to which the actual outcome belongs,
(b) nd the outcome in this group.
Let pj,i , 1 j m, 1 i r, be the probabilities of the outcomes in
this random experiment. Then the total probability of all outcomes in
group j is
r
pj,i ,
qj =
(1.168)
i=1
and the conditional probability of outcome i from group j is then given

by
pj,i
.
(1.169)
qj
Now Hmr can be written as follows:
Hmr (p1,1 , p1,2 , . . . , pm,r )
m
= Hm (q1 , . . . , qm ) +
qj H r
j=1
pj,1
pj,r
,...,
,
qj
qj
(1.170)
i.e., the uncertainty can be split into the uncertainty of choosing a group
and the uncertainty of choosing one particular outcome of the chosen
group, averaged over all groups.
Theorem 1.31. The only functions Hr that satisfy the above four conditions
are of the form
r
Hr (p1 , . . . , pr ) = c
pi ln pi
(1.171)
i=1
where the constant c > 0 decides about the units of Hr .

Proof: This theorem was proven by Khinchin in 1956, i.e., after Shannon had dened entropy. The article was rstly published in Russian [Khi56],
and then in 1957 it was translated into English [Khi57]. We omit the details.
1.B. Appendix: Entropy and Variational Distance
1.B
27
Appendix: Entropy and Variational Distance
1.B.1
Estimating PMFs
Suppose we have a RV X X with an unknown PMF P () that we would like

to estimate, and suppose that by some estimation process we come up with
an estimate P () for the unknown P (). If we now use P () to compute the
entropy H(X), then how good an approximation is this for the entropy H(X)?
As we will show next, unfortunately, H(X) can be arbitrarily far from

() are very similar!
H(X) even if P () and P
For the following result, recall the denition of the variational distance
between two PMFs given in Denition 1.25.
Theorem 1.32 ([HY10, Theorem 1]). Suppose > 0 and > 0 are given.
Then for any PMF P () with a support of size r, there exists another PMF
P () of support size r r large enough such that
V (P, P ) <
(1.172)
H(P ) H(P ) > .
(1.173)
but
We see that if we do not know the support size r of P (), then even if our
estimate P () is arbitrarily close to the correct P () (with respect to the vari

ational distance), the dierence between H(P ) and H(P ) remains unbounded.
Proof: For some r, let
P () = (p1 , p2 , . . . , pr ),
(1.174)
where we have used the vector-like notation for the PMF introduced in Section 1.6.2. Moreover, let
P () =
p1
p1
p1
p1
p1
p1
, p2 +
, . . . , pr +
,
, ...,
log t
t log t
t log t t log t
t log t
(1.175)
be a PMF with r = t + 1 r probability masses, t N. Note that P () indeed
is a PMF:
r
P (i) =
i=1
i=1
pi
p1
p1
p1
p1
+t
=1
+
= 1. (1.176)
log t
t log t
log t
log t
For this choice of P () and P we have5

()
p1
p1
2p1
V (P, P ) =
+t
=
,
log t
t log t
log t
(1.177)
For the denition of V it is assumed that P () and P () take value in the same alphabet.
We can easily x by adding the right number zeros to the probability vector of P ().
5
28
H(P ) = p1
p1
log t
log p1
p1
log t
p1
p1
pi +
log pi +
t log t
t log t
i=2
p1
p1
log
t+1r
t log t
t log t
t log t
p1
log
H(P ) +
p1
log t
p1
p1
log t
log t +
log
= H(P ) +
p1
log t
log t
p1
log t
log
= H(P ) + p1 log t +
p1
log t
(1.179)
H(P ) + p1
(1.182)
log t,
(1.178)
(1.180)
(1.181)
where the approximations become more accurate for larger values of t. we

If
) becomes arbitrarily small, while p1 log t
let t become very large, then V (P, P
is unbounded.
Note that if we x the support size r, then Theorem 1.32 does not hold
anymore, and we will show next that then the dierence between H(X) and
H(X) is bounded.
1.B.2
Extremal Entropy for given Variational Distance
We will next investigate how one needs to adapt a given PMF P () in order
to maximize or minimize the entropy, when the allowed changes on the PMF
are limited:
max
Q : V (P,Q)
H Q()
or
min
Q : V (P,Q)
H Q() .
(1.183)
Recall that without any limitations on the PMF, we know from Theorem 1.12
that to maximize entropy we need a uniform distribution, while to minimize
we make it extremely peaky with one value being 1 and the rest 0. Both
such changes, however, will usually cause a large variational distance. So
the question is how to adapt a PMF without causing too much variational
distance, but to maximize (or minimize) the entropy.
In the remainder of this section, we assume that the given PMF P () =
(p1 , . . . , p|X | ) is ordered such that
p1 p2 pr > 0 = pr+1 = = p|X | .
(1.184)
We again use r as the support size of P ().

We will omit most proofs in this section and refer to [HY10] instead.
Theorem 1.33 ([HY10, Theorem 2]). Let 0 2 and P () satisfying
(1.184) be given. Here we must restrict |X | to be nite. Let , R be such
29
that
|X |
(1.185)
( pi )+ = ,
2
(1.186)
i=1
(pi )+ =
and
|X |
i=1
where
()+
max{, 0}.
(1.187)
If , dene Qmax () to be the uniform distribution on X ,

1
,
|X |
Qmax (i)
i = 1, . . . , |X |,
and if < , dene Qmax () as
if pi > ,
Qmax (i)
p if pi ,
i
if pi < ,
i = 1, . . . , |X |.
(1.188)
(1.189)
Then
max
Q : V (P,Q)
H Q() = H Qmax () .
(1.190)
Note the structure of the maximizing distribution: we cut the largest values
of P () to a constant level and add this probability to the smallest values to
make them all constant equal to . The middle range of the probabilities are
not touched. So, under the constraint that we cannot twiddle P () too much,
we should try to approach a uniform distribution by equalizing the extremes.
See Figure 1.6 for an illustration of this.
It is quite obvious that H Qmax depends on the given . Therefore for a
given P () and for 0 2, we dene
P ()
H Qmax ()
(1.191)
with Qmax () given in (1.188) and (1.189). One can show that P () is a
concave (and therefore continuous) and strictly increasing function in .
Theorem 1.34 ([HY10, Theorem 3]). Let 0 2 and P () satisfying
(1.184) be given (|X | can be innite). If 1 p1 2 , dene

Qmin ()
(1, 0).
(1.192)
30
p1
p2
p3
p4
p5
p6
p7
7
|X |
Figure 1.6: Example demonstrating how a PMF with seven nonzero probabilities is changed to maximize entropy under a variational distance constraint (|X | = 9, r = 7). The maximizing distribution is Qmax () =
(, , , p4 , p5 , p6 , , , ).
Otherwise, let k be the largest integer such that
r
i=k
and dene Qmin () as
p1 +
p
i
Qmin (i)
r pj
j=k
if
if
if
if
pi
i = 1,
i = 2, . . . , k 1,
i = k,
i = k + 1, . . . , |X |,
(1.193)
i = 1, . . . , |X |.
(1.194)
Then
min
Q : V (P,Q)
H Q() = H Qmin () .
(1.195)
Note that to minimize entropy, we need to change the PMF to make it more
peaky. To that goal the few smallest probability values are set to zero and the
corresponding amount is added to the single largest probability. The middle
range of the probabilities are not touched. So, under the constraint that we
cannot twiddle Q() too much, we should try to approach the (1, 0, . . . , 0)distribution by removing the tail and enlarge the largest peak. See Figure 1.7
for an illustration of this.
Also here, H Qmin () depends on the given . For a given P () and for
0 2, we dene
P ()
H Qmin ()
(1.196)
p1
p2
p3
p4
p5
p6
31
p7
7
|X |
Figure 1.7: Example demonstrating how a PMF with seven nonzero probabilities is changed to minimize entropy under a variational distance constraint
(r = 7). The minimizing distribution is Qmax () = (p1 +/2, p2 , p3 , p4 , p5 +
p6 + p7 /2, 0, 0, 0, 0).
with Qmin () dened in (1.192)(1.194). One can show that P () is a continuous and strictly decreasing function in .
We may, of course, also ask the question the other way around: For a
given PMF P () and a given entropy H(X), what is the choice of a PMF Q()
such that H(Q) = H(X) is achieved and such that Q() is most similar to P ()
(with respect to variational distance)?
Theorem 1.35 ([HY10, Theorem 4]). Let 0 t log |X | and P () satisfying (1.184) be given. Then
2(1 p1 )
if t = 0,
(t)
if 0 < t H P () ,
P
min
V (P, Q) =
1
if H P () < t < log |X |,
P (t)
Q : H(Q())=t
r
|X |r
if t = log |X |
p 1 +
i=1
|X |
|X |
(1.197)
1
with P () and 1 () being the inverse of the functions dened in (1.191)
P
and (1.196), respectively.
Note that this result actually is a direct consequence of Theorem 1.32 and
1.33 and the fact that P () and P () both are continuous and monotonous
functions that have a unique inverse.
32
1.B.3
Lower Bound on Entropy in Terms of Variational

Distance
In Section 1.2.4 we have found the most general lower bound on entropy:
H(X) 0. Using the results from the previous section, we can now improve
on this lower bound by taking into account the PMF of X.
Theorem 1.36. For a given r {2, 3, . . .}, consider a RV X that takes
value in an r-ary alphabet X with a PMF PX () = (p1 , p2 , . . . , pr ). Then
the entropy of X can be lower-bounded as follows:
r
H(X) = H(p1 , . . . , pr ) log r
i=1
pi
1
log r.
r
(1.198)
This lower bound has a beautiful interpretation: We know that the entropy
of a uniformly distributed RV is equal to the logarithm of the alphabet size. If
the distribution is not uniform, then the entropy is smaller. This theorem now
gives an upper bound on this reduction in terms of the variational distance
between the PMF and the uniform PMF. We rewrite (1.198) as follows: Let
X X be an arbitrary RV and let U be uniformly distributed on the same
alphabet X . Then
H(U ) H(X) V (PU , PX ) log |X |.
(1.199)
Proof: Suppose we can prove that

PU (1 ) log r,
[0, 1].
(1.200)
Since PU () is monotonously decreasing, this means that

1 1 ( log r),
PU
(1.201)
and since 0 H(PX ) log r, it then follows from Theorem 1.35 that
r
=1
1
log r = V (PU , PX ) log r
r
min
Q : H(Q)=H(PX )
1 H(PX )
PU
= 1
PU
(1.202)
V (PU , Q) log r
log r
(1.203)
(by Theorem 1.35)
(1.204)
H(PX )
log r log r
log r
= 1 ( log r)
PU
(1 )
H(PX )
log r
H(PX )
log r
log r
log r
(1.205)
(1.206)
(by (1.201))
(1.207)
33
H(PX )
log r
log r
= log r H(PX ),
=
(1.208)
(1.209)
from which follows (1.198).

Hence, it only remains to prove (1.200). To that goal note that for the
denition of PU in (1.196) we obtain from (1.193) that k is the largest integer
satisfying
r
i=k
1
k1
=1
,
r
r
2
(1.210)
(1.211)
i.e.,
k = r 1
+ 1,
and that
1
+
log
r 2
k1
1
r
PU () =
k2
1
+
log r
+
r 2
r
k1
log 1
.
2
r
2
(1.212)
Proving (1.200) is equivalent to proving the nonnegativity of

fr ()
PU (1 )
log r
1
1
1 1
1 1
=
+
+
log
+
log r r 2 2
r 2 2
r
1 1 r
1
1
+
(1 + ) log
+
log r 2 2 r 2
2 2
(1.213)
r
1
(1 + )
2
r
1 r
(1 + )
r 2
(1.214)
1
r
=1
1 + (1 ) log 1 +
r log r
2
r
1
r
(1 + ) (1 + ) log
r log r 2
2
r
(1 )
2
r
r
(1 + ) (1 + )
2
2
(1.215)
for all [0, 1] and for all r {2, 3, . . .}. Note that fr () is continuous in
for all [0, 1].
Choose l and r such that for all [l , r )

r
(1 + ) = constant
2
(1.216)
For such , we have

r
r
1
1 + (1 ) log 1 + (1 )
r log r
2
2
r
r
1
(1 + ) log (1 + )
r log r 2
2
fr () = 1
(1.217)
34
with
2
r(r + 1 )
fr () =
< 0.
r
r
2
4 log r 1 + 2 (t ) 2 (1 + )
(1.218)
Hence fr () is concave over the interval [l , r ), and therefore, in this interval,
fr () is lower-bounded by one of the boundary points fr (l ) or fr (r ).

be such that
So we investigate these boundary points. Let
r
(1 + ) N,
2
i.e., for some
r
2, r
(1.219)
N, we have
=
2
1.
r
(1.220)
Also note that

lim
r
r
(1 + ) (1 + )
2
2
log
r
r
(1 + ) (1 + )
2
2
= lim t log t = 0.
t0
(1.221)
Therefore, and by the continuity of fr (), we have
r
r
1
1 + (1 ) log 1 + (1 )
r log r
2
2
1
2
(1 + r ) log(1 + r ) f ().
=2
r
r log r
fr ( ) = 1
We extend the denition of f () to
r
2, r
(1.222)
(1.223)
and prove concavity:
2
1
f () =
< 0.
2
(1 + r )r log r
(1.224)
Thus, f () is lower-bounded by one of the boundary points f

where
f (r) = 0,
r
r
r = r log r 1 + 2 log 1 + 2
f
2
r log r
r
r
r log r 1 + 2 log 1 + 2
r log r
r
2
or f (r),
(1.225)
(1.226)
= 0.
(1.227)
r=2
Hence, we have nally shown that fr () 0. This completes the proof.
Chapter 2
Review of Probability Theory

Information theory basically is applied probability theory. So it is quite important to feel comfortable with the main denitions and properties of probability.
In this chapter we will very briey review some of them and also quickly talk
about the notation used in this script.
2.1
Discrete Probability Theory
Probability theory is based on three specic objects:

the sample space ,
the set of all events F, and
the probability measure Pr().
The sample space is the set of all possible outcomes of a random experiment. In discrete probability theory, the sample space is nite (i.e.,
= {1 , 2 , . . . , n }) or at most countably innite (which we will indicate
by n = ).
An event is any subset of , including the impossible event (the empty
subset of ) and the certain event . An event occurs when the outcome of
the random experiment is member of that event. All events are collected in
the set of events F, i.e., F is a set whose elements are sets themselves.
The probability measure Pr() is a mapping that assigns to each event a
real number (called the probability of that event) between 0 and 1 in such a
way that
Pr() = 1
(2.1)
and
Pr(A B) = Pr(A) + Pr(B)
if A B = ,
(2.2)
where A, B F denote any two events. Notice that (2.2) implies

Pr() = 0
35
(2.3)
36
as we see by choosing A = and B = .

The atomic events are the events that contain only a single sample point,
e.g., {i }. It follows from (2.2) that the numbers
pi = Pr({i }),
i = 1, 2, . . . , n
(2.4)
(i.e., the probabilities that the probability measure assigns to the atomic
events) completely determine the probabilities of all events.
2.2
Discrete Random Variables
A discrete random variable is a mapping from the sample space into the real
numbers. For instance, on the sample space = {1 , 2 , 3 } we might dene
the random variables X, Y and Z as
X()
1
2
3
Y ()
Z()
13.4
+5
2 102.44
(2.5)
Note that the range (i.e., the set of possible values of the random variable) of
these random variables are
X() = {5, 0, +5},
(2.6)
Z() = {102.44, , 13.4}.
(2.8)
Y () = {0, 1},
(2.7)
The probability distribution (or frequency distribution) of a random variable X, denoted PX (), is closely related to the probability measure of the underlying random experiment: It is the mapping from X() into the interval
[0, 1] such that
PX (x)
Pr({ : X() = x}),
(2.9)
i.e., it is the mapping that, for every x X(), gives the probability that the
random variable X takes on this value x. Usually, PX is called the probability
mass function (PMF) for X.
It is quite common to use a bit more sloppy notation and to neglect the
mapping-nature of a random variable. I.e., we usually write X and not X(),
and for the event { : X() = x} we simply write [X = x]. So, the PMF is
then expressed as
PX (x) = Pr[X = x].
(2.10)
Note that we use square brackets instead of the round brackets in (2.9) to
emphasize that behind X there still is a random experiment that causes the
2.2. Discrete Random Variables
37
randomness of the function X. The range of the random variable is commonly

denoted by
X
X()
(2.11)
and is called alphabet of X.

Note that it follows immediately from (2.9) that the PMF satises
PX (x) 0,
all x X
(2.12)
and
PX (x) = 1,
(2.13)
where the summation in (2.13) is understood to be over all x in X . Equations

(2.12) and (2.13) are the only mathematical requirements on a probability
distribution, i.e., any function which satises (2.12) and (2.13) is the PMF of
some suitably dened random variable with some suitable alphabet.
In discrete probability theory, there is no fundamental distinction between
a random variable X and a random vector X: While the random variable
takes value in R, a random n-vector takes value in Rn . So basically, the
only dierence is the alphabet of X and X! However, if X1 , X2 , . . . , Xn are
random variables with alphabets X1 , . . . , Xn , it is often convenient to consider
their joint probability distribution or joint PMF dened as the mapping from
X1 X2 Xn into the interval [0, 1] such that
PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = Pr[X1 = x1 , X2 = x2 , . . . , Xn = xn ].
(2.14)
(Note that strictly speaking we assume here that these RVs share a common
sample space such that the joint PMF actually describes the probability of
the event { : X1 () = x1 } { : X2 () = x2 } { : Xn () = xn }.)
It follows again immediately that
PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) 0
(2.15)
(2.16)
and that
x1
x2
PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = 1.

xn
More interestingly, it follows from (2.14) that

PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn )
xi
= PX1 ,...,Xi1 ,Xi+1 ,...,Xn (x1 , . . . , xi1 , xi+1 , . . . , xn ).
(2.17)
The latter is called marginal distribution of the PMF of X.

The random variables X1 , X2 , . . . , Xn are said to be statistically independent when
PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = PX1 (x1 ) PX2 (x2 ) PXn (xn )
(2.18)
38
for all x1 X1 , x2 X2 , . . . , xn Xn .
If X and Y are independent, then we also write X Y .
Suppose that g is a real-valued function whose domain includes X . Then,

the expectation of g(X), denoted E[g(X)], is the real number
PX (x)g(x).
E[g(X)]
(2.19)
The term average is synonymous with expectation. Similarly, when g is a

real-valued function whose domain includes X1 X2 Xn , one denes
E[g(X1 , X2 , . . . , Xn )]
x1
x2
PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn )g(x1 , x2 , . . . , xn ).
(2.20)
xn
It is often convenient to consider conditional probability distributions. If

PX (x) > 0, then one denes
PY |X (y|x)
PX,Y (x, y)
.
PX (x)
(2.21)
It follows from (2.21) and (2.17) that

PY |X (y|x) 0 for all y Y
(2.22)
and
PY |X (y|x) = 1.
(2.23)
Thus, mathematically, there is no dierence between a conditional probability

distribution for Y (given, say, X = x) and the (unconditioned) probability
distribution for Y . When PX (x) = 0, we cannot of course use (2.21) to
dene PY |X (y|x). It is often said that PY |X (y|x) is undened in this case,
but it is better to say that PY |X (y|x) can be arbitrarily specied, provided
that (2.22) and (2.23) are satised by the specication. This latter practice is
often done in information theory to avoid having to treat as special cases those
uninteresting situations where the conditioning event has zero probability.
Note that the concept of conditioning allows a more intuitive explanation
of the denition of independence as given in (2.18). If two random variables
X and Y are independent, i.e.,
PX,Y (x, y) = PX (x)PY (y),
(2.24)
then
PY |X (y|x) =
PX,Y (x, y)
PX (x)PY (y)
=
= PY (y).
PX (x)
PX (x)
(2.25)
Hence, we see that the probability distribution of Y remains unchanged irrespective of what value X takes on, i.e., they are independent!
2.2. Discrete Random Variables
39
If g is a real-valued function whose domain includes X , then the conditional

expectation of g(X) given the occurrence of the event A is dened as
g(x) Pr[X = x|A].
E[g(X)|A]
(2.26)
Choosing A {Y = y0 } for some random variable Y and some value y0 Y,

we see that (2.26) implies
E[g(X) | Y = y0 ] =
g(x)PX|Y (x|y0 ).
(2.27)
More generally, when g is a real-valued function whose domain includes

X Y, the denition (2.26) implies
g(x, y) Pr {X = x} {Y = y} A .
E[g(X, Y )|A] =
x
Again, nothing prevents us from choosing A

reduces to
E[g(X, Y ) | Y = y0 ] =
(2.28)
{Y = y0 }, in which case (2.28)
g(x, y0 )PX|Y (x|y0 ),
(2.29)
as follows from the fact that Pr {X = x} {Y = y} {Y = y0 } vanishes

for all y except y = y0 , in which case it has the value Pr[X = x | Y = y0 ] =
PX|Y (x|y0 ). Multiplying both sides of (2.29) by PY (y0 ) and summing over y0
gives the relation
E[g(X, Y )] =
y
E[g(X, Y ) | Y = y]PY (y),
(2.30)
where we have used (2.20) and where we have changed the dummy variable
of summation from y0 to y for clarity. Similarly, conditioning on an event A,
we would obtain
E[g(X, Y )|A] =
y
E g(X, Y ) {Y = y} A Pr[Y = y|A],
(2.31)
which in fact reduces to (2.30) when one chooses A to be the certain event .
Both (2.30) and (2.31) are referred to as statements of the theorem on total
expectation, and are exceedingly useful in the calculation of expectations.
A sequence Y1 , Y2 , Y3 , . . . of random variables is said to converge in probability to the random variable Y , denoted
Y = plim Yn ,
n
(2.32)
if for every positive it is true that

lim Pr[|Y Yn | < ] = 1.
(2.33)
40
Roughly speaking, the random variables Y1 , Y2 , Y3 , . . . converge in probability

to the random variable Y if, for every large n, it is virtually certain that the
random variable Yn will take on a value very close to that of Y . Suppose
that X1 , X2 , X3 , . . . is a sequence of statistically independent and identicallydistributed (i.i.d.) random variables, let m denote their common expectation
m = E[Xi ],
(2.34)
and let
Yn
X1 + X2 + + Xn
.
n
(2.35)
Then the weak law of large numbers asserts that

plim Yn = m,
(2.36)
i.e., that this sequence Y1 , Y2 , Y3 , . . . of random variables converges in probability to (the constant random variable whose value is always) m. Roughly
speaking, the law of large numbers states that, for every large n, it is virtually
++Xn
will take on a value close to m.
certain that X1 +X2n
2.3
Continuous Random Variables
In contrast to a discrete random variable with a nite or countably innite

alphabet, a continuous random variable can take on uncountably many values.
Unfortunately, once the sample space contains uncountably many elements,
some subtle mathematical problems can pop up. So for example, the set of
all events F can not be dened anymore as set of all subsets, but needs to
satisfy certain constraints, or the alphabet needs to be decent enough. In
this very brief review we will omit these mathematical details. Moreover, we
concentrate on a special family of random variables of uncountably innite
alphabet that satisfy some nice properties: continuous random variables.
A random variable X is called continuous if there exists a nonnegative
function fX , called probability density function (PDF), such that for every
subset of the alphabet B X , we have
Pr[X B] =
fX (x) dx.
(2.37)
This function is called density because it describes the probability per length,
similarly to a volume mass density giving the mass per volume, or the surface
charge density giving the electrical charge per surface area. So for example,
b
Pr[a X b] =
fX (x) dx.
(2.38)
By denition we have
fX (x) 0,
x X,
(2.39)
2.3. Continuous Random Variables
41
and from (2.37) it follows that
fX (x) dx = Pr[X R] = 1.
(2.40)
These two properties fully correspond to (2.12) and (2.13), respectively. However, note that fX (x) > 1 is possible, it is even possible that fX is unbounded.
As an example, consider the following PDF:
fX (x) =
4 x
0 < x 4,
otherwise.
(2.41)
Obviously, (2.39) is satised. We quickly check that also (2.40) holds:
4
4
1
x
4
0
dx =
fX (x) dx =
=
= 1.
(2.42)
2 x=0
2
2
0 4 x
The generalization to continuous random vectors is straightforward. We

dene the joint PDF of X1 , . . . , Xn such that for any subset B X1 Xn
Pr[X B] =
fX1 ,...,Xn (x1 , . . . , xn ) dx1 dx2 dxn .
(2.43)
The denition of expectation for continuous random variables corresponds

to (2.19):
E[g(X)]
fX (x)g(x) dx.
(2.44)
Similarly, also the denition of independence, conditional PDFs, and the total
expectation theorem can be taken over from the discrete probability theory.
Chapter 3
Data Compression: Ecient

Coding of a Single Random
Message
After having introduced the basic denitions of information theory, we now
start with the rst important practical problem: how to represent information
in an ecient manner. We rstly start with the most simple situation of a
single random message, and then extend the setup to more practical cases of
an information source without and nally with memory. We will see that some
of the basic denitions of Chapter 1 are closely related to this fundamental
problem of data compression.
3.1
A Motivating Example
You would like to set up your own phone system that connects you to your
three best friends. The question is how to design ecient binary phone numbers. In Table 3.1 you nd six dierent ways how you could choose them.
Table 3.1: Binary phone numbers for a telephone system with three friends.
Friend
Probability
Alice
1
4
1
2
1
4
Sets of Phone Numbers

(i)
Bob
Carol
(ii)
(iii)
(iv)
(v)
(vi)
0011
001101
00
10
0011
001110
11
11
1100
110000
10
10
10
11
Note that in this example the phone number is a codeword for the person
you want to talk to. The set of all phone numbers is called code. We also
assume that you have dierent probabilities when calling your friends: Bob is
43
44
Efficient Coding of a Single Random Message
your best friend whom you will call in 50% of the times. Alice and Carol are
contacted with a frequency of 25% each.
So let us discuss the dierent designs in Table 3.1:
(i) In this design Alice and Bob have the same phone number. The system
obviously will not be able to connect properly.
(ii) This is much better, i.e., the code will actually work. However, note
that the phone numbers are quite long and therefore the design is rather
inecient.
(iii) Now we have a code that is much shorter and at the same time we made
sure that we do not use the same codeword twice. Still, a closer look
reveals that the system will not work. The problem here is that this
code is not uniquely decodable: if you dial 10 this could mean Carol
or also Bob, Alice. Or in other words: The telephone system will
never connect you to Carol, because once you dial 1, it will immediately
connect you to Bob.
(iv) This is the rst quite ecient code that is functional. But we note
something: When calling Alice, why do we have to dial two zeros? After
the rst zero it is already clear to whom we would like to be connected!
Lets x that in design (v).
(v) This is still uniquely decodable and obviously more ecient than (iv).
Is it the most ecient code? No! Since Bob is called most often, he
should be assigned the shortest codeword!
(vi) This is the optimal code. Note one interesting property: Even though
the numbers do not all have the same length, once you nish dialing
any of the three numbers, the system immediately knows that you have
nished dialing! This is because no codeword is the prex 1 of any other
codeword, i.e., it never happens that the rst few digits of one codeword
are identical to another codeword. Such a code is called prex-free (see
Section 3.3 below). Note that (iii) was not prex-free: 1 is a prex of
10.
From this example we learn the following requirements that we impose on
our code design:
A code needs to be uniquely decodable.
A code should be short, i.e., we want to minimize the average codeword
length E[L], which is dened as follows:
r
pi l i .
E[L]
(3.1)
i=1
1
According to the Oxford English Dictionary, a prex is a word, letter, or number placed
before another.
3.2. A Coding Scheme
45
Here pi denotes the probability that the source emits the ith symbol,
i.e., the probability that the ith codeword ci is selected; li is the length
of the ith codeword; and r is the number of codewords.
We additionally require the code to be prex-free. Note that this requirement is not necessary, but only convenient. However, we will later
see that we lose nothing by asking for it.
3.2
A Coding Scheme
As we have seen in the motivating example, we would like to discuss ways how
we can represent data in an ecient way. To say it in an engineering way, we
want to compress data without losing any information.
C = (C1 , C2 , . . . , CL )
D-ary
message
encoder
U {u1 , u2 , . . . , ur }
r-ary
random
message
Figure 3.2: Basic data compression system for a single random message U .
To simplify the problem we will consider a rather abstract setup as shown
in Figure 3.2. There, we have the following building blocks:
U : random message (random variable), taking value in an r-ary alphabet
{u1 , u2 , . . . , ur }, with probability mass function (PMF) PU ():
PU (ui ) = pi ,
i = 1, . . . , r.
(3.2)
C : codeword letter, taking value in the D-ary alphabet {0, 1, . . . , D 1}.

C: codeword, taking a dierent value depending on the value of the message
U . Actually, the codeword is a deterministic function of U , i.e., it is
only random because U is random.
L: codeword length (length of C). It is random because U is random and
because the dierent codewords do not necessarily all have the same
length.
Remark 3.1. We note the following:
For every message ui we have a codeword ci of length li . Hence, ci is a
representation of ui in the sense that we want to be able to recover ui
from ci .
The complete set of codewords is called a code for U .
For a concrete system to work, not only the code needs to be known,
but also the assignment rule that maps the message to a codeword. We
call the complete system coding scheme.
46

The quality of a coding scheme is its average length E[L]: the shorter,
the better! (Remember, we are trying to compress the data!)
Since the code letters are D-ary, the code is called a D-ary code.
To summarize:
The codeword C is a one-to-one representation of the random message
U . We try to minimize the length of C, i.e., we compress the data U .
3.3
Prex-Free or Instantaneous Codes
Consider the following code with four codewords:

c1 = 0
c2 = 10
c3 = 110
(3.3)
c4 = 111
Note that the zero serves as a kind of comma: Whenever we receive a zero
(or the code has reached length 3), we know that the codeword has nished.
However, this comma still contains useful information about the message as
there is still one codeword without it! This is another example of a prex-free
code. We recall the following denition.
Denition 3.2. A code is called prex-free (sometimes also called instantaneous) if no codeword is the prex of another codeword.
The name instantaneous is motivated by the fact that for a prex-free code
we can decode instantaneously once we have received a codeword and do not
need to wait for later until the decoding becomes unique. Unfortunately, in
literature one also nds people calling a prex-free code a prex code. This
name is quite confusing because rather than having prexes it is the point of
the code to have no prex! We shall stick to the name prex-free codes.
Consider next the following example:
c1 = 0
c2 = 01
c3 = 011
(3.4)
c4 = 111
This code is not prex-free (0 is a prex of 01 and 011; 01 is a prex of 011),
but it is still uniquely decodable.
Exercise 3.3. Given the code in (3.4), decode the sequence 0011011110.
3.4. Trees and Codes
47
Note the drawback of the code design in (3.4): The receiver needs to wait
and see how the sequence continues before it can make a unique decision about
the decoding. The code is not instantaneously decodable.
Apart from the fact that they can be decoded instantaneously, another
nice property of prex-free codes is that they can very easily be represented
by leaves of decision trees. To understand this we will next make a small
detour and talk about trees and their relation to codes.
3.4
Trees and Codes
The following denition is quite straightforward.

Denition 3.4. A D-ary tree consists of a root with some branches, nodes,
and leaves in such a way that the root and every node have exactly D children.
As example, in Figure 3.3 a binary tree is shown, with two branches stemming forward from every node.
parent node
with
two children
root
branch
leaves
nodes
forward
Figure 3.3: A binary tree with a root (the node that is grounded), four nodes
(including the root), and ve leaves. Note that we will always clearly
distinguish between nodes and leaves: A node always has children, while
a leaf always is an end-point in the tree!
The clue to this section is to note that any D-ary code can be represented
as a D-ary tree. The D branches stemming from each node stand for the D
dierent possible symbols in the codes alphabet, so that when we walk along
the tree starting from the root, each symbol in a codeword is regarded as
a decision which of the D branches to take. Hence, every codeword can be
represented by a particular path traversing through the tree.
48
As an example, Figure 3.4 shows the binary (i.e., D = 2) tree of a binary

code with ve codewords, and Figure 3.5 shows a ternary (D = 3) tree with
six codewords. Note that we need to keep branches that are not used in order
to make sure that the tree is D-ary.
0010
0
1
001101
001110
110000
110
Figure 3.4: An example of a binary tree with ve codewords: 110, 0010,

001101, 001110, and 110000. At every node, going upwards corresponds
to a 0, and going downwards corresponds to a 1. The node with the
ground symbol is the root of the tree indicating the starting point.
In Figure 3.6, we show the tree describing the prex-free code given in
(3.3). Note that here every codeword is a leaf. This is no accident.
Lemma 3.5. A D-ary code {c1 , . . . , cr } is prex-free if, and only if, in its Dary tree every codeword is a leaf. (But not every leaf necessarily is a codeword.)
Exercise 3.6. Prove Lemma 3.5.
Hint: Carefully think about the denition of prex-free codes (Denition
3.2).
As mentioned, the D-ary tree of a prex-free code might contain leaves that
are not codewords. Such leaves are called unused leaves. Some more examples
of trees of prex-free and non-prex-free codes are shown in Figure 3.7.
An important concept of trees is the depths of their leaves.
Denition 3.7. The depth of a leaf in a D-ary tree is the number of steps it
takes when walking from the root forward to the leaf.
As an example consider again Figure 3.7. Tree (iv) has four leaves, all of
them at depth 2. Both tree (iii) and tree (v)-(vi) have three leaves, one at
depth 1 and two at depth 2.
3.4. Trees and Codes
49
0
1
2
01
201
202
2
22
Figure 3.5: An example of a ternary tree with six codewords: 1, 2, 01, 22,
201, and 202. At every node, going upwards corresponds to a 0, taking
the middle branch corresponds to a 1, and going downwards corresponds
to a 2.
0
0
10
110
111
Figure 3.6: Decision tree corresponding to the prex-free code given in (3.3).
(iii)
(iv)
00
(v)-(vi)
10
10
10
11
11
not prex-free
prex-free
prex-free
Figure 3.7: Examples of codes and its trees. The examples are taken from
Table 3.1.
50
We will now derive some interesting properties of trees. Since codes can
be represented by trees, we will then be able to apply these properties directly
to codes.
Lemma 3.8 (Leaf Counting and Leaf Depth Lemma). The number of
leaves n and their depths l1 , l2 , . . . , ln in a D-ary tree satisfy:
n = 1 + N(D 1),
(3.5)
Dli = 1,
(3.6)
i=1
where N is the number of nodes (including the root).

Proof: By extending a leaf we mean changing a leaf into a node by
adding D branches that stem forward. In that process
we reduce the number of leaves by 1,
we increase the number of nodes by 1, and
we increase the number of leaves by D,
i.e., in total we gain 1 node and D 1 leaves. This process is depicted graphically in Figure 3.8 for the case of D = 3.
1 leaf
+1 node
+3 leaves
Figure 3.8: Extending a leaf in a ternary tree (D = 3): The total number of
nodes is increased by one and the total number of leaves is increased by
D 1 = 2.
To prove the rst statement (3.5), we start with the extended root, i.e., at
the beginning, we have the root and n = D leaves. In this case, we have N = 1
and (3.5) is satised. Now we can grow any tree by continuously extending
some leaf, every time increasing the number of leaves by D1 and the number
of nodes by 1. We see that (3.5) remains valid. By induction this proves the
rst statement.
We will prove the second statement (3.6) also by induction. We again start
with the extended root.
3.5. The Kraft Inequality
51
1. An extended root has D leaves, all at depth 1: li = 1. Hence,

n
li
=
i=1
i=1
D1 = D D1 = 1,
(3.7)
i.e., for the extended root (3.6) is satised.

2. Suppose n Dli = 1 holds for an arbitrary D-ary tree with n leaves.
i=1
Now we extend one leaf, say the nth leaf.2 We get a new tree with
n = n 1 + D leaves where the D new leaves all have depths ln + 1:
n
n1
Dli =
i=1
i=1
Dli + D D(ln +1)
unchanged
leaves
n1
li
(3.8)
new leaves
at depth
ln + 1
+ Dln
(3.9)
i=1
n
Dli = 1.
(3.10)
i=1
Here the last equality follows from our assumption that n Dli = 1.
i=1
Hence by extending one leaf the second statement continues to hold.
3. Since any tree can be grown by continuously extending some leaf, the
proof follows by induction.
We are now ready to apply our rst insights about trees to codes.
3.5
The Kraft Inequality
The following theorem is very useful because it gives us a way to nd out

whether a prex-free code exists or not.
Theorem 3.9 (Kraft Inequality). There exists a D-ary prex-free code with
r codewords of lengths l1 , l2 , . . . , lr N if, and only if,
r
i=1
Dli 1.
(3.11)
If (3.11) is satised with equality, then there are no unused leaves in the tree.
Example 3.10. Let l1 = 3, l2 = 4, l3 = 4, l4 = 4, l5 = 4, and consider a
binary code D = 2. Then
23 + 4 24 =
2
4
3
1
+
= 1,
8 16
8
(3.12)
Since the tree is arbitrary, it does not matter how we number the leaves!
52
i.e., there exists a binary prex-free code with the above codeword lengths.
On the other hand, we cannot nd any binary prex-free code with ve
codewords of lengths l1 = 1, l2 = 2, l3 = 3, l4 = 3, and l5 = 4 because
21 + 22 + 2 23 + 24 =
17
> 1.
16
(3.13)
But, if we instead look for a ternary prex-free code (D = 3) with ve codewords of these lengths, we will be successful because
31 + 32 + 2 33 + 34 =
43
1.
81
(3.14)
These examples are shown graphically in Figure 3.9.
3
4
4
4
4
4?
1
3
3
4
Figure 3.9: Illustration of the Kraft inequality according to Example 3.10.

Proof of the Kraft Inequality: We prove the two directions separately:
=: Suppose that there exists a D-ary prex-free code with the given codeword lengths. From Lemma 3.5 we know that all r codewords of a D-ary
prex-free code are leaves in a D-ary tree. The total number n of (used
and unused) leaves in this tree can therefore not be smaller than r,
r n.
(3.15)
3.6. Trees with Probabilities
53
Hence,
r
li
i=1
Dli = 1,
(3.16)
i=1
where the last equality follows from the leaf depth lemma (Lemma 3.8).
=: Suppose that r Dli 1. We now can construct a D-ary prex-free
i=1
code as follows:
Step 1: Start with the extended root, i.e., a tree with D leaves, set i =
1, and assume without loss of generality that l1 l2 lr .
Step 2: If there is an unused leaf at depth li , put the ith codeword

there. Note that there could be none because li can be strictly
larger than the current depth of the tree. In this case, extend
any unused leaf to depth li , and put the ith codeword to one
of the new leaves.
Step 3: If i = r, stop. Otherwise i i + 1 and go to Step 2.
We only need to check that Step 2 is always possible, i.e., that there is
always some unused leaf available. To that goal note that if we get to
Step 2, we have already put i 1 codewords into the tree. From the leaf
depth lemma (Lemma 3.8) we know that
n
1=
lj
j=1
i1
lj
j=1
D lj ,
(3.17)
j=i
used leaves
unused leaves
where j are the depths of the leaves in the tree at that moment, i.e.,
l
(1 , . . . , i1 ) = (l1 , . . . , li1 ) and i , . . . , n are the depths of the (so far)
l
l
l
l
unused leaves. Now note that by assumption i r, i.e.,
i1
Dlj <
j=1
j=1
Dlj 1,
(3.18)
where the last inequality follows by assumption. Hence, in (3.17), the

term used leaves is strictly less than 1, and thus the term unused
leaves must be strictly larger than 0, i.e., there still must be some
unused leaves available!
3.6
Trees with Probabilities
We have seen already in Section 3.1 that for codes it is important to consider
the probabilities of the codewords. We therefore now introduce probabilities
in our trees.
54
replacemen
0.2
leaf 1
0.2
0.3
leaf 3
node 1
0.8
node 3
0.1
leaf 4
node 2
0.5
leaf 2
Figure 3.10: An example of a binary tree with probabilities.
Denition 3.11. A tree with probabilities is a nite tree with probabilities

assigned to each node and leaf such that
the probability of a node is the sum of the probabilities of its children,
and
the root has probability 1.
An example of a binary tree with probabilities is given in Figure 3.10. Note
that the node probabilities can be seen as the overall probability of passing
through this node when making a random walk from the root to a leaf. For
the example of Figure 3.10, we have an 80% chance that our path will go
through node 2, and with a chance of 30% we will pass through node 3. By
denition, the probability of ending up in a certain leaf is given by the leaf
probability, e.g., in Figure 3.10, we have a 10% chance to end up in leaf 4.
Obviously, since we always start our random walk at the root, the probability
that our path goes through the root is always 1.
To clarify our notation we will use the following conventions:
n denotes the total number of leaves;
pi , i = 1, . . . , n, denote the probabilities of the leaves;
N denotes the number of nodes (including the root, but excluding the
leaves); and
P , = 1, . . . , N, denote the probabilities of the nodes, where by denition P1 = 1 is the root probability.
Moreover, we will use q,j to denote the probability of the jth node/leaf that is
one step forward from node (the jth child of node ), where j = 0, 1, . . . , D
55
1. I.e., we have
D1
q,j = P .
(3.19)
j=0
Since in a prex-free code all codewords are leaves and we are particularly
interested in the average codeword length, we are very much interested in the
average depth of the leaves in a tree (where for the averaging operation we
use the probabilities in the tree). Luckily, there is an elegant way to compute
this average depth as shown in the following lemma.
Lemma 3.12 (Path Length Lemma). In a rooted tree with probabilities,
the average depth E[L] of the leaves is equal to the sum of the probabilities of
all nodes (including the root):
N
E[L] =
P .
(3.20)
=1
Example 3.13. Consider the tree of Figure 3.10. We

at depth l1 = 1 with a probability p1 = 0.2, one at
probability p2 = 0.5, and two at depth l3 = l4 = 3 with
and p4 = 0.1, respectively. Hence, the average depth of
have four leaves: one

depth l2 = 2 with a
probabilities p3 = 0.2
the leaves is
E[L] =
i=1
pi li = 0.2 1 + 0.5 2 + 0.2 3 + 0.1 3 = 2.1.
(3.21)
According to Lemma 3.12 this must be equal to the sum of the node probabilities:
E[L] = P1 + P2 + P3 = 1 + 0.8 + 0.3 = 2.1.
(3.22)
Proof of Lemma 3.12: The lemma is easiest understood when looking

at a particular example. Lets again consider the tree of Figure 3.10. When
computing the average depth of the leaves, we sum all probabilities of the
leaves weighted with the corresponding depth. So, e.g., the probability p1 =
0.2 of leaf 1 being at depth 1 has weight 1, i.e., it needs to be counted
once only. If we now look at the sum of the node probabilities, then we note
that p1 indeed is only counted once as it only is part of the probability of the
root P1 = 1.
Leaf 2, on the other hand, is at depth 2 and therefore its probability
p2 = 0.5 has weight 2, or in other words, p2 needs to be counted twice. Again,
in the sum of the node probabilities, this is indeed the case because p2 is
contained both the root probability P1 = 1 and also in the probability of the
second node P2 = 0.8.
56
Finally, the probabilities of leaf 3 and leaf 4, p3 = 0.2 and p4 = 0.1, are
counted three times as they are part of P1 , P2 , and P3 :
E[L] = 2.1
(3.23)
= 1 0.2 + 2 0.5 + 3 0.2 + 3 0.1
= 1 (0.2 + 0.5 + 0.2 + 0.1) + 1 (0.5 + 0.2 + 0.1)

+ 1 (0.2 + 0.1)
= 1 P1 + 1 P2 + 1 P3
(3.24)
(3.25)
(3.26)
= P1 + P2 + P3 .
(3.27)
Since we have started talking about probabilities and since this is an information theory course, it should not come as a big surprise that we next
introduce the concept of entropy into our concept of trees.
Denition 3.14. The leaf entropy is dened as
n
Hleaf
pi log pi .
(3.28)
i=1
Denition 3.15. The branching entropy H of node is dened as

D1
j=0
q,j
q,j
log
.
P
P
(3.29)
,j
Note that P is the conditional probability of going along the jth branch
given that we are at node (normalization!). Further note that, as usual, we
implicitly exclude zero probability values. In particular, we do not consider
the situation where a node has zero probability, but assume that in this case
this node with all its children (and their children, etc.) is removed from the
tree and replaced by an unused leaf.
So far we do not see yet what the use of these denitions will be, but we
compute an example anyway.
Example 3.16. Consider the tree in Figure 3.11. We have
Hleaf = 0.4 log 0.4 0.1 log 0.1 0.5 log 0.5 1.361 bits;
0.4
0.4 0.6
0.6
H1 =
log
log
0.971 bits;
1
1
1
1
0.1
0.1 0.5
0.5
H2 =
log
log
0.650 bits;
0.6
0.6 0.6
0.6
0.1
0.1
log
= 0 bits.
H3 =
0.1
0.1
(3.30)
(3.31)
(3.32)
(3.33)
We will next prove a very interesting relationship between the leaf entropy and the branching entropy that will turn out to be fundamental for the
understanding of codes.
57
0.4
1
0.1
H1
0.1
H3
0.6
H2
0.5
Figure 3.11: A binary tree with probabilities.
Theorem 3.17 (Leaf Entropy Theorem). In any tree with probabilities it

holds that
N
P H .
Hleaf =
(3.34)
=1
Proof: First, we recall that by denition of trees and trees with probabilities we have for every node :
D1
q,j
P =
(3.35)
j=0
(see (3.19)). Using the denition of branching entropy, we get
D1
q,j
q,j
log
P H = P
P
P
(3.36)
j=0
D1
q,j log
j=0
q,j
P
D1
(3.37)
D1
q,j log q,j +

j=0
q,j log P
j=0
j=0
indep.
of j
D1
D1
(3.38)
q,j log q,j + log P
q,j
(3.39)
j=0
= P
58
0.4
0.4 log2 0.4
+1 log2 1 = 0
0.05
0.1
+0.1 log2 0.1
0.6 log2 0.6
contribution of P1 H1
0.05 log2 0.05
0.1 log2 0.1 0.05 log 0.05

2
0.6
0.05
+0.6 log2 0.6
0.5 log2 0.5
0.5
Figure 3.12: Graphical proof of the leaf entropy theorem. In this example
there are three nodes. We see that all contributions cancel apart from
the root node (whose contribution is 0) and the leaves.
D1
q,j log q,j + P log P ,
(3.40)
j=0
where the last equality follows from (3.35).
Hence, for every node , we see that it will contribute to
N
=1 P H
twice:
Firstly it will add P log P when the node counter passes through ;
and
secondly it will subtract q,j log q,j = P log P when the node counter
points to the parent node of .

Hence, the contributions of all nodes will be canceled out apart from the
root that does not have a parent. The root only contributes P1 log P1 for = 1.
However, since P1 = 1, we have P1 log P1 = 1 log 1 = 0. So the root does not
contribute either.
It only remains to consider the leaves. Note that the node counter will
not pass through leaves by denition. Hence, a leaf only contributes when the
node counter points to its parent node and its contribution is q,j log q,j =
pi log pi . Since the sum of all pi log pi equals the leaf entropy by denition,
this proves the claim.
In Figure 3.12 we have tried to depict the argumentation of this proof
graphically.
3.7. What We Cannot Do: Fundamental Limitations of Source Coding
59
Example 3.18. We continue with Example 3.16:

P1 H1 + P2 H2 + P3 H3 = 1 0.971 + 0.6 0.650 + 0.1 0 bits
= 1.361 bits = Hleaf
(3.42)
as we have expected.
3.7
(3.41)
What We Cannot Do: Fundamental

Limitations of Source Coding
The main strength of information theory is that it can provide some fundamental statements about what is possible to achieve and what is not possible.
So a typical information theoretic result will consist of an upper bound and a
lower bound or as it is common to call these two parts an achievability
part and a converse part. The achievability part of a theorem tells us what
we can do, and the converse part tells us what we cannot do.
Sometimes, the theorem also will tell us how to do it, but usually the result
is theoretic in the sense that it only proves what is possible without actually
saying how it could be done. To put it pointedly: Information theory tells us
what is possible, coding theory tells us how to do it.
In this section we will now derive our rst converse part, i.e., we will prove
some fundamental limitation about the eciency of source coding. Let us
quickly summarize what we know about codes and their corresponding trees:
We would like to use prex-free codes.
Every prex-free code can be represented by a tree where every codeword
corresponds to a leaf in the tree.
Every codeword has a certain probability corresponding to the probability of the source symbol it represents.
Unused leaves can be regarded as symbols that never occur, i.e., we
assign probability zero to them.
From these observations we realize that the entropy H(U ) of a random message
U with probabilities p1 , . . . , pr and the leaf entropy of the corresponding tree
are the same:
Hleaf = H(U ).
(3.43)
Note that the unused leaves do not contribute to Hleaf since they have zero
probability.
Moreover, the average codeword length E[L] is equivalent to the average
depth of the leaves. (Again we can ignore the unused leaves because they have
probability zero and therefore do not contribute to the average.)
60
Now note that since we consider D-ary trees where each node branches
into D dierent children, we know from the properties of entropy that the
branching entropy can never be larger than log D (see Theorem 1.12):
H log D.
(3.44)
Hence, using this together with the leaf entropy theorem (Theorem 3.17) and
the path length lemma (Lemma 3.12) we get:
=1
H(U ) = Hleaf =
P H
=1
P log D = log D
=1
P = log D E[L]
(3.45)
and hence
E[L]
H(U )
.
log D
(3.46)
Note that the logarithm to compute the entropy and the logarithm to compute
log D must use the same basis, however, it does not matter which basis is
chosen.
This is the converse part of the coding theorem for a single random message. It says that whatever code you try to design, the average codeword
length of any D-ary prex-free code for an r-ary random message U cannot
be smaller than the entropy of U (using the correct units)!
Note that to prove this statement we have not designed any code, but
instead we have been able to prove something that holds for every code that
exists.
When do we have equality? From the derivation just above we see that we
have equality if the branching entropy always is H = log D, i.e., the branching
probabilities are all uniform. This is only possible if pi is a negative integer
power of D for all i:
pi = Di
(3.47)
with i N a natural number (and, of course, if we design an optimal code).

In Appendix 3.A we show an alternative proof of this result that uses the
fact that relative entropy is nonnegative.
3.8
What We Can Do: Analysis of Some Good

Codes
In practice it is not only important to know where the limitations are, but
perhaps even more so to know how close we can get to these limitations. So,
as a next step, we would like to nally start thinking about how to design
codes and then try to analyze their performance.
3.8. What We Can Do: Analysis of Some Good Codes
3.8.1
61
Shannon-Type Codes
We have already understood that a good code should assign a short codeword
to a message with high probability, and use the long codewords for unlikely
messages only. So we try to design a simple code following this main design
principle.
Assume that PU (ui ) = pi > 0 for all i since we do not care about messages
with zero probability. Then, for every message ui dene
1
log pi
li
log D
(3.48)
(where denotes the smallest integer not smaller than ) and choose an
arbitrary unique prex-free codeword of length li . Any code generated like
that is called Shannon-type code.
We need to show that such a code always exists. Note that by denition
in (3.48)
li
1
log pi
log D
= logD
1
.
pi
(3.49)
Hence, we have
r
r
i=1
Dli
logD
1
pi
DlogD pi =
i=1
pi = 1
(3.50)
i=1
i=1
and the Kraft inequality (Theorem 3.9) is always satised. So we know that
we can always nd a Shannon-type code.
Example 3.19. As an example, consider a random message U with four
symbols having probabilities
p1 = 0.4,
p2 = 0.3,
p3 = 0.2,
p4 = 0.1,
(3.51)
i.e., H(U ) 1.846 bits. We design a binary Shannon-type code:

1
0.4
1
l2 = log2
0.3
1
l3 = log2
0.2
1
l4 = log2
0.1
l1 = log2
= 2,
(3.52)
= 2,
(3.53)
= 3,
(3.54)
= 4.
(3.55)
Note that
22 + 22 + 23 + 24 =
11
< 1,
16
(3.56)
i.e., such a code does exist, as expected. A possible choice for such a code is
shown in Figure 3.13.
62

0
00
l1 = 2
01
l2 = 2
0.4
0.7
0.3
100
l3 = 3
0.2
0.3
0.3
0.1
0.1
1010 l4 = 4
Figure 3.13: A Shannon-type code for the message U of Example 3.19.

Next, lets see how ecient such a code is. To that goal we note that
1
log pi
li <
log D
+1
(3.57)
and therefore
r
pi l i
E[L] =
i=1
r
<
pi
i=1
1
=
log D
(3.58)
1
log pi
log D
r
i=1
+1
1
pi log
+
pi
(3.59)
r
pi
(3.60)
i=1
= H(U )
=1
H(U )
+ 1,
=
log D
(3.61)
where we have used the denition of entropy. We see that a Shannon-type

code (even though it is not an optimal code!) achieves the ultimate lower
bound (3.46) by less than 1! An optimal code will be even better than that!
Example 3.20. Lets return to Example 3.19. We compute the eciency of
the Shannon-type code designed before and compare:
E[L] = 1 + 0.7 + 0.3 + 0.3 + 0.1 = 2.4,
(3.62)
which satises as predicted

1.846 2.4 < 1.846 + 1.
(3.63)
But obviously the code shown in Figure 3.14 is much better than the Shannontype code! Its performance is E[L] = 2 < 2.4.
63
0
00
l2 = 2
10
l3 = 2
11
0.7
l1 = 2
01
l4 = 2
0.4
0.3
1
0.3
0.2
0.1
Figure 3.14: Another binary prex-free code for U of Example 3.19.
3.8.2
Shannon Code
Shannon actually did not directly propose the Shannon-type codes, but a
special case of the Shannon-type codes that we call Shannon code [Sha48,
Sec. 9]. In Shannons version the codeword length were chosen to satisfy
(3.48), but in addition Shannon also gave a clear rule on how to choose the
codewords. But before we can introduce the details, we need to make a quick
remark about our common decimal representation of real numbers (and the
corresponding D-ary representations).
Remark 3.21 (Decimal and D-ary Representation of Real Numbers). We are
1
all used to the decimal representation of real numbers: For example, 2 can
3
be written as 0.5 or 16 = 0.1875. We are also aware that some fractions need
1
innitely many digits like 3 = 0.3333333 . . . = 0.3 (we use here the overline
x to represent innite repetition of x), or 103 = 0.7629. Finally, irrational
135
numbers need innitely many digits without any repetitive patterns.
The problem is that in contrast to how it might seem, the decimal representation of a real number is not necessarily unique! Take the example of 1 :
3
obviously, 3 1 = 1, but if we write this in decimal form, we get 3 0.3 = 0.9,
3
i.e., we see that 0.9 = 1. The problem is that innitely many 9 at the tail of a
decimal representation is equivalent to increasing the digit in front of the tail
by 1 and removing the tail, e.g.,
0.349 = 0.35.
(3.64)
To get rid of this nonuniqueness, we will require that no decimal representation

of a real number has a tail of innitely many 9s.
This same discussion can directly be extended to the D-ary representation
of a real number. There we do not allow any innite tail of the digit (D 1).
64
Consider, for example, the following quaternary (D = 4) representation:

0.2013
= (0.202)4 = (0.53125)10 .
(3.65)
The problem is most pronounced for the binary representation because there
we only have zeros and ones: We do not allow any representation that ends
in a tail of innitely many 1.
Denition 3.22. The Shannon code is a prex-free code that is constructed
as follows:
Step 1: Arrange the symbols in order of decreasing probability.
Step 2: Compute the following cumulative probability3
i1
pj
Fi
(3.66)
j=1
and express it in its D-ary representation (where we do not allow a

representation with a tail of innitely many digits (D 1) in order
to make sure that this representation is unique4 ). For example, Fi =
55
0.625 in binary form is 0.101; or Fi = 81 in ternary form is 0.2001.
Step 3: Carry out the D-ary representation to exactly li positions after the
comma, where li is dened in (3.48), i.e.,
li
logD
1
.
pi
(3.67)
Then remove the leading 0. and use the resulting digits as codeword ci for ui . E.g., the binary Fi = (0.101)2 with li = 5 will give a
codeword ci = 10100; or the ternary Fi = (0.2001)3 with li = 3 will
give ci = 200.
Example 3.23. Lets again consider the random message U of Example 3.19.
The binary Shannon code is shown in Table 3.15, and the ternary Shannon
code in Table 3.16.
Lemma 3.24. The Shannon code as dened in Denition 3.22 is prex-free.

3
Note that this denition does not exactly match the denition of the cumulative distribution function (CDF) of a random variable.
4
We would like to point out that here we do not consider the numerical stability of
computers, which always only calculate with a limited precision. It is quite probable that
the Shannon code could be adapted to show a better performance with respect to numerical
precision problems.
65
Table 3.15: The binary Shannon code for the random message U of Example 3.23.
u1
u2
u3
u4
pi
0.4
0.3
0.2
0.1
Fi
0.4
0.7
0.9
binary representation
1
log2
pi
shortened representation
0.0
0.01100 . . .
0.10110 . . .
0.11100 . . .
0.00
0.01
0.101
0.1110
ci
00
01
101
1110
Table 3.16: The ternary Shannon code for the random message U of Example 3.23.
u1
u2
u3
u4
pi
0.4
0.3
0.2
0.1
Fi
0.4
0.7
0.9
0.20022 . . .
0.22002 . . .
ternary representation
1
log3
pi
shortened representation
0.0 0.10121 . . .
1
0.0
0.10
0.20
0.220
ci
10
20
220
Proof: Since we do not allow a D-ary representation with an innite

tail of digits (D 1), the D-ary representation of Fi is unique, i.e., there exist
(1) (2)
unique ai , ai , . . . {0, . . . , D 1} such that
Fi =
(j)
ai Dj
(3.68)
j=1
and such that

(j )
(j +1)
ai , ai
(j +2)
, ai
, . . . = (D 1, D 1, D 1, . . .)
(3.69)
for any j N. Moreover, from (3.49) it follows (solve for pi !) that
1
.
(3.70)
D li
By contradiction, assume that the codeword ci is the prex of another
codeword ci , i = i . This means that the length li of ci cannot be longer than
the length li of ci , which could happen in two cases:
pi
66

i i and therefore by denition li li , or
i > i , but li = li .
Since the latter case equals to the rst if we exchange the role of i and i , we
do not need to further pursue it, but can concentrate on the former case only.
Now, since ci is the prex of ci , we must have that
(j)
(j)
= a i ,
ai
j = 1, . . . , li .
(3.71)
Hence,
Fi F i =
=
j=1
(j)
(j)
j=li +1
<
(j)
Dj
a i a i
(j)
a i a i
Dj
(by (3.68))
(3.72)
(by (3.71))
(3.73)
D1
(D 1)Dj
(3.74)
j=li +1
li
=D
(3.75)
Here, the strict inequality in (3.74) holds because the dierence term can only
(j)
(j)
be equal to D1 if ai = D1 (and ai = 0) for all j li +1, which we have
specically excluded (see Remark 3.21). The nal equality (3.75) is a wellknown algebraic property (the easiest way to see this is to recall Remark 3.21
again and to consider the typical example 0.499999 . . . = 0.5).
On the other hand, by denition of Fi and Fi ,
Fi Fi = pi + pi+1 + + pi 1
D
li
+D
li+1
+ + D
(3.76)
li 1
(3.77)
where the inequality follows by repeated application of (3.70).

Combining (3.75) and (3.77) yields the following contradiction:
Dli > Dli + Dli+1 + + Dli 1 ,
(3.78)
and thereby shows that ci cannot be a prex of ci . This concludes the proof.5
Obviously, the Shannon code also is a Shannon-type code (because (3.48)
and (3.67) are identical) and therefore its performance is identical to the codes
described in Section 3.8.1. So we do not actually need to bother about the
exact denition of the codewords specied by Shannon, but could simply construct any tree as shown for the Shannon-type codes. The importance of the
Shannon code, however, lies in the fact that it was the origin of arithmetic
coding. We will come back to this in Section 4.3.
5
Many thanks go to Tomas Kroupa for pointing out the nonuniqueness of the D-ary
representation and for suggesting this proof.
3.8.3
67
Fano Code
Shortly before Shannon, Fano suggested a dierent code that actually turns
out to be very easy to implement in hardware [Fan49].
Denition 3.25. The Fano code is generated according to the following algorithm:
Step 1: Arrange the symbols in order of decreasing probability.
Step 2: Split the list of ordered symbols into D parts, with the total probabilities of each part being as similar as possible.
Step 3: Assign the digit 0 to the rst part, the digit 1 to the second part,
. . . , and the digit D 1 to the last part. This means that the
codewords for the symbols in the rst part will all start with 0, and
the codewords for the symbols in the second part will all start with
1, etc.
Step 4: Recursively apply Step 2 and Step 3 to each of the D parts, subdividing each part into further D parts and adding bits to the codewords
until each symbol is the single member of a part.
Note that eectively this algorithm constructs a tree and that therefore the
Fano code is prex-free.
Example 3.26. Let us generate a binary Fano code for a random message
with ve symbols having probabilities
p1 = 0.35,
p2 = 0.25,
p4 = 0.15,
p5 = 0.1.
p3 = 0.15,
(3.79)
Since the symbols are already ordered in decreasing order of probability, Step 1
can be omitted. We hence want to split the list into two parts, both having
as similar total probability as possible. If we split in {1} and {2, 3, 4, 5}, we
have a total probability 0.35 on the left and 0.65 on the right; the split {1, 2}
and {3, 4, 5} yields 0.6 and 0.4; and {1, 2, 3} and {4, 5} gives 0.75 and 0.25.
We see that the second split is best. So we assign 0 as a rst digit to {1, 2}
and 1 to {3, 4, 5}.
Now we repeat the procedure with both subgroups. Firstly, we split {1, 2}
into {1} and {2}. This is trivial. Secondly, we split {3, 4, 5} into {3} and
{4, 5} because 0.15 and 0.25 is closer to each other than 0.3 and 0.1 that
we would have gotten by splitting into {3, 4} and {5}. Again we assign the
corresponding codeword digits.
Finally, we split the last group {4, 5} into {4} and {5}. We end up with
the ve codewords {00, 01, 10, 110, 111}. This whole procedure is shown in
Figure 3.17.
In Figure 3.18 a ternary Fano code is constructed for the same random
message.
68

p1
p2
p3
p4
p5
0.35
0.25
0.15
0.15
0.1
0.6
0.4
0.35
0.25
0.15
0.15
0.1
0.15
0.25
0.15
0
00
01
10
0.1
1
110
111
Figure 3.17: Construction of a binary Fano code according to Example 3.26.

p1
p2
p3
p4
p5
0.35
0.25
0.15
0.15
0.1
0.35
0.25
0.4
0.15
0
0.15
0.1
20
21
22
Figure 3.18: Construction of a ternary Fano code according to Example 3.26.

Exercise 3.27. Construct the binary Fano code for the random message U
of Example 3.19 with four symbols having probabilities
p1 = 0.4,
p2 = 0.3,
p3 = 0.2,
p4 = 0.1,
(3.80)
and compute its performance. Compare to the Shannon-type code designed in

Example 3.19 and 3.20.
Remark 3.28. We would like to point out that there are cases where the
algorithm given in Denition 3.25 does not lead to a unique design: There
might be two dierent ways of dividing the list into D parts such that the
total probabilities are as similar as possible. Since the algorithm does not
specify what to do in such a case, you are free to choose any possible way.
Unfortunately, however, these dierent choices can lead to codes with dierent
performance. As an example consider a random message U with seven possible
symbols having the following probabilities:
p1 = 0.35,
p2 = 0.3,
p3 = 0.15,
p5 = 0.05,
p6 = 0.05,
p7 = 0.05.
p4 = 0.05,
(3.81)
Figure 3.19 and 3.20 show two dierent possible Fano codes for this random
message. The rst has an average codeword length of E[L] = 2.45, while the
latter performs better with E[L] = 2.4.
69
p1
p2
p3
p4
p5
p6
p7
0.35
0.3
0.15
0.05
0.05
0.05
0.05
0.05
0.05
0.65
0.35
0.35
0.3
0.15
0.05
0.05
0.2
0.15
0.15
0.05
0.05
0.05
0.05
0.1
0.05
0.05
0
00
01
100
1100
101
0.05
1101
111
Figure 3.19: One possible Fano code for the random message given in (3.81).
p1
p2
p3
p4
p5
p6
p7
0.35
0.3
0.15
0.05
0.05
0.05
0.05
0.05
0.05
0.05
0.05
0.05
0.05
0.35
0.65
0.3
0.15
0.05
0.05
0.3
0.35
0.15
0.05
0.05
0.15
0.2
0.05
0.05
0.1
0.1
0.05
10
110
0.05
0.05
0
0
0.05
1
11100
11101
11110
11111
Figure 3.20: A second possible Fano code for the random message given in
(3.81).
70
Exercise 3.29. In total there are six dierent possible designs of a Fano code
for the random message given in Remark 3.28. Design all of them and compare
their performances!
We continue with an investigation of the performance of the Fano code.

To that goal, the following lemma is helpful.
Lemma 3.30. The codeword lengths li of a D-ary Fano code satisfy the following:
li logD
1
+ 1.
pi
(3.82)
Proof: We start with the binary case D = 2 and consider the ith
symbol with probability pi . In each of the li times, when the ith symbols is
involved in a split of a subgroup of letters into two parts (where both parts
should have as equal total probability as possible), it will be assigned either
to the left or the right part. We now investigate the dierent possible cases.
1. The ith symbol could be assigned to the right part, i.e., the Fano split
results in
(3.83)
. . . . . . pi . . .
Since the symbols on the left are at least as likely as the symbols on the
right, we see that the total probability of the left part is at least pi and
therefore the total probability of the group before the split must have
been larger than or equal to 2pi .
2. If we have
. . . pi . . . . . .
(3.84)
pi
we again see that the total probability of the group before the split must
have been larger than or equal to 2pi .
3. On the other hand, if
(3.85)
. . . pi . . . . . .
< pi
then the total probability of the group before the split is not necessarily
larger than or equal to 2pi . Lets investigate this case more in detail.
There are several possible scenarios:
More precisely, (3.85) might look as follows:
. . . pi . . . pj . . .
< pi
(3.86)
71
with pi pj > 0. This case, however, will never occur in the Fano
algorithm because the following split is more balanced:
. . . pi . . . p j . . .
(3.87)
< pi
This can be seen as follows. In (3.86) the left part has total probability pleft pi + pj , while the right part has total probability
pright < pi . Hence the dierence between the two parts satises
(3.86) = |pleft pright | = pleft pright > pi + pj pi = pj . (3.88)
On the other hand, in (3.87) we have pleft pi , while the right part
satises pright < pi + pj . If the left part in (3.87) is larger than
the right part, then this would also be the case in (3.86), which
therefore denitely were not balanced. So, in (3.87) the left part
must be smaller than the right and therefore
(3.87) = |pleft pright | = pright pleft < pi + pj pi = pj . (3.89)
So, we see that (3.87) is more balanced and that (3.86) therefore
will never occur.
It might also be that pi is the most right element of the left part,
i.e., (3.85) might look as follows:
. . . pk p i . . .
(3.90)
< pi
with pk pi > 0. This case, however, will never occur in the Fano
algorithm because the following split is more balanced:
. . . pk pi . . .
(3.91)
< pi
This can be seen as follows. In (3.90) the left part has total probability pleft pk +pi 2pi , while the right part has total probability
pright < pi . Hence the dierence between the two parts satises
(3.90) = |pleft pright | = pleft pright > 2pi pi = pi .
(3.92)
On the other hand, in (3.91) we have pleft pk pi and pright < 2pi .
If the left part in (3.91) is larger than the right part, then this
would also be the case in (3.90), which therefore denitely were
not balanced. So, in (3.91) the left part must be smaller than the
right and therefore
(3.91) = |pleft pright | = pright pleft < 2pi pi = pi .
(3.93)
So, we see that (3.91) is more balanced and that (3.90) therefore
will never occur.
72

Hence, we see that the only possible way for (3.85) to occur is
pi . . .
(3.94)
< pi
which means it must be the nal step in Fano algorithm for pi , as

afterwards pi is alone in a part.
So, if we now undo the Fano splitting steps and reunite the two parts in each
step, we note that with each reunion the probability of the group grows by a
factor of at least 2, apart from the possible exception of the very rst reunion.
Hence, we must have
r
j=1
pj = 1 2li 1 pi .
(3.95)
Let now Z be such that

li = log2
1
+ .
pi
(3.96)
Then it follows from (3.95) that

1 2li 1 pi
=2
log2
log2
1
pi
(3.97)
+1
1
+1
pi
pi
2
pi
1 1
=
2
pi
pi
= 21 ,
(3.98)
(3.99)
(3.100)
(3.101)
i.e., 1. This proves the lemma for D = 2.

A similar argument can be given for D = 3. First of all, note that in
every step it is possible that one of the three parts is empty, i.e., has total
probability zero. However, this will only happen if in the other two parts we
only have one symbol left, i.e., we are in the last split. Otherwise, we can
assume that each part is strictly nonzero.
In the following three cases we always end up with a total probability
larger than or equal to 3pi :
. . . . . . . . . pi . . .
(3.102)
. . . . . . pi . . . . . .
(3.103)
. . . pi . . . . . .
pi
pi
...
pi
(3.104)
73
The cases where we have less than 3pi are as follows:

(3.105)
. . . . . . pi . . . . . .
< pi
. . . pi . . . . . .
...
< pi
pi
. . . pi . . . . . .
(3.106)
(3.107)
...
pi
< pi
. . . pi . . . . . .
...
< pi
< pi
(3.108)
We can now investigate each of these latter four cases in the same way as
shown in Part 3 of the proof for D = 2. We only need to pick two out of
the three parts: The part that contains pi and one other part that has total
probability less than pi . Note that, strictly speaking, we should compare the
total dierence of the balancing, i.e.,
= |pleft pmiddle | + |pleft pright | + |pmiddle pright |.
(3.109)
But if we only change one symbol from one part to another and the third part
is left untouched, then it is sucient to compare the dierence between the
total probabilities of the two changed parts.
This way we can show that all four cases (3.105)(3.108) can only occur as
the nal Fano step when pi is assigned to its own subgroup. The proof then
follows from the derivation (3.97)(3.101) with 2 replaced by 3 everywhere.
The extension to a general D is now straightforward and therefore omitted.
Exercise 3.31. Construct a binary Fano code for a random message U with
eight symbols of probabilities:
p1 = 0.44,
p2 = 0.14,
p3 = 0.13,
p4 = 0.13,
p5 = 0.13,
p6 = 0.01,
p7 = 0.01,
p8 = 0.01.
(3.110)
Note that l5 satises (3.82) with equality. Compute the performance of this
code.
From Lemma 3.30 we now have

1
log pi
1
1
li logD
+ 2,
+ 1 < logD + 2 =
pi
pi
log D
(3.111)
i.e., we can adapt the steps (3.58)(3.61) accordingly to show that the Fano
code satises
E[L] <
H(U )
+ 2.
log D
(3.112)
However, I believe that the Fano code actually performs better than what
this bound (3.112) suggest. Actually, I believe that the Fano code is at least
74
as good as the Shannon-type codes. The reason is that in (3.82) we have

inequality, i.e., it is possible that li is less than this bound (this is in contrast
to the Shannon-type codes, which have an equality in (3.48) by denition!).
Moreover, I believe that the rare case of equality in (3.82) only occurs if
1
for some j < i, we have lj logD pj 1, so that the codeword for pj
compensates for the additional code digit of pi . Unfortunately, so far I have
only found a proof for this claim in the special case of a binary Fano code
D = 2. For the moment, I therefore only state it as a conjecture. Any
suggestion of a proof would be very welcome!
Conjecture 3.32. The average codeword length E[L] of a D-ary Fano code
for an r-ary random message U satises
E[L] <
H(U )
+ 1.
log D
(3.113)
Keep in mind that neither Fano nor Shannon-type codes are optimal. The
optimal code will perform even better! We will derive the optimal codes in
Section 3.9.
Exercise 3.33. Construct a ternary Fano code for the random message U of
Example 3.19 and compute its performance.
Exercise 3.34. Construct a ternary Fano code for a random message U with
14 symbols of probabilities:
p1 = 0.3,
p2 = 0.12,
p3 = 0.11,
p4 = 0.11,
p5 = 0.1,
p6 = 0.03,
p7 = 0.03,
p8 = 0.03,
p9 = 0.03,
p10 = 0.03,
p11 = 0.03,
p12 = 0.03,
p13 = 0.03,
p14 = 0.02.
(3.114)
Note that l5 satises (3.82) with equality. Compute the performance of this
code.
3.8.4
Coding Theorem for a Single Random Message
We summarize the so far most important results.

Theorem 3.35 (Coding Theorem for a Single Random Message).
The average codeword length E[L] of an optimal D-ary prex-free code
for an r-ary random message U satises
H(U )
H(U )
E[L] <
+1
log D
log D
(3.115)
with equality on the left if, and only if, PU (ui ) = pi is a negative integer
power of D, i.
Moreover, this statement also holds true for Shannon-type codes and
Shannon codes, and is conjectured to hold for Fano codes.
75
This is the rst step in showing that the denition of H(U ) is useful!
Remark 3.36. Note that unfortunately the three codes from Sections 3.8.1,
3.8.2, and 3.8.3, i.e., the Shannon-type codes, the Shannon code, and the Fano
codes are all known by the same name of Shannon-Fano code.6 The reason
probably lies in the fact that all three codes perform very similarly and that
in [Sha48], Shannon refers to Fanos code construction algorithm, even though
he proposes then the Shannon code.
To add to the confusion, sometimes the Shannon code is also known as
Shannon-Fano-Elias code [CT06, Sec. 5.9]. The reason here is that as
mentioned above the Shannon code was the origin of arithmetic coding,
and that this idea has been credited to Prof. Peter Elias from MIT. However,
actually Elias denied this. The idea probably has come from Shannon himself
during a talk that he gave at MIT.
At this point, one should also ask the question what happens if we design a
code according to a wrong distribution. So assume that we design a Shannontype code for the message U (with probability distribution PU ), but mistakenly
assume that U has a distribution PU , i.e., we choose codeword lengths
1
log pi
li
log D
(3.116)
The performance of such a wrongly design code satises the following bounds.
Theorem 3.37 (Wrongly Designed Shannon-Type Code). The perfor
mance of a Shannon-type code that is designed for the message U , but is used
to describe the message U satises the following bounds:
H(U ) D(PU PU )
H(U ) D(PU PU )
+
E[L] <
+
+ 1.
log D
log D
log D
log D
(3.117)
So we see that the relative entropy tells us how much we lose by designing
according to the wrong distribution.
Proof: From (3.116), we note that
li =
1
log pi
<
log D
1
log pi
log D
+1
(3.118)
and obtain
r
pi l i
E[L] =
i=1
r
<
pi
i=1
(3.119)
1
log pi
log D
+1
(3.120)
The names used in this class have been newly invented and are not used in literature
like this.
76

1
=
log D
=
1
log D
r
i=1
r
1 pi
pi log
pi pi
pi log
i=1
pi
1
pi
+
pi log D
i=1
r
pi log
i=1
(3.121)
1
+1
pi
D(PU PU ) H(U )
+
+ 1.
=
log D
log D
(3.122)
(3.123)
For the lower bound, we bound

li =
1
log pi
log D
1
log pi
log D
(3.124)
The rest is then completely analogous to (3.119)(3.123).
3.9
Optimal Codes: Human Code
Shannon actually did not nd the optimal code. He was working together
with Prof. Robert M. Fano on the problem, but neither of them could solve
it. So in 1951 Fano assigned the question to his students in an information
theory class at MIT as a term paper. David A. Human, who had nished his
B.S. and M.S. in electrical engineering and also served in the U.S. Navy before
he became a Ph.D. student at MIT, attended this course. Human tried a
long time and was about to give up when he had the sudden inspiration to
start building the tree backwards from leaves to root instead from root to
leaves as everybody had been trying so far. Once he had understood this, he
was quickly able to prove that his code was the most ecient one. Naturally,
Humans term paper was later on published.
To understand Humans design we rstly recall the average length of a
code:
r
E[L] =
pi l i .
(3.125)
i=1
Then we make the following general observation.

Lemma 3.38. Without loss in generality, we arrange the probabilities pi in
nonincreasing order and consider an arbitrary code. If the lengths li of the
codewords of this code are not in the opposite order, i.e., we do not have both
p1 p 2 p 3 pr
(3.126)
l1 l2 l3 lr ,
(3.127)
and
then the code is not optimal and we can achieve a shorter average length by
reassigning the codewords to dierent symbols.
3.9. Optimal Codes: Human Code
77
Proof: To prove this claim, suppose that for some i and j with i < j
we have both
pi > pj
and
li > lj .
(3.128)
In computing the average length, originally the sum in (3.125) contains, among
others, the two terms
old:
pi l i + pj l j .
(3.129)
By interchanging the codewords for the ith and jth symbols, we get the terms
new:
pi l j + pj l i
(3.130)
while the remaining terms are unchanged. Subtracting the old from the new
we see that
new old:
(pi lj + pj li ) (pi li + pj lj ) = pi (lj li ) + pj (li lj )

= (pi pj )(lj li )
< 0.
(3.131)
(3.132)
(3.133)
From (3.128) this is a negative number, i.e., we can decrease the average
codeword length by interchanging the codewords for the ith and jth symbols.
Hence the new code with exchanged codewords for the ith and jth symbols
is better than the original code which therefore cannot have been optimal.
To simplify matters, we develop the optimal Human code rstly for the
binary case D = 2.
The clue of binary Human coding lies in two basic observations. The rst
observation is as follows.
Lemma 3.39. In a binary tree of an optimal binary prex-free code, there is
no unused leaf.
Proof: Suppose that the tree of an optimal code has an unused leaf.
Then we can delete this leaf and put its sibling to their parent node, see
Figure 3.21. By doing so we reduce E[L], which contradicts our assumption
that the original code was optimal.
Example 3.40. As an example consider the two codes given in Figure 3.22,
both of which have three codewords. Code (i) has an average length7 of
E[L] = 2, and code (ii) has an average length of E[L] = 1.5. Obviously,
code (ii) performs better.
The second observation basically says that the two most unlikely symbols
must have the longest codewords.
7
Remember Lemma 3.12 to compute the average codeword length: summing the node
probabilities. In code (i) we have P1 = 1 and P2 = P3 = 0.5 (note that the unused leaf has
by denition zero probability) and in code (ii) P1 = 1 and P2 = 0.5.
78
tree
rest of tree
unused leaf
Figure 3.21: Code performance and unused leaves: By deleting the unused
leaf and moving its sibling to their parent, we can improve on the codes
performance.
0.5
0.3
0.5
0.3
0.2
0.2
(i)
(ii)
Figure 3.22: Improving a code by removing an unused leaf.
Lemma 3.41. There exists an optimal binary prex-free code such that the
two least likely codewords only dier in the last digit, i.e., the two most unlikely
codewords are siblings.
Proof: Since we consider an optimal code and because of Lemma 3.38,
the codewords that correspond to the two least likely symbols must be the
longest codewords. Now assume for a moment that these two longest codewords do not have the same length. In this case, the sibling node of the longest
codeword must be unused, which is a contradiction to Lemma 3.39. Hence,
we understand the the two least likely symbols must have the two longest
codewords and are of identical length.
Now, if they have the same parent node, we are done. If they do not have
the same parent node, this means that there exist other codewords of the same
maximum length. In this case we can simply swap two codewords of equal
79
maximum length in such a way that the two least likely codewords have the
same parent, and we are done.
Because of Lemma 3.39 and the path length lemma (Lemma 3.12), we
see that the construction of an optimal binary prex-free code for an r-ary
random message U is equivalent to constructing a binary tree with r leaves
such that the sum of the probabilities of the nodes is minimum when the leaves
are assigned the probabilities PU (ui ) for i = 1, 2, . . . , r:
N
P .
E[L] =
(3.134)
=1
minimize!
But Lemma 3.41 tells us how we may choose one node in an optimal code
tree, namely as parent of the two least likely leaves ur1 and ur :
PN = PU (ur1 ) + PU (ur ).
(3.135)
So we have xed one P in (3.134) already. But, if we now pruned our binary
tree at this node to make it a leaf with probability p = PU (ur1 ) + PU (ur ), it
would become one of r 1 leaves in a new tree. Completing the construction
of the optimal code would then be equivalent to constructing a binary tree
with these r 1 leaves such that the sum of the probabilities of the nodes is
minimum. Again Lemma 3.41 tells us how to choose one node in this new
tree. Etc. We have thus proven the validity of the following algorithm.
Humans Algorithm for Optimal Binary Codes:
Step 1: Create r leaves corresponding to the r possible symbols and
assign their probabilities p1 , . . . , pr . Mark these leaves as active.
Step 2: Create a new node that has the two least likely active leaves
or nodes as children. Activate this new node and deactivate
its children.
Step 3: If there is only one active node left, root it. Otherwise, go to
Step 2.
Example 3.42. In Figure 3.23 we show the procedure of producing a binary
Human code for the random message U of Example 3.19 with four possible
symbols with probabilities p1 = 0.4, p2 = 0.3, p3 = 0.2, p4 = 0.1. We see that
the average codeword length of this Human code is
E[L] = 0.4 1 + 0.3 2 + 0.2 3 + 0.1 3 = 1.9.
(3.136)
Using Lemma 3.12 this can be computed much easier as

E[L] = P1 + P2 + P3 = 1 + 0.6 + 0.3 = 1.9.
(3.137)
80
0.4
0.4
0.3
0.3
0.2
0.2
0.3
0.1
Step 1
0.1
Step 2, rst time
0.4
0.4
1
0.3
0.3
0.6
0.6
0.2
0.2
0.3
0.3
0.1
Step 2, second time
0.1
Step 2, third time
0
0.4
0
1
10
0.3
110
0.6
0.2
0.3
111
0.1
Step 3
Figure 3.23: Creation of a binary Human code. Active nodes and leaves are
shaded.
81
Remark 3.43. Note that similarly to the construction of the Fano code, the
code design process is not unique in several respects. Firstly, the assignment
of the 0 or 1 digits to the codewords at each forking stage is arbitrary, but
this produces only trivial dierences. Usually, we will stick to the convention
that going upwards corresponds to 0 and downwards to 1. Secondly, when
there are more than two least likely (active) nodes/leaves, it does not matter
which we choose to combine. The resulting codes can have codewords of
dierent lengths, however, in contrast to the case of the Fano code, the average
codeword length will always be identical! The reason for this is clear: We have
proven that the Human algorithm results in an optimal code!
Example 3.44. As an example of dierent Human encodings of the same
random message, let p1 = 0.4, p2 = 0.2, p3 = 0.2, p4 = 0.1, p5 = 0.1.
Figure 3.24 shows three dierent Human codes for this message. All of them
have the same performance E[L] = 2.2.
Exercise 3.45. Try to generate all three codes of Example 3.44 (see Figure 3.24) yourself.
Lets now turn to the general case with D 2. What happens, e.g., if
D = 3? In Example 3.42 with four codewords, we note that there does not
exist any ternary code with 4 leaves (see the leaf counting lemma, Lemma 3.8!).
Hence, the ternary Human code for a random message with four symbols
must have some unused leaves! Obviously, the number of unused leaves should
be minimized as they are basically a waste. Moreover, a leaf should only be
unused if it has the longest length among all leaves in the tree!8
So we realize that, because we are growing the optimal code tree from
its leaves, we need to nd out how many unused leaves are needed before we
start! To do this, we recall the leaf counting lemma (Lemma 3.8): For the
example of D = 3 and r = 4, we have
n = 1 + N(D 1) = 1 + 2N,
(3.138)
i.e., n = 3, 5, 7, 9, . . . Hence, the smallest number of leaves that is not less than
r = 4 is n = 5.
We make the following observation.
Lemma 3.46. There are at most D 2 unused leaves in an optimal D-ary
prex-free code, and there exists an optimal code where all unused leaves have
the same parent node.
Proof: The arguments are basically the same as in the binary case.
If there are D 1 or more unused leaves in an optimal tree, then they can
be deleted, which will increase the eciency of a code. See Figure 3.25 for
8
Note that a unused leaf is equivalent to a message of zero probability: It never shows
up! But from Lemma 3.38 we have understood that for messages with low probability we
should assign long codewords.
82

0
0.4
10
0.2
1
110
0.2
0.6
1110
0.4
0.1
0.2
1111
0.1
0
0.4
1
100
0.2
0.4
101
0.2
0.6
110
0.1
0.2
111
0.1
00
0.4
0.6
01
0.2
10
0.2
110
0.4
0.1
0.2
111
0.1
Figure 3.24: Dierent binary Human codes for the same random message.
83
D1 unused leaves
4
.
.
.
.
.
.
D1
D or
more
unused
leaves
D1
Figure 3.25: Illustration of cases when there are more than D 2 unused
leaves in a D-ary tree.
an illustration. The argument of the second statement is identical to the

argument of Lemma 3.41.
Next we want to derive a formula that allows us to directly compute
the number of unused leaves. Let R be the number of unused leaves. By
Lemma 3.46 we know that 0 R < D 1, and from the leaf counting lemma
(Lemma 3.8) we get:
R = (# of leaves in the D-ary tree) (# of used leaves) (3.139)
=nr
(3.140)
= 1 + N(D 1) r
(3.141)
= D + (N 1)(D 1) r
= D r = (N 1)(D 1) + R
(3.142)
(3.143)
where 0 R < D 1. This looks exactly like Euclids division theorem when
dividing D r by D 1! Hence, using Rx (y) to denote the remainder when
y is divided by x, we nally get
R = RD1 (D r)
= RD1 D r + (r D)(D 1)
= RD1 (r D)(D 2) ,
(3.144)
(3.145)
(3.146)
where the second equality follows because RD1 (D 1) = 0 for all N.

Lemma 3.47. The number of unused leaves in the tree of an optimal D-ary
prex-free code is
R = RD1 (r D)(D 2) .
(3.147)
The nal observation now follows the lines of Lemma 3.41.

Lemma 3.48. There exists an optimal D-ary prex-free code for a random
message U with r possible values such that the D R least likely codewords
dier only in their last digit. Here R is given in (3.147).
84
The path length lemma (Lemma 3.12) tells us that we need to minimize
the sum of node probabilities, and Lemma 3.48 tells us how to choose one such
optimal node, namely, a node that combines the D R least likely codewords
and the R unused leaves. If we prune the tree at this new node, Lemma 3.46
tells us that there will be no unused leaves in the remaining D-ary tree. So
we continue to create nodes by combining the D least likely leaves in the
remaining tree and then prune the tree at this node.
We have justied the following algorithm of designing an optimal code.
Humans Algorithm for Optimal D-ary Codes:
Step 1: Create r leaves corresponding to the r possible symbols and
assign their probabilities p1 , . . . , pr . Compute
R
RD1 (r D)(D 2) ,
(3.148)
create R leaves, and assign the probability 0 to them. Mark all

leaves as active.
Step 2: Create a new node that has the D least likely active leaves or
nodes as children. Activate this new node and deactivate its
children.
Step 3: If there is only one active node left, root it. Otherwise go to
Step 2.
Example 3.49. Lets again go back to Example 3.19 and design an optimal
ternary (D = 3) code. Note that r = 4, i.e.,
R = R2 (4 3) (3 2) = 1.
(3.149)
In Figure 3.26 we show the procedure of designing this Human code. We get
E[L] = 1.3. Note that
H(U )
H(U )
= 1.16 E[L] <
+ 1 = 2.16
log2 3
log2 3
(3.150)
as expected.
As a warning to the reader we show in Figure 3.27 how Humans algorithm does not work: It is quite common to forget about the computation of
the number of unused leaves in advance, and this will lead to a very inecient
code with the most ecient codeword not being used!
Example 3.50. We consider a random message U with seven symbols having

probabilities
p1 = 0.4,
p2 = 0.1,
p3 = 0.1,
p5 = 0.1,
p6 = 0.1,
p7 = 0.1,
p4 = 0.1,
(3.151)
85
0.4
0.4
0.3
0.3
0.2
0.2
0.3
0.1
0.1
0
Step 1
0
Step 2, rst time
0.4
0.4
0.3
0.3
0
1
1
1
20
0.2
0.2
0.3
0.3
0.1
0
1
2
0
Step 2, second time
21
0.1
0
Step 3
Figure 3.26: Creation of a ternary Human code. Active nodes and leaves are
shaded.
0.4
0.3
1
0.6
0
1
2
0.2
0.1
0
10
11
12
0
Figure 3.27: An example of a wrong application of Humans algorithm. Here
it was forgotten to compute the number of unused leaves. We get E[L] =
1.6. Note that one of the shortest (most valuable) codeword 2 is not used!
86

p1
p2
p3
p4
p5
p6
p7
0.4
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.5
0.5
0.4
0.1
0.1
0.1
0.1
0.2
0.3
0.1
0.1
0.1
0.1
0.1
0.2
0.1
0.1
0
00
01
100
101
0.1
1
1100
1101
111
Figure 3.28: Construction of a binary Fano code for Example 3.50.
i.e., H(U ) 2.52 bits. We rstly design a binary Fano code, see Figure 3.28.
The corresponding tree is shown in Figure 3.29. Note that the construction
algorithm is not unique in this case: In the second round we could split the
second group either to {3, 4} and {5, 6, 7} or {3, 4, 5} and {6, 7}. Both ways
will result in a code of identical performance. The same situation also occurs
in the third round.
The eciency of this Fano code is
E[L] = 1 + 0.5 + 0.5 + 0.2 + 0.3 + 0.2 = 2.7,
(3.152)
which satises as predicted

2.52 2.7 < 3.52.
(3.153)
Next, we design a binary Shannon code for the same random message.
The details are given in Table 3.30. Its performance is less good
E[L] = 0.4 2 + 0.6 4 = 3.2,
(3.154)
but must, of course, still satisfy

2.52 3.2 < 3.52.
(3.155)
Finally, a corresponding binary Human code for U is shown in Figure 3.31. Its performance is E[L] = 2.6, i.e., it is better than both the Fano
and the Shannon code, but it still holds that
2.52 2.6 < 3.52.
(3.156)
3.10. Types of Codes
87
0
1
00
0.5
0.4
01
0.1
100
1
0.2
0.1
0.1
101
1100
0.5
0.1
0.3
0.2
1101
0.1
111
0.1
Figure 3.29: A Fano code for the message U of Example 3.50.

Exercise 3.51. Design binary and ternary Human, Shannon, and Fano
codes for the random message U with probabilities
p1 = 0.25,
p2 = 0.2,
p3 = 0.2,
p4 = 0.1,
(3.157)
p5 = 0.1,
p6 = 0.1,
p7 = 0.05,
and compare their performances.
3.10
Types of Codes
Note that in Section 3.1 we have restricted ourselves to prex-free codes. So,
up to now we have only proven that Human codes are the optimal codes
under the assumption that we restrict ourselves to prex-free codes. We would
now like to show that Human codes are actually optimal among all useful
codes.
To that goal we need to come back to a more precise denition of useful
codes, i.e., we continue the discussion that we have started in Section 3.1. Let
88
Table 3.30: The binary Shannon code for the random message U of Example 3.50.
u1
u2
u3
u4
u5
u6
u7
pi
0.4
0.1
0.1
0.1
0.1
0.1
0.1
Fi
binary
representation
1
log2
pi
shortened
representation
ci
0.4
0.5
0.6
0.7
0.8
0.9
0.0
0.01100
0.1
0.00
0.0110
0.1000
0.1001
0.1011
0.1100
0.1110
00
0110
1000
1001
1011
1100
1110
0.10011 0.10110 0.11001 0.11100
0.4
0.1
100
0.1
101
0.1
1100
0.1
1101
0.1
1110
0.1
1111
1
1
0.2
0.6
0.2
0.4
0.2
Figure 3.31: A binary Human code for the message U of Example 3.50.
89
Table 3.32: Various codes for a random message with four possible values.
U
Code (i)
Code (ii)
Code (iii)
Code (iv)
a
b
c
d
0
0
1
1
0
010
01
10
10
00
11
110
0
10
110
111
us consider an example with a random message U with four dierent symbols

and let us design various codes for this message as shown in Table 3.32.
We discuss these dierent codes:
Code (i) is useless because some codewords are used for more than one symbol. Such a code is called singular.
Code (ii) is nonsingular. But we have another problem. If we receive 010
we have three dierent possibilities how to decode it: It could be (010)
giving us b, or it could be (0)(10) leading to ad, or it could be (01)(0)
corresponding to ca. Even though nonsingular, this code is not uniquely
decodable and therefore in practice similarly useless as code (i).9
Code (iii) is uniquely decodable, even though it is not prex-free! To see
this, note that in order to distinguish between c and d we only need to
wait for the next 1 to show up: If the number of zeros in between is
even, we decode 11, otherwise we decode 110. Example:
11000010 = (11)(00)(00)(10)
= cbba,
(3.158)
110000010 = (110)(00)(00)(10) = dbba.
(3.159)
So in a uniquely decodable, but not prex-free code we may have to

delay the decoding until later.
Code (iv) is prex-free and therefore trivially uniquely decodable.
We see that the set of all possible codes can be grouped as shown in Figure 3.33.
We are only interested in the uniquely decodable codes. But so far we have
restricted ourselves to prex-free codes. So the following question arises: Is
there a uniquely decodable code that is not prex-free, but that has a better
performance than the best prex-free code (i.e., the corresponding Human
code)?
Luckily the answer to this question is no, i.e., the Human codes are the
best uniquely decodable codes. This can be seen from the following theorem.
9
Note that adding a comma between the codewords is not allowed because in this case
we change the code to be ternary, i.e., the codewords contain three dierent letters 0, 1,
and , instead of only two 0 and 1.
90

uniquely decodable codes
all codes
nonsingular codes
prex-free codes
Figure 3.33: Set of all codes.
Theorem 3.52 (McMillans Theorem). The codeword lengths li of any

uniquely decodable code must satisfy the Kraft inequality
r
i=1
Dli 1.
(3.160)
Why does this help answering our question about the most ecient
uniquely decodable code? Well, note that we know from Theorem 3.9 that
every prex-free code also satises (3.160). So, for any uniquely decodable,
but non-prex-free code with given codeword lengths, one can nd another
code with the same codeword lengths that is prex-free! But if the codeword
lengths are the same, also the performance is identical! Hence, there is no
gain in designing a non-prex-free code.
Proof of Theorem 3.52: Suppose we are given a random message
U that takes on r possible values u U (here the set U denotes the message
alphabet). Suppose further that we have a uniquely decodable code that assigns
to every possible symbol u U a certain codeword of length l(u).
Now choose an arbitrary positive integer and design a new code for
a vector of symbols u = (u1 , u2 , . . . , u ) U = U U by simply
concatenating the original codewords.
Example 3.53. Consider a binary message with the possible values u = a or
u = b (U = {a, b}). We design a code for this message with the two codewords
{01, 1011}. Choose = 3, i.e., now we have eight possible symbols
{aaa, aab, aba, abb, baa, bab, bba, bbb}
and the corresponding codewords are
{010101, 01011011, 01101101, 0110111011, 10110101,
(3.161)
91
1011011011, 1011101101, 101110111011}.
(3.162)
The clue observation now is that because the original code was uniquely
decodable, it immediately follows that this new concatenated code also must
be uniquely decodable.
Exercise 3.54. Why? Explain this clue observation!

The lengths of the new codewords are
l(u) =
l(uj ).
(3.163)
j=1
Let lmax be the maximal codeword length of the original code. Then the new
code has a maximal codeword length max satisfying
l
max = lmax .
l
(3.164)
We now compute:
l(u)
Dl(u1 )
uU
u1 U
Dl(u2 )
u2 U
=
u1 U u2 U
u U
Dl(u )
(3.165)
u U
Dl(u1 ) Dl(u2 ) Dl(u )
Dl(u1 )l(u2 )l(u )
(3.166)
(3.167)
uU
j=1
l(uj )
(3.168)
uU
Dl(u)
(3.169)
uU
lmax
w(m)Dm
(3.170)
m=1
lmax
w(m)Dm .
(3.171)
m=1
Here (3.166) follows by writing the exponentiated sum as a product of sums;

in (3.167) we combine the sums over u1 , . . . , u into one huge sum over the
-vector u; (3.169) follows from (3.163); in (3.170) we rearrange the order of
the terms by collecting all terms with the same exponent together where w(m)
counts the number of such terms with equal exponent, i.e., w(m) denotes the
number of codewords of length m in the new code; and in the nal equality
(3.171) we use (3.164).
92
Now note that since the new concatenated code is uniquely decodable,
every codeword of length m is used at most once. In total there are only Dm
dierent sequences of length m, i.e., we know that
w(m) Dm .
(3.172)
Thus,
l(u)
lmax
lmax
=
m=1
uU
w(m)Dm
Dm Dm = lmax
(3.173)
m=1
or
1
uU
Dl(u) (lmax ) .
(3.174)
At this stage we are back at an expression that depends only on the original
uniquely decodable code. So forget about the trick with the new concatenated
code, but simply note that we have shown that for any uniquely decodable
code and any positive integer , expression (3.174) must hold! Note that we
can choose freely here!
Note further that for any nite value of lmax we have
1
lim (lmax ) = lim e log(lmax ) = e0 = 1.
(3.175)
Hence, by choosing extremely large (i.e., we let tend to innity) we have
uU
Dl(u) 1
(3.176)
as we wanted to prove.
Since we have now proven that there cannot exist any uniquely decodable
code with a better performance than some prex-free code, it now also follows
that the converse of the coding theorem for a single random message (Theorem 3.35), i.e., the lower bound in (3.115), holds for any uniquely decodable
code.
3.A
Appendix: Alternative Proof for the Converse

Part of the Coding Theorem for a Single
Random Message
Consider the D-ary tree of any D-ary prex-free code for an r-ary random
message U . Let w1 , . . . , wn be the leaves of this tree, ordered in such a way that
wi is the leaf corresponding to the message ui , i = 1, . . . , r, and wr+1 , . . . , wn
are the unused leaves.
Now dene two new probability distributions:
PW (wi ) =
PU (ui ) = pi
0
i = 1, . . . , r,
i = r + 1, . . . , n
(3.177)
3.A. Appendix: Alternative Proof for the Converse Part
93
and
PW (wi ) = Dli ,
i = 1, . . . , n,
(3.178)
where li denotes the depth of leaf wi . Note that by the leaf depth lemma
(Lemma 3.8)
n
Dli = 1.
PW (wi ) =
(3.179)
i=1
i=1
Now compute the relative entropy between these two distributions and use
the fact that the relative entropy is always nonnegative:
r
0 D(PW PW ) =
PW (wi ) log
i=1
r
=
i=1
r
PW (wi )
PW (wi )
(3.180)
r
PW (wi ) log PW (wi )
PW (wi ) log Dli
(3.181)
i=1
pi log pi + log D
i=1
pi l i
(3.182)
i=1
= H(U ) + log D E[L].
(3.183)
Hence,
E[L]
H(U )
.
log D
(3.184)
Chapter 4

Coding of an Information
Source
So far we have only considered a single random message, but in reality we are
much more likely to encounter a situation where we have a stream of messages
that should be encoded continuously. Luckily we have prepared ourselves
for this situation already by considering prex-free codes only, which make
sure that a sequence of codewords can be separated easily into the individual
codewords.
4.1
A Discrete Memoryless Source
In the following we will consider only the simplest case where the random
source is memoryless, i.e., each symbol that is emitted by the source is independent of all past symbols. A formal denition is given as follows.
Denition 4.1. An r-ary discrete memoryless source (DMS) is a device whose
output is a sequence of random messages U1 , U2 , U3 , . . . , where
each Uk can take on r dierent values with probability p1 , . . . , pr , and
the dierent messages Uk are independent of each other.
Hence, a DMS produces a sequence of independent and identically distributed
(IID) r-ary random variables.
Note that the DMS is memoryless in the sense that Uk does not depend
on the past and also that the distribution of Uk does not change with time.
The obvious way of designing a compression system for such a source is
to design a Human code for U , continuously use it for each message Uk ,
and concatenate the codewords together. The receiver can easily separate
the codewords (because the Human code is prex-free) and decode them to
recover the sequence of messages {Uk }.
95
96
Efficient Coding of an Information Source
However, the question is whether this is the most ecient approach. Note
that it is also possible to combine two or more messages into a new vector
message
(Uk , Uk+1 , . . . , Uk+m1 )
and then design a Human code for these combined vector messages. We will
show below that this approach is actually more ecient.
Ck
message
Vk
source
codewords
of length L
encoder
messages of
length M
parser
Uk
source
symbols
r-ary
DMS
Figure 4.1: A coding scheme for an information source: The source parser
groups the source output sequence {Uk } into messages {Vk }. The message encoder then assigns a codeword Ck to each possible message Vk .
So we consider an extended compression system as shown in Figure 4.1,
where before the message encoder we have added a source parser. A source
parser is a device that groups several incoming source letters Uk together to
a new message Vk . This grouping must be an invertible function in order to
make sure that the message can be split back into its components, however,
it is possible that the length M of the message Vk (i.e., the number of letters
that are combined to the message Vk ) can be random: We can easily make
it depend on the value of the (random) source output.
Before we consider such variable-length parser, in the next section we start
with the most simple parser that produces messages of constant length: the
M-block parser that simply always groups exactly M source letters together.
As an example assume M = 3:
U1 , U2 , U3 , U4 , U5 , U6 , U7 , U8 , U9 , . . .
V1
V2
(4.1)
V3
Note that since {Uk } are IID, also {Vk } are IID. So we can drop the time
index and only consider one such block message. Note that if we choose
M = 1, we are back at the situation without source parser.
4.2
BlocktoVariable-Length Coding of a DMS
Consider an M-block parser V = (U1 , U2 , . . . , UM ) with M xed. Then the

random message V takes on rM dierent values and has the following entropy:
H(V) = H(U1 , U2 , . . . , UM )
(4.2)
= H(U1 ) + H(U2 |U1 ) + + H(UM |U1 , U2 , . . . , UM1 )
(4.3)
= M H(U ),
(4.5)
= H(U1 ) + H(U2 ) + + H(UM )
(4.4)
4.2. BlocktoVariable-Length Coding of a DMS
97
where (4.3) follows from the chain rule; (4.4) from the rst I in IID; and (4.5)
from the second I and the D in IID.
Now lets use an optimal code (i.e., Human code1 ) or at least a good code
(e.g., Shannon or Fano code) for the message V. From the coding theorem for
a single random message (Theorem 3.35) we know that such a code satises
the following inequality:
H(V)
H(V)
E[L] <
+ 1.
log D
log D
(4.6)
Next we note that it is not really fair to compare E[L] for dierent values of
M because for larger M, E[L] also will be larger. So, to be correct we should
compute the average codeword length necessary to describe one source symbol.
Since V contains M source symbols Uk , the correct measure of performance
is E[L] .
M
Hence, we divide the whole expression (4.6) by M:
H(V)
E[L]
H(V)
1
<
+
M log D
M
M log D M
(4.7)
and make use of (4.5):

M H(U )
E[L]
M H(U )
1
<
+
M log D
M
M log D M
(4.8)
E[L]
H(U )
1
H(U )
<
+ .
log D
M
log D M
(4.9)
i.e.,
We have proven the following important result, also known as Shannons

source coding theorem.
Theorem 4.2 (BlocktoVariable-Length Coding Theorem for
a DMS).
There exists a D-ary prex-free code of an M-block message from a DMS
such that the average number of D-ary code digits per source letter satises:
E[L]
H(U )
1
<
+
M
log D M
(4.10)
where H(U ) is the entropy of a single source letter.

Conversely, for every uniquely decodable D-ary code of an M-block
message it must be true that
H(U )
E[L]
.
M
log D
(4.11)
Note that since we now design the Human code for the M-block message V, in Humans algorithm we have to replace r by rM ; see, e.g., (3.148).
98
We would like to briey discuss this result. The main point to note here
is that by choosing M large enough, we can approach the lower bound H(U )
log D
arbitrarily closely when using a Human, a Shannon-type or a Fano code.
Hence, the entropy H(U ) is the ultimate limit of compression and precisely
describes the amount of information that is packed in the output of the discrete
memoryless source {Uk }! In other words we can compress any DMS to H(U )
bits (entropy measured in bits) or H(U ) D-ary code letters on average, but
log D
not less. This is the rst real justication of the usefulness of the denition
of entropy.
We also see that in the end it does not make a big dierence whether we use
a Human code or any of the suboptimal good codes (Shannon-type, Shannon,
or Fano codes) as all approach the ultimate limit for M large enough.
On the other hand, note the price we have to pay: By making M large we
not only increase the number of possible messages and thereby make the code
complicated, but we also introduce delay in the system, because the encoder
can only encode the message after it has received the complete block of M
source symbols! Basically the more closely we want to approach the ultimate
limit of entropy, the larger is our potential delay in the system.
4.3
Arithmetic Coding
A very elegant and ecient blocktovariable-length coding scheme is the socalled arithmetic coding. Basically it is a generalization of the Shannon code
to a DMS.
4.3.1
Introduction
Recall that a Shannon code will rstly order all possible messages according to
probability and then generate the codeword by truncating the D-ary represenj1
tation of the cumulative probability Fj

j =1 pj . The big problem with this
approach is that in order to arrange the messages according to probabilities
and to compute Fj , one needs to compute the probability of every message.
This is infeasible for a large message blocklength M 1. Imagine for a moment a ternary (r = 3) DMS {Uk } taking value in {a, b, c} whose output is
grouped into blocks of length M = 1000. This will result in 31000 10477
dierent sequences that need to be sorted according to their probabilities! No
whatever large computer is capable of doing this.
To avoid this problem, in arithmetic coding we do not order the sequences
anymore according to their probabilities, but according to the alphabet. For
example, for the ternary source {Uk } mentioned above, we have the following
order:
u1 = aa . . . aaa
u2 = aa . . . aab
u3 = aa . . . aac
4.3. Arithmetic Coding
99
u4 = aa . . . aba
u5 = aa . . . abb
.
.
.
u3M 2 = cc . . . cca
u3M 1 = cc . . . ccb
u3M = cc . . . ccc
The cumulative probability Fuj will now be computed according to this alphabetical order:
j1
pu j ,
Fu j
j = 1, . . . , rM ,
(4.12)
j =1
where puj is the probability of the output sequence uj :

M
PU (uj,k ),
pu j
j = 1, . . . , rM .
(4.13)
k=1
So far this does not seem to help because it still looks as if we need to
compute puj for all uj . However, this is not the case: It turns out that
because of the alphabetical ordering, for a given output sequence u one can
compute Fu iteratively over the vector u without having to consider any other
sequences at all! To understand this, consider again our ternary example from
above, consider u = bbac and assume that we have computed Fbba already. It
is now easy to see that we can derive Fbbac directly from Fbba and pbba :
Fbbac = Pr({aaaa, . . . , bbab})
(4.14)
= Pr(all length-4 seq. alphabetically before bba ) + Pr({bbaa, bbab})

(4.15)
= (paaaa + + paaac ) + (paaba + + paabc ) +
+ (pbaca + + pbacc ) + Pr({bbaa, bbab})
(4.16)
= paaa (pa + + pc ) + paab (pa + + pc ) +

=1
=1
+ pbac (pa + + pc ) + Pr({bbaa, bbab})
(4.17)
=1
= paaa + paab + + pbac + Pr({bbaa, bbab})
(4.18)
= Fbba
= Fbba + pbba (pa + pb )
(4.19)
= Fc
= Fbba + pbba Fc .
(4.20)
Hence, we see that we can compute the cumulative probability directly by

iteratively updating the values of F and p with the current output of the
100
DMS: For every output uk (k = 1, . . . , M) of the DMS we compute

pu1 ,...,uk = pu1 ,...,uk1 puk ,
(4.21)
Fu1 ,...,uk = Fu1 ,...,uk1 + pu1 ,...,uk1 Fuk .
(4.22)
Unfortunately, if we now use the D-ary representation of Fu and truncate

it to lu = logD p1 to get the codeword for u, then the code will not be prexu
free anymore! (This happens because we do no longer have the property that
luj is increasing with j, since the corresponding puj are not sorted!)
To solve this problem, the second main idea of arithmetic coding is the
insight that we need to make the codewords slightly larger:
u
l
logD
1
+1
pu
(4.23)
and that instead of Fu we need to use the D-ary representation of

u
F
Fu + D l u .
(4.24)
Hence, in arithmetic coding the codeword is given as the D-ary representation

of u truncated to length u .
F
l
So we see that by the sacrice of one additional codeword digit we can
get a very ecient algorithm that can compress the outcome of a memoryless
source on the y.
Before we summarize this algorithm, prove that it really generates a prexfree code, and compute its eciency, lets go through a small example.
Example 4.3. Lets again assume a ternary DMS {Uk } that emits an independent and identically distributed (IID) sequence of ternary random messages
Uk with the following probabilities:
PU (a) = 0.5,
PU (b) = 0.3,
PU (c) = 0.2.
(4.25)
Now lets compute the binary (D = 2) codeword of the length-10 (M = 10)

output sequence u = baabcabbba: For the original Shannon code we would
have to compute the probabilities of each of the 310 = 59049 dierent possible
output sequences and then order them according to their probabilities. This
needs a tremendous amount of time and storage place.
For arithmetic coding, we do not order the sequences and do not need
to bother with the probability of any other sequence, but we can directly
compute the probability pbaabcabbba and the cumulative probability Fbaabcabbba
of the given sequence recursively as follows:
pb
= PU (b)
pba
= pb PU (a)
= pbaa PU (b)
pbaa
pbaab
= pba PU (a)
= 0.3,
(4.26)
= 0.3 0.5
= 0.15,
(4.27)
= 0.075,
(4.28)
= 0.075 0.3
= 0.0225,
(4.29)
= 0.15 0.5

pbaabc
pbaabca
pbaabcab
pbaabcabb
101
= pbaab PU (c)
= 0.0225 0.2
= 0.0045,
(4.30)
= 0.00225,
(4.31)
= pbaabca PU (b)
= 0.00225 0.3
= 0.000675,
(4.32)
= 0.0002025,
(4.33)
= pbaabc PU (a)
= 0.0045 0.5
= pbaabcab PU (b) = 0.000675 0.3
pbaabcabbb = pbaabcabb PU (b) = 0.0002025 0.3 = 0.00006075,
pbaabcabbba = pbaabcabb PU (a) = 0.00006075 0.5 = 0.000030375,
(4.34)
(4.35)
and
Fb
= fb
= 0.5,
Fba
= Fb + p b f a
= 0.5 + 0.3 0
Fbaa
Fbaab
Fbaabc
Fbaabca
Fbaabcab
Fbaabcabb
= 0.5,
= Fba + pba fa
= 0.5 + 0.15 0
= 0.5,
= Fbaa + pbaa fb
= 0.5 + 0.075 0.5
= 0.5375,
= Fbaab + pbaab fc
= 0.5375 + 0.0225 0.8
= Fbaabc + pbaabc fa
= 0.5555 + 0.0045 0
= 0.5555,
= 0.5555,
= Fbaabca + pbaabca fb
= 0.5555 + 0.00225 0.5
= Fbaabcab + pbaabcab fb
= 0.556625 + 0.000675 0.5
Fbaabcabbb = Fbaabcabb + pbaabcabb fb
= 0.556625,
= 0.5569625,
= 0.5569625 + 0.0002025 0.5

= 0.55706375,
Fbaabcabbba = Fbaabcabbb + pbaabcabbb fa = 0.55706375 + 0.00006075 0

= 0.55706375.
(4.36)
(4.37)
(4.38)
(4.39)
(4.40)
(4.41)
(4.42)
(4.43)
(4.44)
(4.45)
Here instead of Fa , Fb , and Fc we have introduced fa 0, fb pa = 0.5, and

fc
pa + pb = 0.8 to distinguish them better from the Fu that is computed
recursively.
Now we only need to compute the required length
baabcabbba = log
l
D
1
pbaabcabbba
+ 1 = log2
1
+ 1 = 17,
0.000030375
(4.46)
nd the binary representation of the adapted cumulative probability

baabcabbba = 0.55706375 + 217 = 0.557071379394531192
F
(4.47)
and shorten it to length 17. Note that it is actually easier to change the order:
Firstly nd the binary representation of Fbaabcabbba = 0.55706375, shorten it
102
to 17, and then add 217 :

(0.55706375)10 + 217 = (0.10001110100110111 . . .)2 + 217
(4.48)
= (0.10001110100111000 . . .)2 .
(4.49)
This results in the codeword 10001110100111000.
4.3.2
Encoding
We summarize the encoding procedure of arithmetic coding in the following

algorithmic form.
D-ary Arithmetic Encoding of a DMS:
Step 1: Fix a blocklength M and decide on an alphabetical order on
the source alphabet U = {1 , 2 , . . . , r }. Then dene
i1
PU (i ),
f i
i = 1, . . . , r,
(4.50)
i =1
and set k
1, p0
1, F0
0.
Step 2: Observe the kth source output letter uk of the DMS and compute
pk
Fk
pk1 PU (uk ),
Fk1 + pk1 fuk .
(4.51)
(4.52)
Step 3: If k < M, k k + 1 and go to Step 2. Otherwise, compute
logD
1
+ 1,
pM
(4.53)
compute the D-ary representation of FM , shorten it to length

l,
l
and add D to it. Then output the D-ary sequence after the
comma as codeword.
Theorem 4.4 (Prex-Free Property of Arithmetic Codes). The arithmetic code as dened by the encoding algorithm given in (4.50)(4.53) is prexfree.
Proof: For every M-block message u, the corresponding codeword cu
of length u implicitly belongs to an interval:
l
0.cu , 0.cu + Dlu .
(4.54)
103
(Note that here we write the numbers in D-ary format.) This interval contains
all sequences 0.c that have the rst u digits identical to 0.cu .
l
Hence, to show that arithmetic codes are prex-free, it is sucient to show
that no two such intervals overlap.
To that goal consider the jth (in alphabetical order) message uj and note
that because of the additional term Dluj in (4.24), its corresponding codeword is such that 0.cuj always is larger than Fuj :
0.cuj > Fuj .
(4.55)
On the other hand, due to the cutting to length uj , 0.cuj is smaller than
l
(or equal to) uj ,
F
0.cuj uj ,
F
(4.56)
and from (4.23) it follows that

pu j
1
luj
= D Dluj .
(4.57)
Hence,
F
0.cuj + Dluj uj + Dluj
= Fu j + D
(by (4.56))
j
lu
+D
j
lu
(4.58)
(by denition of uj )
F
(4.59)
= Fuj + 2Dluj
Fu j + D D
(4.60)
j
lu
(because D 2)
Fu j + p u j
(4.61)
(by (4.57))
(by denition of Fuj+1 ).
= Fuj+1
(4.62)
(4.63)
See also Figure 4.2 for a graphical description of these relations.
Dluj
Fuj
0.cuj
Dluj
0.cuj +Dluj
Fuj+1
0.cuj+1
puj D Dluj
Figure 4.2: The codewords of an arithmetic code are prex-free.
From (4.55) and (4.63) we see that the interval (4.54) is a strict subset of
the interval (Fuj , Fuj+1 ):
0.cuj , 0.cuj + Dluj
(Fuj , Fuj+1 ).
(4.64)
104
But since the intervals (Fuj , Fuj+1 ) are pairwise disjoint by denition (recall
that Fuj is monotonically strictly increasing in j!), this proves that all intervals
(4.54) are disjoint and that therefore no codeword can be a prex of another
codeword.2
It remains to derive the performance. This is straightforward: Using (4.23)
we get
log p1
u
u = log 1 + 1 <
l
+2
D
pu
log D
(4.65)
and therefore
E[L] <
E log p1
U
log D
+2=
H(U)
M H(U )
+2=
+ 2,
log D
log D
(4.66)
i.e.,
E[L]
H(U )
2
<
+ .
M
log D M
(4.67)
This is almost as good as in Theorem 4.2 that is based on Shannon, Fano, or

Human coding, and it is asymptotically optimal! Note that both Shannon
and Human coding are completely infeasible for large message lengths, and
also Fano coding will quickly get to its limitations.
4.3.3
Decoding
Finally, lets think about the decoding. We rstly note that we face here quite
a challenge: Recall that one of the main points of arithmetic coding is that we
do not want (and also not need) to keep a complete list of all codewords. But
this now means that, even though we know that the code is prex-free, we do
not actually know where a codeword ends and the next starts! Moreover, even
if we gure out the starting and ending of a codeword within the sequence,
how shall we decode it ? Once again, we emphasize that we do not have a list
of codewords to consult!
Luckily, both problems can be solved in a kind of reverse process of the
encoding. Assume for the moment the decoder knew the exact value of FM
(instead of the truncated version of From (4.52) we know that
F).
FM = fu1 + p1 fu2 + p1 p2 fu3 + + p1 pM1 fuM .
(4.68)
Lets assume that u1 = i . Hence, we see that FM fi . On the other hand,

we must have FM < fi + PU (i ) = fi+1 because otherwise there would exist
another source sequence starting with i+1 instead of i that would result
into the same FM , contradicting the fact that our code is prex-free.
2
One could again require that no D-ary representations with an innite tail of symbols
(D 1) are allowed. Strictly speaking, however, this is not necessary because we take care
l
of the ambiguity of the D-ary representation by adding the term D uj in (4.24).
105
So we see that we can gain fu1 back from FM : We simply need to search
for the largest fi that is smaller or equal to FM . Once we have fu1 , we also
know p1 = PU (u1 ) and can therefore get rid of the inuence of u1 :
FM1 =
FM f u 1
= fu2 + p2 fu3 + + p2 pM1 fuM .
p1
(4.69)
We continue this way and recursively gain back u.

Now unfortunately, the decoder does not know FM = Fu , but only 0.c,
which is a truncated version of u . But again this is not a problem, because
F
we know from (4.55) that
Fuj < 0.cuj
(4.70)
and from (4.59)(4.63) that
0.cuj < 0.cuj + Dluj Fuj+1 .
(4.71)
Fuj < 0.cuj < Fuj+1
(4.72)
fi FM = Fuj < 0.cuj < Fuj+1 fi+1 .
(4.73)
Hence,
and therefore
(Here the last inequality follows again because we have proven that the code is
prex-free: were Fuj+1 < fi+1 , then the intervals belonging to some sequence
starting with i and to some sequence starting with i+1 would overlap, leading to prexes!)
So we see that the truncated version of u still lies within the same boundF
aries as FM = Fu and we can apply the recursive algorithm (4.69) also to 0.cu .
Finally, there is the problem of the end of one codeword and the start of
the next. If instead of the correct codeword 0.cu , we use a sequence 0.c that
is too long, then we again increase the value, 0.cu 0.c. But since 0.cu is
truncated to u , this increase is strictly less than Dlu , i.e., we have

l
0.cu 0.c < 0.cu + Dlu
(4.74)
and we see from (4.71) that we still are in the correct boundaries:
fi 0.c < fi+1 .
(4.75)
So, the decoder will rstly compute the length of the longest possible codeword, then it will pick this many digits from the received sequence and start
the algorithm (4.69). Once it has decoded a codeword and regained M source
letters, it will compute the correct length of the decoded codeword and remove
it from the sequence. Then it can restart the same procedure again.
The algorithm is summarized as follows.
106
Decoding of an D-ary Arithmetic Code:

Step 1: Given is the blocklength M, the DMS U with an alphabetical order on the source alphabet U = {1 , 2 , . . . , r }, and a
sequence {c} of D-ary bits representing one or several concatenated codewords. We compute
max = M log
l
D
1
mini{1,...,r} PU (i )
+ 1,
(4.76)
and take the max rst digits c from {c}. (If the sequence is
l
max digits, simply take the complete sequence.)
shorter than l
We dene
i1
PU (i ),
f i
i = 1, . . . , r,
(4.77)
i =1
set k
1, p0 = 1, and compute the decimal representation of
1 (0.c)10 .
F
Step 2: Find the largest fi such that fi k . Put out i and compute
F
k+1
F
pk
k f
F
i
,
PU (i )
pk1 PU (i ).
(4.78)
(4.79)
Step 3: If k < M, k k + 1 and go to Step 2. Otherwise, compute
logD
1
+1
pM
(4.80)
and remove the rst digits from {c}. If there are still digits
l
remaining, repeat the algorithm for the next codeword.
Example 4.5. Lets consider again the DMS from Example 4.3. We are
given a stream of encoded source symbols based on an arithmetic code of
blocklength M = 3. The rst couple of digits of the received sequence are as
follows: 10001110100 . . .
We rstly compute
max = 3 log 1 + 1 = 8
l
2
0.2
(4.81)
and pick the rst 8 digits from the given sequence. We compute:
1
F
(0.10001110)2 = (0.5546875)10 .
(4.82)
4.4. Variable-LengthtoBlock Coding of a DMS
107
Using fa 0, fb pa = 0.5, and fc pa + pb = 0.8, we see that the largest

fi such that fi 1 is fb . Hence, we put out b and update the value of 1 :
F
F
put out b
p1 = 1 0.3
put out a
p2 = 0.3 0.5 = 0.15
put out a
= 0.3
2 = (0.5546875 0.5)/0.3
F
= 0.182291
6
(4.83)
3 = (0.182291 0)/0.5
F
6
= 0.36458
3
p3 = 0.15 0.5 = 0.075.
(4.84)
(4.85)
Hence, the decoded source sequence is baa. The codeword length is

= log
l
2
1
+ 1 = 5,
0.075
(4.86)
i.e., the rst 5 digits from the received sequence 10001 should be removed.
The remaining sequence then is 110100 . . . We could now continue to decode
those digits.
We would like to conclude the discussion of arithmetic coding by pointing

out that there exist dierent variations of arithmetic coding schemes. For
example, instead of adding Dlu to Fu as in (4.24), one could also add 1 pu to

2
Fu . These changes have no fundamental impact on the performance. However,
we would like to mention that we have not worried about numerical precision:
There are other versions of the encoding and decoding algorithm that are
numerically more stable than those introduced here.
For an easy-to-read introduction to arithmetic coding including its history, the introductory chapter of the Ph.D. thesis of Jossy Sayir is highly
recommended [Say99].
4.4
Variable-LengthtoBlock Coding of a DMS
The variable length codewords that we have considered to this point are sometimes inconvenient in practice. For instance, if the codewords are stored in a
computer memory, one would usually prefer to use codewords whose length
coincides with the computer word-length. Or if the digits in the codeword are
to be transmitted synchronously (say, at the rate of 2400 bits/s), one would
usually not wish to buer variable-length codewords to assure a steady supply
of digits to be transmitted.
But it was precisely the variability of the codeword lengths that permitted
us to encode a single random variable eciently, such as the message V from
an M-block parser for a DMS. How can we get similar coding eciency for
a DMS when all codewords are forced to have the same length? The answer
is that we must assign codewords not to blocks of source letters but rather
to variable-length sequences of source letters, i.e., we must do variable-length
parsing.
So, we consider the coding of a variable number of source letters into an Lblock codeword as indicated in Figure 4.3. Note that while an M-block parser
108
C=
block
(C1 , . . . , CL )
encoder
V=
source
(U1 , . . . , UM )
parser
Uk
r-ary
DMS
Figure 4.3: Variable-lengthtoblock coding of a discrete memoryless source:

The source output sequence {Uk } is parsed into messages V of variable
length M , that are then assigned a codeword C of xed length L.
in the situation of blocktovariable-length coding produces messages of equal

length M, now the situation is more complicated with dierent messages of
dierent length. We dene the message set V to be the set of all possible
(variable-length) output sequences of the source parser.
Note that since the DMS is memoryless, we may still concentrate on just
one message V = [U1 , U2 , . . . , UM ] and do not need to worry about the future
(the future is independent of the past!). Again we want to minimize the
average number of code digits per source letter. In the case of a blockto
variable-length code this was
E[L]
.
M
(4.87)
However, in our new situation of variable-lengthtoblock codes L is constant

and M is variable. Hence, we now have to consider
L
,
E[M ]
(4.88)
as is proven in the following lemma.

Lemma 4.6. The quality criterion of a general coding scheme for a DMS
according to Figure 4.1, where the r-ary source sequence {Uk } is parsed into
messages of variable length M , which are then mapped into D-ary codewords
of variable length L, is the average number of code digits per source
letter
E[L]
.
E[M ]
(4.89)
Proof: Since M source symbols are mapped into a codeword of length

L
L, it is clear that we need M code digits per source letter. However, L and
M are random, i.e., we somehow need to compute an averaged version of this
E[L]
L
quality criterion. So why do we consider E[M ] and not E M ?
E[L]
L
First of all, note that E[M ] and E M are not the same! If you are unsure
about that, go back to the denition of expectation (see (2.19) and (2.44)) and
think about linearity. You might also want to go back to Jensens inequality
(Theorem 1.29) and think about it.
So, since they are not the same, which one should we use in our situation?
To answer this, note that what we really want to do is to use our compression
109
scheme many times and thereby minimize the total number of used code digits
divided by the total number of encoded source letters:
L1 + L2 + + Ln
.
M1 + M2 + + Mn
(4.90)
Here Mk denotes the length of the message during the k th time we use the
system, and Lk is the length of the corresponding codeword. If we now use
the system for a long time, i.e., if we let n become large, then we know from
the weak law of large numbers that
L1 + L2 + + Ln
=
M1 + M2 + + Mn
L1 +L2 ++Ln
n
n
M1 +M2 ++Mn
n
E[L]
E[M ]
in prob.
(4.91)
E[L]
Hence, the quantity of interest is E[M ] .
So lets go back to the case where the codewords all have constant length
E[L] = L and consider a simple example of a ternary DMS.
Example 4.7. Consider a ternary DMS r = 3 with source alphabet {a, b, c}.
We decide to use the following message set:
V
{aaa, aab, aac, b, ca, cb, cc},
(4.92)
and assign them the following corresponding binary (D = 2) codewords of

xed length L = 3:
C
{001, 010, 011, 100, 101, 110, 111}.
(4.93)
So, for example, the source output sequence aabcccabca . . . is split as follows
aab|cc|ca|b|ca| . . .
(4.94)
010111101100101 . . .
(4.95)
and is mapped into
Will this system perform well? This of course depends on the probability
distribution of the source! However, if the long messages are very likely, i.e.,
if a is much more likely than b, then it should not be too bad. Concretely, if
L
E[M ] is large, then E[M ] is small, as it is wished for a good compression.
Example 4.8. We make another example to point out some possible problems. Consider again r = 3 with source alphabet {a, b, c}, but now choose
V
{aa, b, cba, cc} as message set. Then the sequence of source letters
aabcccabca . . . cannot be split into dierent messages:
aa|b|cc| cabca . . .
(4.96)
So this sequence cannot be parsed! What is wrong?
110
We realize that, similarly to codes, also message sets must satisfy some
properties. In particular we have the following condition on any legal message set:
Every possible sequence of source symbols U1 , U2 , . . . must contain a
message V as prex so that we can parse every possible source
sequence.
Since we have successfully been using the tool of rooted trees in order to
describe the mapping from messages to codewords, we will now try to do the
same thing also for the mapping of source sequences to messages. Obviously,
since we have an r-ary DMS, we need to consider r-ary trees!
Example 4.9. Lets once more consider the ternary (r = 3) source with
source alphabet {a, b, c}. In Figure 4.4 you see two examples of message sets
that are represented by ternary trees. The rst set is not allowed because not
every possible sequence from the source can be parsed into messages: Any
sequence containing two c in sequence fails to be parsed. The second one,
however, is a legal message set once we dene a clear rule how to deal with
cases where the parsing is not unique.
a
b
c
b
ca
cb
illegal
ca
c
cb
legal, but not proper
Figure 4.4: Two examples of message sets. The message set on the left cannot be used for a source parser because there is no valid parsing of the
source sequence (U1 , U2 , . . .) = (c, c, . . .). The message set on the right,
however, can be used, perhaps with the rule always choose the longest
message if more than one parsing is possible. Example: c|ca|b| . . . instead
of c|c|a|b| . . .
Similarly to our codes, we also would like to have a prex-free property:
We prefer a message set where the encoder can use the message immediately
and does not need to wait longer. The message set on the right of Figure 4.4 is
an example of a set that does not have this desired prex-free property. In
111
particular, if the parser sees a c at its input it cannot yet decide whether this
will be a message, but rst has to wait and see the following source symbol.
We give the following denition.
Denition 4.10. An r-ary source message set is called proper if, and only if,
it forms a complete set of leaves in an r-ary tree (i.e., all leaves are messages,
no nodes are messages).
Example 4.11. An example of a proper message set for a quaternary (r = 4)
source U is given in Figure 4.5. Note that in this tree we have already assigned
the corresponding probabilities to each message, assuming that the PMF of
the source U is PU (a) = 0.5, PU (b) = PU (c) = 0.2 and PU (d) = 0.1.
a
b
c
d
0.5 = PU (a)
0.1 = 0.2 0.5 = PU (b) PU (a)
0.04 = 0.2 0.2 = PU (b) PU (b)
0.2
0.04 = 0.2 0.2 = PU (b) PU (c)
1
0.2 = PU (c)
0.02 = 0.2 0.1 = PU (b) PU (d)

0.05 = 0.1 0.5 = PU (d) PU (a)
0.02 = 0.1 0.2 = PU (d) PU (b)
0.1
0.02 = 0.1 0.2 = PU (d) PU (c)

0.01 = 0.1 0.1 = PU (d) PU (d)
Figure 4.5: A proper message set for a quaternary source with PMF PU (a) =
0.5, PU (b) = 0.2, PU (c) = 0.2, and PU (d) = 0.1. The message V will take
value in {a, ba, bb, bc, bd, c, da, db, dc, dd}.
Using our knowledge about trees, in particular, using the leaf entropy
theorem (Theorem 3.17) and the path length lemma (Lemma 3.12), we can
easily compute the entropy of the messages of a proper message set:
H(V) = Hleaf
(4.97)
=
=1
N
=
=1
P H
(4.98)
P H(U )
(4.99)
= H(U )
(4.100)
=1
112

= H(U ) E[M ].
(4.101)
Here, (4.97) follows from the fact that all possible values of V are leaves;
(4.98) follows from the leaf entropy theorem (Theorem 3.17); in (4.99) we use
the fact that the branching entropy is always the source entropy because the
PMF of the source decides about the branching; and (4.101) follows from the
path length lemma (Lemma 3.12).
Hence we get the following theorem.
Theorem 4.12. The entropy of a proper message set H(V) for an r-ary DMS
U is
H(V) = H(U ) E[M ]
(4.102)
where E[M ] is the average message length.

Example 4.13. Consider the following ternary DMS: PU (a) = 0.1, PU (b) =
0.3, PU (c) = 0.6. We would like to design a binary (D = 2) block code for
this source. Note that
H(U ) = 0.1 log 0.1 0.3 log 0.3 0.6 log 0.6 1.295 bits.
(4.103)
Lets use the proper message set V = {a, b, ca, cb, cc} shown in Figure 4.6.
0.1
000
0.3
a
b
c
001
0.06
0.18
011
0.36
0.6
010
100
Figure 4.6: A proper message set for the DMS U of Example 4.13 and a
possible binary block code.
We see that
E[M ] = 1 + 0.6 = 1.6,
H(V) = 0.1 log 0.1 0.3 log 0.3 0.06 log 0.06
0.18 log 0.18 0.36 log 0.36
2.073 bits,
H(U ) E[M ] 1.295 bits 1.6 = 2.073 bits.
(4.104)
(4.105)
(4.106)
(4.107)
4.5. General Converse
113
Note that since we have 5 dierent messages, we need to assign 5 codewords

of equal length. The smallest possible choice for the codeword length will
therefore be L = 3.
4.5
General Converse
Next note that from Theorem 4.12 we know

H(V) = H(U ) E[M ]
(4.108)
and from the converse part of the coding theorem for a single random message
(Theorem 3.35) we have
E[L]
H(V)
.
log D
(4.109)
Hence, we can derive a general converse:

E[L]
H(U ) E[M ]
.
log D
(4.110)
Theorem 4.14 (Converse to the General Coding Theorem for a

DMS).
For any uniquely decodable D-ary code of any proper message set for
an r-ary DMS U , we have
H(U )
E[L]
,
E[M ]
log D
(4.111)
where H(U ) is the entropy of a single source symbol, E[L] is the average
codeword length, and E[M ] is the average message length.
4.6
Optimal Message Sets: Tunstall Message Sets
We now would like to nd an optimum way of designing proper message sets.

From Lemma 4.6 we know that in order to minimize the average number of
codewords digits per source letter, we have to maximize E[M ].
Denition 4.15. A proper message set for an r-ary DMS with n = 1+N(r1)
messages is a Tunstall message set if the corresponding r-ary tree can be
formed, beginning with the extended root, by N 1 applications of the rule
extend the most likely leaf .
As before, n denotes the number of leaves and N is the number of nodes
(including the root).
114
Example 4.16. Consider a binary memoryless source (BMS), i.e., r = 2,

with the following PMF:
PU (u) =
0.4 u = 1,
0.6 u = 0.
(4.112)
Moreover, we would like to have a message set with n = 5 messages, i.e.,

n = 5 = 1 + N(r 1) = 1 + N,
(4.113)
i.e., we need to apply the rule extend the most likely leaf N 1 = 3 times.
The corresponding Tunstall message set is shown in Figure 4.7.
third
rst
000
0.216
0.36
001
0.144
0.6
0.24
01
1
0.24
10
0.4
second
0.16
11
Figure 4.7: Example of a Tunstall message set for a BMS. The dashed arrows
indicate the order in which the tree has been grown.
The following lemma follows directly from the denition of a Tunstall

message set.
Lemma 4.17 (The Tunstall Lemma). A proper message set for an r-ary
DMS is a Tunstall message set if, and only if, in its r-ary tree every node is
at least as likely as every leaf.
Proof: Consider growing a Tunstall message set by beginning from the
extended root and repeatedly extending the most likely leaf. The extended
root trivially has the property that no leaf is more likely than its only node (the
root itself). Suppose this property continues to hold for the rst extensions.
On the next extension, none of the r new leaves can be more likely than the
4.7. Optimal Variable-LengthtoBlock Codes: Tunstall Codes
115
node just extended and thus none is more likely than any of the old nodes
since these were all at least as likely as the old leaf just extended. But none
of the remaining old leaves is more likely than any old node nor more likely
than the new node since this latter node was previously the most likely of the
old leaves. Thus, the desired property holds also after + 1 extensions. By
induction, this property holds for every Tunstall message set.
Conversely, consider any proper message set with the property that, in its
r-ary tree, no leaf is more likely than any intermediate node. This property
still holds if we prune the tree by cutting o the r branches stemming from
the least likely node. After enough such prunings, we will be left only with
the extended root. But if we then re-grow the same tree by extending leaves
in the reverse order to our pruning of nodes, we will at each step be extending
the most likely leaf. Hence this proper message set was indeed a Tunstall
message set, and the lemma is proved.
Based on this lemma we can now prove the following theorem.
Theorem 4.18. A proper message set with n messages for an r-ary
DMS maximizes E[M ] over all such proper message sets if, and only if,
it is a Tunstall message set.
Proof: The theorem follows from the path length lemma (Lemma 3.12)
and the Tunstall lemma (Lemma 4.17). Consider the r-ary semi-innite tree of
the DMS. Note that all vertices of an r-ary message set are nodes in this semiinnite tree. Since E[M ] = P , if we want to maximize E[M ], we need to
pick the N most likely nodes in this tree. This means that the n = 1+N(r 1)
leaves will all be less likely than these nodes. Hence, according to the Tunstall
lemma we have created a Tunstall message set.
4.7
Optimal Variable-LengthtoBlock Codes:

Tunstall Codes
To nd the optimal variable-lengthtoblock codes we already have almost

all ingredients. In the last section we have proven that the optimal proper
message sets are Tunstall message sets. There is only one question remaining:
How large should the message set be, i.e., how shall we choose n?
Note that
we want E[M ] to be large, i.e., we want to choose n large;
we can increase n in steps of size (r 1) (see leaf counting lemma,
Lemma 3.8);
but since the codeword length L and the size of the coding alphabet D
are xed, we have at most DL possible codewords available.
116
So we realize that we should choose n such that it satises

DL (r 1) < n DL .
(4.114)
We play around with this expression to get

DL + (r 1) >
DL ,
(r 1) > D n 0.
(4.115)
(4.116)
Using the leaf counting lemma (Lemma 3.8), i.e., n = 1 + N(r 1), we get
DL n = DL 1 N(r 1)
(4.117)
N(r 1) + DL n = DL 1,
(4.118)
or
where by (4.116) 0 R < r 1. So by Euclids division theorem, we have that

N is the quotient when DL 1 is divided by r 1. (Note that additionally
DL r is required because the smallest proper message set has r messages.)
Hence, we get the following algorithm.
Tunstalls Algorithm for Optimum D-ary L-Block Encoding of
a Proper Message Set for an r-ary DMS:
Step 1: Check whether DL r. If not, stop because such coding is not
possible. Otherwise compute the quotient N when DL 1 is
divided by r 1:
N
DL 1
.
r1
(4.119)
Step 2: Start with the extended root and construct the Tunstall message set of size n = 1 + N(r 1) by N 1 times extending the
most likely leaf.
Step 3: Assign a distinct D-ary codeword of length L to each message.
Example 4.19. We again consider the BMS (r = 2) with PU (0) = 0.6 and
PU (1) = 0.4, and construct a binary block code of length 3, i.e., D = 2 and
L = 3. Then
DL = 23 = 8 r = 2 = OK
(4.120)
and
N
DL 1
23 1
=
= 7,
r1
21
(4.121)
117
sixth
third
rst
0.1296
0.216
0.0864
0.36
0.144
1
0.6
0.144
fourth
0.24
0.096
fth
0.144
0.24
0.096
0.4
0.16
second
Figure 4.8: Tunstall message set for the BMS of Example 4.19. The dashed
arrows indicate the order in which the tree has been grown.
i.e., starting with the extended root, we need to extend the most likely leaf
six times. The corresponding Tunstall message set is shown in Figure 4.8.
We get
E[M ] = 3.056,
E[L] = L = 3,
H(U )
0.971,
log D
=
(4.122)
(4.123)
L
H(U )
0.982 0.971
.
E[M ]
log D
(4.124)
(4.125)
And nally, our choice of the Tunstall code is shown in Table 4.9.
Notice that if we had used no coding at all (which is possible here
since the source alphabet and coding alphabet coincide), then we would have
achieved trivially 1 code digit per source digit. It would be an economic question whether the 2% reduction in code digits oered by the Tunstall scheme
in the above example was worth the cost of its implementation. In fact, we
see that no coding scheme could compress this binary source by more than
3% so that our Tunstall code has not done too badly.
118

Table 4.9: Tunstall code of Example 4.19.
Message
Codewords
0000
0001
001
010
011
100
101
11
000
001
010
011
100
101
110
111
There also exists a coding theorem for Tunstall codes, but it is less beautiful than the version for the Human codes.
Theorem 4.20 (The Variable-LengthtoBlock Coding Theorem for a DMS).

The ratio E[M ]/L of average message length to blocklength for an optimum D-ary block-L encoding of a proper message set for an r-ary DMS
satises
2
E[M ]
log D
log D log pmin
<
,
H(U )
L H(U )
L
H(U )
(4.126)
where H(U ) is the entropy of a single source letter and where pmin =
minu PU (u) is the probability of the least likely source letter. (We actually
have assumed here that PU (u) > 0 for all u. If this is not the case, simply
remove all letters with PU (u) = 0 from the source alphabet.)
Proof: The right inequality follows directly from the general converse
(Theorem 4.14). It remains to prove the left inequality.
The least likely message in the message set has probability P pmin where
P is the probability of the node from which it stems; but since there are n
1
messages, this least likely message has probability at most n , so we must have
P pmin
1
.
n
(4.127)
By the Tunstall lemma (Lemma 4.17) no message (i.e., no leaf) has probability
more than P so that (4.127) implies
PV (v) P
1
,
npmin
v,
(4.128)
119
which in turn implies

log PV (v) log n log
1
pmin
v.
(4.129)
Next, we recall that the number n = 1 + N(r 1) of messages was chosen

with N as large as possible for DL codewords so that surely (see (4.114))
n + (r 1) > DL .
(4.130)
But n r so that (4.130) further implies

2n n + r > DL + 1 > DL ,
(4.131)
which, upon substitution in (4.129), gives

H(V) = E[log PV (v)] log
2n
1
2
log
> L log D log
(4.132)
.
2
pmin
pmin
Making use of Theorem 4.12, we see that (4.132) is equivalent to

E[M ] H(U ) > L log D log
2
.
pmin
(4.133)
Dividing now by L H(U ) gives the left inequality in (4.126).

We see that by making L large we can again approach the upper bound
in (4.126) arbitrarily closely. So, it is possible to approach the fundamental
limit both with Human codes and Tunstall codes. Usually a Human code
is slightly more ecient than the corresponding Tunstall code with the same
number of codewords. The most ecient way is, of course, to combine a
Tunstall message set with a Human code. This will then lead to an optimal
variable-lengthtovariable-length code (see exercises).3
In spite of its fundamental importance, Tunstalls work was never published in the open literature. Tunstalls doctoral thesis [Tun67], which contains this work, lay unnoticed for many years before it became familiar to
information theorists.
Remark 4.21. Tunstalls work shows how to do optimal variable-lengthto
block coding of a DMS, but only when one has decided to use a proper message
set. A legal, but not proper message set denoted quasi-proper message set
(such as the message set on the right of Figure 4.4) in which there is
at least one message on each path from the root to a leaf in its r-ary tree,
but sometimes more than one message, can also be used. Quite surprisingly
(as the exercises will show), one can sometimes do better variable-lengthto
block coding of a DMS with a quasi-proper message set than one can do with a
proper message set having the same number of messages. Perhaps even more
surprisingly, nobody today knows how to do optimal variable-lengthtoblock
3
What happens if we do blocktoblock coding? In general this will result in lossy

compression!
120
coding of a DMS with a given number of messages when quasi-proper message

sets are allowed.
Note, however, that one cannot use less than (log D)/ H(U ) source letters
per D-ary code digit on the average, i.e., the converse (Theorem 4.14) still
holds true also for quasi-proper message sets. The proof of this most general
converse will be deferred to the advanced information theory course. It is
based on the concept of rate distortion theory (also see the discussion on
p. 298).
4.8
The Eciency of a Source Coding Scheme
It is common to dene an eciency of a coding scheme.

Denition 4.22. The eciency of a source coding scheme is dened by the
ratio of the lower bound H(U )/ log D in the general coding theorem (Theorem 4.14) to the achieved average number of codeword digits per source letter
E[L]/ E[M ]:
H(U ) E[M ]
.
log D E[L]
(4.134)
Note that 0 1, where 1 denotes full eciency because in this case

the code achieves the best possible theoretical lower bound exactly.
Remark 4.23. We end this chapter with a nal remark. All the systems we
have discussed here contain one common drawback: We have always assumed
that the probability statistics of the source is known in advance and can be
used in the design of the system. In a practical situation this is often not
the case. For example, what is the probability distribution of digitized speech
in a telephone system? Or of English ASCII text in comparison to French
ASCII text? Or of dierent types of music? A really practical system should
work independently of the source, i.e., it should estimate the probabilities of
the source symbols on the y and adapt to it automatically. Such a system
is called a universal compression scheme. Such systems do exist, and we will
discuss some examples in Chapter 6 in the even more general context of sources
with memory. Beforehand, we need to discuss memory, and how to compute
the entropy of sources with memory.
Chapter 5
Stochastic Processes and

Entropy Rate
Most real information sources exhibit substantial memory so that the DMS
is not an appropriate model for them. We now introduce two models for
sources with memory that are suciently general to satisfactorily model most
real information sources, but that are still simple enough to be analytically
tractable: discrete stationary sources and Markov sources.
5.1
Discrete Stationary Sources
We begin with the following very general denition.

Denition 5.1. A discrete stochastic process or discrete random process {Xk }
is a sequence of RVs
. . . , X3 , X2 , X1 , X0 , X1 , X2 , . . .
(5.1)
with a certain joint probability distribution.

Note that the word stochastic is just fancy for random. Moreover, we
would like to point out that in our notation k denotes the discrete time, {Xk }
is a discrete-time stochastic process, and Xk denotes the RV of this process
at time k.
It is quite obvious that it is dicult to handle the vast amount of all
possible discrete stochastic processes. Simply a proper denition that contains
the complete joint distribution of the innite process is a very dicult task
indeed! We need to introduce additional restrictions in order to make them
more accessible for analysis.
One such restriction is to make sure that the probability laws reigning the
process do not change over time. The process is then called stationary.
Denition 5.2. A stochastic process {Xk } is called stationary or strict-sense
stationary (SSS) if the joint distribution of any subset of the sequence does not
121
122
Stochastic Processes and Entropy Rate
change with time: Z, n N, k1 , . . . , kn where kj Z (j = 1, . . . , n):

Pr[Xk1 = 1 , . . . , Xkn = n ] = Pr[Xk1 + = 1 , . . . , Xkn + = n ],
(1 , . . . , n ).
(5.2)
In particular, note that this denition means that for any stationary
stochastic process we have
Pr[Xk = ] = Pr[X1 = ],
k, .
(5.3)
Sometimes, strict-sense stationarity is a too strong restriction. Then, weak

stationarity might be a solution.
Denition 5.3. A stochastic process {Xk } is called weakly stationary or widesense stationary (WSS) if the rst and second moments of the process do not
depend on time, i.e.,
E[Xk ] = constant, not depending on k,
KXX (k, j)
E Xk E[Xk ] Xj E[Xj ]
(5.4)
= KXX (k j).
(5.5)
Note that by denition any stationary process is also weakly stationary,

but not vice versa.
The following denition is a straightforward application of the denition
of stochastic processes and stationarity.
Denition 5.4. An r-ary discrete stationary source (DSS) is a device that
emits a sequence of r-ary RVs such that this sequence forms a stationary
stochastic process.
5.2
Markov Processes
In engineering we often have the situation that a RV is processed by a series

of (random or deterministic) processors, as shown in Figure 5.1. In this case
Z
processing
processing
processing
Figure 5.1: A series of processing black boxes with a random input.

Z depends on W , but only via its intermediate products X and Y . In other
words, once we know Y , it does not help us to also know X or W to improve
our prediction about Z.
Another way of looking at the same thing is that the system has limited
memory: Once we know W we do not need to know anymore what happened
before! The process forgets about its past.
This basic idea can be generalized to a so-called Markov process.
5.2. Markov Processes
123
Denition 5.5. A discrete stochastic process {Xk } is called Markov process

of memory if
PXk |Xk1 ,Xk2 ,...,Xk ,... (xk |xk1 , xk2 , . . . , xk , . . .)
= PXk |Xk1 ,...,Xk (xk |xk1 , . . . , xk ),
(5.6)
for all k and for all (xk , xk1 , . . .).

Hence we only need to keep track of the last time steps and can forget
about anything older than time-steps before now!
Remark 5.6. Every Markov process of memory can be reduced to a Markov
process of memory 1 by enlarging its alphabet. To show this, we give an
instructive example.
Example 5.7. Consider r = 2 with the alphabet {a, b}, and let = 2, i.e., it
is sucient to know the last two time steps, e.g., PX3 |X2 ,X1 (x3 |x2 , x1 ). Now
dene Vk (X2k , X2k1 ) and note that {Vk } takes value in the larger alphabet
{aa, ab, ba, bb} (i.e., r = 4). Lets check how big the memory of this new
process {Vk } is (for simplicity of notation, we drop the PMFs arguments):
PV2 |V1 ,V0 ,V1 ,...
= PX4 ,X3 |X2 ,X1 ,X0 ,...
(5.7)
= PX3 |X2 ,X1 ,X0 ,... PX4 |X3 ,X2 ,X1 ,X0 ,...
(by chain rule)
(5.8)
(5.9)
= PX4 ,X3 |X2 ,X1
(by Markovity of {Xk })

(by chain rule)
= PX3 |X2 ,X1 PX4 |X3 ,X2 ,X1

= PV2 |V1 .
(5.10)
(5.11)
Hence, {Vk } is again Markov, but of memory = 1!
We usually will only consider Markov processes of memory 1 and simply

call them Markov processes.
Denition 5.8. A Markov process is called time-invariant (or homogeneous)
if the conditional probability distribution does not depend1 on time k:
PXk |Xk1 (|) = PX2 |X1 (|),
k, , .
(5.12)
We will always assume that a Markov process is time-invariant.

Remark 5.9. We make the following observations:
For a time-invariant Markov process the transition probabilities from a
certain state to a certain another state always remains the same: p, .
In the case of a Markov process of memory > 1, this denition is accordingly adapted
to PXk |Xk1 ,...,Xk (|) = PX+1 |X ,...,X1 (|) for all k, , .
124
These transition probabilities p, are usually stored in the transition
probability matrix
p1,1
p2,1
.
.
.
p1,2
p2,2
.
.
.
..
.
p1,m
p2,m
. .
.
.
(5.13)
pm,1 pm,2 pm,m
A time-invariant Markov process can easily be depicted by a state diagram, see the following Example 5.10.
Example 5.10. Consider a binary source (r = 2) with alphabet {a, b}. Assume that
1
PX2 |X1 (a|a) = ,
2
1
PX2 |X1 (b|a) = ,
2
3
PX2 |X1 (a|b) = ,
4
1
PX2 |X1 (b|b) = .
4
(5.14)
(5.15)
Then
P=
1
2
3
4
1
2
1
4
(5.16)
and the corresponding state transition diagram is shown in Figure 5.2.
1
2
1
2
1
4
3
4
Figure 5.2: State transition diagram of a binary (r = 2) Markov source with

transition probability matrix given in (5.16).
So we see that in order to uniquely dene a Markov process, we certainly
need to specify the set of all states S and the transition probabilities P, or
equivalently, the state transition diagram. But is this sucient? Is a Markov
process completely specied by states and transition probabilities?
We remember that a stochastic process is specied by the joint distribution
of any subset of RVs (see Denition 5.1). So lets check the joint distribution
of a Markov process {Xk } and see if P fully species it. As an example, we
125
pick three RVs:

PX1 ,X2 ,X3 (, , ) = PX1 () PX2 |X1 (|) PX3 |X2 ,X1 (|, )
= PX1 () PX2 |X1 (|) PX3 |X2 (|)
=
PX2 |X1 (|) PX2 |X1 (|) .
PX1 ()
not specied yet!
(5.17)
(5.18)
(5.19)
We see that the joint probability distribution is not completely specied: We

are missing the starting distribution PX1 !
Lemma 5.11 (Specication of a Markov Process). A time-invariant
Markov process is completely specied by the following three items:
1. set of all states S;
2. transition probabilities P;
3. starting distribution PX1 ().
Remark 5.12. Note that if we consider a Markov source of memory > 1,
then the transition probabilities are given by all possible values of
PX+1 |X ,...,X1 (|, . . . , ),
(5.20)
and the starting distribution corresponds to the distribution of the rst

time-steps:
PX1 ,...,X (, . . . , ).
(5.21)
But, as mentioned, most of the time we will consider Markov processes of

memory 1.
A very interesting choice for the starting distribution is the so-called
steady-state distribution:
Denition 5.13. A distribution () that has the property that a transition
will keep it unchanged,
()PXk |Xk1 (|) = (),
(5.22)
is called steady-state distribution. Usually, (5.22) is written in vector notation:2

T P = T.
(5.23)
2
Note again that in the case of Markov processes with memory > 1 the steady-state
distribution is a joint distribution of variables (, . . . , ) = PXk ,...,Xk+1 (, . . . , ). In this
case we will not use the vector notation.
126
Example 5.14. Lets look at the example of Figure 5.2. What is the steadystate distribution? We write down (5.23) explicitly:
PX2 (a) = PX1 (a) PX2 |X1 (a|a) + PX1 (b) PX2 |X1 (a|b)
3 !
1
= a + b = a ,
2
4
PX2 (b) = PX1 (a) PX2 |X1 (b|a) + PX1 (b) PX2 |X1 (b|b)
1
1 !
= a + b = b .
2
4
(5.24)
(5.25)
(5.26)
(5.27)
Together with the normalization a + b = 1, this equation system can be

solved:
3
a = ,
5
2
b = .
5
(5.28)
(5.29)
We have learned the following.

Lemma 5.15. The steady-state distribution can be computed by the balance
equations:
T = T P
(5.30)
and the normalization equation:

m
j = 1.
(5.31)
j=1
We will now state the most important result concerning Markov processes
that explains why they are so popular in engineering. We will only state the
results and omit the proofs. They can be found in many textbooks about
probability, e.g., in [BT02].
We rst need two more denitions.
Denition 5.16. A Markov process is called irreducible if it is possible (i.e.,
with positive probability) to go from any state to any other state in a nite
number of steps.
Denition 5.17. An irreducible Markov process is called periodic if the states
can be grouped into disjoint subsets so that all transitions from one subset
lead to the next subset.
Figure 5.3 shows some examples of irreducible, reducible, periodic and
aperiodic Markov processes, respectively.
And now we are ready for the most important theorem concerning Markov
processes.
127
1
1
3
1
1
3
1
3
irreducible
reducible
periodic
aperiodic
Figure 5.3: Examples of irreducible, reducible, periodic and aperiodic Markov

processes, respectively.
Theorem 5.18 (Markovity & Stationarity).

Let us consider a time-invariant Markov process that is irreducible and
aperiodic. Then
1. the steady-state distribution () is unique;
2. independently of the starting distribution PX1 (), the distribution
PXk () will converge to the steady-state distribution () for k
;
3. the Markov process is stationary if, and only if, the starting distribution PX1 () is chosen to be the steady-state distribution ():
PX1 () = (),
(5.32)
Example 5.19. Lets go back to the example of Figure 5.2. We already know
128
Table 5.4: The state probability distribution of a Markov source converges to

its steady-state distribution.
PXk ()
k=1 k=2
Pr[Xk = a]
Pr[Xk = b]
PXk ()
k=5
Pr[Xk = a]
Pr[Xk = b]
1
= 0.5
2
1
= 0.5
2
77
= 0.6015625
128
51
= 0.3984375
128
k=3
k=4
5
= 0.625
8
3
= 0.375
8
19
= 0.59375
32
13
= 0.40625
32
...
k=
...
...
3
= 0.6
5
2
= 0.4
5
that = 3 , 2 . By Theorem 5.18 we know that if we start in state a, after

5 5
3
a while we will have a probability of about 5 to be in state a and a probability
of about 2 to be in state b. This is shown in Table 5.4.
5
Remark 5.20. Note that time-invariance and stationarity are not the same
thing! For stationarity it is necessary that a Markov is time-invariant, however,
a time-invariant Markov process is not necessarily stationary!
5.3
Entropy Rate
So now that we have dened general sources with memory, the question arises
what the uncertainty of such a source is, i.e., what is the entropy H({Xk })
of a stochastic process {Xk }?
Example 5.21. If {Xk } is IID, then dening H({Xk }) = H(X1 ) would make
sense as this corresponds to the situation we have considered so far. On the
other hand, if the source has memory, this choice will lead to a rather strange
result. To see this take as an example PX1 (0) = PX1 (1) = 1 , i.e., H(X1 ) = 1
2
bit. Assume further that
PX2 |X1 (0|0) = 0,
PX2 |X1 (0|1) = 0,
PX2 |X1 (1|0) = 1,
PX2 |X1 (1|1) = 1,
(5.33)
(see Figure 5.5). From this we can now compute PX2 (1) = 1, which means
that H(X2 ) = 0, H(X2 |X1 ) = 0, and H(X1 , . . . , Xn ) = 1 bit. So it is clear
that it is denitely not correct to dene the entropy to be 1 bit, as only the
5.3. Entropy Rate
129
1
Figure 5.5: Transition diagram of a source with memory.
rst step contains any uncertainty and then we know for sure that the source
remains in state 1 forever.
Latest by now it should be obvious that the amount of uncertainty coming

from a source with memory strongly depends on the memory.
We give the following denition.
Denition 5.22. The entropy rate (i.e., the entropy per source symbol) of
any stochastic process {Xk } is dened as
H({Xk })
H(X1 , X2 , . . . , Xn )
n
n
(5.34)
lim
if the limit exists.

Remark 5.23. Unfortunately, this limit does not always exist, as can be seen
from the following example. Let {Xk } be independent, but for each k the
distribution is dierent:
1 if k = 1,
2
1
(5.35)
Pr[Xk = 1] = 2 if log2 log2 k is even (k 2),
0 if log2 log2 k is odd (k 2),

Pr[Xk = 0] = 1 Pr[Xk = 1].
(5.36)
Then we see that
2
for k = 1, 2 (= 22 ), 516 (= 22 ),
H(Xk ) = 1 bit
25765536 (= 22 ), (22 + 1)22 , . . .

for k = 3, 4 (=
5
6553722 ,
H(Xk ) = 0 bits
1
22 ),
+ 1)22 , . . .
6
(22
3
17256 (=
7
(5.37)
3
22 ),
(5.38)
7
The changing points are at 22 , 22 , 22 , 22 , 22 , 22 , 22 , 22 , . . . If we now look

at
f (n)
1
1
H(X1 , . . . , Xn ) =
n
n
H(Xk ),
(5.39)
k=1
we note that f (1) = f (2) = 1 bit, then it will decrease f (3) = 2 bits, f (4) =
3
2
1
3
4
4 = 2 bits, then it will increase again f (5) = 5 bits, f (6) = 6 bits, . . . , until
130
14
the next changing point f (16) = 16 bits, where it turns again and decreases,
etc. So, f (n) is continuously alternating between increasing and slowly tending
towards 1 bit and decreasing and slowly tending towards 0. Now note that
the intervals between the changing points are growing double-exponentially,
so that the time between each change is getting larger and larger. Therefore,
f (n) has enough time to actually approach the corresponding limit 0 or 1
closer and closer each time, before turning away again. Hence, we see that
f () oscillates between 0 and 1 forever and does not converge.
Question: Why not dene the entropy rate as
H({Xk })
lim H(Xn |Xn1 , Xn2 , . . . , X1 )?
(5.40)
The (partial) answer lies in the following theorem.

Theorem 5.24 (Entropy Rate of a DSS). For a stationary stochastic
process (or any DSS) the entropy rate H({Xk }) always exists and its
value is identical to H({Xk }):

H(X1 , . . . , Xn )
= lim H(Xn |Xn1 , . . . , X1 ). (5.41)
n
n
n
H({Xk }) = lim
Furthermore,
1. H(Xn |Xn1 , . . . , X1 ) is nonincreasing in n;

2.
1
n
H(X1 , . . . , Xn ) is nonincreasing in n;
3. H(Xn |Xn1 , . . . , X1 )
1
n
H(X1 , . . . , Xn ),
n 1.
Before we can prove this important result, we need another small lemma.
Lemma 5.25 (Cesro Mean). If
a
lim an = a
(5.42)
then
bn
1
n
n
k=1
ak a
(5.43)
as n .
Proof: By assumption, an a. This in mathematical language
means: > 0, N such that |an a| , for n > N . We now want to
show that for any 1 > 0 and for n large enough we have
|bn a| 1 .
(5.44)
5.3. Entropy Rate
131
To that goal note that for n > N

|bn a| =
1
n
k=1
n
1
=
n
=
1
n
n
1
=
n
k=1
n
1
n
k=1
ak
1
n
(5.45)
n
k=1
(5.47)
|ak a|
(5.48)
1
|ak a| +
n
N
k=1
(5.46)
(ak a)
k=1
n
k=1
N
ak a
|ak a| +
k=N +1
|ak a|
(5.49)
n N
(5.50)
constant
+ = 2
1 ,
for n large enough,
(5.51)
where the inequality (5.48) follows from the triangle inequality. Since > 0 is
arbitrary, also 1 is arbitrary and therefore the claim is proven.
Proof of Theorem 5.24: We start with a proof of Part 1:
H(Xn |Xn1 , . . . , X1 ) = H(Xn+1 |Xn , . . . , X2 )
H(Xn+1 |Xn , . . . , X2 , X1 )
(5.52)
(5.53)
where (5.52) follows from stationarity and (5.53) follows from conditioning
that reduces entropy.
To prove Part 3 we rely on the chain rule and on Part 1:
1
1
H(X1 , . . . , Xn ) =
n
n
1
n
n
k=1
n
k=1
H(Xk |Xk1 , . . . , X1 )
(chain rule)
(5.54)
H(Xn |Xn1 , . . . , X1 )
(Part 1)
(5.55)
= H(Xn |Xn1 , . . . , X1 ).
(5.56)
To prove Part 2 we rely on the chain rule (equality in (5.57)), on Part 1

(inequality in (5.58)), and on Part 3 (inequality in (5.59)):
H(X1 , . . . , Xn , Xn+1 ) = H(X1 , . . . , Xn ) + H(Xn+1 |Xn , . . . , X1 )
H(X1 , . . . , Xn ) + H(Xn |Xn1 , . . . , X1 )
1
H(X1 , . . . , Xn ) + H(X1 , . . . , Xn )
n
n+1
H(X1 , . . . , Xn ),
=
n
(5.57)
(5.58)
(5.59)
(5.60)
132
i.e.,
1
1
H(X1 , . . . , Xn+1 ) H(X1 , . . . , Xn ).
n+1
n
(5.61)
Finally, to prove (5.41) note that by Parts 1 and 2, both H(Xn |Xn1 , . . . ,
1
X1 ) and n H(X1 , . . . , Xn ) are nonincreasing, but bounded below by zero (because entropies are always nonnegative). Hence, they must converge. This
shows that the limit exists. Now it only remains to show that they converge
to the same limit. To that goal, we rely on the Cesro mean (Lemma 5.25):
a
H({Xk }) = lim
1
H(X1 , . . . , Xn )
n
(5.62)
bn
1
n n
= lim
k=1
H(Xk |Xk1 , . . . , X1 )
(5.63)
ak
= lim H(Xn |Xn1 , . . . , X1 )
(Lemma 5.25)
= H({Xk }).
(5.64)
(5.65)
Beside stationary processes we can also prove for a Markov process that the
entropy rate always is dened (even if the Markov process is not stationary).
Theorem 5.26 (Entropy Rate of a Markov Source). For a timeinvariant, irreducible, and aperiodic Markov process {Xk } of memory
the entropy rate H({Xk }) always exists and its value is identical to
H({Xk }). In particular, the entropy rate can be computed as

H({Xk }) = H(X+1 |X , . . . , X1 )
PX ,...,X1 (,...,)=(,...,)
(5.66)
where, when computing the conditional entropy H(X+1 |X , . . . , X1 ),

we need to choose PX ,...,X1 (, . . . , ) to be the steady-state distribution
(, . . . , ).
In the case where the memory has length = 1, this simplies to
H({Xk }) = H(X2 |X1 )
PX1 ()=()
(5.67)
with () = denoting the steady-state distribution.
Thus we see that the entropy rate of a Markov process does not depend
on the starting distribution, but only on the transition probabilities (and the
steady-state distribution that is computed from the transition probabilities).
5.3. Entropy Rate
133
Remark 5.27. Before we prove Theorem 5.26, we would like to give a warning: The reader should be aware that even though we assume a time-invariant
Markov process, i.e., we assume that
PXn |Xn1 ,...,Xn (|) = PX+1 |X ,...,X1 (|)
(5.68)
for all X and X , we have

H(Xn |Xn1 , . . . , Xn ) = H(X+1 |X , . . . , X1 )
(5.69)
in general! The reason lies in the fact that for the computation of the conditional entropy H(Xn |Xn1 , . . . , Xn ) it is not sucient to know merely
the conditional PMF PXn |Xn1 ,...,Xn , but we actually need the joint PMF
PXn ,...,Xn . This joint PMF, however, depends on n. The inequality in
(5.69) can only be replaced by equality if the starting distribution equals to
the steady-state distribution, i.e., only if the Markov process happens to be
stationary.
Proof of Theorem 5.26: The basic idea behind this result is the fact
that by Theorem 5.18, a time-invariant, irreducible, and aperiodic Markov
process will always converge to the steady-state distribution. A Markov process based on the steady-state distribution, however, looks stationary and
therefore should have an entropy rate that is well-dened.
We start with H({Xk }):
H({Xk }) = lim H(Xn |Xn1 , . . . , X1 )
(5.70)
= lim H(Xn |Xn1 , . . . , Xn )
(5.71)
= lim H(Xn |Xn , . . . , Xn1 )
(5.72)
= lim
PXn ,...,Xn1 () H Xn (Xn , . . . , Xn1 ) = )
(5.73)
PXn ,...,Xn1 () H X+1 (X1 , . . . , X ) = )
(5.74)
= lim
lim PXn ,...,Xn1 () H X+1 (X1 , . . . , X ) = ) (5.75)
= ()
() H X+1 (X1 , . . . , X ) = )
(5.76)
= H(X+1 |X1 , . . . , X )
PX1 ,...,X (,...,)=(,...,)
(5.77)
Here, (5.71) follows from Markovity; in (5.72) we rearrange the order of

Xn , . . . , Xn1 for notational reasons; (5.73) is due to denition of conditional entropy; (5.74) follows from time-invariance; in (5.75) we take the
limit into the sum since only PXn1 ,...,Xn depends on n; (5.76) is the crucial
step: Here we use the fact that for a time-invariant, irreducible, and aperiodic Markov process of memory , the distribution PXn1 ,...,Xn converges to
134
the steady-state distribution ; and in (5.77) we again use the denition of

conditional entropy. This proves the rst part.
Next we prove that the entropy rate is identical to the above:
H({Xk })
1
H(X1 , . . . , Xn )
n n
1
H(X+1 , . . . , Xn |X1 , . . . , X ) + H(X1 , . . . , X )
= lim
n n
n 1
= lim
H(X+1 , . . . , Xn |X1 , . . . , X )
n n n
1
+ lim H(X1 , . . . , X )
n n
= lim
(5.78)
(5.79)
(5.80)
=0
n
1
= lim
H(X+1 , . . . , Xn |X1 , . . . , X )
n
n
n
1
1
= lim
n n
1
n n
n
k=+1
n
= lim
1
n n
(5.81)
k=+1
n
H(Xk |X1 , . . . , Xk1 )
(5.82)
H(Xk |Xk , . . . , Xk1 )
(5.83)
= lim
PXk ,...,Xk1 () H Xk (Xk , . . . , Xk1 ) =

k=+1
(5.84)
= lim
1
n
PXk ,...,Xk1 () H X+1 (X1 , . . . , X ) = (5.85)

k=+1
1
lim
n n
n
k=+1
PXk ,...,Xk1 () H X+1 (X1 , . . . , X ) =
= ()
(5.86)
() H X+1 (X1 , . . . , X ) = )
(5.87)
= H(X+1 |X1 , . . . , X )
PX1 ,...,X (,...,)=(,...,)
(5.88)
Here, (5.79) follows from the chain rule; in (5.81) we use that H(X1 , . . . , X )
is independent of n and nite; (5.82) again follows from the chain rule; (5.83)
from Markovity; in (5.85) we swap the order of nite summations and use
the time-invariance of the Markov process; in (5.86) we use the fact that
H(X+1 |(X1 , . . . , X ) = ) does not depend on k or n; and (5.87) again follows because for a time-invariant, irreducible, and aperiodic Markov process
of memory the distribution PXk ,...,Xk1 converges to the steady-state distribution and because of the Cesro mean (Lemma 5.25).
a
Chapter 6

Coding of Sources with
Memory
In Chapter 5 we have introduced some types of sources with memory. Now,
we will start thinking about ways of how to compress them.
6.1
BlocktoVariable-Length Coding of a DSS
We start with a reminder.

Remark 6.1. An r-ary discrete stationary source (DSS) {Uk } is a device
that emits a sequence U1 , U2 , . . . of r-ary RVs such that this sequence is a
stationary stochastic process. Hence, every DSS has an entropy rate
1
H(U1 , . . . , Un ) = lim H(Un |Un1 , . . . , U1 ).
n
n n
H({Uk }) = lim
(6.1)
For simplicity, at the moment we will only consider block parsing of the
output of the DSS, but variable-length parsers are also often used in practice.
So, we consider now the system shown in Figure 6.1.
Ck , Ck+1 , . . . enc. with Vk , Vk+1 , . . .
M-block
memory
parser
U1 , U2 , . . .
r-ary
DSS
Figure 6.1: A general compression scheme for a source with memory: The
D-ary code is, as usual, assumed to be prex-free, but the encoder is
allowed to have memory. For the moment, we assume an M-block parser.
We note that since the source has memory, the messages have memory,
too, i.e., the dierent Vk depend on each other. Hence, to make the encoder
ecient, it should use this additional information provided by the memory!
135
136
Efficient Coding of Sources with Memory
This means that we allow the encoder to take previous messages into account
when encoding the message Vk . In principle, the encoder could use a dierent
prex-free code for each possible sequence of values for the previous messages
V1 , V2 , . . . , Vk1 . Note that at the time when the decoder should decode the
codeword Ck , it already has decoded C1 , . . . , Ck1 to V1 , . . . , Vk1 . Hence,
the decoder also knows the previous messages and therefore the code used for
Vk . It therefore can decode Ck .
Claim 6.2 (Optimal BlocktoVariable-Length Coding of a DSS).
We consider an M-block parser that generates the block messages
Vk = (UkMM+1 , . . . , UkM1 , UkM ),
k.
(6.2)
length M
For every k, we take the conditional distribution of the current message Vk

conditional on the past (Vk1 , . . . , V1 ) = (vk1 , . . . , v1 )
PVk |Vk1 ,...,V1 (|vk1 , . . . , v1 )
(6.3)
to design an optimal D-ary Human code for Vk and then encode Vk with
this code. Note that this Human code will be dierent for each k.
The receiver also knows the past (vk1 , . . . , v1 ) because it has decoded that
beforehand, and therefore also knows which code has been used to encode Vk ,
and hence can decode it.
Such a system is called adaptive Human coding.
We immediately realize that adaptive Human coding is not a practical
system because the eort involved is rather big: We have to redesign our code
for every message!
Nevertheless we will quickly analyze the performance of such an adaptive
Human coding system. The analysis is quite straightforward as we can base
it on our knowledge of the performance of a Human code. From the coding
theorem for a single random message (Theorem 3.35) it follows that the length
Lk of the codeword Ck for the message Vk must satisfy
H Vk (V1 , . . . , Vk1 ) = (v1 , . . . , vk1 )
log D
E[Lk | (V1 , . . . , Vk1 ) = (v1 , . . . , vk1 )]
H Vk (V1 , . . . , Vk1 ) = (v1 , . . . , vk1 )
+ 1.
<
log D
(6.4)
(6.5)
Lets rst consider the lower bound (6.4). Taking the expectation over V1 , . . . ,
Vk1 (i.e., multiplying the expression by PV1 ,...,Vk1 (v1 , . . . , vk1 ) and sum-
6.1. BlocktoVariable-Length Coding of a DSS
137
ming over all values v1 , . . . , vk1 ) gives

E[Lk ] = EV1 ,...,Vk1 [E[Lk | V1 , . . . , Vk1 ]]
=
(6.6)
PV1 ,...,Vk1 (v1 , . . . , vk1 )
v1 ,...,vk1
E[Lk | (V1 , . . . , Vk1 ) = (v1 , . . . , vk1 )]
(6.7)
PV1 ,...,Vk1 (v1 , . . . , vk1 )

v1 ,...,vk1
H Vk (V1 , . . . , Vk1 ) = (v1 , . . . , vk1 )

(6.8)
log D
EV1 ,...,Vk1 H Vk (V1 , . . . , Vk1 ) = (v1 , . . . , vk1 )
(6.9)
log D
H(Vk |V1 , . . . , Vk1 )
(6.10)
log D
H U(k1)M+1 , . . . , UkM U1 , . . . , UM , . . . , U(k2)M+1 , . . . , U(k1)M
log D
(6.11)
H U(k1)M+1 U1 , . . . , U(k1)M + + H UkM U1 , . . . , UkM1
log D
(6.12)
M H UkM U1 , . . . , UkM1
(6.13)
log D
M H({Uk })
.
(6.14)
log D
=
=
=
Here, (6.6) follows from the law of total expectation; (6.8) follows from (6.4);
in (6.11) we use the denition of the messages (6.2); the subsequent equality
(6.12) follows from the chain rule; and in the subsequent inequalities (6.13)
and (6.14) we use that H(Uj |Uj1 , . . . , U1 ) is monotonically nonincreasing and
converges to the entropy rate H({Uk }) (Theorem 5.24). Hence, we have
E[Lk ]
H({Uk })
.
M
log D
(6.15)
Similarly, again using the monotonicity of H(Uj |Uj1 , . . . , U1 ), we obtain

from (6.5)
H U(k1)M+1 U1 , . . . , U(k1)M + + H UkM U1 , . . . , UkM1
+1
log D
(6.16)
M H U(k1)M+1 U1 , . . . , U(k1)M
+ 1,
(6.17)
log D
E[Lk ] <
i.e.,
H U(k1)M+1 U1 , . . . , U(k1)M
1
E[Lk ]
<
+ .
M
log D
M
(6.18)
138
Since H U(k1)M+1 U1 , . . . , U(k1)M converges monotonically to H({Uk }), for

every > 0 there must exist some M0 large enough such that for all M M0 ,
we have
E[Lk ]
H({Uk })
<
+ .
M
log D
(6.19)
We summarize this in the following theorem.

Theorem 6.3 (BlocktoVariable-Length Coding Theorem for
a DSS).
For every > 0, one can choose an M N large enough such that there
exists a D-ary prex-free code for the kth M-block message of a DSS
{Uk } with an average codeword length per source letter satisfying
E[Lk ]
H({Uk })
<
+ .
M
log D
(6.20)
Conversely, every uniquely decodable D-ary code of the kth M-block message of a DSS {Uk } has an average codeword length per source symbol
that satises
H({Uk })
E[Lk ]
.
M
log D
(6.21)
This holds even if we allow the code to depend on previous messages.
The alert reader might object that strictly speaking we have not proven
that adaptive Human coding indeed is an optimal approach. However, we
hope that our argument of choosing the optimal code at every step based on
the complete knowledge at that time is convincing. Readers who would like
to get more insight are referred to Section 18.7.
In spite of the success of having derived Theorem 6.3, we still feel uneasy
because an adaptive Human coding scheme is completely infeasible in practice. Moreover, we also have to address the problem that we assume a known
source distribution. Both issues can be resolved by so-called universal coding
schemes.
6.2
EliasWillems Universal
BlockToVariable-Length Coding
We have already mentioned that adaptive Human coding is very expensive

and often even not feasible. However, there is one more very fundamental
problem with Human codes: We need to know the exact probability distribution of the DSS. In practice we usually have no clue of the statistical
distribution of the source we would like to compress.
6.2. EliasWillems Universal BlockToVariable-Length Coding
139
We would like to nd a compression scheme that does compress as efciently as adaptive Human coding, but without the need of knowing
the distribution of the source (and, if possible, in a more feasible and
practical way).
The clue idea here is to try to nd the distribution on the y.

Example 6.4. Consider a binary DSS Uk {a, b} and assume that it produces
the following sequence:
aaabaabaaaababbaaaaba
(6.22)
We note that a seems to have a higher probability than b! So how can we

make use of such observations?
6.2.1
The Recency Rank Calculator
The basic idea is shown in Figure 6.2. We use a new device called recency rank
calculator. To understand how it works, rstly note that there are rM dierent
possible messages that can be at the input of the recency rank calculator. We
assume that the system has been running for a long time, so that all possible
messages have occurred at least once. Then at time k, the message that has
been seen most recently has recency rank 1, the next dierent message before
the most recent one has recency rank 2, etc.
Nk
codewords
rec.-rank
Vk
M-block
calculator messages
U1 , U2 , . . .
r-ary
parser
DSS
Figure 6.2: A rst idea for a universal compression system that relies on a
new device called recency rank calculator.
Example 6.5. Assume a binary source (r = 2) and a parser producing

block messages of length M = 2. Then we have 22 = 4 possible messages
{00, 01, 10, 11} at the input of the recency rank calculator. Suppose that the
sequence of past messages before time k is
now
now
. . . Vk4 Vk3 Vk2 Vk1
= . . . 11 00 11 10 01 11 01 01
(6.23)
Then we have the recency rank list shown in Table 6.3. We note that the
recency rank calculator simply orders all possible messages in the order of
their last occurrence.
Lets make this denition formal.
140
Table 6.3: Example of a recency rank list according to the situation shown in
(6.23).
Message
Recency Rank at Time k
00
01
10
11
4
1
3
2
Denition 6.6. A recency rank calculator is a device that assigns the actual
recency rank Nk to the message Vk , and afterwards updates the recency
rank list.
Example 6.7. To continue with our example, assume that Vk = 10. Hence,
from the current recency rank list in Table 6.3 we see that Nk = 3. So, the
recency rank calculator puts out Nk = 3 and then updates the recency rank
list as shown in Table 6.4.
Table 6.4: Updated recency rank list.

Message
Recency Rank at Time k + 1
00
01
10
11
4
2
1
3
According to Figure 6.2, the codeword for Vk is now simply the number
Nk . Since the decoder has decoded V1 , V2 , . . . , Vk1 before it needs to decode
Nk , it can easily keep track of the recency rank list in the same way as the
encoder and therefore can correctly decode Vk .
Note that at the start of the system, we do not yet have a recency rank
list. But this is not really a problem as we simply will dene a default starting
value of the recency rank list. An example is shown in Table 6.5.
What is the intuition behind this system? Note that messages that occur
often have small recency rank. Hence we use a shorter codeword (small number) for them. Messages that show up only rarely will have large Nk , i.e., we
assign them a long codeword (large number). So it looks like that we are successfully compressing the source without knowing its distribution! Moreover,
if the source actually were not stationary, but kept changing its distribution
over time, the system would also be able to adapt to these changes because
141
Table 6.5: Example of a default starting recency rank list.

Message
Recency Rank at Time 1

(by denition)
00
01
10
11
1
2
3
4
the recency rank list will reect the change as well.1

However, we do have one serious problem: The numbers Nk are not prexfree! As a matter of fact, they are not even uniquely decodable!
Example 6.8. In our example we have four recency ranks: 1, 2, 3, 4. Their binary representation is then 1, 10, 11, 100, respectively. So how shall we decode
a received sequence 111? It could be 1, 11 or 11, 1 or 1, 1, 1!
We realize:
We need a prex-free code for the positive integers!
6.2.2
Codes for Positive Integers
Standard Code
We start with the usual binary representation of the positive integers and call
this code the standard code B0 (n). It is shown in Table 6.6.
Table 6.6: Standard code: usual binary representation of positive integers.
n
B0 (n)
1
1
2
10
3
11
4
100
5
101
6
110
7
111
8
1000
9
1001
10
1010
As mentioned above, this code is not prex-free (not even uniquely decodable).
We observe that every codeword has a 1 as its rst digit. Moreover, we
note that the length LB0 (n) of the codeword B0 (n) is given by
LB0 (n) = log2 n + 1.
(6.24)
Be careful: log2 n + 1 is not the same as log2 n. To see this, compare, e.g.,
the values for n = 4.
1
Here we need to assume that the change will not be too fast, as the recency rank list
needs some time to adapt to a new distribution.
142
First Elias Code

Elias had the idea [Eli75] to make the standard code prex-free by simply
adding a prex in front of every codeword consisting of LB0 (n) 1 zeros. We
call this code the rst Elias code B1 (n). The rst couple of codewords are
listed in Table 6.7.
Table 6.7: First Elias code: Add a prex of LB0 (n) 1 zeros to B0 (n).
n
B0 (n)
LB0 (n) 1
B1 (n)
1
1
0
1
2
10
1
010
3
11
1
011
4
100
2
00100
5
101
2
00101
6
110
2
00110
n
B0 (n)
LB0 (n) 1
B1 (n)
7
111
2
00111
8
1000
3
0001000
9
1001
3
0001001
10
1010
3
0001010
11
1011
3
0001011
Obviously, the code B1 (n) is prex-free because the number of leading zeros
in a codeword tells us exactly how many digits will come after the inevitable
leading 1 of the standard code.
Example 6.9. The sequence of codewords
0010101100110001110 . . .
(6.25)
is unambiguously recognizable as
00101|011|00110|00111|0 . . . = 5|3|6|7| . . .
(6.26)
Unfortunately, the codewords B1 (n) are almost twice as long as the codewords B0 (n):
LB1 (n) = LB0 (n) 1 + LB0 (n) = 2LB0 (n) 1 = 2log2 n + 1.
(6.27)
Second Elias Code

To solve the problem of the inecient length, Elias came up with another idea
[Eli75]: Instead of using zeros as prex to tell the length of the codeword, lets
use B1 (LB0 (n)) to tell how many digits are coming! Moreover, since B1 () is
prex-free, we do not need to use the leading 1 of B0 (n). This code is called
the second Elias code and is given in Table 6.8.
Example 6.10. The codeword for n = 7 in the standard representation is
B0 (7) = 111 with codeword length LB0 (7) = 3. Now we use the rst Elias code
143
Table 6.8: Second Elias code: The length is encoded by the rst Elias code,
the number by the standard code:
B2 (n) = B1 (LB0 (n)) [B0 (n) without leading 1].
n
B0 (n)
LB0 (n)
B1 (LB0 (n))
B2 (n)
n
B0 (n)
LB0 (n)
B1 (LB0 (n))
B2 (n)
1
1
1
1
1
2
10
2
010
0100
3
11
2
010
0101
4
100
3
011
01100
5
101
3
011
01101
7
8
9
10
11
111
1000
1001
1010
1011
3
4
4
4
4
011
00100
00100
00100
00100
01111 00100000 00100001 00100010 00100011
6
110
3
011
01110
to write 3 as B1 (3) = 011, remove the leading 1 from 111, and get as a new
codeword for n = 7:
B2 (7) = 011 11.
(6.28)
Note that because the code B1 (n) is prex-free, we can recognize the rst
part B1 (LB0 (n)) of the codeword B2 (n) as soon as we see the last digit of
B1 (LB0 (n)). We decode this part and then immediately know the number
LB0 (n) of digits in the remainder of the codeword. Thus B2 (n) is indeed
prex-free.
How ecient is B2 (n)? We get
LB2 (n) = LB1 LB0 (n) + LB0 (n) 1
= LB1 (log2 n + 1) + log2 n + 1 1
= 2log2 (log2 n + 1) + 1 + log2 n

log2 n,
for large n.
(6.29)
(6.30)
(6.31)
(6.32)
So, because log2 log2 n is extremely slowly growing, the second Elias code is,
at least for large n, far better than the rst. As a matter of fact we have
LB2 (n) > LB1 (n),
only for n = 2, 3, . . . , 15;
(6.33)
LB2 (n) = LB1 (n),
for n = 16, 17, . . . , 31;
(6.34)
LB2 (n) < LB1 (n),
for n 32.
(6.35)
Asymptotically, the second Elias code has the same eciency as the standard
code in spite of the overhead that we added to make it prex-free.
We will now use the second Elias code to build a blocktovariable-length
coding scheme.
144
6.2.3
EliasWillems BlocktoVariable-Length Coding for a

DSS
It is now straightforward to combine the idea of a recency rank calculator with

the prex-free code of Elias. See Figure 6.9 for the corresponding system.
Ck , . . .
B2 ()
Nk , . . .
codewords
of length Lk
rec.-rank
calculator
Vk , . . .
messages
of length M
M-block
parser
U1 , . . .
r-ary
DSS
Figure 6.9: An EliasWillems compression scheme.

Lets compute the eciency of this system. We suppose that encoding
has been in progress for a time suciently long for all message values to have
appeared at least once in the past so that all recency ranks are genuine.
Dene k to be the time to the most recent occurrence of Vk : If Vk = v,
Vk = v, Vj = v for all j = k 1, . . . , k + 1, then k = .
Example 6.11. If
. . . Vk4 Vk3 Vk2 Vk1 Vk = . . . 11 00 11 10 01 11 01 01 10
(6.36)
then k = 5.
Note that
N k k ,
(6.37)
because only the values of Vk1 , . . . , Vkk +1 (which might not all be dierent!) could have smaller recency ranks than Vk .
We assume that the DSS is ergodic, i.e., time statistics are equal to the
probabilistic statistics. Then we know that for a very long sequence, the
number of occurrences of v divided by the whole length of the sequence is
approximately PV (v). Hence the average spacing between these occurrences
is PV1(v) :
E[k | Vk = v] =
1
.
PV (v)
(6.38)
The binary codeword Ck for the message Vk is B2 (Nk ). Therefore its

length Lk satises
Lk = LB2 (Nk )
(6.39)
LB2 (k )
(6.40)
= log2 k + 2 log2 log2 k + 1
log2 k + 2 log2 log2 k + 1 + 1,
+1
(6.41)
(6.42)
145
where the rst inequality follows from (6.37) and the second from dropping the
ooring-operations. Taking the conditional expectation, given that Vk = v,
results in
E[Lk | Vk = v]
E[log2 k | Vk = v] + 2 E log2 log2 k + 1
Vk = v + 1
log2 (E[k | Vk = v]) + 2 log2 log2 (E[k | Vk = v]) + 1 + 1
= log2 PV (v) + 2 log2 log2 PV (v) + 1 + 1.
(6.43)
(6.44)
(6.45)
Here (6.44) follows from Jensens inequality (Theorem 1.29) and the concavity
of the logarithm; and (6.45) follows from (6.38).
Taking the expectation over Vk and invoking the total expectation theorem nally yields:
E[Lk ] = EV [E[Lk | Vk = v]]
(6.46)
E[log2 PV (V)] + 2 E log2 log2 PV (V) + 1
+1
(6.47)
E[log2 PV (V)] + 2 log2 E[log2 PV (V)] + 1 + 1
(6.48)
= H(V)
= H(V)
= H(V) + 2 log2 H(V) + 1 + 1

(6.49)
1
1
H(U1 , . . . , UM ) + 2 log2 M
H(U1 , . . . , UM ) + 1 + 1,
=M
M
M
(6.50)
where (6.47) follows from (6.45) and where in (6.48) we have used Jensens
inequality once more. Hence we have
H(U1 , . . . , UM )
E[Lk ]
H(U1 , . . . , UM )
2
1
+
log2 M
+ 1 + . (6.51)
M
M
M
M
M
Recall that the source is a DSS so that the entropy rate
H(U1 , . . . , UM )
M
M
H({Uk }) = lim
(6.52)
exists. Hence for M we have

grows more slowly than M
2
1
H(U1 , . . . , UM )
E[Lk ]
lim
+
log2 (M H({Uk }) + 1) +
lim
M
M M
M
M
M
H({Uk })
= H({Uk }).
(6.53)
(6.54)
When we compare this with the converse of the coding theorem for blockto
variable-length coding for a DSS (Theorem 6.3), we see that this system is
asymptotically (i.e., for large M) optimal.
146
Theorem 6.12 (EliasWillems BlocktoVariable-Length Coding Theorem for a DSS).

For the EliasWillems binary coding scheme of M-block messages from
an r-ary ergodic DSS, the codeword length Lk (for all k suciently large)
satises:
H({Uk })
E[Lk ]
H(U1 , . . . , UM )
M
M
H(U1 , . . . , UM )
1
2
log2 M
+1 +
(6.55)
+
M
M
M
where all entropies are given in bits. By choosing M suciently large,

E[Lk ]
M can be made arbitrarily close to H({Uk }).
We would like to add that this theorem also holds if the DSS is nonergodic,
but the proof is more complicated. Moreover, as already mentioned, it usually
will even work if the source is not stationary. However, in such a case it will
be very hard to prove anything.
Since this coding scheme works without knowing the source statistics, it
is called a universal coding scheme. The EliasWillems coding scheme is an
example of a universal blocktovariable-length coding scheme. It is very
elegant in description and also analysis. However, in practice it is not widely
used. Some of the most commonly used schemes are universal variable-length
toblock coding schemes called LempelZiv coding schemes. They are also
asymptotically optimal, however as we will see next their analysis is
considerably more complicated.
6.3
LempelZiv Universal Coding Schemes
In practical life, the probably most often used compression algorithms are
based on the LempelZiv universal coding schemes. There are many variations, but in general one can distinguish two families: the sliding window LempelZiv coding schemes (usually called LZ-77 because Lempel and
Ziv published it in 1977) and the tree-structured LempelZiv coding schemes
(called LZ-78 as it was published in 1978).
6.3.1
LZ-77: Sliding Window LempelZiv
This algorithm is related to the EliasWillems coding. We will explain it using

the example of a binary source Uk {A, B}.
Description
Fix a window size w and a maximum match length lmax . (In the following we will use the example of w = 4.)
6.3. LempelZiv Universal Coding Schemes
147
Parsing the source sequence, consider the length-w window before the
current position
now
. . . BAABBABBABAA . . .
(6.56)
window
and try to nd the largest match of any sequence starting in the window
that is identical to the sequence starting on the current position:
match
current position:
AB B AB AA...
(6.57)
3 positions before:
AB B AB B AB AA...
(6.58)
match
If there exists such a match, the encoder gives out the starting position
of this match in the window (in the example of above: 3 before the
current position) and the length of the match (in the example from
above: 5), and then moves the current position the number of the match
length positions ahead. Note that after lmax matching source letters the
match is stopped even if there were further matching letters available.
If there is no such match, the encoder puts out the uncompressed source
letter at the current position, and then moves the current position one
letter ahead.
To distinguish between these two cases (i.e., either putting out a match
position and a match length or putting out an uncompressed source
letter), we need a ag bit: A ag 0 tells that the codeword describes an
uncompressed source letter and a ag 1 tells that the codeword describes
a match within the window.
Note that in the case of a match, the match position and the match
length will be encoded into a two D-ary phrases of length logD (w)
and logD (lmax ), respectively, where a 1 is mapped to 00 0 and w or
lmax is mapped to DD D.
Also note that in the case of an uncompressed source letter, the source
letter needs to be mapped into a D-ary phrase of length logD (|U |),
where |U | denotes the alphabet size of the source. This mapping has to
be specied in advance.
Example 6.13. We still keep the example of a binary source Uk {A, B} and
again choose the window size w = 4. Moreover, we x lmax = 8 and choose
binary codewords D = 2. Assume that the source produces the following
sequence:
ABBABBABBBAABABA.
(6.59)
148
The window of the LZ-77 algorithm slides through the given sequence as follows:
|A B B A B B A B B B A A B A B A
|A|
|A|B|
|A|B|B|
|A|B|B|A B B A B B|
|A|B|B|A B B A B B|B A|
|A|B|B|A B B A B B|B A|A|
|A|B|B|A B B A B B|B A|A|B A|
|A|B|B|A B B A B B|B A|A|B A|B A|
(6.60)
This leads to the following description:

(0, A)
(0, B)
(1, 1, 1)
(1, 3, 6)
(1, 4, 2)
(1, 1, 1)
(1, 3, 2)
(1, 2, 2).
(6.61)
We map A to 0, and B to 1. Moreover, we need log2 (4) = 2 digits for

the match position (where 1 00, 2 01, 3 10, and 4 11) and
log2 (8) = 3 digits for the match length (where 1 000, . . . , 8 111).
Hence, the description (6.61) is translated into
(0, 0)
(0, 1)
(1, 00, 000)
(1, 10, 101)
(1, 11, 001)
(1, 00, 000)
(1, 10, 001)
(1, 01, 001),
(6.62)
i.e., the codeword put out by the encoder is

0001100000110101111001100000110001101001.
(6.63)
Note that in this small toy example, the window size, the maximum match
length, and the length of the encoded sequence are too short to be able to
provide a signicant compression. Indeed, the encoded sequence is longer
than the original sequence.
Also note that it may be better to send short matches uncompressed and
only compress longer matches (i.e., we use ag 0 more often). For example,
in (6.62) the third part (1, 00, 000) would actually be more eciently encoded
as (0, 1).
Similar to the EliasWillems coding scheme, the LZ-77 algorithm relies on

the repeated occurrence of strings. So its analysis is related to the analysis
shown in Section 6.2.3 and we therefore omit it. In practice LZ-77 is used,
e.g., in gzip and pkzip.
6.3. LempelZiv Universal Coding Schemes
6.3.2
149
LZ-78: Tree-Structured LempelZiv
We will again explain the algorithm using the example of a binary source
Uk {A, B} and a binary codeword alphabet D = 2.
Description
Parse the source sequence and split it into phrases that have not yet
appeared:
A
B BA BB AB BBA ABA BAA BAAA
(6.64)
These distinct phrases are then numbered sequentially starting with 1.

Encode then every phrase by the number of the phrase that matches its
prex (or by 0 if the prex is empty) and by the additional new symbol:
(0, A)
(0, B)
(2, A)
(2, B)
(4, A)
(5, A)
(3, A)
(1, B)
(8, A).
(6.65)
Then the number of the prex is mapped into its D-ary representation
of length logD (c) + 1, where c denotes the number of distinct phrases
found in the parsing. The source letter is mapped into a D-ary phrase
of length logD (|U |), where |U | denotes the alphabet size of the source.
In the example above, we need log2 (9) + 1 = 4 digits to encode the
phrase number. Moreover, we map A to 0, and B to 1. Hence, the
description (6.65) is translated into
(0000, 0)
(0000, 1)
(0010, 0)
(0010, 1)
(0100, 0)
(0101, 0)
(0011, 0)
(0001, 1)
(1000, 0)
(6.66)
and the encoder puts out

000000000100100001010001101000010100011010000.
(6.67)
We see that for this algorithm we need two passes over the string. Firstly
the string is parsed into phrases, then we count the total number of phrases
so that we know how many bits are needed to describe them. And secondly
we allot the pointers to the phrases. Note that it is possible to adapt the
algorithm such that only one pass is necessary. Also note that the parsing
actually generates a tree of phrases.
LZ-78 is used, e.g., in compress, in the GIF-format of images, and in
certain modems.
150
6.4
Analysis of LZ-78
We will only analyze LZ-78 because the analysis of LZ-77 is related to the
EliasWillems scheme shown before.2 In the following we will assume that the
tree-structured LempelZiv algorithm is applied to a stationary and ergodic
source {Uk }. Moreover, to simplify our notation, we will restrict ourselves
to binary sources and to binary codewords (D = 2). The generalization to
general sources and D-ary LZ-78 is straightforward.
6.4.1
Distinct Parsing
Denition 6.14. A parsing of a sequence is a division of the sequence into

phrases separated by commas. A distinct parsing is a parsing such that no
two phrases are the same.
Example 6.15. 0, 11, 11 is a parsing and 0, 1, 111 is a distinct parsing of
01111.
Note that the LZ-78 parsing is a distinct parsing. But also note that many
of the results shown below hold not only for the LZ-78 parsing, but for any
distinct parsing.
Denition 6.16. Let c(n) denote the number of phrases in a distinct parsing
of a sequence of length n.
Lemma 6.17 (Codeword Length of LZ-78). If the LZ-78-algorithm is
applied to a binary length-n sequence u1 , . . . , un , then the encoded sequence
has a total length l(u1 , . . . , un ) satisfying
l(u1 , . . . , un ) = c(n) log2 c(n) + 1 +
number of bits
to describe prex
c(n) log2 c(n) + 2
= c(n) log2 c(n) + 2c(n).
6.4.2
(6.68)
additional
new symbol
(6.69)
(6.70)
Number of Phrases
Lemma 6.18 (Number of Phrases). The number of phrases c(n) in a

distinct parsing of a binary sequence of length n 55 is upper-bounded as
follows:
c(n)
n
(1 n ) log2 n
(6.71)
where
n
log2 (log2 n + 1) + 3
.
log2 n
(6.72)
As a matter of fact, in the analysis given in [CT06], LZ-77 is adapted in such a way
that we end up with an EliasWillems coding scheme. . .
6.4. Analysis of LZ-78
151
Note that n 0 as n .
Proof: Lets consider all possible phrases of length not larger than l:
0, 1,
(6.73)
00, 01, 10, 11,
(6.74)
000, 001, 010, . . . , 111,

.
.
.
(6.75)
00 . . . 0, . . . , 11 . . . 1 .
(6.76)
l digits
l digits
The total number of all these phrases is

l
cl = 2 + 4 + 8 + + 2l =
j=1
2j = 2l+1 2.
(6.77)
The total length of all these phrases put together is

n l = 1 2 + 2 4 + 3 8 + + l 2l
(6.78)
j2j
(6.79)
j=1
= (l 1)2l+1 + 2,
(6.80)
where from (6.79) it also follows that

nl+1 nl = (l + 1)2l+1 .
(6.81)
We deduce from (6.80) that for l 2

2l+1 =
nl 2
nl
.
l1
l1
(6.82)
Now, if we want to maximize3 the number of phrases in a distinct parsing

of a sequence of length n, then we need to make the phrases as small as
possible. The worst case is if all cl phrases of length not more than l are used.
If n = nl for some l (that actually depends on n, i.e., we should actually
write l(n)), then
(6.77)
(6.82)
c(n) cl = 2l+1 2 2l+1
nl
.
l1
(6.83)
3
Actually, since we want to compress the source, we really would like to minimize the
number of phrases. But here we are looking at the worst case scenario, so we check what
happens in the worst case, when we have as many phrases as possible.
152
If nl n < nl+1 for some l (or l(n), actually), then we write n = nl +

where
(6.81)
0 < nl+1 nl = (l + 1)2l+1 .
(6.84)
Then, in the worst case, we split the rst nl bits into cl phrases of
length at most l, and the remaining bits in l+1 phrases of length l + 1.

However, if is not divisible by l + 1, we will need to generate one
phrase of length larger than l + 1 because we cannot use any phrases of
length l as they are used up already for the rst nl bits!
Example 6.19. Assume n = 52. Since n3 = 34 and n4 = 98, we have
n = n3 + 18, i.e., = 18. To maximize the number of phrases we split
the rst 34 bits into the c3 = 14 dierent phrases of length 3, thereby
using up all possible phrases of length 3 or smaller. The remaining 17
bits can be split into at most 3 phrases of length 4 and one phrase of
length 6. Note that if we split the remaining 18 bits into 4 phrases of
length 4, then we would be left with a phrase of length 2. But we do
not have any unused phrase of length 2 left!
Hence, we get l+1 1 phrases of length l + 1 and one phrase that is

possibly longer than l + 1. So,
1 +1
l+1
nl
+
l1 l+1
nl
+
l1 l1
nl +
l1
n
.
l1
c(n) cl +
(6.85)
(6.86)
=
=
(6.87)
(6.88)
(6.89)
Note that all this bounding is rather crude, but it will turn out to be good
enough.
Next we need to nd some bounds on l = l(n). We note that
n nl = (l 1)2l+1 + 2 2l
(6.90)
(which again is very crude) so that

l log2 n.
(6.91)
This we use to show that

(6.80)
(6.91)
n < nl+1 = l2l+2 + 2 (l + 1)2l+2 (log2 n + 1)2l+2
(6.92)
153
such that
l + 2 > log2
n
log2 n + 1
(6.93)
or
l 1 > log2 n log2 (log2 n + 1) 3
log2 (log2 n + 1) + 3
log2 n
= 1
log2 n
(1 n ) log2 n.
(6.94)
(6.95)
(6.96)
Before plugging this into (6.89) to get the nal result, we need to make sure
that this lower bound is actually positive! A quick check shows that as soon
as n 55 this is the case. Note that in this case l 3 so that the requirement
l 1 > 0 is also satised.
6.4.3
Maximum Entropy Distribution
We make a short detour to a general problem: Given an alphabet and some

constraints on some expectations of a RV, what PMF maximizes entropy? We
already know that without any moment constraints, the uniform PMF maximizes entropy for a given nite alphabet. In the following we will generalize
this result.
Theorem 6.20 (Maximum Entropy Distribution).
Let the RV X with PMF p() take value in the nite alphabet X and
satisfy the following J constraints:
p(x)rj (x) = j ,
E[rj (X)] =
for j = 1, . . . , J,
(6.97)
xX
for some given functions r1 (), . . . , rJ (), and some given values 1 , . . . ,
J . Then H(X) is maximized if, and only if,
p(x) = p (x)
e 0 +
J
j=1
j rj (x)
(6.98)
assuming that 0 , . . . , J can be chosen such that (6.97) is satised and

such that
p (x) = 1.
(6.99)
xX
Proof: Let q() be any PMF for X satisfying (6.97), and assume that
p () in (6.98) exists, i.e., it satises (6.97) and (6.99). Then
Hq (X) =
q(x) ln q(x)
(6.100)
xX
154
q(x) ln
xX
= D(q p )
0
q(x)
p (x)
p (x)
q(x) ln p (x)
xX
J
j=1
j rj (x)
j=1
(6.103)
q(x)0 +
= 0
(6.102)
xX
q(x) ln e0 +
xX
(6.101)
j rj (x)
(6.104)
j Eq [rj (X)]
(6.105)
j=1
J
= 0
j j
(q() satises (6.97))
(6.106)
j=1
J
= 0
=
=
xX
xX
p (x)
j Ep [rj (X)]
j=1
p (x)0 +
J
j=1
p (x) ln p (x)
(p () satises (6.97)) (6.107)
j rj (x)
(6.108)
(6.109)
xX
= Hp (X).
(6.110)
Hence for any q() we have

Hq (X) Hp (X),
(6.111)
i.e., p () maximizes entropy!

As mentioned, Theorem 6.20 is very important in a much more general
context than the analysis of LZ-78 and we will come back to it later on again.
In the following, we need it only in the context of the following corollary.
Corollary 6.21. Let X be a nonnegative integer-valued RV with mean :
X N0 , E[X] = . Then
H(X) ( + 1) log( + 1) log .
(6.112)
Proof: Given X N0 and E[X] = , we know from Theorem 6.20

that
= e0 +1 x = c ex maximizes the entropy. We need to compute the
constants c and . From
p (x)
c ex = 1
x=0
(6.113)
155
we get
c=
x
x=0 e
1
1
1e
= 1 e .
(6.114)
From
E[X] =
x=0
1 e ex x
= 1 e
= 1 e
(6.115)
1 e
e
(1 e )2
= 1 e
=
ex
(6.116)
x=0
(6.117)
(6.118)
e
!
=
1e
(6.119)
we get
= ln
.
+1
(6.120)
Hence,
p (x) =
1
+1 +1
(6.121)
and therefore
Hp (X) = E log
+1 +1
= log( + 1) E[X] log

+1
= log( + 1) log
+1
= ( + 1) log( + 1) log .
6.4.4
(6.122)
(6.123)
(6.124)
(6.125)
Denition of a Stationary Markov Process
As mentioned, we assume that we apply the LempelZiv algorithm to a binary,

stationary, and ergodic source {Uk }. Let the PMF of this process be denoted
by P{Uk } ().
Now, we dene a new process {Uk } to be a (time-invariant) Markov process

of memory , for some xed value of . To specify this new Markov process
we need to specify the transition probabilities and the starting distribution:
156
The transition probabilities are dened as follows:

PU+1 |U1 ,...,U (|, . . . , )
PU+1 |U1 ,...,U (|, . . . , )
PU1 ,...,U+1 (, . . . , )
.
PU1 ,...,U (, . . . , )
(6.126)
(6.127)
The starting distribution is chosen as follows:

PU1 ,...,U (, . . . , )
PU1 ,...,U (, . . . , ).
(6.128)
Hence, we dene {Uk } to be Markov with a starting distribution and transition
probabilities that match the PMF of our source. Note that {Uk } and {Uk } are
not the same, though, because {Uk } might have memory much longer than
the memory of {Uk }. So, {Uk } is only an approximation to {Uk }, where

this approximation becomes more accurate for a larger choice of .
By the way, even though we will not use this property below, we can show
that {Uk } is stationary, because {Uk } is stationary:

PU2 ,...,U+1 (2 , . . . , u+1 )
u
PU1 ,U2 ,...,U+1 (1 , u2 , . . . , u+1 )

(6.129)
PU1 ,...,U (1 , . . . , u ) PU+1 |U1 ,...,U (+1 |1 , . . . , u )
(6.130)
u
u
PU1 ,...,U (1 , . . . , u ) PU+1 |U1 ,...,U (+1 |1 , . . . , u )
(6.131)
PU1 ,...,U+1 (1 , . . . , u+1 )

u
(6.132)
u1
=
u1
=
u1
=
u1
= PU2 ,...,U+1 (2 , . . . , u+1 )

u
(6.133)
= PU1 ,...,U (2 , . . . , u+1 )
(6.134)
= PU1 ,...,U (2 , . . . , u+1 ).
(6.135)
Here, the rst equality (6.129) by summing out the marginal distribution from
the joint distribution; (6.129) follows from the chain rule; (6.131) from (6.127)
and (6.128); the subsequent equalities (6.132) and (6.133) follow again from
the marginal distribution and the chain rule, respectively; (6.134) is due to
the stationarity of {Uk }; and (6.135) follows from (6.128). Hence, the starting
distribution is the steady-state distribution and therefore the Markov process
is stationary.
n
We next consider the following RV that is generated from the RVs U+1 :
1
log PU1 ,...,Un |U+1 ,...,U0 (U1 , . . . , Un |U+1 , . . . , U0 ).
(6.136)
n
n
Be very clear that the randomness of Wn follows from U+1 and not from
the PMF of {Uk }! In other words, we have

Wn
Wn = f (U+1 , . . . , Un )
(6.137)
157
where the function f is given by

1
f (+1 , . . . , n ) log PU1 ,...,Un |U+1 ,...,U0 (1 , . . . , n |+1 , . . . , 0 ).
n
(6.138)
We now use our knowledge about this function f (i.e., the fact that f describes
the PMF of a Markov process) to change its form:
f (+1 , . . . , n )
1
= log PU1 ,...,Un |U+1 ,...,U0 (1 , . . . , n |+1 , . . . , 0 )
n
n
1
= log
PUj |U+1 ,...,Uj1 (j |+1 , . . . , j1 )

n
1
= log
n
1
= log
n
1
= log
n
1
=
n
1
n
(6.139)
(6.140)
j=1
n
PUj |Uj ,...,Uj1 (j |j , . . . , j1 )

(6.141)
PU+1 |U1 ,...,U (j |j , . . . , j1 )
(6.142)
PU+1 |U1 ,...,U (j |j , . . . , j1 )
(6.143)
log PU+1 |U1 ,...,U (j |j , . . . , j1 )
(6.144)
j=1
n
j=1
n
j=1
j=1
n
g(j , . . . , j )
(6.145)
j=1
where (6.140) follows from the chain rule; (6.141) by Markovity of {Uk };
(6.142) by the time-invariance of {Uk }; (6.143) by (6.127); and where the

last equality (6.145) should be read as denition of the function g, i.e.,
log PU+1 |U1 ,...,U (j |j , . . . , j1 ).
g(j , . . . , j )
(6.146)
Hence, we have
Wn = f (U+1 , . . . , Un )
=
1
n
(6.147)
g(Uj , . . . , Uj )
(6.148)
j=1
E[g(U1 , . . . , U+1 )]
in probability
= E log PU+1 |U1 ,...,U (U+1 |U1 , . . . , U )

= H(U+1 |U1 , . . . , U ).
(6.149)
(6.150)
(6.151)
Here, in (6.149) we let n tend to innity and note that due to the weak law
of large numbers and due to our assumption of stationarity and ergodicity of
{Uk }, the time average will tend to the sample average in probability.4
4
This means that the probability that this equality holds tends to 1 as n . See the
last part of Section 2.2 for more details.
158

Hence, we have the following lemma.5
n
Lemma 6.22. Consider the random sequence U+1 and use it to generate
the RV
Wn
1
log PU1 ,...,Un |U+1 ,...,U0 (U1 , . . . , Un |U+1 , . . . , U0 ).
(6.152)
Then the probability that Wn equals to (the constant value of ) the entropy
H(U+1 |U1 , . . . , U ) tends to one as n tends to innity: For an arbitrary > 0
we have
lim Pr Wn H(U+1 |U1 , . . . , U ) < = 1.
6.4.5
(6.153)
n
Distinct Parsing of U1
n
In the following consider a certain given sequence U+1 = un
+1 .
n into c distinct phrases y , . . . , y where phrase j starts at the
We parse u1
1
c
time kj
yj = (ukj , . . . , ukj+1 1 ),
(6.154)
sj
(6.155)
and we dene
(ukj , . . . , ukj 1 )
to be the bits preceding yj . Note that s1 = u0

+1 .
Example 6.23. In the following we show an arbitrary distinct parsing of a
binary sequence and choose = 3:
s4
s1
011|01| 1 |10| 0 |101|0011|0111|
(6.156)
yc
y1 y2 y3 y4
Now take the c phrases yj and group them into groups of phrases with a
certain length l and with a preceding state s (l = 1, . . . , n and s {0, 1} ).
Let cl,s be the number of such phrases in each group. Hence,
n
cl,s = c,
(6.157)
l cl,s = n,
(6.158)
l=1 s{0,1}
n
l=1
s{0,1}
where (6.157) follows because we have in total c phrases and where (6.158)
follows because the total length of all phrases together is n.
5
This lemma is also called AEP and will be discussed in more detail in Chapter 18.
159
Example 6.24. We continue with (6.156):

for l = 2, s = 011 : y1 , y3
= c2,011 = 2,
(6.159)
for l = 2, s = 111 : no phrases
= c2,111 = 0,
(6.160)
for l = 1, s = 101 : y2 (not y4 !) = c1,101 = 1.
(6.161)
un
1
We next state a very surprising result that links the parsing of the sequence
with the Markov distribution dened in Section 6.4.4.
Lemma 6.25 (Zivs Inequality). For any distinct parsing of un , we have

1
n
log PU1 ,...,Un |U 0
+1
(u1 , . . . , un |s1 )
cl,s log cl,s .
(6.162)
l=1 s{0,1}
Note that the right-hand side does not depend on P{Uk } !
Proof: Using the Markovity of P{Uk } we have
+1
(u1 , . . . , un |s1 )
= log PU1 ,...,Un |U 0
+1
(y1 , . . . , yc |s1 )
= log
j=1
k 1
Ukj ,...,Ukj+1 1 |Uk j

j
(6.163)
(yj |sj )
(6.164)
(yj |sj )
(6.165)
log P
=
j=1
n
k 1
Ukj ,...,Ukj+1 1 |Uk j
j
log P
=
l=1
s{0,1}
all j such that

yj has length l
and state sj =s
=
l=1 s{0,1}
cl,s
1
cl,s
(yj |s)
log P
all j such that
yj has length l
and state sj =s
cl,s log
cl,s
l=1 s{0,1}
k 1
Ukj ,...,Ukj +l1 |Uk j
k 1
j
(6.166)
(yj |s)
(6.167)
P
kj 1 (yj |s),
Ukj ,...,Ukj +l1 |Uk
j
such that
all j
yj has length l
and state sj =s
(6.168)
where the inequality follows from Jensens inequality: We simply regard 1/cl,s
as probability of a uniform distribution over all values j such that yj has
length l and such that sj = s.
160
Next note that for a given state s, the summation of P (yj |s) over all yj of
given length l cannot exceed 1 because the parsing is distinct, i.e., we do not
count any sequence twice:
P
all j such that
yj has length l
and state sj =s
k 1

j
(yj |s) 1.
(6.169)
So we get from (6.168)

n
+1
6.4.6
(u1 , . . . , un |s1 )
cl,s log
l=1
s{0,1}
1
.
cl,s
(6.170)
Asymptotic Behavior of LZ-78
Now we are ready to combine all these results to get an insight in the asymptotic behavior of LZ-78. Firstly, based on (6.157) we dene a new probability
cl,s
.
(6.171)
l,s
c
Then we dene a RV L {1, . . . , n} and a random vector S {0, 1} with a
joint distribution l,s such that
Pr[L = l, S = s] = l,s .
(6.172)
Note that these RVs have no engineering meaning, but are simply dened for
the convenience of our proof.
Then, using Lemma 6.25, we get
+1
=
=
(u1 , . . . , un |s1 )
cl,s log cl,s
l=1
n
cl,s log
cl,s
c
c
cl,s log
cl,s
l=1 s{0,1}
n
l=1
(6.173)
s{0,1}
s{0,1}
(6.174)
n
cl,s log c
l=1
(6.175)
s{0,1}
= c (by (6.157))
n
= c
= c
l=1 s{0,1}
n
l=1 s{0,1}
cl,s
cl,s
log
c log c
c
c
(6.176)
l,s log l,s c log c
(6.177)
= c H(L, S) c log c
c H(L) + c H(S) c log c,
(6.178)
(6.179)
161
where in the last inequality we have used the fact that conditioning reduces
entropy such that
H(L, S) = H(L) + H(S|L) H(L) + H(S).
(6.180)
Next note that S takes on 2 dierent values and therefore

H(S) log2 2 = bits.
(6.181)
Moreover, using (6.158), we have

n
l Pr[L = l]
E[L] =
(6.182)
l=1
n
Pr[L = l, S = s]
=
l=1
n
(6.183)
s{0,1}
=
l=1 s{0,1}
n
l l,s
l
=
l=1 s{0,1}
cl,s
n
= ,
c
c
(6.184)
(6.185)
so that by Corollary 6.21

H(L) ( + 1) log2 ( + 1) log2
n
n
n
n
=
+ 1 log2
+ 1 log2 .
c
c
c
c
(6.186)
(6.187)
Combining (6.187) and (6.181) with (6.179) then yields:

1
log2 PU1 ,...,Un |U 0 (u1 , . . . , un |s1 )

+1
n
c n
c n
n
c
c
n
+ 1 log2
+ 1 + log2 + log2 c
n c
c
n c
c
n
n
n
c
n
c
c
n
= log2
+ 1 log2
+ 1 + log2 + log2 c
c
n
c
c
n
n
c
c
n
c
c
log2
+ 1 + log2 c,
= log2 1 +
n
n
c
n
n
function of
(6.188)
(6.189)
(6.190)
c
n
where in the last step we have combined the rst and third term of (6.189).
Now note that by Lemma 6.18
c
1
n
(1 n ) log2 n
(6.191)
162
and that log2 (1 + z) z log2

Hence,
1
z
+ 1 z is monotonically decreasing in z.
1
log2 PU1 ,...,Un |U 0 (u1 , . . . , un |s1 )

+1
n
1
c
+ 1 z
log2 c + log2 (1 + z) z log2
n
z
c
z= n (1
1
n ) log2 n
(6.192)
1
c(n) log2 c(n)
log2 1 +
n
(1 n ) log2 n
1
log2 (1 n ) log2 n + 1
(1 n ) log2 n
(1 n ) log2 n
c(n) log2 c(n)
(n).
n
(6.193)
(6.194)
Here (6.194) should be read as the denition of (n). Note that (n) 0
as n for every xed .
Hence, for any distinct parsing of a binary sequence u1 , . . . , un we have
c(n) log2 c(n)
1
log2 PU1 ,...,Un |U 0 (u1 , . . . , un |s1 ) + (n).

+1
n
n
(6.195)
Turning nally again back to the concrete case of the LZ-78 algorithm and
its particular distinct parsing,6 we know from (6.70) that
c(n) log2 c(n) + 2c(n)
l(u1 , . . . , un )
.
n
n
(6.196)
But since this holds for any given sequence u+1 , . . . , un , it must hold for the
random sequence U+1 , . . . , Un (in which case the number of phrases c(n)
becomes a RV C(n), too):7
l(U1 , . . . , Un )
n
C(n) log2 C(n) + 2C(n)
lim
n
n
1
0
lim log2 PU1 ,...,Un |U 0
U1 , . . . , Un U+1 + (n)

+1
n
n
2
+
(1 n ) log2 n
= H(U+1 |U , . . . , U1 ) bits
in probability.
lim
(6.197)
(6.198)
(6.199)
6
Note that the only result that actually depends on LZ-78 is (6.70)! All other results
are true for any distinct parsing.
7
Note that limn , called limit superior or lim sup in short, is a mathematical
trick to avoid the problem of having to prove that a limit exists. It can be thought of as
the limit of the maximum points of a (possibly oscillating) sequence. It always exist or is
equal to innity. If you dont worry about whether the limit really exists or not, then you
can simply ignore it and think about it as being a normal limit limn .
163
Here, (6.198) follows from (6.195) and (6.191), where we have replaced the
n
deterministic sequence un
+1 by the random sequence U+1 ; and (6.199)
follows from Lemma 6.22.
If we now let tend to innity and note that by stationarity
lim H(U+1 |U , . . . , U1 ) = H({Uk }),
(6.200)
then we get the following nal result.

Theorem 6.26 (Asymptotic Behavior of LZ-78).
Let {Uk } be a binary stationary ergodic process. Let l(U1 , . . . , Un ) be the
n
LZ-78 binary codeword length associated with U1 . Then
lim
l(U1 , . . . , Un )
H({Uk }) bits
n
in probability.
(6.201)
Note that from the converse given in Theorem 6.3 we know that for any
blocksize n and for D = 2 we have
E[l(U1 , . . . , Un )]
H({Uk }) bits.
n
(6.202)
So we see that LZ-78 is asymptotically optimal (at least the probability that
the source emits a sequence that will be compressed by LZ-78 in an optimal
fashion tends to 1). This works for arbitrary sources with memory and without
a priori knowledge of the source distribution, as long as the source is stationary
and ergodic. Note, however, that in practice it will also work for nonstationary
and nonergodic sources as long as the source distribution is not changing too
fast.
It is also important to realize that no compression scheme will work perfectly for every possible source sequence. Even an adaptive Human coding
scheme that is designed for a known source has certain sequences that will be
very poorly compressed. However, the probability that such a sequence will
occur is very small and tends to zero as n gets large.
Chapter 7
Optimizing Probability
Vectors over Concave
Functions:
KarushKuhnTucker
Conditions
7.1
Introduction
In information theory (and also in many situations elsewhere), we often encounter the problem that we need to optimize a function of a probability mass
function (PMF). A famous example of such a problem is the computation of
channel capacity (see Chapter 10), which involves maximizing mutual information over the probability distribution of the channel input.
In general, such problems are rather dicult as we are not only looking
for the optimal value that maximizes a given merit function, but we need to
nd the optimal distribution, i.e., an optimal vector where the components of
the vector need to satisfy some constraints, namely, they must be nonnegative
and need to sum to one. In the lucky situation when the merit function is
concave (or convex ), however, the problem actually simplies considerably. In
the following we will show how to solve this special case.
So, the problem under consideration is as follows. Let f : RL R be
a concave function of a probability vector1 RL . We want to nd the
maximum value of f () over all possible (legal) values of :
max
f ().
(7.1)
s.t. 0, =1,...,L
L
and
= 1
=1
By probability vector we mean a vector whose components are nonnegative and sum to
1, see Denition 7.3.
165
166
7.2
Optimizing Probability Vectors over Concave Functions
Convex Regions and Concave Functions
Lets start by a quick review of convex regions and concave functions.

Denition 7.1. A region R RL is said to be convex if for each pair of
points in R, the straight line between those points stays in R:
, R :
+ (1 ) R
for 0 1.
(7.2)
Example 7.2. Some examples of convex or nonconvex regions are shown in

Figure 7.1.
not convex
convex
convex
not convex
convex
not convex
Figure 7.1: Examples of convex and nonconvex regions in R2 .

We are especially interested in one particular convex region: the region of
probability vectors.
Denition 7.3. A vector is called probability vector if all its components are
nonnegative and they sum up to 1.
A probability vector is an elegant way to write a probability distribution
over a nite alphabet.
Example 7.4. If a RV X takes value in
1
2
PX (x) = 1
81
16
1
16
{1, 2, 3, 4, 5} with probabilities

if
if
if
if
if
x = 1,
x = 2,
x = 3,
x = 4,
x = 5,
(7.3)
then its distribution can be written as probability vector

p=
1 1 1 1 1
, , , ,
2 4 8 16 16
(7.4)
7.2. Convex Regions and Concave Functions
167
Denition 7.5. For any L N, we dene the region of probability vectors

RL RL as
L
RL
p RL : p 0, {1, . . . , L}, and
p = 1 .
(7.5)
=1
Lemma 7.6. The region of probability vectors RL is convex.

Proof: Let , RL be probability vectors. Dene + (1 )
for any [0, 1]. We need to show that is also a probability vector.
Obviously, 0 for all {1, . . . , L}. Moreover,
L
=
=1
=1
L
+ (1 )
=
=1
(7.6)
+ (1 )
=1
(7.7)
=1
=1
= + (1 ) = 1.
(7.8)
Hence, all components of are nonnegative and they sum up to 1, i.e., RL .

Denition 7.7. Let R RL be a convex region. A real-valued function
f : R R is said to be concave over a convex region R if , R and
0 < < 1:
f () + (1 )f () f + (1 ) .
(7.9)
It is called strictly concave if the above inequality is strict.

Note that a function can only be dened to be concave over a convex region
because otherwise the right-hand side of (7.9) might not be dened for some
values of 0 < < 1.
Also note that the denition of convex functions is identical apart from a
reversed sign. This means that if f () is concave, then f () is convex and
vice-versa. So, all results can easily be adapted for convex functions and we
only need to treat concave functions here.
Example 7.8. See Figure 7.2 for an example of a concave function.
Lemma 7.9. Concave functions have the following properties:

1. If f1 (), . . . , fn () are concave and c1 , . . . , cn are positive numbers, then
n
ck fk ()
(7.10)
k=1
is concave with strict concavity if at least one fk () is strictly concave.
168
f ()
f + (1 )
f () + (1 )f ()
f ()
0
Figure 7.2: Example of a concave function.
2. If f () is monotonically increasing and concave over a convex region R1 ,

and g() is concave over R2 and has an image I R1 , then f (g()) is
concave.
3. For a one-dimensional vector , if
2 f ()
0
2
(7.11)
everywhere in an interval, then f () is concave in this interval.

4. Jensens inequality: for any concave function f () and any random
vector X RL taking on M values,2 we have
E[f (X)] f (E[X]).
(7.12)
Proof: We only have a short look at the last property. Let (1 , . . . , M )

be the probabilities of the M values of X, i.e., m 0 and M m = 1.
m=1
Let 1 , . . . , M be a set of vectors in a convex region R RL where f () is
concave. Then
M
M
m=1
m f (m ) f
m m
(7.13)
m=1
by denition of concavity (to be precise, for M = 2 it is by denition, for

M > 2 we use the denition recursively). Let X be a random vector that
takes on the values m with probability m , m = 1, . . . , M. The result now
follows from (7.13).
7.3
Maximizing Concave Functions
Why are concave functions interesting? We will see that they are relatively
easy to maximize over a given region.
2
Actually, Jensens inequality also holds for random vectors with an alphabet of innite
size or even with a continuous alphabet. Recall the discussion in Section 1.5.
7.3. Maximizing Concave Functions
169
Lets rstly consider the simpler situation where we do not worry about
the normalization of a probability vectors components (i.e., we ignore that
the components need to sum to 1). Instead, let R RL be a convex region,
let R be an L-dimensional real vector, and let f () be a concave function
over R. We would now like to nd the maximum of f over the convex region
R.
In a rst straightforward attempt to solve the problem we remember from
calculus that an optimum is reached if the derivative is zero. So we ask the
optimum vector to satisfy the following conditions:
f ( ) !
= 0,
{1, . . . , L}.
(7.14)
But there is a problem with these conditions: The set of equations (7.14)
might not have any solution, or, if it has a solution in R, this solution might
not be in R!
Luckily, we have at least the following lemma.
Lemma 7.10. If there exists a solution RL that simultaneously satises
(7.14) and the constraint R, then this solution maximizes f ().
Proof: By contradiction assume that there exists a R, = ,
with f () > f ( ). Then the line from f ( ) to f () is increasing. But
because f () is concave, the slope of f () at must be larger than the slope
of the line. This is a contradiction to (7.14).
So what about the situation when we cannot nd a correct solution to
(7.14)? Since we have assumed that f () is concave we note that such a
situation can only happen if either f () has no peak where the derivative is
zero, or this peak occurs outside of the given convex region R. See Figure 7.3
for an illustration. In that case, f () is maximized by an on the boundary
of the convex region R (where we can think of innity to be a boundary,
too).
maximum
R
Figure 7.3: Maximum is achieved on the boundary of the region.
Note that in such a case can only maximize f () over R if f () is
decreasing when going from the boundary point into the region! This
170
gives us an idea on how to x (7.14): Instead of (7.14), an optimal vector

should satisfy
f ( ) !
= 0,
f ( ) !
0,
such that is inside the region R,
(7.15)
such that is on the boundary of R.
(7.16)
We will prove (7.15) and (7.16) later. By the way, why do we have in
(7.16) and not ? Note that the maximum might also occur on the right
boundary, in which case we want the derivative be positive! The answer here
lies in the way how we compute the derivative. See Appendix 7.A for more
details.
Now note that we are actually interested in the situation of probability
vectors, where we have seen that the region of probability vectors is convex.
Since 0, its boundary is = 0, {1, . . . , L}. However, beside this
nonnegativity condition, probability vectors also have to satisfy
L
= 1.
(7.17)
=1
Unfortunately, this condition is not a boundary because the conditions on the

dierent components are highly dependent.
So how shall we include this condition into our maximization problem?
We use an old trick deep in the mathematicians box of tricks: Since we
want (7.17) to hold, we also have that
L
=1
1 = 0
(7.18)
(7.19)
or, for any constant R,

L
=1
= 0.
Hence, any that maximizes f () must also maximize the function

L
F()
f ()
=1
1 .
(7.20)
And this has to be true for every choice of ! Now the trick is to try to
nd an () that maximizes F() instead of f () and to ignore the constraint
(7.17) for the moment. Note that such an optimal () will depend on
and it will in general not satisfy (7.17). But since we can choose freely, we
can tweak it in such a way that the solution also satises (7.17). So, then we
have found one solution that maximizes F() and satises (7.17). However,
in exactly this case, also maximizes f (). We are done!
171
So, the problem hence reduces to nding the maximum of (7.20). Lets
therefore apply (7.15) and (7.16) to F() instead of f (). We note that
F()
f ()
=
(7.21)
and therefore we get from (7.15) and (7.16):

f () !
=
f () !
such that > 0,
(7.22)
such that = 0,
(7.23)
where is chosen such that (7.17) is satised. The mathematicians call the
constant the Lagrange multiplier. Let us now summarize and formally prove
our insights in the following theorem.
Theorem 7.11 (KarushKuhnTucker (KKT) Conditions).

Let f : RL R be concave over the convex region RL of L-dimensional
probability vectors. Let = (1 , . . . , L )T RL be a probability vector.
Assume that f () are dened and continuous over RL with the possible
exception that
lim
f ()
= +
(7.24)
is allowed. Then
f () !
=
f () !
such that > 0,
(7.25)
such that = 0,
(7.26)
are necessary and sucient conditions on to maximize f () over

RL . Note that the Lagrange multiplier needs to be chosen such that
L
= 1.
(7.27)
=1
Proof: Suciency: Assume (7.25) and (7.26) hold for some and .
Then we want to show that, for any , f () f (). We know from concavity
that for any [0, 1]
f () + (1 )f () f + (1 ) .
(7.28)
172
Hence,
f + (1 ) f ()
f + ( ) f ()
=
L
f ()
( )
for 0
f () f ()
(7.29)
(7.30)
(7.31)
=1
L
=1
( )
L
=
=1
(7.32)
(7.33)
=1
= (1 1) = 0.
(7.34)
Here in (7.31) we choose to tend to 0. This will cause the expression to tend
to the sum of the derivatives. If this expression looks confusing to you, you
might remember the one-dimensional case, which should look familiar:
f + (1 ) f ()
0
f + ( ) f ()
( )
= lim
0
( )
f ()
( ).
=
lim
(7.35)
(7.36)
The subsequent inequality (7.32) follows by assumption from (7.25) and (7.26),
and the last equality follows because both and are probability vectors,
which sum to 1.
Necessity: Let maximize f and assume for the moment that the derivatives of f are continuous at . Since maximizes f , we have for any :
f () f () 0.
Now choose
above we get
(7.37)
+ (1 ) for some and let 0. Then similarly to

f + (1 ) f () 0
f + (1 ) f ()
=
0
L
f ()
( ) 0,
=
=1
(7.38)
(7.39)
as 0.
(7.40)
Since is a probability vector, its components must sum to 1. Hence, there

must be at least one component that is strictly larger than zero. For convenience, lets assume that this component is 1 . We dene i(k) to be a unit
173
vector with a 1 in position k and zeros everywhere else. Then we x some

k {1, . . . , L} and choose
+ i(k) i(1) .
(7.41)
This is a probability vector as long as is chosen correctly, as can be seen as

follows:
L
(k)
=
=1
=1
=1
(1)
i i
= 1 + (1 1) = 1,
(7.42)
=1
i.e., always sums to 1. So we only have to make sure that the components
are nonnegative. But the components of are for sure nonnegative. So for a
nonnegative choice of we are only in danger at the rst component where we
subtract . But we have assumed that 1 is strictly larger than 0. Hence, as
long as we choose smaller than 1 , we are OK: we x such that 0 < 1 .
Finally, we choose f () .
1
With these choices we can evaluate (7.40):
L
=1
!
f ()
f ()
f ()
f ()
=
( ) =
0.
k
1
k
(7.43)
So, when we divide by (which is positive!), we get

f () !
(7.44)
which corresponds to (7.26).

Now note that if k > 0 then can also be chosen to be negative with
still having all components nonnegative. In this case we can choose such
that k < 0. If we then divide (7.43) by , we get3
f () !
.
k
(7.45)
Equations (7.44) and (7.45) can only be satised simultaneously if

f () !
= ,
k
(7.46)
which corresponds to (7.25).

It only remains to consider the case where the derivative is unbounded if
one component tends to zero. Hence, assume that for a given k, f () = +
k
for k 0. Again, we assume 1 > 0 and therefore
3
f ()
1
must be continuous
Note that since we divide by a negative value, we have to change the inequality sign!
174
and well-behaved by assumption, i.e., k = 1.4 We now show that with

k = 0 cannot achieve the maximum:
f + i(k) i(1) f ()
f + i(k) i(1) f + i(k)

f + i(k) f ()
=
+
(7.47)
f + i(k) f ()
f + i(k) + ()i(1) f + i(k)
+
(7.48)
=
()
f ()
f ()
0
= > 0.
(7.49)

+
1
k
>
by assumption
because 1 > 0
=
by assumption
Hence,
f + i(k) i(1) > f ()
(7.50)
and cannot achieve the maximum.
7.A
Appendix: The Slope Paradox
The aware reader might want to adapt the example of Figure 7.3 slightly to
get a situation shown in Figure 7.4.
maximum
R
Figure 7.4: Slope paradox: The maximum is achieved on the boundary of the
region, but now the slope seems to be positive instead of negative as we
have claimed in (7.16).
Now obviously the slope at the boundary point where the maximum is
achieved is positive and not negative as claimed in (7.16). So it seems that
(7.16) is wrong for this case. How can we save ourselves?
4
Note that we only allow the derivative to become innite for tending to zero. Since
we have here that 1 > 0 this derivative must be well-behaved!
7.A. Appendix: The Slope Paradox
175
The clue lies in the way we compute derivatives. The actual idea behind
(7.15) and (7.16) was as follows:
slope is zero
slope is decreasing when walking from the boundary into the
region
for all components that lie inside the region,

for all components that are on
the boundary.
(7.51)
(7.52)
Now, strictly speaking, our function f is only dened in the region R. Hence
we need to compute the slope accordingly. In the situation of Figure 7.3 the
slope is computed as follows:
lim
0
f + i() f ()
f ()
=
,
(7.53)
which then leads to (7.15) and (7.16).

In the case of Figure 7.4, however, this approach is not possible because
+ i() R. Instead, there we need to compute the slope as follows:
/
lim
0
f i() f ()
f i() f ()
f ()
= lim
=
,
0
(7.54)
i.e., condition (7.16) then actually becomes

f ()
0.
(7.55)
So, our paradox is solved if we understand that by f () we mean the deriva

tive that must be computed from the boundary towards inside the convex
region. We learn that we always need to check the concrete case and choose
the sign of the inequality accordingly!
Note that in the case of the region of probability vectors RL , we did it
correctly because the situation corresponds to Figure 7.3: The region is only
bounded below by zero, i.e., the way we state equation (7.16) is correct (as it
must have been, since we have proven it in Theorem 7.11).
Chapter 8
Gambling and Horse Betting

In this chapter we make an excursion into an area that at rst sight seems to
have little to do with information theory: gambling. However, this impression
is wrong! It turns out that we can describe optimal betting strategies very
elegantly using some of our information-theoretic quantities.
The setup of the gambling in this chapter is very simple. As basic example
we assume a horse gambling game, where one can bet money on a horse in a
horse race and will receive money if the horse wins. However, in concept the
ideas that are presented here can also be applied in a much wider context,
e.g., to the stock-markets or to large scale investment portfolios.
8.1
Problem Setup
Consider a set X of m horses. Let the random variable X X be the winning

horse with PMF
PX (x) = p(x) =
pi
0
if x = i,
otherwise,
(8.1)
where pi denotes the probability that horse i wins. Moreover, let oi denote
the return per dollar (odds) if horse i wins.
Example 8.1. Lets assume we bet 5 dollars on horse 1 with odds o1 = 2.
Then, if horse 1 wins, we will receive 10 dollars; if horse 1 does not win, our
5 dollars are lost and we receive nothing.
We use b to denote the betting strategy, i.e., bi is the proportion of the

gamblers total wealth that he bets on horse i. This means that b is a percentage, i.e., the gambler needs to decide rst how much money he wants to risk
in gambling and then decide how to bet with this money. Or in mathematical
terms:
m
bi = 1.
(8.2)
i=1
177
178
Remark 8.2. For convenience we use two equivalent forms of notation: a

vector notation and a functional notation. For example, in the case of the
betting strategy we have
bi = b(i)
the proportion of the money that is bet on horse i,
b = b()
the complete betting strategy.
Similarly we write pi = p(i), p = p(), oi = o(i), and o = o().
8.2
Optimal Gambling Strategy
How shall we now invest our money? Shall we bet on the horse that has
highest chance of winning? Or shall we rather favor the horse with the best
odds?
Example 8.3. Consider a race with m = 3 horses, with the following winning
probabilities:
p1 = 0.7,
p2 = 0.2,
p3 = 0.1,
(8.3)
o2 = 4.5,
o3 = 8.5.
(8.4)
and with the following odds:

o1 = 1.2,
On which horse would you put your money? On horse 1 that has the highest
winning probability? Or on horse 3 that has the highest odds?
Actually, if you are a slightly more capable gambler, you will bet your
money on the horse with the highest expected return, i.e., the one with the
largest product pi oi :
p1 o1 = 0.84,
p2 o2 = 0.9,
p3 o3 = 0.85,
(8.5)
i.e., on the second horse! But note that also this strategy is risky because
you might lose all your money in one blow! Hence, we realize that the really
careful gambler should look at the long-term expected return!
So, we realize that we need to keep participating in many races. For the
moment we will assume that all races are IID, so we therefore also assume
that we stick with the same betting strategy during all races. Let Sn be the
growth factor of our wealth after participating in n races, and let Xk denote
the winning horse in race k.
Example 8.4. Lets return to the horse race of Example 8.3 and assume that
you bet 50% of your money on horse 1 and 50% on horse 2. Hence we have
b1 = 0.5,
b2 = 0.5,
b3 = 0.
(8.6)
Assume further that you decide to gamble with a total of 0 = 800 dollars.
8.2. Optimal Gambling Strategy
179
It so happens that in the rst race horse 2 wins, X1 = 2. Hence, the

b1 0 = 0.5 800 = 400 dollars that you have bet on horse 1 are lost. But the
other 400 dollars that you have bet on horse 2 are multiplied by a factor of
4.5, i.e., you get 4.5 400 = 1800 dollars back. Therefore your current wealth
now is
1 = 0 b(2) o(2) = 800 0.5 4.5 = 1800.
(8.7)
= s1
Note that the growth factor turned out to be s1 = 2.25.

Lets say that in the second race horse 1 wins, X2 = 1. This time the
b2 1 = 0.5 1800 = 900 dollars that you have bet on horse 2 are lost, but you
gain o1 (b1 1 ) = 1.2 (0.5 1800) dollars from horse 1. So now your wealth is
2 = 1 b(1) o(1) = 1800 0.5 1.2 = 1080.
(8.8)
Note that using (8.7) we can write this also as follows:

2 = 0 s1 b(1) o(1) = 0 s1 b(1) o(1) ,
= 1
(8.9)
= s2
where s2 denotes the growth factor of your wealth during the rst two races:
s2 = 2.25 0.6 = 1.35.
Suppose that in the third race, horse 1 wins again, X3 = 1. Your wealth
therefore develops as follows:
3 = 2 b1 o1 = 0 s2 b(1) o(1) .
(8.10)
= s3
And so on.
From Example 8.4 we now understand that
S1 = b(X1 ) o(X1 ),
(8.11)
S2 = S1 b(X2 ) o(X2 ),
(8.12)
S3 = S2 b(X3 ) o(X3 ),
.
.
.
(8.13)
b(Xk ) o(Xk )
= Sn =
(8.14)
k=1
and that the wealth after n races is

n = 0 Sn .
(8.15)
You note that n only depends on the starting wealth 0 and the growth factor
Sn . We therefore usually directly look at Sn without worrying about the exact
amount of money we use for betting.
180
Example 8.5. To quickly go back to Example 8.4, we also would like to point
out that once horse 3 wins (which is unlikely but possible!), you are nished:
Since b(3) = 0, your wealth is reduced to zero and you cannot continue to bet.
So it is quite obvious that the chosen betting strategy is not good in the long
run.
We go back to our question of deciding about a good betting strategy.

From the discussion above we conclude that we would like to nd a betting
strategy that maximizes Sn , or rather, since Sn is random, that maximizes
the expected value of Sn for large n. Unfortunately, the expression (8.14) is
rather complicated. In order to simplify it, we introduce a logarithm:
1
log2 Sn =
n
=
1
log2
n
1
n
b(Xk ) o(Xk )
(8.16)
k=1
log2 b(Xk ) o(Xk )
(8.17)
k=1
E log2 b(X) o(X)
in probability
(8.18)
by the weak law of large numbers. So lets dene the following new quantity.
Denition 8.6. We dene the doubling rate W(b, p, o) as
W(b, p, o)
E log2 b(X) o(X)

=
p(x) log2 b(x) o(x) .
(8.19)
From (8.18) we see that, for n large, we have with high probability
Sn 2nW(b,p,o) .
(8.20)
This also explains the name of W: If W = 1, then we expect to double our

wealth after each race; if W = 0.1, we expect to double our wealth after every
10 races; and if W is negative, we will be broke sooner or later. . .
So our new aim is to maximize the doubling rate by choosing the best
possible betting strategy b:
W (p, o)
max W(b, p, o)
(8.21)
= max
b
p(x) log2 b(x) o(x)
= max
p(x) log2 b(x) +
= max
b
(8.22)
p(x) log2 o(x)
(8.23)
p(x) log2 o(x)
(8.24)
p(x) log2 b(x) +

x
x
independent of b
= we ignore the odds!
8.2. Optimal Gambling Strategy
= max
181
p(x) log2
b(x)
p(x) + E[log2 o(X)]
p(x)
(8.25)
= max D(p b) H(X) + E[log2 o(X)]
(8.26)
= min D(p b) H(X) + E[log2 o(X)]
(8.27)
= 0 for b =p
= H(X) + E[log2 o(X)].
(8.28)
Hence, we get the following theorem.

Theorem 8.7 (Proportional Gambling is Log-Optimal).
Consider an IID horse race X p with odds o. The optimum doubling
rate is given by
W (p, o) = E[log2 o(X)] H(X)
(8.29)
(where the entropy must be specied in bits) and is achieved by proportional betting:
b = p.
(8.30)
Note the really interesting fact that optimal gambling ignores the odds!
Example 8.8. Consider a race with 2 horses. Horse 1 wins with probability
p, 1 p < 1; and horse 2 wins with probability 1 p. Assume even odds
2
o1 = o2 = 2. If you like risk, you put all your money on horse 1 as this gives
highest expected return. But then
W([1, 0], [p, 1 p], [2, 2]) = E log2 b(X) o(X)
= p log2 (1 2) + (1 p) log2 (0 2)
= ,
(8.31)
(8.32)
(8.33)
i.e., on the long run your wealth will be

Sn 2nW = 2 = 0.
(8.34)
Actually, this will happen very soon, a couple of races in general is enough. . .
If you are very much afraid of risk, put half on horse 1 and half on horse 2:
n
b(Xk ) o(Xk ) =
Sn =
k=1
k=1
1
2 = 1,
2
(8.35)
i.e., you will not lose nor gain anything, but for sure keep your money the way
it is. The doubling rate in this case is
W
1 1
, , [p, 1 p], [2, 2]
2 2
= p log2
1
1
2 + (1 p) log2
2
2
2
= 0.
(8.36)
182

The optimal thing to do is proportional betting:
b2 = 1 p,
b1 = p,
(8.37)
which yields a doubling rate

W ([p, 1 p], [2, 2]) = E[log2 o(X)] H(X) = 1 Hb (p).
(8.38)
Then your wealth will grow exponentially fast to innity:

Sn 2n(1Hb (p)) .
8.3
(8.39)
The Bookies Perspective
So far we have concentrated on the gamblers side of the game and simply
assumed that the odds were given. However, in an actual horse race, there
must be a bookie that oers the game, i.e., the odds.
How shall the bookie decide on the odds oi that he oers? He, of course,
wants to maximize his earning, i.e., minimize the gamblers winning. However,
he has no control over the gamblers betting strategy and needs to announce
his odds before he learns the gamblers bets!
Instead of working with the odds oi directly, it will turn out to be easier
to work with a normalized reciprocal version of it. We dene
1
oi
ri
(8.40)
m
1
j=1 oj
such that
oi =
c
ri
(8.41)
with
1
m
1
j=1 oj
(8.42)
Hence, (r, c) is equivalent to o, i.e., by specifying r and c the bookie also

species the odds o.
It is important to note that
m
ri =
i=1
i=1
oi
1
m
1
j=1 oj
like a proper probability distribution.
m
1
j=1 oj i=1
1
=1
oi
(8.43)
8.3. The Bookies Perspective
183
We use (8.41) to write

m
pi log2 (bi oi )
W(b, p, o) =
(8.44)
c
ri
(8.45)
i=1
m
pi log2 bi
=
i=1
m
pi log2
pi log2 c +
i=1
i=1
bi p i
pi ri
(8.46)
= log2 c D(p b) + D(p r).
(8.47)
Note that the bookie has no control over b, but he can minimize the doubling
rate W(b, p, o) by choosing r = p! We have proven the following result.
Theorem 8.9. The optimal strategy of the bookie is to oer odds that
are inverse proportional to the winning probabilities.
Moreover, the bookie can inuence the doubling rate by the constant c.
1
If c = 1, i.e., m oi = 1, we have fair odds. In this case there exists a
i=1
betting strategy that contains no risk (but also gives no return): simply
bet
1
bi = .
(8.48)
oi
Note that we can do this because

m
i=1
1
= 1.
oi
(8.49)
Then for all n

n
b(Xk ) o(Xk ) =
Sn =
k=1
k=1
1
o(Xk ) =
o(Xk )
1 = 1,
(8.50)
k=1
i.e., the doubling rate is 0.

In general the doubling rate in the case of fair odds is
W(b, p, o) = D(p r) D(p b)
(8.51)
and has a beautiful interpretation: In reality neither the gambler nor the
bookie know the real winning distribution p. But for both it is optimal
to choose b = p or r = p, respectively. Hence b and r represent the
gamblers and the bookies estimate of the true winning distribution,
respectively. The doubling rate is the dierence between the distance
of the bookies estimate from the true distribution and the distance
of the gamblers estimate from the true distribution.
184
A gambler can only make money if his estimate of the true

winning distribution is better than the bookies estimate!
This is a fair competition between the gambler and the bookie, and
therefore the name fair odds.
1
If c > 1, i.e., m oi < 1, we have superfair odds. In this not very
i=1
realistic scenario, the gambler can win money without risk by making
a Dutch book :
bi =
c
.
oi
(8.52)
He then gets for all n

n
b(Xk ) o(Xk ) =
Sn =
k=1
k=1
c
o(Xk ) =
o(Xk )
c = cn .
(8.53)
k=1
However, note that a Dutch book is not log-optimal. To maximize W,

the gambler still needs to do proportional betting (in which case he
ignores the odds)!
1
If c < 1, i.e., m oi > 1, we have subfair odds. That is often the case in
i=1
reality: The bookie takes a cut from all bets. This case is more dicult
to analyze. We will see in Section 8.6 that in this case it might be clever
not to bet at all!
8.4
Uniform Fair Odds
Before we start to investigate the realistic situation of subfair odds, we would

like to quickly point out some nice properties of the special case of uniform
fair odds:
oi = m
(8.54)
where m denotes the number of horses. (Note that

m
i=1
1
=
oi
m
i=1
1
= 1,
m
(8.55)
i.e., these really are fair odds.) Then

m
W (p, o) =
i=1
pi log2 m H(X)
= log2 m H(X)
and we have the following result.
(8.56)
(8.57)
8.5. What About Not Gambling?
185
Theorem 8.10 (Conservation Theorem). For uniform fair odds (8.54),

we have
W (p, o) + H(X) = log2 m = constant,
(8.58)
i.e., the sum of doubling rate and entropy rate is constant.

Hence, we see that we can make most money from races with low entropy
rate. (On the other hand, if the entropy rate is low, then the bookie wont be
so stupid to oer uniform odds. . . )
8.5
What About Not Gambling?
So far we have always assumed that we use all our money in every race.
However, in the situation of subfair odds, where the bookie takes some xed
amount of the gamblers bet for himself for sure, we should allow the gambler
to retain some part of his money. Let b(0) = b0 be that part that is not put
into the game.
In this situation the growth factor given in (8.14) will become
n
b(0) + b(Xk ) o(Xk )
Sn =
(8.59)
k=1
and the Denition 8.6 of the doubling rate is changed to

m
W(b, p, o)
E log2 b(0) + b(X) o(X)
pi log2 (b0 + bi oi ).
(8.60)
i=1
Lets see how our results adapt to these new denitions:

1
Fair odds: We have already seen that the bet bi = oi is risk-free (but
also gives no gain). Hence, instead of taking b0 out of the game, we
1
might also simply distribute b0 proportionally to oi among all horses.
This yields the same eect!
So we see that every betting strategy with b0 > 0 can be replaced by

another betting strategy with b0 = 0 and with identical risk and identical
performance. However, obviously any betting strategy with b0 = 0 is
identical to a betting strategy without b0 in the rst place. Hence, the
optimal solution that we have derived in Section 8.2 still is optimal,
i.e., proportional betting bi = pi (and b0 = 0) is log-optimal (but not
risk-free).
Superfair odds: Here we would be silly to retain some money as we can
increase our money risk-free! So the optimal solution is to have b0 = 0
and always use all money. Hence, using the same argument as above for
fair odds, proportional betting is log-optimal (but not risk-free).
186
Subfair odds: This case is much more complicated. In general it is

optimal to retain a certain portion of the money. How much depends on
the expected returns pi oi . Basically, we need to maximize
m
pi log2 (b0 + bi oi )
W(b, p, o) =
(8.61)
i=1
under the constraints

b0 0,
m
b0 +
bi 0,
(8.62)
i = 1, . . . , m,
bi = 1.
(8.63)
(8.64)
i=1
This problem can be solved using the KarushKuhnTucker (KKT) conditions (Theorem 7.11 in Chapter 7).
8.6
Optimal Gambling for Subfair Odds
We apply the KKT conditions to the optimization problem (8.61)(8.64) and

get the following:
W
=
b0
m
i=1
W
p o
=
log2 e
b
b0 + b o
if b0 > 0,
if b0 = 0,
pi
log2 e
b0 + bi oi
if b > 0,
if b = 0,
(8.65)
= 1, . . . , m,
where must be chosen such that (8.64) is satised. We dene

rewrite (8.65) and (8.66) as follows:
m
= 1 if b0 > 0,
1 if b0 = 0,
pi oi
b0 + bi o i
i=1
pi
b0 + bi o i
= 1 if bi > 0,
1 if bi = 0,
(8.66)
log2 e
and
(8.67)
i = 1, . . . , m,
(8.68)
We now try to solve these expressions for b.

From (8.68) we learn that, if bi > 0,
bi =
1
pi oi b0 ,
oi
(8.69)
bi
1
pi oi b0 .
oi
(8.70)
and if bi = 0,
8.6. Optimal Gambling for Subfair Odds
187
In other words, if pi oi b0 we get the case bi > 0; and if pi oi < b0 ,

we have bi = 0. Hence,
bi =
1
pi oi b0
oi
(8.71)
where
()+
max{, 0}.
(8.72)
Next we will prove that (under the assumption of subfair odds!) we

always have b0 > 0. To that aim assume b0 = 0. Then from (8.71) we
see that
1
+
bi =
pi oi 0 = pi .
(8.73)
oi
Using this and our assumption b0 = 0 in (8.67) we get
m
i=1
pi
=
b0 + bi o i
m
i=1
pi
=
0 + pi oi
m
i=1
sub-
1 fair
> 1,
oi
(8.74)
where the last inequality follows from the fact that we have assumed
subfair odds. Since 1 > 1 is not possible, we have a contradiction, and
therefore we must have b0 > 0.
Hence, we have proven the following fact.
In a horse race with subfair odds, it is never optimal to invest all
money.
So, b0 > 0. Let
B
i {1, . . . , m} : bi > 0
(8.75)
be the set of all horses that we bet money on. We now rewrite (8.67) as
follows:
m
1=
i=1
=
iB
=
iB
=
iB
pi
b0 + bi o i
pi
pi
+
b0 + bi o i
b0 + b i o i
iBc
pi
pi
+
b0 + pi oi b0
b0 + 0
c
(8.76)
(8.77)
(8.78)
iB
1
+
oi b0
pi ,
(8.79)
iBc
pi
= 1
iB
188

where in rst sum of (8.78) we have used (8.71). Hence,
b0 =
1
1
iB
pi
1
iB oi
(8.80)
Using (8.71) and (8.80) in (8.64) then yields

m
1 = b0 +
bi
(8.81)
1
(pi oi b0 )
oi
(8.82)
i=1
= b0 +
iB
= b0 1
=
1
1
= 1
iB
1
oi
iB
pi
(8.83)
iB
pi
1
iB oi
pi
1
oi
iB
pi = ,
pi
(8.84)
iB
(8.85)
iB
iB
i.e., = 1.
So we nally get
b0 =
1
1
iB
pi
1
iB oi
(8.86)
and
bi =
1
p i o i b0
oi
(8.87)
for i {1, . . . , m}.

We see that we have a recursive denition here: The values of bi depend on
b0 , but b0 in turn depends (via B) on bi . So the remaining question is how
to nd the optimal set B of horses we do bet on. We cannot compute this
optimal set directly, but we can nd a recursive procedure to gure it out.
Firstly, note that from (8.87) we see that the horses with bi > 0 must
have an expected return pi oi larger than b0 . Hence, we will only (if at all)
bet on horses with large expected return! Secondly, note that b0 is a common
threshold to all horses, i.e., if we bet on horse i and horse j has an expected
return larger than horse i, pj oj pi oi , then we will also bet on horse j.
So, we start by ordering the horses according to expected return pi oi . Then
we try recursively:
Step 1: Set B = {} and compute b0 from (8.86).
8.6. Optimal Gambling for Subfair Odds
189
Step 2: Check whether the best horse (i.e., the one with the largest expected
return) has an expected return pi oi larger than b0 . If yes, add this
horse to B, recompute b0 and repeat Step 2 with the remaining
horses. If no, stop.
This leads to the following algorithm that nds the optimal betting strategy
for horse racing with subfair odds.
Theorem 8.11 (Optimal Gambling for Subfair Odds).
Consider an IID horse race X p with odds o. If the gambler is allowed
to withhold the fraction b0 of his wealth, then an optimal betting strategy
that maximizes the doubling rate can be found as follows:
Order the horses according to the expected return pi oi :
p1 o 1 p 2 o 2 pm o m .
(8.88)
Dene
1
1
t1
pj
j=1
t1
j=1
for t = 1, . . . , m.
(8.89)
1
oj
Note that 1 = 1.
Find the smallest t such that t pt ot . If t < pt ot for all t =
1, . . . , m (which can only happen if the odds were fair or superfair!),
then set t = m + 1 and t = 0.
Assign
b0 = t ,
bi = p i
bi = 0
t
oi
for i = 1, . . . , t 1,
(8.90)
for i = t, . . . , m.
Remark 8.12. Note that it is very well possible that the solution is b0 = 1
and bi = 0 (i = 1, . . . , m), i.e., that the optimal betting strategy is not to
gamble at all! This is the case if the expected return of all horses is less than
(or equal to) 1: pi oi 1, i. If the expected return of at least one horse is
strictly larger than 1, then b0 < 1 and we will gamble.
Also note that if we apply the algorithm to a game that is fair or superfair,
we will in general regain our previously found solution b0 = 0 and bi = pi ,
190
i. The only exceptions are some cases of fair odds where the algorithm
will result in b0 > 0 and bi = 0 for some horses. This will happen if the
algorithm at a certain stage nds t = pi oi (with equality!). However, in such
a case the resulting doubling rate is identical to the doubling rate achieved by
proportional betting.
Example 8.13. We go back to Example 8.3 with m = 3 horses with
p1 = 0.7,
p2 = 0.2,
p3 = 0.1,
(8.91)
o1 = 1.2,
o2 = 4.5,
o3 = 8.5.
(8.92)
and
Firstly we order the horses according to expected return

p1 o1 = 0.84,
p2 o2 = 0.9,
p3 o3 = 0.85,
(8.93)
i.e., we have the new order (2, 3, 1). But then we note that the largest expected
return is less than 1, i.e., in the above algorithm the smallest t such that
t pt ot is t = 1 and we realize that the optimal solution is not to bet at all.
Lets change the example slightly by oering better odds:
o1 = 1.2,
o2 = 6,
o3 = 10.
(8.94)
The game still is subfair as

3
i=1
1
= 1.1 > 1,
oi
(8.95)
but now the expected returns are

p2 o2 = 1.2,
p3 o3 = 1,
p1 o1 = 0.84.
(8.96)
The computation of the algorithm is shown in Table 8.1. It turns out that the
smallest t such that t pt ot is t = 3 and we have (again using the original
names of the horses):
b0 = 3 = 0.955,
0.955
3
= 0.2
= 0.041,
b2 = p 2
o2
6
3
0.955
b3 = p 3
= 0.1
= 0.00455,
o3
10
b1 = 0.
(8.97)
(8.98)
(8.99)
(8.100)
Hence, we see that we only bet on the two unlikely horses with much higher
odds. However, we also see that we gamble with only about 5% of our money.
8.7. Gambling with Side-Information
191
Table 8.1: Derivation of optimal gambling for a subfair game.

original name
new name (t)
2
1
3
2
1
3
pi
oi
pi o i
0.2
6
1.2
0.1
10
1.0
0.96
0.7
1.2
0.84
0.955
The doubling rate is now given as

m
W (p, o) =
pi log2 (b0 + bi oi )
(8.101)
i=1
= 0.7 log2 (0.955) + 0.2 log2 (0.955 + 0.041 6)

+ 0.1 log2 (0.955 + 0.00455 10)
= 0.00563.
(8.102)
(8.103)
This is very small, but it is positive. I.e., if we play long enough (it might be
very long!), we will make money.
8.7
Gambling with Side-Information
Let us return to the situation where we have to bet with all money. So far we
have assumed that all horse races are independent of each other and have the
same distribution. This is of course not very realistic. In order to loosen this
stringent assumption, as a rst step, we introduce side-information about the
horse race. Later on, we will be able to use this to deal with memory in the
system: We simply take the past racing results as the side-information. Note
that at the moment we still keep the assumption of independent races.
Assume now that we have available some side-information Y about which
horse is going to win: Given is some joint probability distribution PX,Y . Depending on the current value of the side-information Y = y, we can adapt our
betting strategy, i.e., our betting strategy is a function of y: b(|y). To nd the
optimal strategy, we need to redo all derivations from above. This basically
happens in an identical manner, with the only dierence that everything is
now conditioned on Y = y. Hence we basically get the same results.
Since we assume that the races are still IID, i.e., (Xk , Yk ) are jointly IID,
we have
n
Sn =
k=1
b(Xk |Yk ) o(Xk )
(8.104)
192
and
1
log2 Sn =
n
1
log2
n
1
n
n
k=1
b(Xk |Yk ) o(Xk )
(8.105)
n
k=1
log2 b(Xk |Yk ) o(Xk )
(8.106)
EX,Y log2 b(X|Y ) o(X)
in probability.
(8.107)
We dene the conditional doubling rate

W b(|), PX,Y , o
EX,Y log2 b(X|Y ) o(X)
(8.108)
and get
W b(|), PX,Y , o
=
PX,Y (x, y) log2 b(x|y) o(x)
(8.109)
x,y
PX,Y (x, y) log2 b(x|y) +

x,y
PX,Y (x, y) log2 o(x)
(8.110)
x,y
PY (y)
y
PX|Y (x|y) log2 b(x|y)
PX|Y (x|y)
PX|Y (x|y)
PX (x) log2 o(x)
(8.111)
PY (y)
PX|Y (x|y) log2 PX|Y (x|y)
PY (y)
y
PX|Y (x|y) log2

x
PX|Y (x|y)
b(x|y)
= H(X|Y ) EY D PX|Y (|Y ) b(|Y )
+ E[log2 o(X)]
+ E[log2 o(X)]
H(X|Y ) + E[log2 o(X)],
(8.112)
(8.113)
(8.114)
where the inequality can be achieved with equality if, and only if, we choose
b(|) = PX|Y (|).
In summary, we have the following result.
Theorem 8.14 (Betting with Side-Information).
Consider an IID horse race X with odds o with access to some sideinformation Y about X, where (X, Y ) PX,Y . The optimal doubling
rate
W (X|Y )
max W b(|), PX,Y , o

b(|)
(8.115)
8.8. Dependent Horse Races
193
is given by
W (X|Y ) = E[log2 o(X)] H(X|Y )
(8.116)
(where the entropy must be specied in bits) and is achievable by proportional betting:
b (x|y) = PX|Y (x|y),
x, y.
(8.117)
Here we have introduced a new notation:

W (X)
W (PX , o),
W (X|Y )
W (PX,Y , o).
(8.118)
(8.119)
So, how much does side-information help? To see this, we compute

W
W (X|Y ) W (X)
(8.120)
= E[log2 o(X)] H(X|Y ) E[log2 o(X)] H(X)
(8.121)
= I(X; Y ).
(8.123)
= H(X) H(X|Y )
(8.122)
Hence, we have shown the following beautiful theorem.

Theorem 8.15. The increase W in the optimal doubling rate due to
the side-information Y for an IID horse race X is
W = I(X; Y ).
8.8
(8.124)
Dependent Horse Races
The most common type of side-information is the past performance of the

horses. Hence, we will now drop the IID assumption, but instead assume
that {Xk } is a stationary stochastic process. Then from Theorem 8.14 we
immediately get
W (Xk |Xk1 , . . . , X1 ) = E[log2 o(Xk )] H(Xk |Xk1 , . . . , X1 )
(8.125)
which is achieved by proportional betting

b (xk |xk1 , . . . , x1 ) = P (xk |xk1 , . . . , x1 ).
(8.126)
194
The question is how to dene a corresponding doubling rate in this new context. To answer this, we again need to consider the growth factor of our
wealth. Note that the growth factor is now more complicated because our
betting strategy changes in each step:
n
Sn =
k=1
n
=
k=1
b (Xk |Xk1 , . . . , X1 ) o(Xk )
(8.127)
P (Xk |Xk1 , . . . , X1 ) o(Xk ).
(8.128)
Sn
is a random number. However, while in the case of an

Of course, as before,
IID process we could use the weak law of large numbers to show that for large
n, Sn will converge to a constant, we cannot do this anymore here, because

the weak law of large number does not hold for a general stationary process
{Xk }.
1
Instead we compute the expected value of n log2 Sn :

E
log2 Sn
n
1
= E
n
=
1
n
n
k=1
log2 P (Xk |Xk1 , . . . , X1 ) o(Xk )
1
E[log2 P (Xk |Xk1 , . . . , X1 )] +
n
k=1
n
1
n
k=1
H(Xk |Xk1 , . . . , X1 ) +
1
n
(8.129)
E[log2 o(Xk )]
(8.130)
k=1
E[log2 o(X1 )]
(8.131)
k=1
1
= H(X1 , . . . , Xn ) + E[log2 o(X1 )]
n
n
H({Xk }) + E[log2 o(X1 )].
(8.132)
(8.133)
Here, (8.131) follows from the denition of entropy and from stationarity;
(8.132) follows from the chain rule; and the last equality (8.133) is due to the
denition of entropy rate.
Hence, we have shown that for n 1 we have
E
log2 Sn E[log2 o(X1 )] H({Xk }).

n
(8.134)
If we are unhappy about this much weaker statement, we can x the situation by assuming that {Xk } is also ergodic. Then, the weak law of large
numbers can be applied again and we get
1
log2 Sn =
n
n
1
n
n
k=1
1
log2 P (Xk |Xk1 , . . . , X1 ) +
n
lim E[log2 P (Xn |Xn1 , . . . , X1 )]
log2 o(Xk )
k=1
(8.135)
8.8. Dependent Horse Races
195
+ lim E[log2 o(Xn )]

n
in prob.
(8.136)
= lim H(Xn |Xn1 , . . . , X1 ) + E[log2 o(X1 )]

n
= H({Xk }) + E[log2 o(X1 )].
(8.137)
(8.138)
Here, (8.136) follows from ergodicity; in (8.137) we rely on the denition of

entropy and use stationarity; and the last equality (8.138) follows from the
second equivalent denition of entropy rate for stationary processes.
So we get the following result.
Theorem 8.16. Consider a stationary and ergodic horse race {Xk } with
odds o. Then we have for large n
Sn 2nW
({Xk })
in probability,
(8.139)
where
W ({Xk })
E[log2 o(X1 )] H({Xk })
(8.140)
(where the entropy rate must be specied in bits).

Remark 8.17. As a matter of fact, both IID processes and stationary and
ergodic processes satisfy the strong law of large numbers. So in the whole
chapter we can replace in probability with almost surely (or, equivalently,
with probability 1) everywhere.
Example 8.18. Instead of horses, lets consider cards. Assume that we have
52 cards, 26 black and 26 red. The cards are well shued, and then opened
one by one. We bet on the color, with fair odds
o(red) = o(black) = 2.
(8.141)
We oer two dierent strategies:

1. We bet sequentially: We know that it is optimal to bet proportionally,
taking the known past into account as side-information. Hence, at the
beginning we bet
26
1
= .
52
2
Then for the second card we bet either, if X1 = red,
b(X1 = red) = b(X1 = black) =
b(X2 = red) =
(8.142)
25
,
51
b(X2 = black) =
26
,
51
(8.143)
26
,
51
b(X2 = black) =
25
.
51
(8.144)
or, if X1 = black,
b(X2 = red) =
etc.
196
2. We bet on the entire sequence of 52 cards at once. There are 52

26
dierent possible sequences of 25 red and 26 black cards, all equally
likely. Hence we bet
b(x1 , . . . , x52 ) =
1
52
26
(8.145)
on each of these sequences.

Both strategies are equivalent! To see this note, e.g., that half of the 52
26
sequences will start with red, i.e., in the second strategy we actually bet half
on red in the rst card, which is identical to the rst strategy.
How much are we going to win? Well, at the end one sequence must show
up, so we get for this sequence 252 (52 times the odds 2) times our money we
have bet on this sequence, which is 1/ 52 , and for the rest we get nothing.
26
Hence
S52 = 252
1
52
26
9.08.
(8.146)
Note that in this special problem we actually again have a risk-free situation!
We will multiply our money by a factor of 9 for sure. But this is not that
surprising when you realize that once all 26 cards of one color are opened, we
know the outcome of the rest for sure. This will happen latest with the second
last card, but usually even earlier.
Chapter 9
Data Transmission over a

Noisy Digital Channel
9.1
Problem Setup
We have already mentioned that information theory can be roughly split into
three main areas:
Source coding asks the question of how to compress data eciently.
Channel coding ask the question of how to transmit data reliably over a
channel.
Cryptography asks the question of how to transmit data in a secure way
such that nobody unwanted can read it.
So far we have had a quite close rst look onto the rst topic. Now we go to
the second one. We consider the general system shown in Figure 9.1.
uniform memoryless source

source
dest.
discrete-time channel
source
encoder
M channel X
encoder
modulator
source
decoder
M channel Y
decoder
demodulator
waveform
noisy
channel
noise
noisy waveform
destination of compressed data

Figure 9.1: Most general system model in information theory.
We will simplify this general system model in several ways: Firstly, we
assume that all values of the random message M that is to be transmitted over
197
198
Data Transmission over a Noisy Digital Channel
the channel have the same equal probability. This might sound like a strong
restriction, but actually it is not. The reason is that the output of any good
data compression scheme is (almost) memoryless and uniformly distributed !
Remark 9.1 (Compressed data is IID and uniform). Note that the output of
an IID uniform source cannot be compressed. The reason is that every block
message of length M is equally likely which means that the Human code
cannot apply shorter messages to more likely messages. This can also be seen
directly from the converse of the coding theorem (e.g., Theorem 4.14 or 6.3).
A similar argument can be used in the case of a Tunstall code.
Any other source can be compressed. So assume for the moment that the
output of a perfect (ideal) compressor has memory and/or is not uniformly
distributed. In this case we could design another compressor specically for
this output and compress the data even further. But this is a contradiction
to the assumption that the rst compressor already is perfect!
Note that in general the random message M emitted by the source might be
a sequence of digits (in practice usually a sequence of binary digits). However,
we do not really care about the form of the message, but simply assume that
there is only a nite number of possible messages and that therefore we can
number them from 1 to M.
Denition 9.2. The source emits a random message M that is uniformly
distributed over the (nite) message set M = {1, 2, . . . , M}. This means we
have |M| = M possible messages that are all equally likely.
Secondly, we simplify our system model by ignoring modulation and demodulation. This means that we think of the modulator and the demodulator
to be part of the channel and only consider a discrete-time channel, i.e., the
channel input and output are sequences of RVs {Xk } and {Yk }, respectively,
where k denotes the discrete time.
Even more abstractly and generally, we give the following engineering
denition of a channel.
Denition 9.3. A channel is that part of a communication system that the
engineer either is unwilling or unable to change.
Example 9.4. To explain this notion, we give two (quite practical) examples:
In the design of some communication system, we usually must restrict
ourselves to a certain frequency band that is specied by the regulatory
administration of the country. Hence, the constraint that we have to
use this band becomes part of the channel. Note that there might be
other frequency bands that would be better suited to the task, but we
are not allowed to use them: We are unable to change that part of our
communication system.
9.1. Problem Setup
199
Suppose we want to set up a simple communication system between two

locations. Suppose further that we have a certain cheap transmitter
available and do not want to buy a new one. Hence, the fact that we
use exactly this transmitter becomes part of the channel. There might
be much better transmitters available, but we are unwilling to change
to those.
So we see that once the channel is given, the remaining freedom can then
be used to transmit information.
The only question that remains now is how we can translate this engineering denition into something that is easier for a mathematical analysis.
The trick here is to abstract the channel further to its three most important
features:
What can we put into the channel?
What can come out of the channel?
What is the (probabilistic) relation between the input and the output?
To simplify our life we will make a further assumption: We assume that the
channel is memoryless, i.e., the output of the channel only depends on the
current input and not on the past inputs. We give the following denition.
Denition 9.5. A discrete memoryless channel (DMC) is a channel specied
by
an input alphabet X , which is a set of symbols that the channel accepts
as input;
an output alphabet Y, which is a set of symbols that the channel can
produce at its output; and
a conditional probability distribution PY |X (|x) for all x X such that
PYk |X1 ,X2 ,...,Xk ,Y1 ,Y2 ,...,Yk1 (yk |x1 , x2 , . . . , xk , y1 , y2 , . . . , yk1 )
= PY |X (yk |xk ),
k.
(9.1)
This conditional distribution PY |X (|) is sometimes also called channel

law. It describes the probabilistic connection between input and output.
So mathematically a DMC is simply a conditional probability distribution
that tells us the distribution of the output for every possible dierent input
letter.
We would like to point out the following important observations from the
denition of a DMC:
The current output Yk depends only on the current input xk (the channel
is memoryless).
200
For a particular input xk , the distribution of the output Yk does not

change over time (the channel is time-invariant). This can be seen from
our notation: We wrote PY |X (yk |xk ) and not PYk |Xk (yk |xk )!
For a given input x, we do not know the output for sure (due to the
random noise), but we are only given a certain probability distribution.
Hopefully, this distribution is not the same for dierent inputs, as otherwise we will never be able to recover the input from the distorted
output!
Example 9.6. A very typical example of a DMC is the binary symmetric
channel (BSC) shown in Figure 9.2. It is a binary channel, i.e., both input
0
X
1
Y
1
Figure 9.2: Binary symmetric channel (BSC).

and output can only take on two dierent values, i.e., X = Y = {0, 1}. We
see that if x = 0 the probability of the output Y being unchanged is 1 .
The probability of a ip is . The channel is called symmetric because the
probability of a ip is the same for x = 0 to Y = 1 and x = 1 to Y = 0. We
see that
PY |X (0|0) = 1 ,
PY |X (0|1) = ,
PY |X (1|0) = ,
PY |X (1|1) = 1 .
(9.2)
(9.3)
Another often used channel is the binary erasure channel (BEC) shown
in Figure 9.3. This channel has the same binary input as the BSC, but the
output alphabet is ternary: If we receive ?, then we know that something
went wrong and the input symbol was lost. The probability for such a lost
symbol is .
Note that this seems to be a better behavior than for the BSC, where we
never can be sure if a 1 really is a 1. In the BEC we never are going to confuse
Y = 0 with Y = 1.
Adding these two simplications to our system model in Figure 9.1, we

now get the simplied, abstract system model shown in Figure 9.4. The only
two devices in this model that we have not yet described are the encoder and
the decoder.
9.1. Problem Setup
201
1
? Y
1
1
Figure 9.3: Binary erasure channel (BEC).

noise
dest.
decoder
Y1 , . . . , Yn
received
sequence
channel
X 1 , . . . , Xn
codeword
encoder
M uniform
source
Figure 9.4: Simplied discrete-time system model.
Denition 9.7. The channel encoder is a device that accepts a message m

as input and generates from it a sequence of channel input symbols at the
output. Usually, of course, we will have a dierent output sequence for every
dierent message m.
We will assume that all output sequences have the same length n. These
output sequences are denoted codewords of blocklength n.
Mathematically, the encoder is a deterministic mapping
: M X n
(9.4)
that assigns for every message m a codeword x = (x1 , . . . , xn ) X n . The set

of all codewords C is called codebook or simply code.
Denition 9.8. The channel decoder is a device that receives a length-n
channel output sequence (i.e., a distorted codeword) and then needs to make
a decision on what message has been sent.
Of course we should design this device in a clever way so as to optimize
some cost function. Usually, the decoder is designed such that the probability
of a message error is minimized. For more details see Section 9.3.
Note that this optimization is in general very dicult, but it can be done
during the design phase of the system. Once we have gured out the optimal
way of decoding, we do not change the decoder anymore. I.e., mathematically,
the decoder is a deterministic mapping
: Y n M.
(9.5)
202
Here usually M = M, but sometimes we might have M = M {0} where 0

corresponds to a declaration of an error.
In short: A decoder takes any possible channel output sequence y =
(y1 , . . . , yn ) and assigns to it a guess m which message has been transmit
ted.
9.2
Discrete Memoryless Channels
Next, we are going to investigate the DMC more in detail and make one more
important assumption.
Denition 9.9. We say that a DMC is used without feedback if
P (xk |x1 , . . . , xk1 , y1 , . . . , yk1 ) = P (xk |x1 , . . . , xk1 ),
k,
(9.6)
i.e., Xk depends only on past inputs (by choice of the encoder), but not on
past outputs. In other words, there is no feedback link from the receiver back
to the transmitter that would inform the transmitter about the past outputs.
Remark 9.10. It is important to realize that even though we assume the
channel to be memoryless, we do not restrict the encoder to be memoryless!
Actually, memory in the encoder will be crucial for our transmission system!
As a very simple example consider a codebook with only two codewords of
length n = 3:
C = {000, 111}.
(9.7)
(This code is called three-times repetition code.) In this case if x1 = 1, we

know for sure that x2 = x3 = 1, i.e., we have very strong memory between
the dierent inputs xn !
1
In this class we will most of the time (i.e., always, unless we explicitly state
it) assume that we do not have any feedback links.
We now prove the following theorem.
Theorem 9.11. If a DMC is used without feedback, then
n
P (y1 , . . . , yn |x1 , . . . , xn ) =
k=1
PY |X (yk |xk ),
n 1.
(9.8)
Proof: The proof only needs the denition of a DMC (9.1) and the
assumption that we do not have any feedback (9.6):
P (x1 , . . . , xn , y1 , . . . , yn )
= P (x1 ) P (y1 |x1 ) P (x2 |x1 , y1 ) P (y2 |x1 , x2 , y1 )
P (xn |x1 , . . . , xn1 , y1 , . . . , yn1 ) P (yn |x1 , . . . , xn , y1 , . . . , yn1 ) (9.9)
= P (x1 ) P (y1 |x1 ) P (x2 |x1 ) P (y2 |x1 , x2 , y1 )
9.2. Discrete Memoryless Channels
203
P (xn |x1 , . . . , xn1 ) P (yn |x1 , . . . , xn , y1 , . . . , yn1 )
(9.10)
P (xn |x1 , . . . , xn1 ) PY |X (yn |xn )
(9.11)
= P (x1 ) PY |X (y1 |x1 ) P (x2 |x1 ) PY |X (y2 |x2 )

= P (x1 ) P (x2 |x1 ) P (xn |x1 , . . . , xn1 )
n
k=1
PY |X (yk |xk )
(9.12)
= P (x1 , . . . , xn )
k=1
PY |X (yk |xk ),
(9.13)
where the rst equality (9.9) follows from the chain rule, the second (9.10)
because we do not have feedback, and the third (9.11) from the denition of
a DMC.
Hence
P (x1 , . . . , xn , y1 , . . . , yn )
= P (y1 , . . . , yn |x1 , . . . , xn ) =
P (x1 , . . . , xn )
n
k=1
PY |X (yk |xk ).
(9.14)
While in source coding we have seen that the entropy rate H({Uk }) was the
crucial denition, it probably will not come as a big surprise that in channel
coding this role is taken over by the mutual information I(X; Y ), i.e., the
amount of information that Y reveals about X (or vice versa). So next we
will investigate the mutual information between the channel input and output
of a DMC.
Lemma 9.12. The mutual information I(X; Y ) between the input X and
output Y of a DMC that is used without feedback is a function of the channel
law PY |X (|) and the PMF of the input PX ():
I(X; Y ) = f PX , PY |X .
(9.15)
I(X; Y ) = EX D PY |X (|X) (PX PY |X )() ,
(9.16)
To be precise we have
where we have dened

(PX PY |X )(y)
PX (x )PY |X (y|x ),
y.
(9.17)
Proof: We remember that mutual information can be written using the

relative entropy D( ):
I(X; Y ) = D(PX,Y PX PY )
=
(x,y)
(9.18)
PX,Y (x, y)
PX,Y (x, y) log
PX (x)PY (y)
(9.19)
204
PX (x)
x
PX (x)
x
PY |X (y|x) log
PY |X (y|x)
PY (y)
PY |X (y|x) log
PY |X (y|x)
x PX (x )PY |X (y|x )
= EX D PY |X (|X) (PX PY |X )() .
(9.20)
(9.21)
(9.22)
Hence, we learn that the mutual information is a function of the channel

law PY |X (|), which is given to us via the channel, and of the distribution on
the input, which can be chosen freely by us via the design of the encoder.
Based on the fact that we can design the encoder, but have to accept the
channel that we get, the following denition is now quite natural.
Denition 9.13. We dene the information capacity of a given DMC as
Cinf
max I(X; Y ).
PX ()
(9.23)
Note that this is a purely mathematical denition without any engineering

meaning so far!
9.3
Coding for a DMC
So we have introduced our system model (see Figure 9.4) and have understood
some issues about the channel. We are now ready to think about the design
of an encoder and decoder for a given DMC. Recalling our mathematical
denitions of the encoder and decoder in Denitions 9.7 and 9.8, we see that
such a design basically is equivalent to specifying a codebook C , an encoding
function () and a decoding function (). Hence, we give the following
denition.
Denition 9.14. An (M, n) coding scheme for a DMC (X , Y, PY |X ) consists
of
the message set M = {1, . . . , M},
the codebook (or code) C of all length-n codewords,
an encoding function (), and
a decoding function ().
Remark 9.15. We make a short comment about choice of name. It is an
unfortunate habit of many information theorists (including [CT06]) to call
the (M, n) coding scheme an (M, n) code. The problem is that the term
code is used already by all coding theorists to denote the codebook, i.e., the
unsorted list of codewords. We only mention the Hamming code and the
9.3. Coding for a DMC
205
Reed-Solomon code as examples. To avoid confusion between the two groups

of people, we will always try to use codebook for the list of codewords and
coding scheme for the complete system consisting of the codebook and the
mappings.
As one sees from its name (M, n) coding scheme, the two crucial parameters of interest of a coding scheme are the number of possible messages M
and the codeword length (blocklength) n. Usually we would like to make M
large, because this way we can transmit more information over the channel,
and the blocklength n should be short so that it does not take too long (we
do not need too many discrete time-steps) to transmit the codeword. In other
words,
we have M equally likely messages, i.e., the entropy of the message is
H(M ) = log2 M bits;
we need n transmissions of a channel input symbol Xk over the channel
in order to transmit the complete message.
2
Hence, we transmit on average H(M ) = logn M bits per each transmission of a
n
channel symbol.
This insight is crucial and leads to the following fundamental denition.
Denition 9.16. The rate 1 R of an (M, n) coding scheme is dened as

R
log2 M
bits/transmission.
n
(9.24)
The rate is one of the most important parameter of any coding scheme. It
tells us how much information is transmitted every time the channel is used
given the particular coding scheme. The unit bits/transmission sometimes
is also called bits/channel use or bits/symbol.
So far we have only worried about the encoder. However, at least as important is the decoder. If the decoder is not able to gain back the transmitted
message, the coding scheme is not very useful! Or to say it in other words:
The denition of the transmission rate R only makes sense if the message really arrives at the destination, i.e., if the receiver does not make too many
decoding errors.
So the second crucial parameter of a coding scheme is its performance
measured in its probability of making an error. To explain more about this,
we give the following denitions.
Denition 9.17. Given that message M = m has been sent, we dene m to
be the error probability:
Pr (Y) = m x = (m)
m
=
P (y|x(m)) I{(y) = m},
(9.25)
(9.26)
yY n
1
We dene the rate here using a logarithm of base 2. However, we can use any logarithm
as long as we adapt the units accordingly.
206
where I{} is the indicator function

1 if statement is true,
0 if statement is wrong.
I{statement}
(9.27)
The maximum error probability (n) for an (M, n) coding scheme is dened as
(n)
(n)
The average error probability Pe
max m .
mM
for an (M, n) coding scheme is dened as

1
M
(n)
Pe
(9.28)
m .
(9.29)
m=1
Now we know all main denitions to formulate our main objective.

The main objective of channel coding is to try to nd an (M, n) coding
scheme such that the maximum error probability (n) is as small as
possible for a given rate R.
Note that in order to achieve our main objective, we cannot only vary the
design of the codewords in the codebook C , but also the blocklength n and
the number of codewords M, as long as we keep the rate R = log M constant!
n
Usually, we formulate this in a way that chooses a blocklength n for a xed
R. The number of needed codewords is then
M = 2nR .
(9.30)
Here we have assumed that the rate R is given in bits/channel use, i.e., the
logarithm is to the base 2. If this is not the case, we will have to change the
form to, e.g., enR for the case of a rate measured in nats.
Next we ask the question how the decoder needs to be designed such
that the error probability is minimized. Given that the random message
M = m has been fed to the encoder, the decoder receives the sequence
Y = (Y1 , . . . , Yn ), which has a conditional distribution
n
P (Y|X(m)) =
k=1
PY |X Yk Xk (m) .
(9.31)
The receiver needs to guess which message has been sent, and we want to do
this in a way that minimizes the probability of decoding error or, equivalently,
maximizes the probability of a correct decoding:
PY (y) Pr(correct decoding | Y = y)
(9.32)
PY (y) Pr[M = (y) | Y = y].
(9.33)
Pr(correct decoding) =
=
9.3. Coding for a DMC
207
This is maximized if for each y we choose (y) such as to maximize2 the

conditional probability Pr[M = (y) | Y = y]:
(y)
argmax PM |Y (m|y).
(9.34)
mM
Such a decoder is called maximum a posteriori (MAP) decoder. It is the best

decoder. Its form can be further changed as follows:
(y)
argmax PM |Y (m|y)
(9.35)
mM
PM,Y (m, y)
PY (y)
mM
PM (m)PY|M (y|m)
= argmax
PY (y)
mM
= argmax PM (m)PY|M (y|m)
= argmax
(9.36)
(9.37)
(9.38)
mM
= argmax PM (m)PY|X y x(m)
(9.39)
mM
where (9.38) follows because PY (y) is not a function of m. If we now assume

(as we usually do) that M is uniform, i.e., PM (m) = 1/2nR , we get
1
P
y x(m)
2nR Y|X
mM
= argmax PY|X y x(m) ,
(y) = argmax
(9.40)
(9.41)
mM
where (9.41) follows because 1/2nR is not a function of m. A decoder that

implements the rule given in (9.41) is called maximum likelihood (ML) decoder.
An ML decoder is equivalent to the optimum MAP decoder if (and only
if!) the messages are uniformly distributed. However, even if this were not the
case, people often prefer the suboptimal ML decoder over the optimal MAP
decoder, because for the MAP decoder it is necessary to know the source
distribution PM (), which usually is not the case. Moreover, if the source
changes, the decoder needs to be adapted as well, which is very inconvenient
for a practical system design: We denitely do not want to change our radio
receiver simply because the type of transmitted music is dierent on Sundays
than during the week. . . !
An ML decoder, on the other hand, does not depend on the source, and
since a uniform source is kind of the most dicult case anyway, the ML
decoder is like a worst-case design: The decoder is optimized for the most
dicult source.
Irrespective of the type of decoder used, it is often convenient to describe
it using decoding regions.
2
Note that strictly speaking argmaxm is not dened if the maximum-achieving value
m is not unique. However, in that case, the performance of the decoder does not change
irrespective of which m among all maximum-achieving values is chosen. It is usual to dene
that argmaxm will pick a value m among all optimal values uniformly at random.
208
Denition 9.18. The mth decoding region is dened as

Dm
{y : (y) = m},
(9.42)
i.e., it is the set of all those channel output sequences that will be decoded to
the message m.
Even though it is not strictly correct, it sometimes is helpful for the understanding to think of the channel as a device that adds noise to the transmitted
codeword. The decoding regions can then be thought of as clouds around the
codewords. If the noise is not too strong, then Y will be relatively close
to the transmitted codeword x(m), so a good decoder should look for the
closest codeword to a received Y. See Figure 9.5 for an illustration of this
idea.
x(2)
D2
x(1)
noise
x(5)
x(3)
D3
x(4)
D1
D4
D5
Figure 9.5: Decoding regions of a code with ve codewords.

With the help of the decoding region, we can now rewrite the error probability m as follows:
PY|X (y|x(m)).
m =
(9.43)
yDm
/
Unfortunately, in spite of its simplicity, the evaluation of this expression for

an ML decoder is rather dicult in general. So it can be useful to have a
bound, particularly an upper bound, at hand.
9.4
The Bhattacharyya Bound
For the situation of a code with only two codewords, there exists a very elegant
bound on the error probability m under an ML decoder. According to its
discoverer, this bound has the dicult name Bhattacharyya bound.
9.4. The Bhattacharyya Bound
209
In order to derive it, lets assume that we have a code with only two
codewords: x(1) and x(2). There are only two decoding regions that must
c
satisfy D1 = D2 . So, (9.43) simplies signicantly:
PY|X (y|x(2))
2 =
(9.44)
yD1
and similar for 1 .

Now recall that the ML rule looks for the codeword that maximize the
conditional probability:
y D1 =
PY|X (y|x(1)) PY|X (y|x(2))
(9.45)
PY|X (y|x(1))
(9.46)
PY|X (y|x(1)) PY|X (y|x(2)) PY|X (y|x(2)).
PY|X (y|x(2))
(9.47)
(Note that the inverse direction might not hold because we have not specied
what happens in a case of a tie when PY|X (y|x(1)) = PY|X (y|x(2)).) Using
(9.47) in (9.44) then yields
2 =
yD1
PY|X (y|x(2))
PY|X (y|x(1)) PY|X (y|x(2)).
(9.48)
PY|X (y|x(1)) PY|X (y|x(2)),
(9.49)
yD1
A similar derivation will lead to

1 =
yD2
PY|X (y|x(1))
yD2
and adding (9.48) and (9.49) we obtain

1 + 2
PY|X (y|x(1)) PY|X (y|x(2)).
(9.50)
If we now assume that the code is used on a DMC without feedback, we can
further simplify this as follows:
m 1 + 2
y1 ,...,yn
n
=
k=1 y
(9.51)
n
k=1
PY |X yk xk (1) PY |X yk xk (2)
PY |X y xk (1) PY |X y xk (2) .
(9.52)
(9.53)
Here in the last step we did the following type of algebraic transformation:
a1 b1 + a1 b2 + a2 b1 + a2 b2 = (a1 + a2 )(b1 + b2 ).
(9.54)
We have shown the following bound.
210
Theorem 9.19 (Bhattacharyya Bound).

The worst case block error probability of a code with two codewords that is
used over a DMC without feedback and that is decoded with ML decoding
is upper-bounded as follows:
n
(n)
k=1 y
PY |X y xk (1) PY |X y xk (2) .
(9.55)
In practice, one often encounters channels with a binary input alphabet.

For such channels, Theorem 9.19 can be expressed even more simply.
Denition 9.20. For a DMC with a binary input alphabet X = {0, 1}, we
dene the Bhattacharyya distance DB as
DB
log2
PY |X (y|0) PY |X (y|1) .
(9.56)
For a code used on a binary-input DMC, we then have for xk (1) = xk (2),
PY |X y xk (1) PY |X y xk (2) =
y
DB
=2
PY |X (y|0) PY |X (y|1)
,
(9.57)
(9.58)
and for xk (1) = xk (2),
PY |X y xk (1) PY |X y xk (2) =
PY |X y xk (1) = 1. (9.59)
Hence, we see that for binary-input DMCs, Theorem 9.19 can be rewritten as
follows.
Corollary 9.21. For a binary-input DMC and the code {x(1), x(2)} the worst
case block error probability (using an ML decoder) satises
(n) 2DB
dH (x(1),x(2))
Example 9.22. Consider the BEC shown in Figure 9.6. We have
(1 ) 0 + + 0 (1 ) = log2
DB = log2
(9.60)
(9.61)
and therefore
(n) 2dH (x(1),x(2)) log2 = dH (x(1),x(2)) .
(9.62)
9.5. Operational Capacity
211
1
Figure 9.6: Binary erasure channel (BEC).
9.5
Operational Capacity
We have seen in Section 9.3 that it is our goal to make the maximum probability of error as small as possible. It should be quite obvious though that
unless we have a very special (and very unrealistic) channel it will never be
possible to make the error probability equal to zero. It also turns out that the
analysis of the optimal system is horribly dicult and even not tractable by
simulation or numerical search.
Shannons perhaps most important contribution was a way around this
problem. He realized that in general a system should perform better if we
make n larger. This is quite obvious if we keep the number of codewords
M constant. In this case we can make the distance between the dierent
codewords larger and larger, and thereby the probability that the decoder will
mix them up due to the noise will become smaller and smaller. However, if
we keep M constant while making n larger, we implicitly let the rate R tend
to zero (see (9.24) or (9.30)). Hence, ultimately, for very large n we transmit
almost no information anymore.
At the time of Shannon every engineer was convinced that the only way
of improving the error probability was to reduce the rate. Shannon, on the
other hand, had the following incredible insight:
It is possible to reduce the error probability without reducing the rate!
If you think this is trivially obvious, you probably have not understood
the implications of this statement. . .
Using this insight, Shannon then found a way around the obstacle that it
is basically impossible to analyze or even nd an optimal coding scheme for
a given DMC. His trick was not to try to analyze a system for a xed rate
212
and blocklength, but to make the blocklength n arbitrarily large and at the
same time ask the error probability to become arbitrarily small. He gave the
following two crucial denitions.
Denition 9.23. A rate R is said to be achievable on a given DMC

if there exists a sequence of (2nR , n) coding schemes such that the
maximum error probability (n) tends to 0 as n .
Denition 9.24. The operational capacity Cop of a DMC is dened to
be the supremum of all achievable rates.
Hence we see that the capacity is the highest amount of information (in
bits) that can be transmitted over the channel per transmission reliably in the
sense that the error probability can be made arbitrarily small if we make n
suciently large.
Note that Shannon on purpose ignored two issues that actually are important in practice. Firstly, he completely ignores delay in the system. If we
make n very large, this means that at a rst step, the encoder needs to wait
until he gets the complete random message M which usually is encoded as a
binary string, i.e., if the number of messages M is very large (which will be
the case if n is large!) we have to wait for a long binary sequence. Next the
codeword needs to be transmitted which takes n transmissions. The receiver
can only decode once it has received the full message (Y1 , . . . , Yn ). So we see
that large n causes large delays in the system.3
Secondly, Shannon does not worry about the practicability of a scheme,
i.e., he does not bother about the computational complexity of the system.4
Before we start proving Shannons main results, we would like to recapitulate some of the important insights we have seen so far:
If we x n, but increase the number of messages M (i.e., we increase the
rate R), we need to choose more codewords in the same n-dimensional
vector space. Thereby we increase the chance that the decoder will mix
them up. Hence we enlarge the error probability.
However, if we increase both the number of messages M and n (such
that the rate R remains constant), then it might be possible that we
increase the space between the dierent codewords in the n-dimensional
vector space (even though the number of codewords is increasing!). If
this is the case, then we actually reduce the chance of a mix up and
thereby the error probability.
3
The delays you usually experience when making an international telephone call that is
routed via satellites are not mainly caused by the distance, but by the used channel coding
with large blocklength!
4
To see the problem, just imagine how an ML decoder will look like for codewords of
blocklength n = 100 000. . .
9.6. Two Important Lemmas
213
Whatever we do, there is no way of having a system with exactly zero

error probability.
9.6
Two Important Lemmas
Before we can gure out how to compute the operational capacity, we need
two important lemmas. The rst one relates error probability with entropy.
Lemma 9.25 (Fanos Ineqality). Let U and U be L-ary (L 2) RVs taking

5 Let the error probability be dened as
value in the same alphabet.
Pe
Pr[U = U ].
(9.63)
Then
Hb (Pe ) + Pe log(L 1) H(U |U ).
(9.64)
Proof: Dene the indicator RV (it indicates an error!)

Z
1 if U = U,
0 if U = U.
(9.65)
Then
PZ (1) = Pe ,
(9.66)
PZ (0) = 1 Pe ,
(9.67)
H(Z) = Hb (Pe ).
(9.68)
Now we use the chain rule to derive the following:
H(U, Z|U ) = H(U |U ) + H(Z|U, U ) = H(U |U )
(9.69)
=0
and
H(U, Z|U ) = H(Z|U ) + H(U |U, Z)
H(Z) + H(U |U, Z)
(9.70)
(9.71)
= Hb (Pe ) + PZ (0) H(U |U, Z = 0) + PZ (1) H(U |U, Z = 1) (9.72)

=0
because U =U
Hb (Pe ) + Pe log(L 1),
log(L1)
because U =U
(9.73)
where the rst inequality follows from conditioning that cannot increase entropy.
5
You may think of U to be a decision or guess about U .
214
Example 9.26. Let H(U |U ) = 2 bits for binary random variables U and U .
Then Fanos inequality tells us that
1
bits,
2
(9.74)
0.11 Pe 0.89.
(9.75)
Hb (Pe )
i.e.,
Often, we use Fanos inequality in a looser form.
Corollary 9.27. Let U and U be L-ary (L 2) RVs taking value in the same
alphabet and let the error probability be dened as in (9.63). Then
log 2 + Pe log L H(U |U )
(9.76)
or
Pe
H(U |U ) log 2
.
log L
(9.77)
Proof: This follows directly from (9.64) by bounding Hb (Pe ) log 2

and log(L 1) < log L.
The second lemma is related to our discussion of Markov processes. We
give the following important denition.
Denition 9.28. We say that the three RVs X, Y , and Z form a Markov
chain and write X Y Z if
P (z|y, x) = P (z|y)
(9.78)
H(Z|Y, X) = H(Z|Y ).
(9.79)
or, equivalently,
A Markov chain has a strong engineering meaning as it is implicitly given

by a chain of processing boxes as shown in Figure 9.7.
X
Processor 1
Processor 2
Figure 9.7: A Markov chain. Note that the processors can be deterministic or
random.
Remark 9.29. An equivalent denition of a Markov chain is to say that

conditional on Y , the RVs X and Z are independent. We leave it to the
reader to prove that this is equivalent to (9.78) and (9.79).
9.7. Converse to the Channel Coding Theorem
215
Lemma 9.30 (Data Processing Inequality (DPI)). If X Y Z,
then
I(X; Z) I(X; Y )
I(X; Z) I(Y ; Z)
(9.80)
Hence, processing destroys (or better: cannot create) information.

Proof: Using Markovity and the fact that conditioning cannot increase
entropy, we have
I(Y ; Z) = H(Z) H(Z|Y )
(9.81)
= H(Z|Y,X)
= H(Z) H(Z|Y, X)
H(Z|X)
H(Z) H(Z|X)
= I(X; Z).
9.7
(9.82)
(9.83)
(9.84)
Converse to the Channel Coding Theorem
Similarly to the chapters about source coding where we have proven a source
coding theorem, we will now derive a channel coding theorem. And similarly to
the source coding theorem, the channel coding theorem consists of a converse
part that tells what we cannot do and an achievability part that tells what is
possible. We start with the converse part.
Suppose someone gives you an achievable coding scheme, i.e., a coding
(n)
scheme (or rather a sequence of coding schemes) for which Pe 0 as
n . Taking this scheme and remembering the denition of the code
rate (Denition 9.16), we have
R=
=
=
=
log M
n
1
H(M )
n
1
1
1
H(M ) H(M |M ) + H(M |M )

n
n
n
1
1
H(M |M ) + I(M ; M )
n
n
1
1
n
H(M |M ) + I(X1 ; Y1n )

n
n
log 2 log M
1
n
(n)
+
Pe + I(X1 ; Y1n )
n
n
n
(9.85)
(9.86)
(9.87)
(9.88)
(9.89)
(9.90)
=R
1
1
log 2
(n)
n
+ RPe + H(Y1n ) H(Y1n |X1 )
=
n
n
n
(9.91)
216

log 2
1
(n)
=
+ RPe +
n
n
=
log 2
1
(n)
+ RPe +
n
n
log 2
1
(n)
+ RPe +
n
n
log 2
1
(n)
=
+ RPe +
n
n
H Yk
Y1k1
k=1
n
k=1
H Yk Y1k1
H(Yk )
n
k=1
n
H(Yk )
1
n
1
n
n
n
H Yk Y1k1 , X1
(9.92)
H(Yk |Xk )
(9.93)
k=1
n
k=1
n
k=1
H(Yk |Xk )
(9.94)
I(Xk ; Yk )
k=1
(9.95)
max I(X;Y )
PX
1
log 2
(n)
+ RPe +
n
n
(9.96)
Cinf
k=1
log 2
(n)
=
+ RPe + Cinf .
n
(9.97)
Here, (9.86) follows because we assume that the messages are uniformly distributed; (9.89) follows by twice applying the data processing inequality (Lemma 9.30, see Figure 9.4); the subsequent inequality (9.90) follows from Fanos
(n)
inequality (Corollary 9.27) with Pe = Pr[M = M ]; in (9.92) we use the chain

rule; the equality in (9.93) is based on our assumption that we have a DMC
without feedback; and (9.96) follows from the denition of the information
capacity (Denition 9.13).
Note that (9.93) does not hold if we drop the condition that the DMC is
used without feedback! While it is true for any DMC that
k
H Yk Y1k1 , X1 = H(Yk |Xk )
(9.98)
due to (9.1), we have here

n
H Yk Y1k1 , X1
k
k
n
= H Yk Y1k1 , X1 I Xk+1 ; Yk Y1k1 , X1
= H(Yk |Xk ) H
n
Xk+1
k
Y1k1 , X1
+H
n
Xk+1
(9.99)
k
Y1k , X1
(9.100)
However, note that

k
n
k
n
H Xk+1 Y1k1 , X1 = H Xk+1 Y1k , X1
(9.101)
if, and only if, (9.6) is satised, too.

Now recall that we have assumed an achievable coding scheme, i.e., the
(n)
coding scheme under investigation is by assumption such that Pe 0 as
n . Therefore, we see that for n ,
R Cinf .
We have proven the following theorem.
(9.102)
9.8. The Channel Coding Theorem
217
Theorem 9.31 (Converse Part of the Channel Coding Theorem

for a DMC).
(n)
Any coding scheme that has an average error probability Pe going to
zero as the blocklength n tends to innity must have a rate that is not
larger than the information capacity:
R Cinf .
(9.103)
Remark 9.32. Note that this is a weak converse. It is possible to prove a

stronger version that shows that for rates above capacity, the error probability
goes to 1 exponentially fast in n!
Remark 9.33. Note that we have stated the converse using the average error
(n)
(n)
probability Pe . However, Theorem 9.31 also holds if we replace Pe by the
maximum error probability (n) . Indeed, since the average error probability
certainly will tend to zero if the maximum error probability tends to zero,
it follows that any coding scheme with vanishing maximum error probability
also must satisfy the converse, i.e., also must have a rate R Cinf .
Next we turn to the achievability part.
9.8
The Channel Coding Theorem
We will now derive a lower bound to the achievable rates. To this aim we will
need to design a coding scheme (codebook, encoder, and decoder) and then
analyze its performance. The better this performance, the better the lower
bound.
Our main ideas are as follows:
The aim of our design is not to create a coding scheme (i.e., codebook,
encoder and decoder) that can be used in practice, but that is easy to
analyze.
Since we have no clue how to design the codebook, we choose one at
random from the huge set of all possible codebooks. Then we will analyze
the performance of such a random codebook on average, and if it is good,
then we know that there must exist at least some codebooks that are
good!
So, think of a storehouse that contains all possible codebooks. We go
there and randomly pick one and analyze its performance, i.e., compute
its average error probability. Then we pick another codebook and analyze its performance, etc. Finally, we average the performance over
all such codebooks, i.e., we compute the average of the average error
probability.
218

This trick was Shannons idea! It is one of the most crucial ideas in
information theory and is called random coding.
The encoder will be a simple mapping based on a lookup table. As mentioned above, such a design is highly inecient in practice, but easy to
analyze. Note that by itself the encoder is not random, but a deterministic mapping that maps any given message into a codeword. However,
since both the message and the codewords are random, the encoder output looks random, too.
The decoder should be an ML decoder as discussed in Section 9.3. Unfortunately, an ML decoder turns out to be quite dicult to analyze.
Therefore we will use a suboptimal decoder. If this decoder will perform
well, then an optimal ML decoder will perform even better!
Note that the decoder by itself is not random, but a deterministic mapping that maps a received sequence into a decision. But again, since
message, codewords and channel output all are random, the output of
the decoder looks random.
So, lets turn to the details. We go through the following steps:
1: Codebook Design: We randomly generate a codebook. To that goal
we x a rate R, a codeword length n, and a distribution PX (). Then
we design a (2nR , n) coding scheme6 by independently generating 2nR
length-n codewords according to the distribution
n
PX (xk ).
PX (x) =
(9.104)
k=1
Note that this codebook can be described by a matrix of 2nR rows of

length n:
X1 (1)
X2 (1) Xn (1)
X2 (2) Xn (2)
X1 (2)
,
C =
(9.105)
.
.
.
..
.
.
.
.
.
.
.
X1 (2nR ) X2 (2nR ) Xn (2nR )
where each row describes a codeword. Each entry of this matrix is generated IID PX (), i.e.,
x1 (1) xn (1)
2nR n
.
C = .
=
.
.
P (C ) Pr
PX xk (m) .
.
.
m=1 k=1
x1 (2nR ) xn (2nR )
(9.106)
6
Actually, to be precise, we should write 2nR , however, as is custom in the literature,

from now on will always omit the ceiling-operator.
219
Once the codebook has been generated, we reveal it to both transmitter

and receiver. And of course they also know the channel law PY |X (|) of
the DMC.
2: Encoder Design: As always we assume that the source randomly picks
a message M {1, . . . , 2nR } according to a uniform distribution
Pr[M = m] =
1
.
2nR
(9.107)
The encoder is now very simple: Given the message m, it takes the mth
codeword X(m) of C (the mth row of (9.105)) and sends it over the
channel.
3: Decoder Design: The decoder receives the sequence Y = y, which has,
conditional on the transmitted codeword X(m), the distribution
n
PY|X (y|X(m)) =
k=1
PY |X (yk |Xk (m)).
(9.108)
The decoder needs to guess which message has been sent. The best thing
to do is to minimize the probability of decoding error or, equivalently,
maximizing the probability of a correct decoding as we have described in
Section 9.3, i.e., the best thing would be to design the decoder as an ML
decoder.
Unfortunately, an ML decoder is quite dicult to analyze.7 We therefore
will use another, suboptimal decoder, that is not very practical, but
allows a very elegant and simple analysis. Moreover, even though it is
suboptimal, its performance is good enough for our proof purposes.
Before we can explain the design of our decoder, we need to introduce a
function of two vectors:
PX,Y (x, y)
i(x; y) log
, (x, y) X n Y n .
(9.109)
PX (x)PY (y)
This function is called instantaneous mutual information. Its name stems
from the fact that the mutual information I(X; Y) between the random
vectors X and Y is given as the average of the instantaneous mutual
information between all possible realizations of these vectors:
I(X; Y) = EX,Y [i(X; Y)] = E log
PX,Y (X, Y)
.
PX (X)PY (Y)
(9.110)
Note that we will use the instantaneous mutual information in the context
where PX,Y denotes the joint PMF of our random codewords and our
DMC without feedback:
n
PX,Y (x, y) = PX (x) PY|X (y|x) =
k=1
PX (xk )PY |X (yk |xk ),
(9.111)
It is not impossible to do it, though. Shannon in his 1948 paper [Sha48] does analyze
the ML decoder.
220

where in the second equality we have made use of the memorylessness of
the DMC without feedback (see Theorem 9.11) and of the IID codebook
generation PX (x) = n PX (xk ). For the same reason we also have
k=1
n
PY (yk ),
PY (y) =
(9.112)
k=1
where PY () is the marginal distribution of PX,Y (, ) = PX () PY |X (|).

Hence, the instantaneous mutual information (9.109) can be rewritten
and simplied as follows:
PY|X (y|x)
PY (y)
n
PY |X (yk |xk )
= log k=1
n
k=1 PY (yk )
n
PY |X (yk |xk )
log
=
PY (yk )
i(x; y) = log
(9.113)
(9.114)
(9.115)
k=1
n
i(xk ; yk ),
(9.116)
k=1
where (9.116) should be read as denition of i(xk ; yk ).

Now we are ready to describe the design of our suboptimal decoder. Our
decoder is a threshold decoder and works as follows. Given a received
Y = y, the decoder computes the instantaneous mutual information
between any codeword X(m) and the received sequence y: i(X(m); y)
nR . Then it tries to nd an m such that at the same

for all m = 1, . . . , 2
time
i X(m); y > log
(9.117)
and
i X(m); y log ,
m = m,
(9.118)
for some specied threshold (we will describe our choice of below).
If it can nd such an m, it will put out this m. If not, it will put out
m = 0, i.e., it declares an error.
4: Performance Analysis: There is a decoding error if M = M . (Note
that if M = 0, then for sure this is the case.) To analyze the error
probability we now use the trick not to try to analyze one particular
codebook, but instead the average over all possible codebooks, i.e., we
221
also average over the random codebook generation:

Pr(error) = Pr(error, averaged over all codewords and -books) (9.119)
=
C
probability of
choosing codebook C
1
P (C ) nR
2
=
C
1
= nR
2
(n)
Pe (C )
P (C )
(9.120)
average error prob.

for given codebook C
2nR
m (C )
(9.121)
P (C )m (C )
(9.122)
m=1
2nR
m=1 C
independent of m
1
2nR
2nR
P (C )1 (C )
(9.123)
m=1 C
P (C )1 (C )
(9.124)
= Pr(error, averaged over all codebooks | M = 1).
(9.125)
Here, in (9.123) we use the fact that all codewords are IID randomly
generated. Hence, since we average over all codebooks, the average error
probability does not depend on which message has been sent. So without
loss of generality we can assume that M = 1.
Let
Fm
i X(m); Y > log2 ,
m = 1, . . . , 2nR ,
(9.126)
be the event that the instantaneous mutual information of the mth codeword and the received sequence Y is above the threshold. Note that for
simplicity we choose a base 2 for the logarithms.
c
Hence an error occurs if F1 occurs (i.e., if the instantaneous mutual
information of transmitted codeword and the received sequence is too
small) or if F2 F3 F2nR occurs (i.e., if one or more wrong codewords
have an instantaneous mutual information with the received sequence Y
that is too big). Hence, we can write (9.125) as follows:
Pr(error) = Pr(error | M = 1)
=
c
Pr(F1
(9.127)
F2 F3 F2nR | M = 1)
(9.128)
2nR
c
Pr(F1 | M = 1) +
m=2
Pr(Fm | M = 1)
(9.129)
where the inequality follows from the union bound : for any two events A
and B:
Pr(A B) = Pr(A) + Pr(B) Pr(A B) Pr(A) + Pr(B).
(9.130)
222
generate X(1)
independent
send it
through DMC
generate other
codewords X(m) using
the same PX (),
but independently
independent
receive Y
Figure 9.8: The received vector Y is independent of the codewords X(m),
m 2, that are not transmitted.
Note that Y is completely independent of X(m) for all m 2 (see
Figure 9.8 for an illustration of this fact). Hence, for m 2,
Pr(Fm | M = 1) = Pr i X(m); Y > log2 M = 1
(9.131)
PX,Y (X(m), Y)
= Pr log2
> log2 M = 1 (9.132)
PX (X(m))PY (Y)
PX,Y (X(m), Y)
(9.133)
> M =1
= Pr
PX (X(m))PY (Y)
=
PX (x)PY (y)
(9.134)
PX (x)PY (y)
(9.135)
(x,y) such that

PX,Y (x,y)>PX (x)PY (y)
(x,y) such that

<
1
PX (x)PY (y)< PX,Y (x,y)
<
(x,y) such that
1
PX (x)PY (y)< PX,Y (x,y)
PX,Y (x, y) =
(x,y)
1
P
(x,y)
X,Y
1
PX,Y (x, y)
1
.
(9.136)
(9.137)
Here, the most important step is (9.134): There we make use of the fact
that X(m) Y, i.e., their joint probability distribution is a product
distribution. In the last inequality we upper-bound the sum by including

all possible pairs (x, y) without any restriction, such that the summation
adds up to 1.
c
In order to bound the term Pr(F1 | M = 1), we make a clever choice for
our threshold . We x some > 0 and choose
2n(I(X;Y )) ,
(9.138)
223
where I(X; Y ) is the mutual information between a channel input symbol X and its corresponding output symbol Y when the input has the
distribution PX () (the same as used for the codebook generation!). Then
c
Pr(F1 | M = 1)
= Pr i X(1); Y log2 M = 1
= Pr i X(1); Y n I(X; Y )
(9.139)
M =1
(9.140)
i Xk (1); Yk n I(X; Y )
= Pr
k=1
= Pr
Pr
1
n
M =1
(9.141)
i Xk (1); Yk I(X; Y ) M = 1
(9.142)
n
k=1
n
1
n
k=1
i Xk (1); Yk I(X; Y ) M = 1 ,
(9.143)
where in (9.141) we used (9.116).

Now recall the weak law of large numbers, which says that for a sequence
Z1 , . . . , Zn of IID random variables of mean and nite variance and for
any > 0,
Z1 + + Zn
= 0.
n
lim Pr
(9.144)
Applied to our case in (9.143), we have Zk

i Xk (1); Yk and =
E[Zk ] = E i Xk (1); Yk = I(X; Y ). Hence, it follows from (9.143) that
c
Pr(F1 | M = 1) 0 for n ,
(9.145)
c
i.e., for n large enough, we have Pr(F1 | M = 1) .
Combining these results with (9.129) and choosing n large enough, we

now get
2nR
Pr(error)
c
Pr(F1
| M = 1) +
2nR
<+
m=2
m=2
Pr(Fm | M = 1)
<
(9.146)
(9.147)
= + 2nR 1 2n(I(X;Y ))
(9.148)
< 2nR
nR
n(I(X;Y ))
(9.149)
n(I(X;Y )R)
(9.150)
<+2
=+2
2,
(9.151)
224

where in the last step we need again n to be suciently large and
I(X; Y ) R > 0 so that the exponent is negative. Hence we see
that as long as
R < I(X; Y )
(9.152)
for any > 0 we can choose n large enough such that the average error
probability, averaged over all codewords and all codebooks, is less than
2:
Pr(error) 2.
(9.153)
5: Strengthening: Now we strengthen the proof in several ways.

Firstly, note that is a parameter that we can choose freely. Hence,
we can make it as small as we wish, so that condition (9.152) can
be replaced by
R < I(X; Y ).
(9.154)
Secondly, note that we have not yet clearly specied how we want
to choose PX () in the codebook design process. So lets choose it
to be the information-capacity-achieving input distribution PX (),

i.e., the distribution that achieves the maximum in
Cinf
max I(X; Y ).
PX ()
(9.155)
In this case we can replace the condition (9.154) by the weaker

condition
R < Cinf .
(9.156)
Thirdly, lets get rid of the average over the codebooks. Since our average error probability is small ( 2), there must exist at least one
codebook C which has an equally small error probability. Hence
we have shown that for this codebook
(n)
Pe (C ) 2.
(9.157)
Note that in our calculation of the average error probability we have

included all possible codebooks, i.e., really stupid ones with, e.g., all
codewords being the same (which means that the error probability
for this codebook will be 1!) were not excluded.
Finally, lets strengthen the proof to use the maximum error prob(n)
ability (n) instead of the average error probability Pe . We only
consider the good codebook C for which we know that
(n)
Pe (C )
1
= nR
2
2nR
m=1
m (C ) 2.
(9.158)
225
Now we order the codewords according to their error probability.
Let be the error probability of the worst codeword of the better

half. Then
2
1
2nR
1
2nR
1
2nR
(9.159)
all
codewords
m +
worse
half
1
2nR
m
better 0
half
worse
half
2nR
2nR 2
= .
2
(9.160)
(9.161)
(9.162)
(9.163)
Hence, 4. So if we design a new codebook with only half the

amount of codewords by taking the good codebook C and throwing
away the worse half of the codewords, we get a new codebook C

with
a maximum error probability being very small:
(n) 4;
(9.164)
a new rate
nR
2
nR1
nR 1
1
log2 2 = log2 2
R=
=
=R .
n
n
n
n
But for n suciently large,
1
n
(9.165)
can be neglected!
Hence, we have proven the following very exciting result.

Theorem 9.34 (The Channel Coding Theorem for a DMC).
Let a DMC (X , Y, PY |X ) be given. Dene the capacity of this DMC as
C
max I(X; Y )
PX ()
(9.166)
where X and Y are the input and output RVs of this DMC. Then all rates
below C are achievable, i.e., for every R < C there exists a sequence of
(2nR , n) coding schemes with maximum error probability (n) 0 as the
blocklength n gets very large.
Conversely, any sequence of (2nR , n) coding schemes with average
(n)
error probability Pe 0 must have a rate R C.
226
We would like to comment on this, in this class probably most important

result.
We have proven that Cinf = Cop . Therefore, henceforth we will not make
the distinction between information and operational capacity anymore,
but simply call it channel capacity C.
Note that the inequality R < C is strict for the achievable rates! If
R = C, the situation is unclear: Depending on the channel, C might be
achievable or not.
The result is counterintuitive: As long as R < C, we can increase the
amount of codewords without having a higher error probability! Or
equivalently: As long as R < C, we can make the error probability
smaller without reducing the rate!
To summarize: As long as we transmit not too much information per
channel use, we can do it as reliable as you wish (give me any , e.g.,
= 10100 is possible!), but if you are above C, you are not able to
transmit reliably at all!
This result was like a bombshell in 1948! Before 1948 the engineers believed that you need to increase your power or SNR (i.e., make the channel better!) in order to get better error probability. However, Shannon
proved that you can do it without changing the channel (in particular,
without increasing power).
Where do we pay? (There is no free lunch!) Note that the result only
holds for n . So for a very small choice of , the necessary n might
be very large. This will cause
a very complex system (how shall we decode codewords of length
1 000 000?);
very long delays (we need to wait until we have received the complete received sequence before we can decode!).
The proof only shows that there exists a good code, but unfortunately
does not give any clue on how to nd it!
In practice we have the additional problem that the encoding and decoding functions must be ecient. For example, a simple lookup table
at the decoder for all possible received sequences will not work.
Example 9.35. Assume a binary channel and a coding scheme with
1
rate R = 4 and blocklength n = 1024. Then we have M = 2nR = 2256
possible messages, i.e., we need 256 bits to store the decision for each of
227
the 2n = 21024 dierent possible received sequences. Hence the decoder

needs a memory of size
21024 256 bits = 21032 bits = 101032 log10 2 bits 10311 bits
10311
=
GiB 10301 GiB.
(9.167)
8 10243
Note that if we took all existing hard-drives worldwide together, we
would not even reach 1030 GiB for many years to come. . .
In practice it took about 50 years before the rst practical codes were
found that performed close to the limit predicted by Shannon. See the
discussion at the end of Chapter 15.
Chapter 10
Computing Capacity
10.1
Introduction
We have now learned the denition and meaning of the channel capacity for
a DMC. Next we need to tackle the problem of how we can actually compute
its value. It turns out that in many cases this is very dicult. But there is
still hope because we have a couple of tricks available that sometimes help.
We start by recalling the denition of capacity: For a given DMC with
conditional probability distribution PY |X the capacity is given as
C = max I(X; Y ).
(10.1)
PX ()
So how can we compute it? This usually is dicult because

C = max
PX ()
H(X)
nice!
depends on
PX only
H(X|Y )
(10.2)
very annoying!
looking backwards
through DMC!
PX ()
H(Y )
H(Y |X) .
annoying!
need to compute PY
from PX and PY |X
= max
nice!
DMC
(10.3)
We will show two dierent approaches on how to compute capacity. The

rst is based on some special symmetry assumption about the DMC under
consideration (strongly and weakly symmetric DMCs). The second approach
relies on the KarushKuhnTucker conditions and convexity.
10.2
Strongly Symmetric DMCs
Denition 10.1. A DMC is called uniformly dispersive if there exist numbers

p1 , p2 , . . . , p|Y| such that
PY |X (y1 |x), PY |X (y2 |x), . . . , PY |X (y|Y| |x)
is a permutation of [p1 , p2 , . . . , p|Y| ] for all x X .
229
230
Computing Capacity
Example 10.2. A BEC as shown in Figure 10.1 is uniformly dispersive. The

corresponding numbers are [p1 , p2 , p3 ] = [1 , , 0].
|X | = 2
? |Y| = 3
Figure 10.1: The BEC is uniformly dispersive.
Lemma 10.3. A uniformly dispersive DMC satises

H(Y |X) =
|Y|
pj log pj ,
(10.4)
j=1
independently of PX ().
Proof: Using the denition of entropy, we have
H(Y |X = x) =
PY |X (y|x) log PY |X (y|x) =
|Y|
pj log pj ,
(10.5)
j=1
where the second equality follows from the assumption that the DMC is uniformly dispersive. Note that this expression does not depend on x. Hence,
H(Y |X) = EX [H(Y |X = x)] =
|Y|
pj log pj .
(10.6)
j=1
Denition 10.4. A DMC is called uniformly focusing if there exist numbers

r1 , r2 , . . . , r|X | such that
PY |X (y|x1 ), . . . , PY |X (y|x|X | )
is a permutation of [r1 , r2 , . . . , r|X | ] for all y Y.
Example 10.5. The uniform inverse BEC as shown in Figure 10.2 is uniformly focusing. The corresponding numbers are [r1 , r2 , r3 ] = 1, 1 , 0 .
2
Lemma 10.6. A uniformly focusing DMC has the following properties:
10.2. Strongly Symmetric DMCs

1
231
1
1
1
2
|X | = 3
|Y| = 2
1
2
Figure 10.2: The uniform inverse BEC is uniformly focusing.
1. If PX () is uniform, then also PY () is uniform.

2. The maximum output entropy is
max H(Y ) = log |Y|
PX ()
(10.7)
and is achieved by a uniform PX () (but possibly also by other choices

of PX ()).
Proof: We start with the derivation of PY () if PX () is uniform:
PY (y) =
x
=
x
1
=
|X |
PX (x)PY |X (y|x)
(10.8)
1
P
(y|x)
|X | Y |X
(10.9)
|X |
(10.10)
=1
= constant =
1
.
|Y|
(10.11)
Note that the r do not in general sum to one, but to some general nonnegative
number. However, if PY (y) is constant for all values of y, then this constant
1
must be |Y| because PY () must sum to 1.
To prove the second statement, we note that by the properties of entropy
we have
H(Y ) log |Y|,
(10.12)
with equality if, and only if, Y is uniform. However, since we have just shown
that if X is uniform, Y is uniform, too, we see that maxPX () H(Y ) = log |Y|.
232
Computing Capacity
Denition 10.7. A DMC is called strongly symmetric 1 if it is both uniformly

dispersive and uniformly focusing.
Example 10.8. The BSC as shown in Figure 10.3 is strongly symmetric. The
corresponding numbers are [p1 , p2 ] = [1 , ] and [r1 , r2 ] = [1 , ].
|X | = 2
|Y| = 2
1
1
Figure 10.3: Example of a strongly symmetric DMC.
Theorem 10.9 (Capacity of a Strongly Symmetric DMC).

The capacity of a strongly symmetric DMC with input alphabet X , output
alphabet Y, and transition probabilities p1 , . . . , p|Y| is given as
C = log |Y| +
|Y|
pj log pj
(10.13)
j=1
and is achieved by a uniform input distribution

PX (x) =
1
,
|X |
xX
(10.14)
(but possibly also by other distributions).
Proof: By denition of capacity we have

C = max {H(Y ) H(Y |X)}
PX ()
= max H(Y ) +
PX ()
= log |Y| +
|Y|
pj log pj
(10.15)
(10.16)
j=1
|Y|
pj log pj ,
(10.17)
j=1
In some books strongly symmetric DMCs are called symmetric DMCs only. See, e.g.,
[CT06].
10.3. Weakly Symmetric DMCs
233
where the second equality (10.16) follows from Lemma 10.3 and the third
equality (10.17) from Lemma 10.6.
Example 10.10. Take again the BSC of Figure 10.3, with |Y| = 2, p1 = 1,
p2 = . Then we have
C = log 2 + (1 ) log(1 ) + log = 1 Hb () bits.
(10.18)
Note that
if = 0, then C = 1 bit;
if = 1, then C = 1 bit, too.
In the latter case the DMC deterministically swaps zeros and ones which can
be repaired without problem (i.e., without error).
10.3
Weakly Symmetric DMCs
The assumption of strong symmetry is pretty stringent. Most DMCs are not
strongly symmetric. Luckily, there is a slightly less strong assumption for
which we can compute the capacity, too.
Denition 10.11. A DMC (X , Y, PY |X ) is called weakly symmetric if it can
be split up into several strongly symmetric DMCs (called subchannels) in the
following way:
The input alphabets Xs of all subchannels are the same and equal to X :
Xs = X ,
s.
(10.19)
The output alphabets Ys of the dierent subchannels are disjoint and

add up to Y:
Ys Ys = ,
s
Ys = Y.
s = s ,
(10.20)
(10.21)
We can nd a stochastic process that randomly selects one of the subchannels in such a way that the combination of selection process and
subchannel yields the same conditional probability distribution as the
channel law PY |X of the DMC.
Example 10.12. Take again the BEC of Figure 10.1. We split it up into
two strongly symmetric DMCs as shown in Figure 10.4. The selection process
chooses the rst subchannel with probability q1 = 1 and the second with
probability q2 = .
234
Computing Capacity
0
1
0
0
DMC 1
q1 = 1
q2 =
1
?
1
1
DMC 2
Figure 10.4: The BEC is split into two strongly symmetric DMCs.
It is fairly easy to test a DMC for weak symmetry. One merely needs to
examine each subset of output letters with the same focusing, and then check
to see whether each input letter has the same dispersion into this subset of
output letters. Note that it is not hard to see that if a channel is weakly
symmetric, it must be uniformly dispersive. We summarize this idea in the
following algorithm.
Algorithm for checking whether a DMC is weakly symmetric:
Step 1: Check whether the channel is uniformly dispersive. If no =
abort!
Step 2: Partition the output letters into disjoint sets Ys in such a way
that all output letters with the same focusing are combined
together into one subchannel. Number these subchannels from
s = 1, . . . , S.
Step 3: Compute the selection probabilities of the S subchannels as
follows:
qs
yYs
PY |X (y|x1 ),
s = 1, . . . , S.
(10.22)
Then normalize the conditional subchannel probabilities by dividing the original conditional probability by qs :
PY |X,S=s (y|x) =
PY |X (y|x)
,
qs
x X , y Ys ,
(10.23)
for s = 1, . . . , S.
Step 4: For each subchannel s check whether all inputs have the same
dispersion. If no = abort!
10.3. Weakly Symmetric DMCs
235
Step 5: The DMC is weakly symmetric.
Since we know how to compute the capacity of every strongly symmetric

subchannel, it is not surprising that the capacity of a weakly symmetric DMC
can also be derived easily.
Theorem 10.13 (Capacity of a Weakly Symmetric DMC).

The capacity of a weakly symmetric DMC is
S
qs Cs ,
C=
(10.24)
s=1
where Cs denotes the capacity of the sth strongly symmetric subchannel,

and qs is the selection probability of the sth subchannel.
Proof: We decompose the weakly symmetric DMC into S strongly symmetric DMCs, with the selection probabilities q1 , . . . , qS . Dene S to be the selection RV, i.e., if S = s, then the sth subchannel is selected. Since the output
alphabets are disjoint, it follows that once we know Y , we also know S. Therefore, H(S|Y ) = 0, and therefore also H(S|Y, X) = 0. Moreover, since X S
(the input is independent of the channel!), we also have H(S|X) = H(S).

Hence, we get
H(Y ) = H(Y ) + H(S|Y )
= H(Y, S)
(10.25)
(10.26)
= H(S) + H(Y |S)
(10.27)
= H(S) +
s=1
S
= H(S) +
s=1
PS (s) H(Y |S = s)
(10.28)
qs H(Y |S = s)
(10.29)
and
H(Y |X) = H(Y |X) + H(S|X, Y )
(10.30)
= H(Y, S|X)
(10.31)
= H(S|X) + H(Y |S, X)
(10.32)
= H(S) +
s=1
qs H(Y |X, S = s).
(10.33)
236
Computing Capacity
Combined this yields

I(X; Y ) = H(Y ) H(Y |X)
(10.34)
=
s=1
S
=
s=1
qs H(Y |S = s) H(Y |X, S = s)
(10.35)
qs I(X; Y |S = s),
(10.36)
where I(X; Y |S = s) is the mutual information of the sth subchannel. We

know that
I(X; Y |S = s) Cs
(10.37)
with equality if X is uniform. Since the uniform distribution on X simultaneously achieves capacity for all subchannels, we see that
S
qs Cs .
C = max I(X; Y ) =
PX ()
(10.38)
s=1
Remark 10.14. Note that the proof of Theorem 10.13 only works because
the same input distribution achieves capacity for all subchannels. In general,
if a DMC is split into several not necessarily strongly symmetric subchannels
with the same input and disjoint output alphabets, s qs Cs is an upper bound
to capacity.
Remark 10.15. Note that in [CT06, Sec. 7.2], only strongly symmetric channels are considered, and there they are called symmetric.
Example 10.16. We return to the BEC of Figure 10.1. The capacities of the
two strongly symmetric subchannels are C1 = 1 bit and C2 = 0. Hence,
CBEC = (1 )C1 + C2 = 1 bits.
(10.39)
This is quite intuitive: Since the BEC erases a fraction of of all transmitted
bits, we cannot have hope to be able to transmit more than a fraction 1
bits on average.
Example 10.17. We would like to compute the capacity of the DMC shown
in Figure 10.5.
Note that the DMC is not strongly symmetric because the focusing of
the output letter d is dierent from the focusing of the other three output
letters. However, since the dispersion is the same for all three input letters,
we check whether we can split the channel into two strongly symmetric DMCs:
According to the focusing we create two subchannels with Y1 = {a, b, c} and
Y2 = {d}. We then compute the selection probabilities:
q1 =
yY1
q2 =
yY2
PY |X (y|0) = p1 + p2 ,
(10.40)
PY |X (y|0) = 1 p1 p2 .
(10.41)
10.4. Mutual Information and Convexity
237
p1
a
p2
p2
1
1 p1 p2
p1
p2
p1
2
1 p1 p2
1 p1 p2
d
Figure 10.5: A weakly symmetric DMC.
This yields the two subchannels shown in Figure 10.6.

p1
p1 +p2
0
p2
p1 +p2
p2
p1 +p2
p1
p1 +p2
p2
p1 +p2
p1
p1 +p2
1
2
1
1
Figure 10.6: Two strongly symmetric subchannels.

We double-check and see that both subchannels are uniformly dispersive,
i.e., the DMC of Figure 10.5 indeed is weakly symmetric. Its capacity is
C = q1 C1 + q2 C2
(10.42)
p1
p1
p2
p2
= (p1 + p2 ) log 3 +
log
+
log
p1 + p2
p1 + p2 p1 + p2
p1 + p2
+ (1 p1 p2 ) 0
p1
= (p1 + p2 ) log 3 Hb
.
p1 + p2
(10.43)
(10.44)
10.4
Mutual Information and Convexity
In this section we will show that mutual information is concave in the input
distribution. This will be used in Section 10.5 when we apply the Karush
238
Computing Capacity
KuhnTucker conditions (Theorem 7.11) to the problem of computing capacity. We introduce a new notation here that we will use in combination with
our notation so far: We will denote the input distribution PX () of a DMC
by Q(), and the conditional channel distribution PY |X (|) will be denoted by
W (|). Moreover, whenever a DMC has nite alphabets, we will use a vectorand matrix-notation Q and W, respectively.
Theorem 10.18. Consider a DMC with input alphabet X , output alphabet Y,
and transition probability matrix W with components W (y|x), y Y, x X .
T
Let Q = Q(1), . . . , Q(|X |) be an arbitrary input probability vector. Then
I(X; Y ) = I(Q, W)
Q(x)W (y|x) log

xX
yY
W (y|x)
x X Q(x )W (y|x )
(10.45)
is a concave function in Q (for a xed W) and a convex function in W

(for a xed Q).
Proof: We start with the rst claim. We x W and let Q0 and Q1 be
two input distributions. We want to show that
I(Q0 , W) + (1 )I(Q1 , W) I Q0 + (1 )Q1 , W .
(10.46)
To that goal we dene Z to be a binary RV with PZ (0) = and PZ (1) = 1

and dene X such that, conditional on Z,
X
Q0
Q1
if Z = 0,
if Z = 1.
(10.47)
Note that the unconditional PMF of X then is

Q = PZ (0)Q0 + PZ (1)Q1
(10.48)
= Q0 + (1 )Q1 .
(10.49)
We can think of this like two channels in series: a rst channel (that is part of
our encoder) with input Z and output X, and a second channel (the DMC)
with input X and output Y (see Figure 10.7). Hence, we have a Markov chain
Z X Y , i.e.,
I(Z; Y |X) = 0.
(10.50)
I(X; Y ) = I(X; Y ) + I(Z; Y |X)
(10.51)
= I(X, Z; Y )
(10.52)
= I(Z; Y ) + I(X; Y |Z)
(10.53)
We get
I(X; Y |Z),
(10.54)
10.4. Mutual Information and Convexity
239
where the inequality follows because mutual information cannot be negative.

Since
I(X; Y ) = I(Q, W) = I Q0 + (1 )Q1 , W
(10.55)
I(X; Y |Z) = PZ (0)I(X; Y |Z = 0) + PZ (1)I(X; Y |Z = 1)
(10.56)
and
= I(Q0 , W) + (1 )I(Q1 , W),
(10.57)
(10.54) proves the concavity of I(Q, W) in Q.

Z
0
Q0 (1)
Q0 (2)
1
2
Q0 (|X |)
2
W
Q1 (1)
Q1 (2)
1
|Y|
|X |
Q1 (|X |)
Figure 10.7: Mutual information is concave in the input. The encoder consists
of random switch between two random distributions. Since such an encoder looks like a DMC with a binary input, the system looks like having
two DMCs in series.
To prove the second statement, x Q and let W0 and W1 be two conditional
channel distributions. We want to show that
I(Q, W0 ) + (1 )I(Q, W1 ) I Q, W0 + (1 )W1 .
(10.58)
To that aim we dene a new channel W to be a random combination of W0

and W1 as shown in Figure 10.8. The RV Z is the random switch inside this
new channel, i.e., conditional on Z,
W
W0
W1
if Z = 0,
if Z = 1.
(10.59)
Note that the channel law of this new DMC W is

W = PZ (0)W0 + PZ (1)W1
(10.60)
= W0 + (1 )W1 .
(10.61)
240
Computing Capacity
W
1
1
2
2
W0
X
1
2
|Y|
|X |
1
1
2
2
|X |
Y
1
2
|Y|
W1
|X |
|Y|
Figure 10.8: Mutual information is convex in the channel law. We have two
parallel channels with random switch.
Also note that X (input) and Z (part of the new channel) are independent in
this case. Therefore
I(X; Y |Z) = I(X; Y |Z) + I(X; Z)
(10.62)
=0
= I(X; Y, Z)
(10.63)
= I(X; Y ) + I(X; Z|Y )
(10.64)
I(X; Y ).
(10.65)
Since
I(X; Y |Z) = PZ (0)I(X; Y |Z = 0) + PZ (1)I(X; Y |Z = 1)
(10.66)
I(X; Y ) = I(Q, W) = I Q, W0 + (1 )W1 ,
(10.68)
= I(Q, W0 ) + (1 )I(Q, W1 )
(10.67)
and
(10.65) proves the convexity of I(Q, W) in W.
10.5
KarushKuhnTucker Conditions
Recall that in Section 7.3 we have derived the KarushKuhnTucker (KKT)

conditions to describe the optimal solution of an optimization problem involv-
10.5. KarushKuhnTucker Conditions
241
ing the maximization of a concave function over a probability vector. We note

that the computation of capacity is exactly such a problem:
C = max I(Q, W)
(10.69)
where Q is a probability vector and where I(Q, W) is concave in Q.

Hence, we can use the KKT conditions to describe the capacity-achieving
input distribution.
Theorem 10.19 (KKT Conditions for Capacity).

A set of necessary and sucient conditions on an input probability vector
Q to achieve the capacity of a DMC (X , Y, W) is that for some number
C,
W (y|x) log
W (y|x)
=C
x Q(x )W (y|x )
x with Q(x) > 0, (10.70)
W (y|x) log
W (y|x)
C
x Q(x )W (y|x )
x with Q(x) = 0. (10.71)
Moreover, the number C is the capacity of the channel.

More concisely, the KKT conditions can also be written as
D W (|x) (QW )()
=C
C
x with Q(x) > 0,

x with Q(x) = 0.
(10.72)
Proof: From Theorem 10.18 we know that the mutual information

I(Q, W) is concave in the input probability vector Q. We are therefore allowed
to apply the KKT conditions (Theorem 7.11). Hence, we need to compute
the derivatives of I(Q, W). To that goal note that I(Q, W) can be written as
follows:
Q(x)W (y|x) log
x
W (y|x)
x Q(x )W (y|x )
Q(x)W (y|x) log
I(Q, W) =
W (y|x)
Q(x )W (y|x )
x
=
x
y
x=
x
Q()W (y|) log

x
x
y
W (y|)
x
,
)W (y|x )
x Q(x
(10.73)
(10.74)
where where have split up one summation into x = x and x = x for x being
242
Computing Capacity
some xed value. Then
I(Q, W)
Q()
x
=
Q(x)W (y|x)
x
y
x=
x
W (y|) log e
x
+
W (y|) log
x
y
Q()W (y|)
x
x
W (y|) log
x
y
Q(x )W (y|x )
W (y|)
x
W (y|) log
x
y
= log e +
W (y|)
x
x
Q(x )W (y|x )
Q(x )W (y|x )
W (y|x)
W (y|x)
Q(x )W (y|x )
1
W (y|) log e
x
x Q(x )W (y|x )
(10.77)
W (y|) log e
x
W (y|)
x
)W (y|x )
x Q(x
W (y|) log
x
y
(10.76)
W (y|)
x
)W (y|x )
x Q(x
x Q(x)W (y|x)
x Q(x )W (y|x )
Q(x )W (y|x )
W (y|)
x
Q(x )W (y|x )
x
Q(x)W (y|x)
W (y|) log
x
(10.75)
Q(x)W (y|x)
W (y|) log e
x
W (y|x)
W (y|)
x
)W (y|x )
x Q(x
W (y|) log e
x
=
Q(x )W (y|x )
W (y|x)
W (y|)
x
)W (y|x )
x Q(x
(10.78)
=
if Q() > 0,
x
if Q() = 0,
x
(10.79)
where in (10.79) we have applied the KarushKuhnTucker conditions with

the Lagrange multiplier . Taking the constant log e to the other side and
dening c + log e now nishes the proof of the rst part.
It remains to show that c is the capacity C. To that goal take the optimal
Q and average both sides of (10.79):
Q (x)
x
W (y|x) log
y
W (y|x)
=
Q (x )W (y|x )
x
Q (x)c
(10.80)
i.e.,
I Q , W = c.
But since I Q , W = C, we see that c must be the capacity.
(10.81)
10.5. KarushKuhnTucker Conditions
243
Unfortunately, (10.70) and (10.71) are still dicult to solve. They are
most useful in checking whether a given Q is optimal or not: We make a
guess of how the optimal input distribution might look like. Then we check
whether (10.70) and (10.71) are satised. If yes, then we are done and we
know the capacity of the channel. If not, then we at least know one example
of how the capacity-achieving input distribution does not look like. . .
Example 10.20. Lets again consider the binary symmetric channel (BSC).
From symmetry we guess that the capacity-achieving input distribution should
be uniform Q(0) = Q(1) = 1 . Lets use Theorem 10.19 to check if we are right:
2
Since Q(0) = Q(1) > 0 we only need (10.70). For x = 0 we get
W (1|0)
W (0|0)
+ W (1|0) log 1
1
1
+ 2 W (0|1)
2 W (1|0) + 2 W (1|1)
1
(10.82)
= (1 ) log 1
1 + log 1
1
2 (1 ) + 2
2 (1 ) + 2
1
1
(10.83)
= (1 ) log(1 ) + log (1 + ) log (1 ) +
2
2
= Hb () + log 2
(10.84)
W (0|0) log
1
2 W (0|0)
= 1 Hb () bits.
(10.85)
Similarly, for x = 1:
W (0|1)
W (1|1)
+ W (1|1) log 1
1
1
+ 2 W (0|1)
2 W (1|0) + 2 W (1|1)
= 1 Hb () bits.
(10.86)
W (0|1) log
1
2 W (0|0)
Since both expressions (10.85) and (10.86) are the same, our guess has been
correct and the capacity is C = 1 Hb () bits.
Chapter 11
Convolutional Codes
After having studied rather fundamental properties of channels and channel
coding schemes, in this chapter we nally give a practical example of how one
could try to actually build a channel coding scheme. Note that while we know
that there must exist many schemes that work reliably, in practice we have
the additional problem of the limited computational power available both at
transmitter and receiver. This complicates the design of a channel coding
scheme considerably.
The code that will be introduced in this chapter is called convolutional code
or trellis code. It is a quite powerful scheme that is often used in practice. We
will explain it using a particular example that will hopefully shed light on the
basic ideas behind any such design.
11.1
Convolutional Encoder of a Trellis Code
We assume that we have a message source that generates a sequence of information digits {Uk }. We will assume that the information digits are binary,
i.e., information bits, even though strictly speaking this is not necessary. But
in practice this is the case anyway. These information bits are fed into a convolutional encoder. As an example consider the encoder shown in Figure 11.1.
This encoder is a nite state machine that has (a nite) memory: Its current output depends on the current input and on a certain number of past
inputs. In the example of Figure 11.1 its memory is 2 bits, i.e., it contains
a shift-register that keeps stored the values of the last two information bits.
Moreover, the encoder has several modulo-2 adders.
The output of the encoder are the codeword bits that will be then transmitted over the channel. In our example for every information bit, two codeword
bits are generated. Hence the encoder rate is
Rt =
1
bits.
2
(11.1)
In general the encoder can take ni information bits to generate nc codeword

ni
bits yielding an encoder rate of Rt = nc bits.
245
246
Convolutional Codes
Uk
Uk1
parallelserial
X2k1
Uk2
X2k1 , X2k
X2k
Figure 11.1: Example of a convolutional encoder.
To make sure that the outcome of the encoder is a deterministic function of

the sequence of input bits, we ask the memory cells of the encoder to contain
zeros at the beginning of the encoding process. Moreover, once Lt information
bits have been encoded, we stop the information bit sequence and will feed
T dummy zero-bits as inputs instead, where T is chosen to be equal to the
memory size of the encoder. These dummy bits will make sure that the state
of the memory cells are turned back to zero. In the case of the encoder given
in Figure 11.1 we have T = 2.
Example 11.1. For the encoder given in Figure 11.1 let Lt = 3 and T = 2.
How many codeword bits do we generate? Answer: n = 2 (3 + 2) = 10
codeword bits. So our codewords have length n = 10. How many dierent
messages do we encode in this way? Since we use Lt = 3 information bits, we
have 2Lt = 23 = 8 possible messages. So in total, we have an actual coding
rate of
R=
log2 8
3
log2 (#messages)
=
=
bits/channel use.
n
10
10
(11.2)
In general we get the following actual coding rate:

R=
Lt
Rt .
Lt + T
(11.3)
These eight codewords can be described with the help of a binary tree.
At each node we either turn upwards (corresponding to an input bit 1) or
downwards (corresponding to an input bit 0). Since the output depends on the
current state of the memory, we include the state in the graphical depiction of
the node. The branches will then be labeled according to the corresponding
output bits. See Figure 11.2. Any code that can be described by a tree is
11.1. Convolutional Encoder of a Trellis Code
247
output bits (X2k1 , X2k )
01
U =1
10
U =0
10
11
11
11
10
01
01
00
01
10
11
00
00
10
00
11
00
state (Uk1 , Uk2 )
10
11
01
01
00
11
00
10
00
00
10
11
01
00
10
11
01
00
01
00
01
00
01
00
01
00
11
00
11
00
11
00
11
00
00
00
00
00
00
00
00
00
Figure 11.2: Tree depicting all codewords of the convolutional code: The bits
on the branches form the codewords. The labels inside the boxes correspond to the current state of the encoder, i.e., the contents of the memory
cells. The input bits are not depicted directly, but are shown indirectly
as the choice of the path: Going upwards corresponds to an input bit 1,
going downwards to 0. The last two input bits are by denition 0, so we
only show one path there.
called tree code. Obviously a tree code is a special case of a block code as all
codewords have the same length.
Note that at depth 3 and further in the tree, the dierent states start to
repeat itself and therefore (since the output bits only depend on the state and
the input) also the output bits repeat. Hence there is no real need to draw all
states many times, but instead we can collapse the tree into a so-called trellis;
see Figure 11.3.
So, our tree code actually also is a trellis code. Note that every trellis code
by denition is a tree code, but not necessarily vice versa. Moreover, note
that we end up in the all-zero state because of our tail of T = 2 dummy bits.
248
Convolutional Codes
11
U =1
01
10
U =0
10
11
11
10
10
11
01 01
00 10
11
10
01 01
10
01 01
00
00
00
00
00
11
11
11
00
00
00
00
00
00
Figure 11.3: Trellis depicting all codewords of the convolutional code. A trellis
is like a tree, but with identical states being collapsed into one state only.
The nal state is called toor, the inverse of root.
11.2
Decoder of a Trellis Code
A code that is supposed to be useful in practice needs to have a decoder with

reasonable performance that can be implemented with reasonable eciency.
Luckily, for our trellis code, it is possible to implement not only a reasonable
good decoder, but actually the optimal maximum likelihood (ML) decoder (see
Section 9.3)!
Before we explain the decoder, we quickly give the following denitions.
Denition 11.2. The Hamming distance dH (x, y) between two sequences x
and y is dened as the number of positions where x dier from y.
The Hamming weight wH (x) of a sequence x is dened as the number of
nonzero positions of x.
Hence, we see that
dH (x, y) = wH (x y).
(11.4)
To explain how the ML decoder of a trellis code works, we again use a

concrete example: Let us assume that we use the trellis code of Figure 11.3
over a BSC with crossover probability 0 < 1 as shown in Figure 11.4. For
2
a given codeword x, the probability of a received vector y is then given as
PY|X (y|x) = (1 )ndH (x,y) dH (x,y)
= (1 )n
(11.5)
dH (x,y)
(11.6)
where dH (x, y) denotes the Hamming distance dened above. Since we want to
implement an ML decoder, we need to maximize this a posteriori probability
11.2. Decoder of a Trellis Code
249
1
X
1
Y
1
Figure 11.4: Binary symmetric channel (BSC).
given in (11.6):
max PY|X (y|x) = max (1 )n
xC
xC
dH (x,y)
(11.7)
Note that because we have assumed < 2 , we see that 1 < 1, i.e., we
achieve the maximum if we minimize dH (x, y).
So we can describe the ML decoder of a trellis code as follows:
For a trellis code that is used on a BSC, an optimal decoder needs to

nd the path in the trellis (see Figure 11.3) that maximizes the number
of positions where the received vector y and the output bits of the path
agree.
This can be done in the following way: Given a received vector y, we

go through the trellis and write over each state the number of positions that
agree with y. Once we reach depth 4, this number will not be unique anymore
because there are two paths arriving at each state. In this case we will compare
the number we will get along both incoming paths and discard the smaller one
because we know that whatever the optimal solution is, it denitely cannot
be the path with the smaller (intermediate) number! We mark the discarded
path with a cross.
If both incoming path result in an identical value, it doesnt matter which
path to take and this might result in a nonunique optimal solution.
Once we reach the toor (the nal state), we only need to go backwards
through the trellis always following the not-discarded path.
Example 11.3. Lets assume we receive y = (0 1 0 1 0 1 0 1 1 1). What is the
ML decoding decision?
In Figure 11.5 we show the trellis including the number of agreeing bits
written above each state and the discarded paths. We see that the optimal
decoding yields
x = (1 1 0 1 0 0 0 1 1 1),
(11.8)
250
Convolutional Codes
1
11
U =1
10
1
10
U =0
11
0
00
01
00
1
00
3
11
10
2
10
11
3
01 01
4
00 10
11
10 4
01 01
10 6
01 01
2
00
11 4
00
00
11 5
00
00
00
y=
01
01
01
01
x=
11
01
00
01
u=
11 8
00
00
not unique
11
11
Figure 11.5: Decoding of a trellis code.
which corresponds to an input vector u = (1 0 1).
The algorithm that we have described just now is called Viterbi Algorithm.
It is an ecient implementation of an ML-decoder for a trellis code.
In our example of a BSC we have seen that the Hamming distance was
the right metric to be used in the Viterbi Algorithm. But how do we choose
a metric for a general DMC with binary input? To nd out, lets list some
properties such a metric should have:
The metric should be additive (in k = 1, . . . , n).
The largest metric must belong to the path that an ML-decoder would
choose.
Note that for every y, an ML-decoder chooses a x that maximizes
n
PY|X (y|x) =
k=1
PY |X (yk |xk )
(11.9)
where the equality follows because we assume a DMC without feedback. Unfortunately, this metric is not additive. To x this, lets take the logarithm:
n
log PY|X (y|x) =

k=1
log PY |X (yk |xk ).
(11.10)
Hence, we could use

log PY |X (yk |xk )
(11.11)
11.2. Decoder of a Trellis Code
251
as metric. The only problem is that such a metric will yield very annoying
numbers, making it very inconvenient for humans to compute. So lets try to
simplify our life. We additionally to the two properties above also require the
following:
For computational simplicity, the metric should consist of small nonnegative numbers.
Moreover, as much as possible, the metric should be 0.
To nd a way to change our metric (11.11) to satisfy these two additional
constraints, note that
n
f (yk )
argmax log PY|X (y|x) = argmax log PY|X (y|x) +

x
(11.12)
k=1
for any > 0 and any function f () (note that f (yk ) does not depend on x!).
Hence, instead of (11.11) we could use
d(xk , yk ) = log PY |X (yk |xk ) + f (yk )
(11.13)
as metric for an arbitrary choice of > 0 and f ().

To make sure that d(xk , yk ) is nonnegative and equals to zero as often as
possible, we choose
f (yk )
log min PY |X (yk |x) .
(11.14)
Finally, we choose such that the dierent possible values of the metric are
(almost) integers.
Example 11.4. Consider the binary symmetric erasure channel given in Figure 11.6. For this channel we have
0.91
0
0
0.02
0.07
0.07
0.02
1
0.91
Figure 11.6: Binary symmetric erasure channel (BSEC).
252
Convolutional Codes
0.02 for y = 0,
min PY |X (y|x) = 0.07 for y = ?,

x
0.02 for y = 1.
(11.15)
Hence we get the (so far unscaled) metric shown in Table 11.7. We see that
Table 11.7: Viterbi metric for a BSEC, unscaled.
y=0
x=0
x=1
y=?
y=1
log 0.91 log 0.02
log 0.07 log 0.07
log 0.02 log 0.02
log 0.02 log 0.02
log 0.07 log 0.07
log 0.91 log 0.02
5.50
=0
=0
=0
5.50
=0
most entries are zero. The only two entries that are not zero, happen to have
1
the same value, so we choose the scaling factor = 5.50 and get our nal
metric shown in Table 11.8.
Table 11.8: Viterbi metric for a BSEC.
y=0
y=?
y=1
x=0
x=1
Lets use the convolutional code of Example 11.3 over the BSEC of Figure 11.6 and lets assume we receive y = (0 1 ? 1 1 ? 0 1 1 1). Now we apply
the Viterbi algorithm to the trellis of Figure 11.3 and use the metric of Table 11.8 to get the optimal decoding shown in Figure 11.9. We see that the
decoding in this case turns out to be not unique: We nd two possible optimal
solutions.
11.3
Quality of a Trellis Code
Suppose you have come up with a certain design of a convolutional encoder.

Now your boss wants to know how good your coding scheme is (of course
when using the encoder together with a Viterbi decoder). One approach would
be to implement the system and let it run for long time and then compute
the average error probability. However, from a designers point of view, it
would be much more practical to have a quick and elegant way of getting the
performance (or at least a bound on the performance) of the system without
having to build it. This is the goal of this section: We will derive an upper
PSfrag
11.3. Quality of a Trellis Code

1
11
U =1
10
1
10
U =0
11
0
00
253
00
1
00
01
3
11
10
2
10
11
2
01 01
2
00 10
11
10 2
01 01
10 4
01 01
1
00
11 3
00
00
11 4
00
00
11 6
00
00
00
y=
01
?1
1?
01
11
x=
11
00
01
00
00
11
01
01
11
11
u=
1
0
0
0
1
1
Figure 11.9: Viterbi decoding of convolutional code over BSEC.
bound on the bit error probability or bit error rate (BER) that can be easily
computed using nothing but the design plan of our convolutional encoder and
the specication of the channel.
While at the end the computation of this elegant bound is easy, the derivation of it is quite long as we need to make quite a few excursions to attain some
necessary auxiliary results. We start with the idea of a detour in a trellis.
11.3.1
Detours in a Trellis
The performance of a trellis code depends on how close two dierent codewords are: If they are very close, then it is much more likely that the decoder
confuses them due to the distortion caused by the channel. Since every codeword corresponds to a path in the trellis, we need to consider the distance
of dierent paths.
To that goal, assume for the moment that we transmit the all-zero codeword (which is a possible codeword of any trellis code!). If the decoder makes
a mistake, the Viterbi decoder will make a detour in the trellis, i.e., it leaves
the correct path and follows partially a wrong path.
To understand all possible detours that could occur, assume for the moment that the trellis goes on forever (i.e., assume Lt = ). In this case we
can describe all possible detours by xing the starting point of the detour to
node 1.
254
Convolutional Codes
Denition 11.5. A detour is a path through the trellis starting at node 1

and ending as soon as we are back to the all-zero state.
An example of a detour is shown in Figure 11.10.
11
10
detour
10
10
11
11
10
01
11
00
00
00
00
00
00
00
00
00
00
00
Figure 11.10: Detour in a trellis: The detour ends as soon as we get back to
the correct path (the all-zero path in this case) for the rst time, even if
we will leave the correct path right again.
For a given detour the important questions are rstly how many bits are
dierent in comparison to the correct path, and secondly how many information bit errors do occur along the way. Ideally, a detour has a large Hamming
distance to the correct path.
Denition 11.6 (Counting Detours). For the innite trellis of a given convolutional encoder, we dene the parameter a(d, i) to denote the number of
detours starting at node 1 that have a Hamming distance d from the correct
(all-zero) path and contain i information bit errors.
Example 11.7. Considering the convolutional encoder of Figure 11.1, it is
not hard to see (compare with Figure 11.3) that
a(d, i) = 0,
for d < 5 and all i,
(11.16)
a(5, 1) = 1,
and
(11.17)
a(6, 2) = 2.
(11.18)
However, it can become rather tiring to try to nd all possible detours by

hand.
In Section 11.3.2 we will nd a way of deriving a(d, i) graphically. Beforehand, however, we should investigate what happens if the correct path is not
the all-zero path. It will turn out that a(d, i) does not change irrespective
which path through the trellis is assumed as the correct path! Concretely, we
have the following lemma.
255
Lemma 11.8. Consider the Lt = trellis for a given convolutional encoder.

For any correct path and any integer j, the number of wrong paths beginning
at node j that have Hamming distance d from the corresponding segment of
the correct path and contain i information bits errors, is equal to a(d, i). For
every nite Lt , a(d, i) is an upper bound on this number.
Proof: Firstly note that in the innite trellis, it does not matter where
a detour starts, as from every node, the remainder of the trellis looks the
same. So without loss of generality we can assume that the deviation from
the correct path starts at node 1.
For an information sequence u denote by g(u) the corresponding codeword
generated by the convolutional encoder.
If 0 is the information sequence of the correct path, let u be the information

sequence corresponding to a wrongly decoded codeword when making a detour
beginning at node 1 (hence, u has a 1 at the rst position!). Let
i = dH (0, u) = wH ( )
u
(11.19)
be the number of bit errors caused by this wrong decoding and let
u
u
u
d = dH g(0), g( ) = dH 0, g( ) = wH g( )
(11.20)
be the Hamming distance between the two codewords.

Now, if we take a general information sequence u with corresponding codeword g(u) as the correct path, let u be the information sequence corresponding to a wrongly decoded codeword when making a deviation from the correct
path beginning at node 1 (hence, u and u dier at the rst position!). Now,
the number of information bit errors is
i = dH (u, u ) = wH (u u )
(11.21)
and the Hamming distance between the two codeword is

d = dH g(u), g(u )
(11.22)
= dH g(u) g(u), g(u ) g(u)
(11.23)
(11.24)
(11.25)
= dH 0, g(u ) g(u)
= dH 0, g(u u)
= wH g(u u) .
(11.26)
Here, in (11.23) we use the fact that adding the same sequence to both codewords will not change the Hamming distance; and in (11.25) we use that any
convolutional encoder is linear in the sense that
g(u u) = g(u ) g(u).
(11.27)
Comparing (11.21) with (11.19) and (11.26) with (11.20), we see that for every
given i, d, u and u , we can nd an u (i.e., u = u u) such that i and d remain

the same, but u is changed to 0. Hence, we can conclude that the number of
256
Convolutional Codes
deviations from a correct path u is the same as the number of detours (from
a correct path 0), and therefore that a(d, i) denotes the number of deviations
from the correct path with i information bit errors and Hamming distance d
for an arbitrary path u.
11.3.2
Counting Detours: Signalowgraphs
Since we have understood the signicance of a(d, i), lets try to nd a way on
how to compute it. The trick is to look at the convolutional encoder using
a state-transition diagram that depicts all possible states and shows how the
encoder swaps between them. For the example of the encoder of Figure 11.1
we have the state-transition diagram shown in Figure 11.11.
1/01
11
1/10
0/10
1/00
10
01
0/01
1/11
00
0/00
0/11
Figure 11.11: State-transition diagram of the encoder given in Figure 11.1.

Each edge u/cc is labeled with the information bit u necessary in order
to change state along this edge and the corresponding codeword bits cc
generated by this transition.
Now consider again the case when the all-zero codeword is the correct
path. Then any detour starts from the all-zero state and will end in the
all-zero state. Hence, we have the idea to break the all-zero state up into
two states, a starting state and an ending state, and to transform the statetransition diagram into a signalowgraph, where every state is transformed
into a node and every transition into a directed edge; see Figure 11.12.
Moreover, each edge has assigned some multiplicative factor, where a term
I denotes that an information bit error occurs when we pass along the corresponding edge, and a term D describes a codeword bit error caused by the
detour that follows along this edge. If we take the product of all weight factors
along a certain detour, then we will get a total factor Ii Dd , where the exponent i denotes the total number of information bit errors along this detour,
257
ID
ID
b
D
I
a
c
ID
D2
start
end
Figure 11.12: Signalowgraph of the state-transition diagram that is cut open

at the all-zero node.
and the exponent d denotes the total Hamming distance between the detour
and the correct path.
Example 11.9. If we take the detour startabbcend, we get the
product
ID2 ID ID D D2 = I3 D7 ,
(11.28)
which means that we have made three information bit errors and have a Hamming distance 7 to the correct all-zero path.
Before we now learn how we can make use of this signalowgraph, let us
quickly repeat the basic denitions of signalowgraphs.
Denition 11.10. A signalowgraph consists of nodes and directed edges
between nodes, where each edge has a weight factor. We dene the following:
A node i is directly connected with a node j if there exists a directed
edge pointing from i to j.
A node i is connected with node j if we can nd a sequence of nodes
such that i is directly connected with 1 ; 1 is directly connected with
2 ; . . . ; 1 is directly connected with ; and is directly connected
with j.
An open path is a connection from a node i to a node j where each node
on the way is only passed through once.
A loop is an open path that returns to the starting node.
258
Convolutional Codes
A path from a node i to a node j is an arbitrary combination of open

paths and loops that connects the starting node with the end node. So
we can also say that node i is connected with node j if there exists a
path from i to j.
Each path has a path gain G that is the product of all weight factors of
the edges along the path.
We say that two paths touch each other if they share at least one node.
The transmission gain T of a signalowgraph between a starting node
and an ending node is the sum of the path gains of all possible paths
(including arbitrary numbers of loops) between the starting and the
ending node.
Example 11.11. Consider the signalowgraph shown in Figure 11.13. The
c
a
b 3
d
1
Figure 11.13: Example of a signalowgraph.

path 1 2 3 4 is an open path, 2 3 2 is a loop, and 1 2
3 2 3 4 is a path that is neither open nor a loop. The path gain of
this last path is
G123234 = a b2 c d.
(11.29)
To compute the transmission gain of this signalowgraph (assuming that node

1 is the starting node and node 4 the ending node), we need to list all possible
paths from 1 to 4 with their path gains:
G1234 = a b d,
(11.30)
G123234 = a b c d,
(11.31)
(11.32)
(11.33)
G12323234 = a b c d,
G1232323234 = a b c d,
.
.
.
259
Hence, we get
T = G1234 + G123234 + G12323234
+ G1232323234 +
2
3 2
(11.34)
4 3
= abd + ab cd + ab c d + ab c d +
= abd
(11.35)
bi c i
(11.36)
i=0
= abd
1
abd
=
.
1 bc
1 bc
(11.37)
Note that we quite easily managed to compute the transmission gain for this
example by listing the paths and cleverly arrange them. However, this was
mainly due to the simplicity of the graph. In general, we need a more powerful
tool.
We will now state a general rule on how to compute the transmission gain
of an arbitrary signalowgraph.
Theorem 11.12 (Masons Rule). The transmission gain T of a signalowgraph is given by
T=
all open paths k
Gk k
(11.38)
Here the sum is over all possible open paths from starting node to end node,
and Gk corresponds to the path gain of the kth of these open paths. The factor
is denoted determinant of the signalowgraph and is computed as follows:
G +
all loops
all loops 1 , 2
1 and 2 do not touch
all loops 1 , 2 , 3
1 , 2 , 3 do not touch
G1 G2
G1 G2 G3 +
(11.39)
Finally, the cofactor k of the kth open path is the determinant of that part
of the graph that does not touch the kth path.
Example 11.13. We return to the signalowgraph of Figure 11.13. The
determinant of this graph is
= 1 b c.
(11.40)
Since there is only one open path from 1 to 4 with G = abd and since this
open path has cofactor 1 = 1 (the path touches all parts of the graph!), we
get
T=
abd
,
1 bc
exactly as we have seen already in (11.37).
(11.41)
260
Convolutional Codes
c
b
d
2
e
g
Figure 11.14: Another example of a signalowgraph.
Example 11.14. Lets make another, slightly more complicated example.

Consider the signalowgraph in Figure 11.14.
For this graph we have two open paths from 1 to 5 and therefore two
corresponding cofactors:
G1245 = aeg,
G12345 = abdg,
1245 = 1 c,
12345 = 1.
(11.42)
(11.43)
Moreover, the determinant is

= 1 c bdf ef + cef.
(11.44)
Note that we have three loops: 3 3, 2 3 4 2, and 2 4 2; two

of which do not touch each other.
Hence, using Masons rule (Theorem 11.12) we get
T=
aeg(1 c) + abdg
.
1 c bdf ef + cef
(11.45)
So, let us now return to the signalowgraph of our convolutional encoder

from Figure 11.12. We recall from Example 11.9 that a path gain describes
the number of information bit errors and the Hamming distance of the corresponding detour with respect to the correct path. Moreover, the transmission
gain by denition is the summation of all possible path gains from starting
node to ending node. But this is exactly what we need for the analysis of our
261
detours:
T=
=
=
(11.46)
Gk
all paths k from
state 0 to state 0
i d
(11.47)
I D
all detours

a(d, i)Ii Dd .
(11.48)
i=1 d=1
Here the rst equality follows from the denition of the transmission gain;
the subsequent equality from the way how we transform the state-transition
diagram into a signalowgraph with the all-zero state cut open and changed
into the starting and ending node and with each edge labeled by factors I
and D depending on the caused errors; and in the nal equality we rearrange
the summation order and collect all terms with identical exponents i and d
together, using a(d, i) as the count of such identical detours.
We see that if we compute the transmission gain, we can read the wanted
values of a(d, i) directly out of the expression.
Example 11.15. We continue with Figure 11.12 and compute the transmission gain using Masons rule (note that the structure of the signalowgraph is
identical to the example in Figure 11.14):
T=
ID5 (1 ID) + I2 D6
ID5
=
.
1 2ID
1 ID I2 D2 ID + I2 D2
(11.49)
If we write this as a geometrical series, we get

T = ID5 1 + 2ID + 4I2 D2 + 8I3 D3 +
5
= ID + 2I D + 4I D + 8I D + ,
(11.50)
(11.51)
from which we see that a(5, 1) = 1, a(6, 2) = 2, a(7, 3) = 4, a(8, 4) = 8,

etc.
11.3.3
Upper Bound on the Bit Error Probability of a Trellis

Code
So now we have a way of deriving the number of wrong paths, i.e., detours,
and their impact concerning information bit errors and Hamming distance
to the correct path. Next we will use the Bhattacharyya bound derived in
Section 9.4 to nd a relation between these detours and the error probability.
This will then lead to an upper bound on the bit error probability of a trellis
code.
We make the following assumptions and denitions:
We consider a convolutional encoder that is clocked Lt times using information bits as input and T times using dummy zero bits as input.
262
Convolutional Codes
Hence, every path in the corresponding trellis will pass through Lt + T
nodes before it ends in the nal all-zero node. We number these nodes
j = 1, . . . , Lt + T + 1. (See also Example 11.16 below.)
We will assume that at every time step, k0 information bits enter the
encoder (so far, we have always assumed k0 = 1, but we want to keep
this analysis general). Hence, we are using in total k0 Lt information bits
and k0 T dummy bits.
We denote by Pe,i the probability of a decoding error for the ith information bit. Then the average bit error probability or average bit error
rate (BER) is
Pb
1
k 0 Lt
k0 Lt
Pe,i .
(11.52)
i=1
We recall that in the innite trellis, the number of detours starting at

node j is identical to the number of detours starting at node 1. We
number these detours in the innite trellis in some (arbitrary) way: =
1, 2, . . .
For each of these detours = 1, 2, . . . let
d
Hamming distance between the th detour and the

corresponding segment of the correct path,
number of information bit errors that occur along

the th detour.
We dene the following events: For any node1 j = 1, . . . , Lt and for any
detour = 1, 2, . . .,
Bj,
{Viterbi decoder follows the th detour starting at node j},
(11.53)
and for any node j = 1, . . . , Lt ,

Bj,0
{Viterbi decoder does not follow

any detour starting at node j}.
(11.54)
Note that Bj,0 can occur for two dierent reasons: Either the Viterbi
decoder does not make an error at node j or the Viterbi decoder is
already on a detour that started earlier. Note further that for every
node j, exactly one event Bj, must occur, i.e.,
Pr(Bj, ) = 1,
for all j = 1, . . . , Lt .
(11.55)
=0
1
At nodes Lt + 1, Lt + 2, . . . no detours can start because there we will be using dummy

zero bits.
263
11
10
detour = 3
i3 = 2
d3 = 6
10
11
detour = 1
i1 = 1
d1 = 5
10
11
10
01
01 01
11
00
1
00 00
2
00 00
3
00 00
4
00 00
5
11
00 00
6
00 00
7
00 00
8
W1 = 2
W2 = 0
W3 = 0
W4 = 0
W5 = 1
W6 = 0
W7 = 0
V1 = 1
V2 = 1
V3 = 0
V4 = 0
V5 = 1
V6 = 0
V7 = 0
Figure 11.15: Example of a path in a trellis that contains two detours: The
rst detour (detour = 3, although this numbering is arbitrary!) starts
at node 1, the second detour (detour = 1) at node 5.
Note that, in general, the events Bj, are highly dependent (e.g., if we
start a detour at node j, we cannot start another one at node j + 1).
Let Wj be the number of information bit errors that occur when the
Viterbi decoder follows a detour starting at node j. Note that Wj depends on which detour is chosen. Note further that Wj = 0 if the decoder
does not start a detour at node j or it is already on a detour that started
earlier.
We see that
Wj =
0
i
if Bj,0 occurs,
if Bj, occurs ( 1).
(11.56)
Let Vi be a binary indicator random variable such that

Vi
1 if the ith information bit is in error,

0 otherwise.
(11.57)
Note that
Pr[Vi = 1] = Pe,i
(11.58)
E[Vi ] = 1 Pr[Vi = 1] + 0 Pr[Vi = 0] = Pe,i .
(11.59)
and
Example 11.16. Consider the example shown in Figure 11.15. We see that
the following events take place: B1,3 , B2,0 , B3,0 , B4,0 , B5,1 , B6,0 , B7,0 . In total
we make three information bit errors, the rst, second, and the fth bits are
wrong.
264
Convolutional Codes
Putting everything together, we now get the following:

1
Pb =
k 0 Lt
1
k 0 Lt
=E
k0 L t
Pe,i
(11.60)
E[Vi ]
(11.61)
i=1
k0 L t
i=1
k0 Lt
1
k0 Lt
Vi ,
(11.62)
i=1
where the last step follows because expectation is a linear operation. Note that
k0 Lt
i=1 Vi denotes the total number of information bit errors. This number can
also be computed in a dierent way by summing over Wj :
k0 Lt
Lt
Vi =
i=1
Wj .
(11.63)
j=1
Hence we have
1
Pb = E
k0 Lt
1
= E
k0 Lt
1
=
k 0 Lt
=
1
k 0 Lt
1
k 0 Lt
1
=
k 0 Lt
k0 Lt
(11.64)
Vi
i=1
Lt
j=1
Lt
Wj
(11.65)
E[Wj ]
(11.66)
j=1
Lt
j=1 =0
Pr(Bj, ) E[Wj |Bj, ]
Lt
j=1
Lt
Pr(Bj,0 ) 0 +
Pr(Bj, ) i .
Pr(Bj, ) i
(11.67)
(11.68)
=1
(11.69)
j=1 =1
Here, (11.66) follows again from the linearity of expectation; in (11.67) we use
the theorem of total expectation (see Chapter 2), based on (11.55); and in
(11.68) we apply (11.56).
Now, in order to get a grip on Bj, and Pr(Bj, ), we need to simplify things:
Firstly, we upper-bound the probability by including all (innitely many)
possible detours in the innite trellis instead of only the (nite number
of) possible detours that can start from node j in our nite trellis.
265
Secondly, we upper-bound the probability further by completely ignoring

the dependency of Bj, on j, i.e., by ignoring the possibility that the
decoder could already be on a detour.
Thirdly, we upper-bound the probability even more by ignoring all other
detours Bj, , = , i.e., we assume that the likelihood of the th detour
must not necessarily be the largest of all detours, but it is sucient if it
is larger than the corresponding correct path. In short: We reduce the
problem to a two-codeword problem discussed in Section 9.4.
We can then bound according to Corollary 9.21:
d
Pr(Bj, ) 2DB
(11.70)
So we get
Pb
1
k0 Lt
1
k0 Lt
Lt
2DB
(11.71)
j=1 =1
i 2DB
Lt
(11.72)
1
j=1
=1
= Lt
=
=
1
k0
1
k0
i 2DB
=1

a(d, i) i 2DB
(11.73)
d
(11.74)
i=1 d=1
where in the last equality we have rearranged the summation order: Instead
of summing over all possible detours, we group all detours with i information
bit errors and Hamming distance d together.
Note that this upper bound holds for any Lt , even for Lt = , i.e., the
case when we do not insert the tail of dummy bits. Also note that DB is a
characteristic of the channel, while a(d, i) and k0 are characteristics of the
encoder.
Now we only need to recall from Section 11.3.2 that
T(D, I) =
a(d, i)Ii Dd ,
(11.75)
i=1 d=1
take the derivative of this expression with respect to I:
T(D, I) =
I
a(d, i) i Ii1 Dd ,
(11.76)
i=1 d=1
and compare it with (11.74) to get the following main result.
266
Convolutional Codes
Theorem 11.17. The average bit error probability (BER) of a convolutional encoder that is used on a binary-input DMC without feedback is
upper-bounded as follows:
Pb
1 T(D, I)
k0
I
(11.77)
I=1,D=2DB
where T(D, I) is the transmission gain of the signalowgraph corresponding to the state transition diagram of the convolutional encoder; where
DB is the Bhattacharyya distance of the DMC; and where k0 is the number of information bits that enter the encoder at each time step.
Example 11.18. Lets return to the main example of this chapter: the encoder given in Figure 11.1. From Example 11.15 we know that
T(D, I) =
ID5
.
1 2ID
(11.78)
Hence,
T(D, I)
D5 (1 2ID) + ID5 2D
D5
=
=
.
I
(1 2ID)2
(1 2ID)2
(11.79)
Moreover, we have k0 = 1.
Lets now assume that we use this encoder on a BEC with erasure probability . Then, from Example 9.22 we know that
2DB = .
(11.80)
Using these values in (11.77) now gives the following bound:

5
.
(1 2)2
(11.81)
For example, if = 0.1, then Pb 1.56 105 .
Pb
Since (11.77) is an upper-bound, any encoder design will be performing

better than what Theorem 11.17 predicts. However, note that the bound
(11.77) is in general quite tight. This means that if we design a system according to the upper-bound instead of the exact average bit error probability,
our design will usually not result in unnecessary complexity.
Chapter 12
Error Exponent and Channel

Reliability Function
In Chapter 9 we have learned that every channel has this fundamental property
that if we do not transmit too much information in each time-step (the crucial
idea here is the concept of information per channel use, i.e., the rate of our
information transmission), then the probability of making an error can be
made as small as we wish by using an appropriate coding scheme. What we
have not considered there are two fundamental questions:
Given that we use a good coding scheme with a rate below capacity,
how quickly does the error probability tend to zero if we increase the
blocklength?
How do two dierent good coding schemes with dierent rates compare
with each other? Concretely, assume we have a DMC with capacity C =
5 bits and we use two sequences of coding schemes on that DMC: The
rst sequence of coding schemes works at a rate of 4.9 bits, the second
at a rate of only 1 bit. Since the second scheme uses far less codewords
for a given blocklength than the rst, it would be quite intuitive that
the average error probability of the second scheme should be smaller. Is
this true?
This chapter will answer these two questions. We will show that the error
probability of a good coding scheme decays exponentially with the blocklength, where the exponent depends on the rate of the scheme. Concretely,
the further away the rate is from capacity, the faster the convergence of the
error probability will be! Hence, the intuition of the second question is indeed
correct.
To arrive at these answers, we will again rely on Shannons fantastic (and
a bit crazy) idea of random coding. Moreover, we need some tools that are
based on the Bhattacharyya bound derived in Section 9.4.
267
268
Error Exponent and Channel Reliability Function
12.1
The Union Bhattacharyya Bound
Recall that in Denition 9.17 we dened the error probability given that the
codeword x(m) has been sent:
Pr X = x(m) x(m) is sent
PY|X (y|x(m))
(12.1)
(12.2)
yDm
/
M
PY|X (y|x(m)),
(12.3)
=1 yD
=m
where D denotes the set of all y that will be decoded to M = . Also recall
that for ML decoding, it holds that
y D = PY|X (y|x()) PY|X (y|x(m).
(12.4)
Hence, by comparison to (9.45), this is identical to a situation where we have

an ML decoder for a coding scheme with only two codewords
{x (1), x (2)} = {x(), x(m)}
(12.5)
and with the ML decoding region1 of x (1)
D1
y : PY|X (y|x (1)) PY|X (y|x (2)) .
(12.6)
Note that D1 D because for D1 we only consider the two codewords of

(12.5) and ignore all other codewords.
We can now apply the Bhattacharyya bound as follows:
PY|X (y|x(m))
yD
PY|X (y|x (2))
(by (12.5))
(12.7)
PY|X (y|x (2))
(because D D1 )
(12.8)
(by (9.48))
(12.9)
yD
yD1
PY|X (y|x (1)) PY|X (y|x (2))
yD1
PY|X (y|x (1)) PY|X (y|x (2))
(enlarge sum)
(12.10)
PY|X (y|x()) PY|X (y|x(m))
(by (12.5)).
(12.11)
=
y
Here, in (12.9) we apply the Bhattacharyya bound on the new coding scheme
with only two codewords; and in (12.10) we enlarge the sum by taking into
account all y.
Plugging this into (12.3) then yields the following.
1
Note that without loss of optimality we have chosen to resolve ties in favor of x (1).
12.2. The Gallager Bound
269
Theorem 12.1 (Union Bhattacharyya Bound).

For a coding scheme with M codewords and an ML decoder, the error
probability is bounded as follows:
M
PY|X (y|x(m)) PY|X (y|x()),

=1
=m
(12.12)
for m = 1, . . . , M. In the case of a DMC without feedback and a blocklength n, this can be written as
n
=1 k=1 y
=m
PY |X (y|xk (m)) PY |X (y|xk ()),
(12.13)
for m = 1, . . . , M.
Note that this bound cannot be expected to be very tight if M is large
because then D1 in (12.8) is much larger than the actual decoding region D .
12.2
The Gallager Bound
We now consider a less straightforward way to generalize the Bhattacharyya

bound. For an ML decoder, we know that
y Dm = PY|X (y|x()) PY|X (y|x(m)) for some = m,
/
(12.14)
but we cannot easily characterize such . But we can certainly be sure that
for all s 0 and 0,
y Dm =
/
M
=1
=m
PY|X (y|x())
PY|X (y|x(m))
= PY|X (y|x(m))s
M
=1
=m
(12.15)
PY|X (y|x())s 1
(12.16)
because (12.14) guarantees that at least one term in the sum is larger equal to
one and because when raising a number larger equal to one to a nonnegative
power, the result still is larger equal to one.
270

Hence, we get from (12.2)
m =
yDm
/
yDm
/
=
yDm
/
PY|X (y|x(m)) 1
(12.17)
PY|X (y|x(m)) PY|X (y|x(m))s
PY|X (y|x(m))1s
=1
=m
M
=1
=m
PY|X (y|x())s (12.18)
PY|X (y|x())s ,
(12.19)
where we have bounded the 1 in (12.17) by (12.16).

Finally, we weaken the bound by including all y in the summation and by
choosing2
1
.
1+
(12.20)
Theorem 12.2 (Gallager Bound).

For a coding scheme with M codewords and an ML decoder, the error
probability is bounded as follows:
PY|X (y|x(m))
1
1+
=1
=m
for m = 1, . . . , M and for all 0.
1
PY|X (y|x()) 1+ ,
(12.21)
In the case of a DMC without feedback and a blocklength n, the bound

(12.21) can be weakened to
k=1 y
PY |X (y|xk (m))
1
1+
=1
=m
1
PY |X (y|xk ()) 1+ ,
(12.22)
for m = 1, . . . , M and for all 0.

Note that if we choose = 1, then the Gallager bound coincides with
the union Bhattacharyya bound. Hence, the Gallager bound is strictly tighter
than the union Bhattacharyya bound, except when the choice = 1 minimizes
the right-hand side of (12.21).
2
Any choice s 0 is legal, so we denitely are allowed to choose it this way. However,
it can be shown that the choice (12.20) is optimal in the sense that it minimizes the upper
bound.
12.3. The Bhattacharyya Exponent and the Cut-O Rate
271
We have to warn the reader, however, that while the Gallager bound is not
substantially more dicult to compute than the union Bhattacharyya bound,
usually both bounds are too complicated to calculate for most practical cases.
Their main use lies in the fact that they can be conveniently averaged over,
as we shall see below.
12.3
The Bhattacharyya Exponent and the

Cut-O Rate
As already mentioned, we are coming back to random coding. The situation

is very similar to Section 9.8: We would like to design a coding scheme that
has good properties, but since we have no idea on how to do it, we generate
the codebook with M codewords of length n at random using a certain chosen
probability distribution. However, while in (9.106) we chose each component
of each codeword IID according to some distribution PX (), here we will allow
a slightly more general distribution where we only require that the codewords
are pairwise IID:
PX(m),X(m ) (x(m), x(m )) = PX (x(m)) PX (x(m ))
(12.23)
for all choices of x(m) and x(m ), m = m . We then compute the error
probability given that the mth codeword has been sent, averaged over all
possible choices of codebooks:
E[m ] = E m X(1), . . . , X(M)
(12.24)
PX(1),...,X(M) x(1), . . . , x(M) m x(1), . . . , x(M) ,
=
x(1),...,x(M)
(12.25)
where m (x(1), . . . , x(M)) denotes the error probability of the mth codeword
given a particular codebook C = {x(1), . . . , x(M)}. We now upper-bound
each of these error probabilities using the union Bhattacharyya bound (Theorem 12.1):
E[m ] E
=1 y
=m
PY|X (y|X(m)) PY|X (y|X())
PY|X (y|X(m)) PY|X (y|X())
(12.26)
(12.27)
=1 y
=m
M
PX (x(m)) PX (x())
=
=1 y x(m)X n x()X n
=m
PY|X (y|x(m)) PY|X (y|x())
(12.28)
272

M
PX (x(m))
PY|X (y|x(m))
=1 y x(m)X n
=m
PX (x())
PY|X (y|x())
(12.29)
x()X n
2
PX (x)
=1 y
=m
PY|X (y|x)
(12.30)
x
2
= (M 1)
PY|X (y|x)
PX (x)
y
(12.31)
for m = 1, . . . , M. Here, in (12.28) we make use of our assumption (12.23); and

in (12.30) we note that x(m) and x() are only dummy summation variables
so that both sums are actually identical.
We now further assume that PX is such that
n
PX (xk ).
PX (x1 , . . . , xn ) =
(12.32)
k=1
Then, for a DMC without feedback, we have

n
PX (xk )
PY|X (y|x) =
PX (x)
k=1
PY |X (yk |xk ).
(12.33)
We substitute this back into (12.31) and obtain

2
E[m ] (M 1)
PX (xk )
y
x k=1
PY |X (yk |xk )
(12.34)
2
= (M 1)
y1
x1
yn
PX (xk )
xn k=1
PY |X (yk |xk )
2
= (M 1)
y1
PX (xk )
k=1 xk
yn
PY |X (yk |xk )
y1
PX (xk )
yn k=1
xk
PY |X (yk |xk )
PX (xk )
k=1 yk
= (M 1)
xk
PY |X (yk |xk )
2
PX (x)
y
PY |X (y|x)
2
PX (x)
y
(12.37)
= (M 1)
(12.36)
2
= (M 1)
(12.35)
PY |X (y|x)
(12.38)
(12.39)
(12.40)

n log2
= 2nR 2
PX (x)
PY |X (y|x))
273
(12.41)
where we have recognized several dummy summation variables and simplied

the expressions accordingly and used that by denition
M = 2nR .
(12.42)
Note that we have not yet specied what to choose for PX (). In Section 9.8
we chose it such as to maximize the mutual information between input and
output of the DMC (because this way we achieved the best bound) and thereby
dened the channel property capacity. Here, we will do something similar: We
choose PX () such as to minimize the upper bound (12.41) and thereby dene
a new channel property that is called cut-o rate of the DMC:
R0 max log2
(12.43)
PX (x) PY |X (y|x) .
PX ()
y
x
Note that for a given DMC, the capacity-achieving input distribution is usually
not the same as the distribution that achieves the cut-o rate.
With denition (12.43) the bound (12.41) now becomes
E[m ] 2n(R0 R) .
(12.44)
Since the right-hand side of (12.44) is independent of m, we can average over

M and still get the same bound:
E
(n)
Pe
1
=E
M
=
1
M
1
M
(12.45)
E[m ]
(12.46)
2n(R0 R)
(12.47)
m=1
M
m=1
M
m=1
n(R0 R)
=2
(12.48)
We now strengthen the result using the same trick as shown in (9.159)
(9.163) in Section 9.8: We only consider a good codebook C for which (12.48)
is satised,
(n)
Pe (C ) 2n(R0 R) ,
(12.49)
and throw away the worse half of the codewords:

(n)
Pe (C ) =
1
1
m +
M worse
M
half (n)
1
m (n) .
2
(12.50)
better 0
half
274
This way we obtain a new codebook C that has rate

1
M
1
R = log2
=R
n
2
n
and that has a maximum error probability satisfying
(n)
(n) 2Pe (C )
(12.51)
(by (12.50))
(12.52)
n(R0 R)
(by (12.49))
(12.53)
1
n(R0 R n )
(by (12.51))
(12.54)
22
=22
n(R0 R)
=42
(12.55)
We have proven the following important result.

Theorem 12.3 (Bhattacharyya Error Exponent).
There exists a coding scheme with M = 2nR codewords of length n and
with an ML decoder such that the maximum error probability satises
(n) 4 2nEB (R) ,
(12.56)
where EB (R) is called the Bhattacharyya exponent and is given by

EB (R)
R0 R
with the cut-o rate R0 dened as
PX (x)
R0 max log2
PX ()
y
x
(12.57)
PY |X (y|x)
(12.58)
From (12.57) we readily see that the Bhattacharyya exponent is positive

only if R < R0 . This shows that R0 has the same dimension and unit as the
code rate and justies the inclusion of rate in its name. We also note that
EB (R) decreases linearly with R from its maximum value of R0 at R = 0.
We remind the reader that the cut-o rate is a channel parameter (i.e., it
depends exclusively on the channel itself), similar to the capacity that also
only depends on the channel. However, in contrast to capacity, the cut-o
rate does not only dene a rate region for which we can guarantee that the
error probability can be made arbitrarily small by an appropriate choice of
n, but it also determines an error exponent for every rate in this region. It
therefore is probably the most important single parameter for characterizing
the quality of a DMC.
Lemma 12.4. The cut-o rate of a DMC satises
0 R0 C,
where C denotes the capacity of the DMC.
(12.59)
275
Proof: The upper bound R0 C follows trivially from the converse

part of the channel coding theorem for a DMC (Theorem 9.31): If R0 > C,
(12.56) would contradict (9.103).
The lower bound can be proven by applying Jensens
inequality (see Theorem 1.29 and Lemma 7.9) to the concave function t t:
(12.60)
R0 = log2 min
PX (x) PY |X (y|x)
PX ()
y
x
= log2 min
PX ()
EPX
PY |X (y|X)
(12.61)
log2 min
PX ()
EPX PY |X (y|X)
= log2 min
PX ()
(12.62)
PX (x) PY |X (y|x)
(12.63)
= log2 min {1}
(12.64)
= log2 1 = 0,
(12.65)
PX ()
i.e., R0 0.
Example 12.5. It is not dicult to prove that the R0 -achieving input distribution for binary-input DMCs is uniform:
1
PX (0) = PX (1) = .
(12.66)
2
Hence, we get for the cut-o rate of binary-input DMCs:
R0 = log2
= log2
1
2
1
PY |X (y|0) +
2
1
1
PY |X (y|0) +
4
2
(12.67)
PY |X (y|1)
PY |X (y|0)
1
PY |X (y|1) + PY |X (y|1)
4
(12.68)
= log2
1 1
+
4 2
= 1 log2 1 +
PY |X (y|0) PY |X (y|1) +
PY |X (y|0) PY |X (y|1)
= 1 log2 1 + 2DB ,
1
4
(12.69)
(12.70)
(12.71)
where we have used the Bhattacharyya distance DB as dened in Definition 9.20. Hence, in the case of a BEC with DB = log2 , we get
R0 = 1 log2 (1 + )
(compare with Example 9.22).
(12.72)
276
12.4
The Gallager Exponent
We have already seen that the Gallager bound (Theorem 12.2) is strictly
better than the union Bhattacharyya bound (Theorem 12.1) unless = 1
happens to be the optimal value. So we expect that if we replace the union
Bhattacharyya bound by the Gallager bound in the derivation of the error
exponent in Section 12.3, we should improve our result.
However, rather than bounding E[m ] in (12.25), it turns out to be more
convenient to begin by bounding the conditional expectation of m given that
the mth codeword is some particular value x . The Gallager bound (12.21)
gives immediately
E m X(m) = x
=
y
PY|X (y|x )
1
1+
1
PY|X (y|x ) 1+ E

1
PY|X (y|x ) 1+
1
PY|X (y|X()) 1+
=1
=m
1
PY|X (y|x ) 1+ E
M
=1
=m
M
1
PY|X (y|X()) 1+
PY|X (y|X()) 1+
=1
=m
E PY|X (y|X()) 1+
=1
=m
X(m) = x (12.73)
X(m) = x (12.74)
(12.75)
(12.76)
X(m) = x
X(m) = x .
Here in (12.74) it becomes obvious why we have conditioned on X(m) = x ,

namely so that we can take the expectation operator inside of the rst sum and
rst factor. The inequality (12.75) follows from Jensens inequality because
t t is a concave function as long as we restrict
0 1.
(12.77)
Now we make the same choices of probability distribution for the selection
of our codebook as in Section 12.3, i.e., X(m) and X(), = m, are chosen
IID pairwise independent. Therefore the expectation does not depend on the
conditioning:
1
E PY|X (y|X()) 1+
X(m) = x = E PY|X (y|X()) 1+

1
PX (x) PY|X (y|x) 1+ .
=
x
(12.78)
(12.79)
12.4. The Gallager Exponent
277
Hence, we get from (12.76),

E m X(m) = x
1
PY|X (y|x ) 1+
PY|X (y|x )
1
1+
M
=1 x
=m
1
PX (x) PY|X (y|x) 1+
(M 1)
(12.80)
PX (x) PY|X (y|x)
1
1+
(12.81)
and hence
PX (x ) E m X(m) = x
E[m ] =
(12.82)
PX (x )
x
PY|X (y|x ) 1+ (M 1)
= (M 1)
PX (x) PY|X (y|x)
1
1+
(12.83)
x
1
PX (x ) PY|X (y|x ) 1+
x
PX (x) PY|X (y|x)
1
1+
(12.84)
x
1+
= (M 1)
PX (x) PY|X (y|x)
1
1+
(12.85)
for 0 1. Here, in the last step we have noted that x and x are
just dummy summation variables and that therefore both sums are actually
identical.
Now, making the same choice for PX () as given in (12.32) and following
a derivation very similar to (12.34)(12.41), we arrive at
n
1+
E[m ] M
n log2
= 2nR 2
PX (x) PY |X (y|x)
PX (x) PY |X (y|x) 1+
for 0 1. We dene the Gallager function
E0 (, Q)
log2
1
1+
1+
(12.87)
1+
Q(x) PY |X (y|x) 1+
(12.86)
(12.88)
where it is customary to replace PX () in (12.87) by Q(), and rewrite (12.87)

as
E[m ] 2n(E0 (,Q)R) ,
0 1.
(12.89)
278
Finally, we strengthen this result using the argument with half of the codewords thrown away (see (12.49)(12.55)) and arrive at the following celebrated
result by Gallager.
Theorem 12.6 (Gallager Error Exponent).
There exists a coding scheme with M = 2nR codewords of length n and
with an ML decoder such that the maximum error probability satises
(n) 4 2nEG (R) ,
(12.90)
where EG (R) is called the Gallager exponent and is given by

EG (R)
max max E0 (, Q) R
Q
with the Gallager function E0 (, Q) dened as
E0 (, Q)
log2
(12.91)
01
Q(x) PY |X (y|x)
The Gallager exponent satises

EG (R) > 0 0 R < C,
1+
1
1+
. (12.92)
(12.93)
with C denoting the capacity of the DMC.
Proof: The main result (12.90) follows from our derivation (12.73)
(12.89) above with an optimal choice for and Q(). The proof of the property
(12.93) is omitted; details can be found in [Gal68, pp. 141].
It is quite obvious that
EG (R) EB (R) = R0 R
(12.94)
with equality if, and only if, = 1 is maximizing in (12.91).

A generic plot of the shape of the Gallager and the Bhattacharyya exponents is given in Figure 12.1.
While it is not surprising that for a weakly symmetric DMC a uniform
input distribution
Q(x) =
1
,
|X |
x X,
(12.95)
maximizes E0 (, Q) for all 0 (and therefore is also maximizing for the cut1
o rate), for a binary-input DMC the uniform distribution Q(0) = Q(1) = 2
does not in general maximize the Gallager function for 0 < 1! (This is in
contrast to Example 12.5 where we have seen that the uniform distribution
achieves the cut-o rate!)
12.4. The Gallager Exponent
279
Error exponents
R0
EG (R)
EB (R)
R
0
R0
Rcrit
Figure 12.1: The general form of the Gallager exponent EG (R) and the Bhattacharyya exponent EB (R).
A careful analysis of the Gallager exponent reveals more interesting properties.
Lemma 12.7 (Properties of the Gallager Exponent).
1. The R = 0 intercept is the cut-o rate R0 and is achieved for = 1:
EG (0) = max E0 (1, Q) = R0 .
(12.96)
2. In the interval 0 R Rcrit the Gallager exponent coincides with the

Bhattacharyya exponent, i.e., it is linear with slope 1. Here,
Rcrit =
E0 (, Q)
Q : R0 -achieving
max
(12.97)
=1
where the maximum is taken only over those channel input distributions
that achieve the cut-o rate R0 . Usually, Rcrit is called critical rate.
However, this name is a bit unfortunate because the rate is not critical
to the design of a communication system, but rather to those who try to
nd bounds on error probability.
3. The Gallager exponent EG (R) is convex and positive in the interval 0
R < C, where C denotes the capacity of the DMC. The latter was already
mentioned in the statement of the main Theorem 12.6 as it is the most
important property of EG (R).
Proof: To learn more about the behavior of EG () it is instructive to
consider the function
R E0 (, Q) R
(12.98)
280
for a xed value of Q and dierent values of . This function is linear with a
slope . Hence, we get a picture as shown in Figure 12.2. The fact that
E0 (1, Q)
1
2, Q
E0
max {E0 (, Q) R}
01
1
3, Q
1
E0 4 , Q
E0
E0
1
8, Q
R
Figure 12.2: Understanding the properties of the Gallager exponent EG (R):
for a xed Q, max01 {E0 (, Q) R} is an upper bound on the linear
functions E0 (, Q) R.
is restricted to 1 then means that at a certain point EG (R) will become
linear.
More details of the proof can be found in [Gal68, pp. 141].
The fact that EG (R) > 0 for R < C not only provides another proof for the
achievability part of the channel coding theorem (Theorem 9.34), but it also
gives us the additional information that, for any xed code rate R less than
capacity, we can make the maximum error probability go to zero exponentially
fast in the blocklength n.
12.5
Channel Reliability Function
Both the Bhattacharyya exponent and the Gallager exponent actually only
provide an achievability result. They show what error exponents are possible
(and we actually proved this using similar ideas based on random coding as
used for the achievability part of the channel coding theorem), but they do
not claim optimality. The question therefore is what we can say about the
best such exponent. We start by dening the best exponent, which is usually
called channel reliability function.
Denition 12.8. For any rate R 0, the channel reliability function E (R)
for a DMC is dened to be the largest scalar 0 such that there exists a
sequence of (Mn , n) coding schemes such that
1
(n)
lim log2 Pe
n n
(12.99)
12.5. Channel Reliability Function
281
and
1
log2 Mn .
n n
R lim
(12.100)
This denition is a bit technical, but its basic meaning is quite plain: The
channel reliability function E (R) is the largest function of R such that we can
nd a rate-R coding scheme with an average error probability smaller than
2nE (R) (where as usual in these denitions we allow n to become arbitrarily

large).
From the previous sections, we immediately have the following.
Corollary 12.9. For any R 0, the channel reliability function E (R) is
lower-bounded by both the Gallager exponent and the Bhattacharyya exponent:
E (R) EG (R) EB (R).
(12.101)
The question now is whether we can nd better coding schemes, i.e.,

schemes that exhibit an even faster exponential decay of the error probability. This question in this generality has not been solved yet. However,
there do exist a couple of upper bounds on the channel reliability function.
The derivation of these bounds is considerably more subtle and tedious. So we
will only state the most important one to complete the picture, but without
proof. More details can be found, e.g., in [Gal68], [SGB66], [SGB67].
Theorem 12.10 (Sphere-Packing Error Exponent).
For any (2nR , n) coding scheme on a DMC,
(n)
Pe 2n(ESP (Ro1 (n))+o2 (n))
(12.102)
where ESP (R) is called the sphere-packing exponent and is given by

ESP (R)
sup max E0 (, Q) R
0
(12.103)
with the Gallager function E0 (, Q) given in (12.92). Hence,

E (R) ESP (R).
(12.104)
The quantities o1 (n) and o2 (n) go to zero with increasing n and can be
taken as
3 + |X | log2 n
,
n
2
e2
3
log2
,
o2 (n) = +
n
n
(PY |X )min
o1 (n) =
(12.105)
(12.106)
where
(PY |X )min
min
x,y : PY |X (y|x)>0
PY |X (y|x).
(12.107)
282
Note that the only dierence between EG (R) and ESP (R) is that in the
latter the maximization over is not restricted to 0 1, but is allowed
for all 0. So, as long as the maximum is achieved for a 1, both
exponents agree and thereby specify the channel reliability function E (R)
precisely! Hence, for Rcrit R C we (usually) know the reliability function
exactly. The typical picture is shown in Figure 12.3.
Error exponents
ESP (R)
R0
EG (R)
E (R)
R
0
Rcrit
Figure 12.3: Upper and lower bound to the channel reliability function E (R).
Note that there might be a problem in certain situations when ESP (R) is
innite for all rates less than a given constant called R and when R > Rcrit ,
see the second case in Figure 12.4.
Error exponents
Error exponents
ESP (R)
R0
ESP (R)
R0
EG (R)
EG (R)
R
R
Rcrit
R
Rcrit R
Figure 12.4: Pathological cases where ESP (R) is innite for R < R .
In summary we have the following corollary.
12.5. Channel Reliability Function
283
Corollary 12.11 (Partial Error Exponent Coding Theorem). For all

rates R such that
max{Rcrit , R } R C,
(12.108)
E (R) = EG (R).
(12.109)
we have
Here, Rcrit is dened in (12.97), and

R
log2 min max

Q
Q(x) I PY |X (y|x) = 0
(12.110)
Note that R = 0 unless each output is unreachable from at least one

input.
We once again would like to mention that the capacity C, in spite of its
fundamental importance, is in some sense a less informative single parameter
for characterizing a DMC than the cut-o rate R0 . The reason is that the
knowledge of C alone tells us nothing about the magnitude of the channel
reliability function, i.e., about the exact speed of decay of the error probability
for some particular R < C. On the other hand, R0 species lower bounds on the
channel reliability function (via EB (R) = R0 R for all R < R0 ). The problem
with C alone is that both situations shown in Figure 12.5 are possible, i.e., we
can have
R0 = C
(12.111)
R0 C.
(12.112)
or
So two DMCs with the same capacity can still be very dierent because for
one DMC the error probability will decay very fast, while for the other the
decay is much slower, requiring much larger blocklengths in order to achieve
an acceptable level of error probability.
284
EG (R)
R0 EG (R)
R0
R
Rcrit
R0
R
C
Figure 12.5: Two extremal shapes of the Gallager exponent EG (R).
Chapter 13
Joint Source and Channel

Coding
13.1
Information Transmission System
We would now like to combine our knowledge of source compression and data
transmission for the design of an information transmission system. To that
goal consider Figure 13.1: Given is a general discrete stationary source (DSS)
with r-ary alphabet U that we would like to transmit over a discrete memoryless channel (DMC) with input and output alphabets X and Y, respectively.
We assume that the source generates a symbol every Ts seconds, i.e., the source
generates uncertainty at a rate of
H({Uk })
bits/s.
Ts
dest.
U1 , . . . , UK
decoder
Y1 , . . . , Y n
DMC
X1 , . . . , Xn
(13.1)
encoder
U1 , . . . , UK
DSS
Figure 13.1: Information transmission system using a joint source channel

coding scheme.
In our system we choose to jointly process K source symbols that are transformed into a channel codeword of length n and then sent over the channel.
Denition 13.1. A joint source channel coding scheme consists of an encoder
that maps every source sequence (U1 , . . . , UK ) into a channel input sequence
(X1 , . . . , Xn ) and of a decoder that makes for every received channel output
sequence (Y1 , . . . , Yn ) a guess (U1 , . . . , UK ) of the transmitted source sequences.

We dene the probability of a decoding error by
(K)
Pe
K
K
Pr U1 = U1 .
(13.2)
Obviously, for this system to work the channel transmission must be

clocked in the correct speed, i.e., if we denote the duration of a channel symbol
285
286
Joint Source and Channel Coding
by Tc , then we must have

KTs = nTc .
(13.3)
We will now derive a very fundamental result that shows when it is possible
to transmit the sources random messages over the given channel depending
on the source entropy and the channel capacity.
As so often we start with the converse result, i.e., we will prove an upper
limitation on the source entropy, for which any transmission system will not
work.
13.2
Converse to the Information Transmission

Theorem
Recall Fanos inequality (Lemma 9.25 or Corollary 9.27) and apply it to the
K
K
random vectors U1 and U1 :
K K
(K)
(K)
H U1 U1 Hb Pe
+ Pe log |U |K 1
log 2
(K)
log 2 + Pe K log |U |.
(13.4)
|U |K
(13.5)
Hence,
K
H U1
1
H({Uk })
=
lim
Ts
Ts K K
1
K
H U1
KTs
1
1
K K
=
H U1 U1 +
I
KTs
KTs
1 (K)
log 2
+ Pe log |U | +
KTs
Ts
log 2
1 (K)
+ Pe log |U | +
KTs
Ts
log 2
1 (K)
+ Pe log |U | +
KTs
Ts
1 (K)
log 2
+ Pe log |U | +
=
KTs
Ts
(13.6)
(13.7)
K K
U1 ; U1
1
K K
I U1 ; U1
KTs
1
n
I(X1 ; Y1n )
KTs
1
nC
KTs
C
.
Tc
(13.8)
(13.9)
(13.10)
(13.11)
(13.12)
Here, (13.6) follows by the denition of the entropy rate; (13.7) follows beK
cause H(U1 )/K is decreasing in K; in the subsequent equality (13.8) we use
the denition of mutual information; in (13.9) we use (13.5); the subsequent
inequality (13.10) follows from the data processing inequality (Lemma 9.30);
(13.11) relies on our assumption of the channel being a DMC without feedback
and the denition of capacity in the same fashion as in (9.89)(9.96); and the
nally equality (13.12) is due to (13.3).
13.3. Achievability of the Information Transmission Theorem

(K)
Hence, any joint source channel coding scheme with Pe

must satisfy
287
0 for K
H({Uk })
C
.
Ts
Tc
(13.13)
C
H({Uk })
> ,
Ts
Tc
(13.14)
In other words, if
then the error probability is bounded from below and cannot tend to zero even
if we allow for innite delay K .
13.3
Achievability of the Information

Transmission Theorem
Next we will turn to the positive side of theorem that tells us what we can do.
As a matter of fact, it will turn out that we can prove an even better result:
Not only is it possible to transmit the stationary source {Uk } reliably over the
DMC as long as (13.13) is satised, but we can do this in a way that separates
the source from the channel. We can rstly compress the source using some
data compression scheme that is completely ignorant of the DMC, and then
apply a standard channel code for the given DMC that assumes a uniform
message source and is completely ignorant of the given stationary source.
13.3.1
Ergodicity
The derivation of the achievability part makes the assumption of ergodicity.

While we do not investigate whether this assumption holds or not, we try
to motivate why it actually is only a very weak restriction that should be
satised for most practical situations: We once again point out that the output of an optimal source compression scheme is (almost) IID and therefore
automatically ergodic (compare with Remark 9.1).
Proposition 13.2. The codeword digits Ck put out by an optimal source
coding scheme are all (almost) equally likely and (almost) independent of the
past.
Proof: By contradiction, assume that the codeword digits are not
equally likely. In this case we could nd a Human code for the output of
the ideal source coding scheme that combines the least likely digits together,
resulting in a new sequence that is more eciently representing the source
than the output of the ideal source coding scheme. This is a contradiction
to our optimality assumption of the ideal source coding scheme. Similarly we
can argue if the current codeword was dependent on the past: In this case
we could nd a new adaptive Human code that can represent the current
288
sequence of codeword digits using less digits based on the knowledge of the
past. Again this is a contradiction to the optimality of our source coding
scheme.
Note that this proposition and its proof are not rigorous. However, this
is not crucial, because we do not directly need the result. The only property
that we need is that the sequence {Ck } satises the law of large numbers.
Since an IID sequence denitely does do that, it is hopefully convincing that
an almost IID sequence also does. . .
We also would like to point out that ergodicity is closely related with the
AEP (see Theorem 18.2). For more details we refer to Chapter 18.
13.3.2
An Achievable Joint Source Channel Coding Scheme
{X1 , . . . , Xn }
1
channel
encoder
{C1 , . . . , C }
1
-block
parser
{Ck }
1
adaptive
Human
encoder
{U1 , . . . , UK }
1
Figure 13.2: An encoder for the information transmission system of Figure 13.1.
We x some > 0 and then design an encoder as shown in Figure 13.2. In
a rst step, this encoder applies an optimal blocktovariable-length source
encoder (e.g., an adaptive Human encoder) that maps a length-K source
sequence into a D-ary codeword Ck . We know from our analysis of source
compression that for any stationary source {Uk } and for the given > 0,
we can choose a parser length K large enough such that the average source
codeword length E[Lk ] of the optimal D-ary codewords Ck satises
H({Uk })
E[Lk ]
H({Uk })
+
log D
K
log D
(for K large enough)
(13.15)
(see Theorem 6.3, and compare with Theorem 6.12 and 6.26). As mentioned,
K denotes the parser length (of the source coding scheme), i.e., each D-ary
source codeword Ck describes K source symbols.
We now use this optimal source coding scheme many times in sequence,
say times ( 1), such that the total resulting D-ary output sequence
(C1 , . . . , C ) has a total (random) length Lk .
k=1
We next apply an -block parser to this D-ary output sequence {Ck } and
split it up into equally long strings. In order to nd out how we should
choose , we need to consider the (random) length of the output sequence
{Ck } of the adaptive Human encoder. To that goal, as already discussed in
Section 13.3.1, we assume that this output sequence satises the weak law of
large numbers.
If the weak law of large numbers can be applied, then the probability of
1
k=1 Lk being close to E[Lk ] tends to 1 for , i.e., we can choose
13.3. Achievability of the Information Transmission Theorem
289
large enough such that

Pr
k=1
Lk E[Lk ] + 1
(for large enough).
(13.16)
Based on this observation, we now choose
E[Lk ] + .
(13.17)
If the (random) length of {Ck } is less than , then we simply ll in random

D-ary digits at the end.1 On the other hand, it is also possible that the
random length of {Ck } is too long:
Lk > .
(13.18)
k=1
In this case the block-parser will discard some code digits, which will then
result in a decoding error. However, luckily, the probability of this event is
very small: From (13.16) we know that
Lk > = Pr
Pr
k=1
1
Pr
Lk > E[Lk ] +
(13.19)
Lk > E[Lk ] +
(13.20)
(for large enough).
(13.21)
k=1
k=1
Finally, the length- D-ary sequences are used as input for a usual channel
coding scheme for the given DMC. This coding scheme will work reliably as
long as its rate is less than the DMCs capacity.
Hence, we have two types of errors: Either the source compression scheme
generates a too long compressed sequence (see (13.18) and (13.21)) or the
channel introduces a too strong error for our channel coding scheme:
(K)
K
K
Pe
= Pr U1 = U1
(13.22)
Pr(compression error) + Pr(channel error)

by (13.21) if
is large enough
+ = 2
(13.23)
if R < C and
n is large enough
(for R < C and for and n large enough).
(13.24)
It only remains to make sure that the code rate R of our channel coding scheme
really is below capacity. For an arbitrary > 0, we have
R=
log D
n
(13.25)
Note that since the codewords Ck are prex-free and we know their number (i.e., ),
it is easily possible to discard these randomly generated lling digits at the decoder.
290
=
=
<
=
=

log D
n
E[Lk ] + log D
n
1
(E[Lk ] + + 1) log D
n
H({Uk })
1
K
+ + + 1 log D
n
log D
K
K log D ( + 1) log D
H({Uk }) +
+
n
n
n
Tc log D ( + 1)Tc log D
Tc
H({Uk }) +
+
Ts
Ts
Ts K
Tc
H({Uk }) + (for small and K large
Ts
(13.26)
(by (13.17))
(13.27)
(13.28)
(by (13.15), K 1) (13.29)

(13.30)
(by (13.3))
(13.31)
enough).
(13.32)
Note that, because of (13.3), if K gets large, also n gets large.

Hence, we can guarantee that R < C (and thereby that our system works)
as long as
H({Uk })
C
< .
Ts
Tc
13.4
(13.33)
We have proven the following fundamental result.
Theorem 13.3 (Information Transmission Theorem).

Assume a DMC of channel capacity C and a nite-alphabet stationary
and ergodic stochastic source {Uk }. Then, if
C
H({Uk })
bits/s <
bits/s
Ts
Tc
(13.34)
(where Ts and Tc are the symbol durations of the source and the channel,
respectively), there exists a joint source channel coding scheme with a
(K)
probability of a decoding error Pe 0 as K .
Conversely, for any stationary stochastic source {Uk } with
H({Uk })
C
bits/s >
bits/s,
Ts
Tc
(13.35)
(K)
the probability of a decoding error is bounded away from zero, Pe > 0,

i.e., it is not possible to send the process over the DMC with arbitrarily
low error probability.
13.4. Joint Source and Channel Coding
291
Note that in the proofs given above we actually used two dierent approaches in the converse and in the achievability part:
Approach 1: General Design: We design an encoder that directly maps
the source output sequence into a channel input sequence. The decoder
then receives a channel output sequence and needs to guess which source
sequence has been sent. This approach has been used in the converse.
Approach 2: Source Channel Separation Coding Scheme: We compress the source into its most ecient representation and then use a
standard channel code (t for the given channel and assuming a uniform message) to transmit the compressed source codewords. The receiver rstly guesses which codeword has been sent to recover the source
codeword and then decompresses this compressed source description.
This approach has been used in the achievability proof.
A couple of remarks:
Approach 2 is a special case of Approach 1.
In Approach 2 the standard channel code is designed assuming that the
input is uniformly distributed. This works even if the source is far from
being uniform because a source compression scheme produces an output
that is (almost) uniform.
Approach 2 is much easier than Approach 1 because we decouple the
channel from the source: If we exchange the source then in Approach 1
we need to redesign the whole system, while in Approach 2 we only need
to nd a new source coding scheme. As a matter of fact, if we use a
universal source coding scheme, we do not need to change anything at
all! Similarly, if we exchange the DMC, then we only need to change the
channel code design.
Note that it is not a priori clear that Approach 2 is optimal because
the data compression scheme completely ignores the channel and the
channel coding scheme completely ignores the source. Luckily, we have
been able to show that Approach 2 is as good as Approach 1.
Hence, we see that we have not only proven the information transmission
theorem, but we have actually shown that Approach 2 works! I.e., we can
separate source from channel and design a system consisting of two completely
separate parts.
292
Corollary 13.4 (The Source Channel Coding Separation Theorem). Consider a nite-alphabet stationary and ergodic stochastic source
{Uk } and a DMC of capacity C such that (13.34) is satised, i.e.,
H({Uk })
C
bits/s <
bits/s.
Ts
Tc
(13.36)
An optimal way of transmitting the source sequence reliably over the

DMC is to rstly apply an optimal data compression scheme (like
LempelZiv or adaptive Human) to the source output sequence, and then
to transmit the compressed source codewords using an optimal channel
coding scheme designed for the DMC. Note that the design of the source
coding scheme is independent of the channel and the design of the channel coding scheme is independent of the source.
We would like to remark that nature does usually not follow the two-stage
design of Approach 2, but uses a direct source-channel code. Examples:
Human oral languages, transmitted through air. Even with a huge
amount of noise and distortion (e.g., party with loud music and many
people talking at the same time) we are able to understand our conversation partner!
English text that is sent over an erasure channel: Even if about 50
percent of the letters are erased, we can still decipher the text!
Also note that our proof only holds for a DMC! There are examples of
channels like, e.g., some multiple-user channels, where the two-stage approach
is strictly suboptimal.
13.5
The Rate of a Joint Source Channel Coding

Scheme
The perhaps most fundamental dierence between the information transmission theorem (Theorem 13.3) and the channel coding theorem in Chapter 9
(Theorem 9.34) is that in the latter we have a free parameter R that needs to
be chosen appropriately (i.e., R < C) in order to make sure that the system
can work. Here in Theorem 13.3, however, we are given both the DMC with
capacity C and clock Tc and the source with entropy rate H({Uk }) and clock
Ts . We then cannot anymore adapt these parameters to make them match
(unless we start to change the channel or the source!), but have to face either
the situation (13.34), which works, or (13.35), which does not work.
Nevertheless, as we have seen in the derivation of Section 13.3, we still
implicitly have a rate R internally in the encoder: The second half of the
13.6. Transmission above Capacity and Minimum Bit Error Rate
293
encoder of the source channel separation coding scheme consists of a standard

channel code as discussed in Chapter 9 and has a rate that is, as usual, dened
as
R=
log(# of codewords)
.
n
(13.37)
Note that in this case the number of codewords depends on how strongly the
K source digits can be compressed, i.e., it depends on the entropy rate of the
source! Recalling the derivation of the rate in (13.25)(13.32) we see that, in
an ideal system and asymptotically for K , we have
R=
Tc
H({Uk }).
Ts
(13.38)
Note that this matches perfectly to the requirement that R < C, i.e., that
(13.33) must be satised.
Also note that if the source happens to be a perfectly compressed binary
source in the rst place, i.e., if {Uk } is an IID uniform binary sequence with
H({Uk }) = 1 bit, then we can omit the source compression scheme and we
have2
R=
K
,
n
(13.39)
which by (13.3) also matches to (13.38).

So, whenever we consider the situation of a joint source and channel coding
scheme, we dene the implicit rate as
R
13.6
Tc
H({Uk }).
Ts
(13.40)
Transmission above Capacity and Minimum

Bit Error Rate
We have understood that it is possible to transmit a source over a DMC with

arbitrarily small error probability if (13.34) is satised. The natural next
question to ask therefore is whether we can say something about the minimal
error probability if we try to transmit a source with an entropy rate above
capacity.
To simplify this investigation we will assume in this section that {Uk } is
an IID uniform binary source,3 i.e., H({Uk }) = 1 bit. Moreover, we assume
2
Strictly speaking, we should not write R as in (13.39) because there we do not see the
correct units of bits. The correct form actually is R = K log D, which yields (13.39) if D = 2
n
and the logarithm is to the base 2.
3
Again, this is not a big restriction as we can transform any source into such a memoryless
uniform binary source by applying an optimal compression scheme to it.
294
encoder
destination
V1 , . . . , VK
lossy
U1 , . . . , UK
compressor
V1 , . . . , V K
decoder
Y1 , . . . , Yn
DMC
uniform
BMS
X1 , . . . , Xn
Figure 13.3: Lossy compression added to joint source channel coding: We

insert a lossy mapping g() between source and channel encoder to control
the introduced errors.
that the parameters of the system and the DMC are such that (13.35) holds,
i.e.,
C
1
bits/s >
bits/s.
Ts
Tc
(13.41)
We know that the probability of a message (or block) error

(K)
Pe
K
K
Pr U1 = U1
(13.42)
is strictly bounded away from zero (compare again with Figure 13.1). Actually,
(K)
it can be shown that Pe tends to 1 exponentially fast in K. However, this is
not really an interesting statement. Since already a single bit error will cause
the total length-K message to be wrong, we realize that some bits still could
K
be transmitted correctly in spite of a wrong message U1 . Hence, we are much
more interested in the average bit error probability or, as it is usually called,
the average bit error rate (BER)
Pb
1
K
Pb,k
(13.43)
k=1
where
Pb,k
Pr Uk = Uk .
(13.44)
Unfortunately, as we will show below, the converse also holds when replacing
(K)
Pe by Pb . I.e., if (13.41) holds, then also the BER is strictly bounded away
from zero. Can we say more about how far away?
A typical engineering way of trying to deal with the problem that reliable
transmission is impossible is to try to take control by adding the errors purposefully ourselves! So we add an additional building block between source
K
and encoder (see Figure 13.3) that compresses the source sequence U1 in a
lossy way such that
1 H(V1K )
C
bits/s <
bits/s.
Ts K
Tc
(13.45)
295
Then we know that V1K can be transmitted reliably over the channel and the
BER is under our full control by choosing the compressor mapping g():
K
V1K = g(U1 ).
(13.46)
Example 13.5. As an example consider a compressor scheme g() as follows:

for K even
(v1 , v2 , . . . , vK1 , vK ) = g(u1 , u2 , . . . , uK1 , uK )
(u1 , u1 , u3 , u3 , . . . , uK1 , uK1 ),
(13.47)
(13.48)
i.e., every second information bit is canceled. We then obviously have

1
1
H(V1K ) = bits
K
2
(13.49)
C
1
bits/s <
bits/s,
2Ts
Tc
(13.50)
and, assuming that
we can reliably transmit V1K , i.e., for any > 0 we have
Pr Vk = Vk
(13.51)
for K large enough. To simplify our notation we therefore now write4 Vk = Vk

and get
Pb,k = Pr Vk = Uk = Pr[Vk = Uk ] =
0
1
2
for k odd,
for k even
(13.52)
and hence
Pb =
1
1
1 1
1
1
Pb,odd + Pb,even = 0 + = .
2
2
2
2 2
4
(13.53)
So we see that we have managed to design a system with average BER Pb = 1

4
C
1
subject to 2Ts bits/s being smaller than the channel capacity Tc .

Example 13.6. Another, more elegant compressor design is based on the
(7, 4) Hamming code. This code is a length-7 binary channel code with the
following 16 codewords:
CH = 0000000, 1010001, 1110010, 0100011,
0110100, 1100101, 1000110, 0010111,
1101000, 0111001, 0011010, 1001011,
1011100, 0001101, 0101110, 1111111 .
(13.54)
It can be shown that this code provides a perfect packing of the 7-dimensional
binary space: If we dene a sphere around every codeword consisting of the
4
Strictly speaking we should say that Pr[Vk = Vk ] > 1 .
296
codeword itself plus all length-7 binary sequences with exactly one position
being dierent from the codeword, then rstly all these spheres will be disjoint
and secondly all spheres together will contain all possible length-7 binary
sequences. Note that each sphere contains eight length-7 sequences.
Hence, we can now design a compressor using inverse Hamming coding:
7
7
g(u7 ) = v1 if v1 is a (7, 4) Hamming codeword and u7 has at most one com1
1
7
ponent dierent from v1 . In other words, every sequence u7 is mapped to the
1
sphere center of its corresponding sphere. Due to the perfect packing property, we know that every possible sequence will belong to exactly one sphere
so that this mapping is well-dened.
Since the source produces all 27 = 128 binary sequences of length 7 with
equal probability and since always eight sequences are mapped to the same
codeword, we see that the 16 codewords will show up with equal probability.
Hence,
1
log2 16
4
H(V1K ) =
= bits,
K
7
7
(13.55)
i.e., slightly larger than the rate 1/2 bits of Example 13.5. On the other hand,
we see that
Pb,k = Pr[Uk = Vk ] =
1
8
(13.56)
because only one of the 8 sequences in each sphere will dier in the kth position
from the correct sequence, and hence
1
Pb = .
8
(13.57)
This is only half the value of the compressor design given in Example 13.5!
So, we come back to our original question: What is the optimal compressor
design, i.e., the design that minimizes the BER subject to the average output
entropy being less than the channel capacity?
This question is similarly hard to answer as to nd the optimum design of
a channel code. Shannon, however, again found a way to answer part of this
question: He managed to identify the minimum BER that is possible without
specifying how it can be achieved!
In the following we will show his proof. It is again based on the system design of Figure 13.3, i.e., we split the encoder into two parts: a lossy compressor
and a standard channel encoder.
So we assume that (13.41) holds, but that a compressor is given such that
C
1 1
H V1K < .
Ts K
Tc
(13.58)
297
We then have:
1
1
1
K
H V1K = H V1K H V1K U1
K
K
K
(13.59)
=0
1
K
= I V1K ; U1
K
1
1
K
K
= H U1 H U1 V1K
K
K
1
K
= 1 H U1 V1K
K
K
1
k1
=1
H Uk U1 , V1K
K
1
1
K
(13.60)
(13.61)
(13.62)
(13.63)
k=1
K
k=1
H(Uk |Vk ) bits,
(13.64)
K
where the rst equality follows because V1K = g(U1 ); (13.62) holds because
{Uk } is IID uniform binary; the subsequent equality follows from the chain
rule; and in the nal step we use that conditioning cannot increase entropy.
Note that
H(Uk |Vk ) = Pr[Vk = 0] H(Uk |Vk = 0) + Pr[Vk = 1] H(Uk |Vk = 1)
(13.65)
= Pr[Vk = 0] Hb Pr[Uk = 1|Vk = 0]
+ Pr[Vk = 1] Hb Pr[Uk = 0|Vk = 1]
(13.66)
Hb Pr[Vk = 0] Pr[Uk = 1|Vk = 0]

+ Pr[Vk = 1] Pr[Uk = 0|Vk = 1]
(13.67)
= Hb Pr[Uk = 1, Vk = 0] + Pr[Uk = 0, Vk = 1]
(13.68)
= Hb Pr[Uk = Vk ]
(13.69)
= Hb (Pb,k ).
(13.70)
Here, the rst equality (13.65) follows by denition of conditional entropy;

the subsequent because Uk is binary; the inequality (13.67) follows from the
concavity of Hb ():
Hb (1 ) + (1 ) Hb (2 ) Hb 1 + (1 )2 ;
(13.71)
and the nal equality is because Vk Vk by reliable communication.

Plugging (13.70) into (13.64) and using concavity once more, we get
1
1
H V1K 1
K
K
Hb (Pb,k )
(13.72)
k=1
1 Hb
1
K
Pb,k
(13.73)
k=1
= 1 Hb (Pb ) bits,
(13.74)
298
where the last equality follows from (13.43).

Hence, from our assumption (13.58) we therefore see that
1
1 1
C
1 Hb (Pb )
H V1K < ,
Ts
Ts K
Tc
(13.75)
which, using the denition (13.40) and the assumption that H({Uk }) = 1 bit,
transforms into
Hb (Pb ) > 1
C
R
(13.76)
C
.
R
(13.77)
or
Pb > H1 1
b
Here H1 () is the inverse function of the binary entropy function Hb () for

b
[0, 1/2] (and we dene H1 () 0 for < 0).
b
We see that if R > C, then the average bit error rate is bounded away from
zero.
There are two questions that remain:
1. Can we nd a lossy compressor that actually achieves the lower bound
(13.77)?
2. Is it possible that another system design that is not based on the idea of
Figure 13.3 (but on the more general system given in Figure 13.1) could
achieve a bit error rate that is smaller than (13.77)?
The answer to the rst question is yes, however, we will not prove it in this
class.5 The proof is based on the concept of rate distortion theory, another
of the big inventions of Shannon [Sha48]. He asked the following question:
Given a certain source and given a certain acceptable distortion, how much
can we compress the source without causing more than the given acceptable
distortion? Actually, we can turn the question also around: Given a certain
source and given desired rate, what is the smallest possible distortion? If we
now choose as a measure of distortion the bit error probability, this question is
exactly our problem described here. It turns out that the smallest achievable
distortion looks exactly like the right-hand side of (13.77).
The second question can be answered negatively. The proof is quite
straightforward and is based on Fanos inequality. Consider again the gen5
A proof is given in the advanced course [Mos13, Sec. 5.7].
299
eral system of Figure 13.1. For such a system we have the following:
1
H U1 , . . . , UK U1 , . . . , UK
K
K
1
=
H Uk U1 , . . . , Uk1 , U1 , . . . , UK
K
1
K
(13.78)
k=1
K
H Uk Uk
k=1
K
k=1
Hb
(13.79)
1
Hb (Pb,k ) +
K
1
K
K
k=1
Pb,k log(|U | 1)
Pb,k
k=1
+ log(|U | 1)
1
K
(13.80)
Pb,k
(13.81)
k=1
= Hb (Pb ) + Pb log(|U | 1),
(13.82)
where (13.78) follows from the chain rule; (13.79) follows because conditioning
reduces entropy; (13.80) follows by Fanos inequality (15.90); (13.81) follows
from concavity; and the last equality (13.82) from the denition of the bit
error rate (13.43). Actually, note that we have just derived Fanos inequality
for the situation of bit errors:
1
H U1 , . . . , UK U1 , . . . , UK Hb (Pb ) + Pb log(|U | 1).

K
(13.83)
We can now re-derive the converse to the information transmission theorem

as follows (compare with (13.6)(13.12)):
K
H U1
H({Uk })
1
=
lim
Ts
Ts K K
1
K
H U1
KTs
1
1
K K
K K
H U1 U1 +
I U1 ; U1
=
KTs
KTs
Hb (Pb ) Pb log(|U | 1)
1
K K
+
+
I U1 ; U1
Ts
Ts
KTs
1
n
+
+
I(X1 ; Y1n )
Ts
Ts
KTs
1
+
+
nC
Ts
Ts
KTs
C
=
+
+ ,
Ts
Ts
Tc
(13.84)
(13.85)
(13.86)
(13.87)
(13.88)
(13.89)
(13.90)
where (13.84) follows by the denition of the entropy rate; (13.85) because
K
H U1 /K is decreasing in K; (13.86) by denition of mutual information; the
subsequent two inequalities (13.87) and (13.88) by Fano (13.82) and the data
processing inequality (Lemma 9.30); (13.89) by (15.16) and the memorylessness of the DMC; and the nal equality holds because KTs = nTc .
300
If we now again assume that the source is IID uniform binary, we have
H({Uk }) = 1 bit and |U | = 2, and therefore we get from (13.90):
1
Hb (Pb )
C
.
Ts
Ts
Tc
(13.91)
Hence, we see that (13.75) and therefore also (13.77) is satised for any
system.
We have once again seen Shannons incredible capability of nding fundamental insights by sacricing some answers (in particular the question of how
to design a system). We will come back to (13.77) in Chapter 15.
Chapter 14
Continuous Random
Variables and Dierential
Entropy
So far we have restricted ourselves to discrete alphabets, i.e., all random variables did only take a nite or at most countably innite number of values. In
reality, however, there are also often cases where the alphabets are uncountably innite. We will next introduce the nice special case of continuous RVs
where we can describe the probability distribution with a probability density
function (PDF) (see also the discussion in Section 2.3).
14.1
Entropy of Continuous Random Variables
Let X be a continuous random variable with continuous alphabet X and with

probability density function (PDF) fX (), i.e.,
x
Pr[X x] =
fX (t) dt.
(14.1)
What is its entropy H(X)? To see this, we need to nd the probability mass
function (PMF) PX () for the continuous random variable X, or at least
since the PMF is not dened for a continuous random variable an approximate value of the PMF. Fix some positive value 1. Then for every
x X,
PX (x)
x+
2
x
2
fX (t) dt fX (x)
x+
2
x
2
dt = fX (x).
(14.2)
This approximation will become better if we let become smaller. So lets

use this approximation in the denition of entropy:
H(X)
PX (x) log PX (x)
(14.3)
301
302
Continuous Random Variables and Differential Entropy

fX (x) log fX (x)
(14.4)
fX (x) log
fX (x) log fX (x)
(14.5)
fX (x) log fX (x)
(14.6)
= log
fX (x)
log
fX (x) dx
log
fX (x) log fX (x) dx
(14.7)
(14.8)
as 0
Hence, we realize that

H(X) = !
(14.9)
Actually, this is quite obvious if you think about it: X can take on innitely
many dierent values!
So, the original denition of uncertainty or entropy does not make sense
for continuous random variables. How can we save ourselves from this rather
unfortunate situation? Well, we simply dene a new quantity. . .
Denition 14.1. Let X X be a continuous random variable with PDF
fX (). Then the dierential entropy h(X) is dened as follows:
h(X)
fX (x) log fX (x) dx = E[log fX (X)].
(14.10)
Note that as for the normal entropy, the basis of the logarithm denes the
unit of dierential entropy. In particular, the choice of the binary logarithm
gives a dierential entropy in bits and the natural logarithm gives a dierential
entropy in nats.
From (14.8) we know that
h(X) = H(X) + log
for 0,
(14.11)
i.e., h(X) is a shift of H(X) by innity. But this also means that
h(X) can be negative!
A negative uncertainty?!? Very embarrassing. . . !

Example 14.2. Let X be uniformly distributed on the interval [0, a] for some
value a 0, X U ([0, a]):
fX (x) =
1
a
for x [0, a],

otherwise.
(14.12)
14.1. Entropy of Continuous Random Variables
303
FY (y)
1
3
4
1
4
Figure 14.1: CDF of a random variable that is a mixture of discrete and

continuous.
We now compute the dierential entropy for X:
a
h(X) =
1
1
1
log dx = log a
a
a
a
dx =
0
a
log a = log a.
a
(14.13)
Hence,
if a > 1 = h(X) > 0;
(14.14)
if a = 1 = h(X) = 0;
(14.15)
if 0 < a < 1 = h(X) < 0;
(14.16)
if a = 0 = h(X) = .
(14.17)
So we see that h(X) = basically means that X is not random enough:

If a = 0 then X only takes on one value with probability 1. This means that it
has zero normal uncertainty which corresponds to a negative innite value of
the dierential entropy. Note that, as we have seen above, the dierential entropy will be negative innite for any nite value of the corresponding normal
entropy.
Remark 14.3. Note that even if X is continuous but contains some discrete
random variable parts, i.e., if X is a mixture between continuous and discrete
random variable, then h(X) = . This can be seen directly from the
PDF fX (), which will contain some Dirac deltas in such a situation; see the
following example.
Example 14.4. As an example, consider an random variable Y with cumulative distribution function (CDF)
0
y < 0,
y+1
FY (y) Pr[Y y] =
(14.18)
0 y < 2,
4
1
y2
(see Figure 14.1). Note that Y is not continuous, because it has two values
304
with positive probability:

1
4
Pr[Y = a] =
for a = 0 or a = 2,
otherwise.
(14.19)
But between 0 and 2, Y is uniformly distributed with a constant density.

Strictly speaking, the PDF is not dened, but we engineers sometimes like to
use Dirac deltas:
0
Dirac (t)
for all t = 0,
for t = 0,
(14.20)
where this strange value has to be thought of being such that if integrated
over, it evaluates to 1:
Dirac (t) dt
1.
(14.21)
Hence, we can write the PDF of Y as follows:

fY (y) =
1
1
1
I{y (0, 2)} + Dirac (y) + Dirac (y 2).
4
4
4
(14.22)
Note that this PDF indeed integrates to 1:
fY (y) dy
1
1
1
I{y (0, 2)} + Dirac (y) + Dirac (y 2) dy
4
4
4
2
1
1
1
Dirac (y) dy +
Dirac (y 2) dy
dy +
=
4
4
0 4
=1
(14.23)
(14.24)
=1
2 1 1
= + + = 1.
4 4 4
(14.25)
If we now daringly apply the denition of dierential entropy to this (weird)

PDF, we get
h(Y ) =
fY (y) log fY (y) dy
= E[log fY (Y )]
1
1
1
I{Y (0, 2)} + Dirac (Y ) + Dirac (Y 2)
= E log
4
4
4
(14.26)
(14.27)
, (14.28)
i.e., we see that we have a Dirac delta inside of the logarithm. Again, we
have to be careful of what the meaning of such an expression is, but considering the fact, that Dirac (0) = , it does not really surprise that we get
E[log Dirac (Y )] = and therefore h(Y ) = .
14.2. Properties of Dierential Entropy
305
Example 14.5. Let X be zero-mean Gaussian distributed with variance 2 ,

X N 0, 2 . Then
h(X) =
x2
e 22 log
2 2
X2
1
e 22
= E log
2 2
E[X 2 ]
1
log e
= log 2 2 +
2
2 2
2
1
= log 2 2 + 2 log e
2
2
1
2
= log 2e .
2
1
2 2
x2
e 22
dx
(14.29)
(14.30)
(14.31)
(14.32)
(14.33)
Here we have demonstrated that often the calculations become much simpler
when relying on the expectation-form of the denition of dierential entropy.
14.2
Properties of Dierential Entropy
So how does h(X) behave? We have already seen that h(X) can be negative,
but that a negative value does not imply anything special apart from the fact
that h(X) < 0 means X has less uncertainty than if h(X) were positive. Now
we will explore and nd some more (strange) properties.
We start with scaling.
Exercise 14.6. Let Xd be a discrete random variable with entropy H(Xd ).
What is the entropy of 2Xd ?
H(2Xd ) =
(14.34)
Why?
So what about the relation between h(X) and h(2X) for a continuous
random variable X? To compute this, we dene a new random variable Y
aX for some a > 0. Then we know that the PDF of Y is
fY (y) =
y
1
.
fX
a
a
(14.35)
Hence,
h(aX) = h(Y )
=
(14.36)
fY (y) log fY (y) dy
1
1
y
y
log fX
fX
dy
a
a
a
Y a
y
y
y
1
1
fX
fX
dy
log fX
dy
= log a
a
a
a
Y a
Y a
(14.37)
(14.38)
(14.39)
306
= log a
fX (x) dx
(14.40)
=1
= log a + h(X),
(14.41)
where in the second last step we have changed integration variables and dened
y
x a . Again, quite embarrassing: The uncertainty changes depending on the
scaling of the random variable!
Lemma 14.7. Let X be a continuous random variable with dierential entropy h(X). Let a R \ {0} be a real nonzero constant. Then
h(aX) = h(X) + log |a|.
(14.42)
Let c R be a real constant. Then

h(X + c) = h(X).
(14.43)
Proof: For a > 0, we have proven (14.42) above. The derivation for
a < 0 is analogous. The proof of (14.43) is left to the reader as an exercise.
Remark 14.8. Many people do not like dierential entropy because of its
weird behavior. They claim that it does not make sense and is against all
engineering feeling. However, the dierential entropy has its right of existence. So, e.g., it can very precisely classify dierent types of fading channels
depending on the uncertainty in the fading process [Mos05].
14.3
Generalizations and Further Denitions
Similarly to normal entropy, we can dene a joint and a conditional dierential

entropy.
Denition 14.9. The joint dierential entropy of several continuous random
variables X1 , . . . , Xn with joint PDF fX1 ,...,Xn is dened as
h(X1 , . . . , Xn )
Xn
X1
fX1 ,...,Xn (x1 , . . . , xn ) log fX1 ,...,Xn (x1 , . . . , xn ) dx1 dxn

(14.44)
= E[log fX1 ,...,Xn (X1 , . . . , Xn )].
(14.45)
Denition 14.10. Let X and Y be two continuous random variables with

joint PDF fX,Y (, ). Then the conditional dierential entropy of X conditional
on Y is dened as follows:
h(X|Y )
fX,Y (x, y) log fX|Y (x|y) dx dy
= E log fX|Y (X|Y ) .
(14.46)
(14.47)
14.3. Generalizations and Further Denitions
307
Besides its many strange properties, dierential entropy also shows good
behavior: Since the chain rule for PDFs work the same way as for PMFs, i.e.,
fX,Y (x, y) = fY (y)fX|Y (x|y),
(14.48)
the chain rule also holds for dierential entropy:

h(X, Y ) = h(Y ) + h(X|Y ).
(14.49)
Lets now turn to mutual information. Do we have a similar embarrassing

situation as in the case of entropy vs. dierential entropy? Luckily, the answer
here is no. You can realize this as follows:
I(X; Y ) = H(Y ) H(Y |X)
(14.50)
h(Y ) log h(Y |X) + log
= h(Y ) h(Y |X),
(14.51)
(14.52)
i.e., the strange shift by innity disappears because mutual information is a

dierence of entropies. However, note that mathematically it is not possible
to dene a quantity as a dierence of two innite values. Hence, we use the
alternative form (1.133) of mutual information as the basis of our denition.
Denition 14.11. The mutual information between two continuous random
variables X and Y with joint PDF fX,Y (, ) is dened as follows:
I(X; Y )
fX,Y (x, y) log

Y
fX,Y (x, y)
dx dy.
fX (x)fY (y)
(14.53)
From this form, it is straightforward to derive the following natural alternative

forms:
fX,Y (x, y) log
I(X; Y ) =
Y
fY |X (y|x)
dx dy
fY (y)
= h(Y ) h(Y |X)
(14.54)
(14.55)
= h(X) h(X|Y ).
(14.56)
Similarly, we extend the denition of relative entropy (Denition 1.22) to

continuous RVs.
Denition 14.12. Let f () and g() be two PDFs over the same continuous
alphabet X . The relative entropy between f () and g() as
f (x) log
D(f g)
X
f (x)
dx.
g(x)
(14.57)
Hence, as before, we can write

I(X; Y ) = D(fX,Y fX fY ).
(14.58)
308
Remark 14.13. We have more good news: Mutual information and relative
entropy behave exactly as expected also for continuous random variables! I.e.,
I(X; Y ) 0
(14.59)
with equality if, and only if, X Y , and
D(f g) 0
(14.60)
with equality if, and only if, f () = g().

From this then also immediately follows that
0 I(X; Y ) = h(Y ) h(Y |X) =
h(Y ) h(Y |X)
(14.61)
i.e., conditioning reduces dierential entropy.

Moreover,
I(2X; Y ) = I(X; Y )
(14.62)
h(2X) h(2X|Y ) = h(X) + log 2 h(X|Y ) log 2
(14.63)
as can be seen from
= h(X) h(X|Y ).
(14.64)
Note that from this we also learn that

h(Y ) h(Y |2X) = h(Y ) h(Y |X),
(14.65)
i.e., the dierential entropy behaves normally in its conditioning arguments:

h(Y |2X) = h(Y |X).
(14.66)
Remark 14.14. We have now understood that entropy H() only handles
discrete random variables, while dierential entropy h() only continuous random variables. Mutual information I(; ), on the other hand, is happy with
both types of random variables. So the alert reader might ask what happens
if we start to mix both types of random variables in mutual information? For
example, letting M be a RV taking value in a nite alphabet and Y R be a
continuous RV, is it meaningful to consider I(M ; Y )?
From an engineering point of view this makes perfectly sense. For example,
consider a message M that is translated into a channel input X and then sent
over a channel that adds Gaussian noise (see Chapter 15). Indeed, this works
perfectly ne and your engineering intuition will not cheat you! Nevertheless
there are some quite delicate mathematical issues here. For example, from
I(M ; Y ) = h(Y ) h(Y |M )
(14.67)
14.3. Generalizations and Further Denitions
309
we see that we need to be able to condition dierential entropy on a discrete

random variable. This is alright because we know that we can condition a
PDF on a discrete event of positive probability:
h(Y |M ) = EM [h(Y |M = m)]
=
mM
(14.68)
Pr[M = m] h(Y |M = m)
Pr[M = m]
mM
fY |M =m (y) log fY |M =m (y) dy.
(14.69)
(14.70)
Things are more murky when we expand mutual information the other way
around:
I(M ; Y ) = H(M ) H(M |Y ).
(14.71)
How shall we dene H(M |Y = y)? Note that the event Y = y has zero
probability! So we see that we cannot avoid dening a joint distribution of M
and Y . To do this properly, one should actually go into measure theory. We
will not do this here, but instead use a heuristic argument that will lead to a
denition (compare also with the discussion in [Lap09, Section 20.4]). So, we
would like to think of Pr[M = m|Y = y] as
??
Pr[M = m|Y = y] = lim

0
Pr[M = m, Y (y , y + )]
.
Pr[Y (y , y + )]
(14.72)
Since
y+
Pr[Y (y , y + )] =
fY () d fY (y) 2
(14.73)
and
Pr[M = m, Y (y , y + )]
= Pr[M = m] Pr[Y (y , y + )|M = m]
(14.74)
y+
fY |M =m () d
= Pr[M = m]
y
Pr[M = m] fY |M =m (y) 2
(14.75)
(14.76)
this would give

Pr[M = m] fY |M =m (y) 2
0
fY (y) 2
Pr[M = m] fY |M =m (y)
=
.
fY (y)
??
Pr[M = m|Y = y] = lim
(14.77)
(14.78)
This is intuitively quite pleasing! There is only one problem left: What happens if fY (y) = 0? Luckily, we do not really need to bother about this because
the probability of Y taking on any value y for which fY (y) = 0 is zero:
Pr[Y {y : fY (y) = 0}] = 0.
(14.79)
310
So we assign an arbitrary value for these cases:

Pr[M =m] fY |M =m (y)
fY (y)
1
2
Pr[M = m|Y = y]
if fY (y) > 0,
otherwise.
(14.80)
Now, we are perfectly able to handle mixed expressions like I(M ; Y ), using
either expansion (14.67) or (14.71) (both give the same result!).
One can even handle expressions like I(M, Y1 ; Y2 ), although here it is better
to use the chain rule
I(M, Y1 ; Y2 ) = I(Y1 ; Y2 ) + I(M ; Y2 |Y1 )
(14.81)
as H(M, Y1 ) is not dened.1
14.4
Multivariate Gaussian
The multivariate Gaussian distribution is one of the most important (and

easiest!) distributions.2 Therefore we spend some extra time on it.
Theorem 14.15. Let X Rn be a multivariate Gaussian random vector
X N (, K) where denotes the mean vector and K is the n n covariance
matrix
K = E (X )(X )T ,
(14.82)
where we assume that K is positive denite (see Appendix B.1). Then

h(X) =
1
log (2e)n det K .
2
(14.83)
Proof: The PDF of X is given as follows:

fX (x) =
1
(2)n det K
e 2 (x)
T K1 (x)
(14.84)
(see Appendix B.9). Hence, we have

h(X) = E[log fX (X)]
1
1
= E log (2)n det K + (X )T K1 (X ) log e
2
2
1
= log (2)n det K
2
1
E
2
(Xi i ) K1
i,j
(Xj j ) log e
(14.85)
(14.86)
(14.87)
Note that H(M ) + h(Y1 |M ) is dened, though.

If you do not agree that the Gaussian distribution is easy, have a look at Appendices A
and B! Many problems involving Gaussian RVs are solvable, while for general distribution
often this is not the case. Gaussians are your friends!
2
14.4. Multivariate Gaussian
1
log (2)n det K
2
1
E[(Xi i )(Xj j )] K1
+
2
i
=
=
311
i,j
log e
(14.88)
1
1
log (2)n det K +
2
2
1
1
log (2)n det K +
2
2
1
1
= log (2)n det K +
2
2
1
log (2)n det K +
2
1
= log (2)n det K +
2
1
= log (2e)n det K .
2
=
(K)j,i K1
j
i,j
log e
(14.89)
KK1
j,j
log e
(14.90)
(I)j,j log e
(14.91)
n
log e
2
1
log en
2
(14.92)
(14.93)
(14.94)
Theorem 14.16. Let the random vector X Rn have zero mean and some
xed (positive denite) covariance matrix K, i.e., E XXT = K. Then
h(X)
1
log (2e)n det K
2
(14.95)
with equality if, and only if, X N (0, K).

Proof: This theorem follows directly from a generalization of Theorem 6.20 to continuous RVs: Let X be a continuous RV with PDF f () that
satises the following J constraints:
E[rj (X)] =
f (x)rj (x) dx = j ,
for j = 1, . . . , J,
(14.96)
for some given functions r1 (), . . . , rJ () and some given values 1 , . . . , J . Then
h(X) is maximized if, and only if,
f (x) = f (x)
e 0 +
J
j=1
j rj (x)
(14.97)
assuming that 0 , . . . , J can be chosen such that (14.96) is satised and

(14.97) is a PDF.
The proof is analogous to the proof of Theorem 6.20, so we omit it.
Applying this result to our situation, we have the following: Under the
constraints that E XXT = K, i.e., E[Xi Xj ] = Ki,j for all i and j, and that
E[X] = 0, i.e., E[Xi ] = 0 for all i, we see that the maximizing distribution
must have a PDF of the form
fX (x1 , . . . , xn ) = e0 +
i x i +
i,j xi xj
(14.98)
312
where the parameters 0 , i , and i,j must be chosen such that the PDF
integrates to 1 and such that the constraints are satised. Note that this
is not hard to do, because it is obvious that this maximizing distribution is
multivariate Gaussian. The theorem then follows from Theorem 14.15.
This result can also be shown using the fact that the relative entropy is
nonnegative. To that aim let be the density of a zero-mean, covariance-K
multivariate Gaussian random vector. Then, for a general PDF fX (),
0 D(fX ) =
= h(X)
fX (x) log
fX (x)
dx
(x)
fX (x) log (x) dx
= h(X) E log
(14.99)
(2)n det K
1
log (2)n det K +
2
1
= h(X) + log (2)n det K +
2
1
= h(X) + log (2e)n det K ,
2
= h(X) +
e 2 X
T K1 X
(14.100)
(14.101)
1
E XT K1 X log e (14.102)
2
n
log e
(14.103)
2
(14.104)
where (14.103) is shown in (14.85)(14.92). Hence,

h(X)
1
log (2e)n det K .
2
(14.105)
Equality is only possible if we have fX = , i.e., if X is multivariate Gaussian.

We can specialize Theorem 14.16 from a vector to a single RV.
Corollary 14.17. Let X be a zero-mean continuous random variable with
variance 2 . Then
h(X)
1
log 2e 2
2
with equality if, and only if, X N 0, 2 .
(14.106)
Chapter 15
The Gaussian Channel

15.1
Introduction
After our preparation in Chapter 14, we are now ready to change to a situation where the channel alphabets are continuous (i.e., the alphabet sizes are
innite). However, note that at the moment we still stick to a discrete time
k.
The most important such continuous-alphabet discrete-time channel is the
Gaussian channel.
Denition 15.1. The Gaussian channel has a channel output Yk at time k
given by
Y k = x k + Zk ,
(15.1)
where xk denotes the input of the channel at time k and is assumed to be

a real number xk R, and where {Zk } is IID N 0, 2 denoting additive
Gaussian noise. Note that like for a DMC there is no memory in the channel.
It is therefore quite usual for the channel denition to drop the time index k:
Y = x + Z.
(15.2)
Following our notation for the DMC we describe the Gaussian channel by
the input and output alphabets
X =Y=R
(15.3)
and the channel law (now a conditional PDF instead of a conditional PMF!)
fY |X (y|x) =
1
2 2
(yx)2
2 2
(15.4)
In spite of the seemingly quite dierent nature of this channel compared

to a DMC, it turns out that it behaves very similarly in many respects. There
are only a few changes: To realize the most important dierence, consider for
a moment the channel capacity from an intuitive point of view.
313
314
Assume rst that 2 = 0. Then Zk = 0 deterministically and Yk = xk

without any noise. How big is the capacity in this situation? Note that xk
can take on innitely many dierent values and Yk can perfectly reconstruct
xk . So, it is not hard to see that C = !
So lets then assume the more realistic situation 2 > 0. How big is
capacity now? Well, note that we have no restriction on xk , i.e., we can
choose xk as large as we want. Therefore it is still possible to choose innitely
many dierent values for xk that are innitely far apart from each other! In
particular, choose
xk {0, a, 2a, 3a, 4a, . . .}
(15.5)
and let a . Since the dierent values are innitely far apart from each
other, we will always be able to reconstruct xk from the noisy observation Yk .
And since xk can take on innitely many dierent values, we again end up
with C = !
However, note that again this is not realistic: We will not be able to choose
such large values for xk because in any practical system the available power
is limited. We see that we need to introduce one or several constraints on the
input. The most common such constraint is an average-power constraint.
Denition 15.2. We say that the input of a channel is subject to an averagepower constraint Es if we require that every codeword x = (x1 , . . . , xn ) satises
1
n
n
k=1
x 2 Es
k
(15.6)
for some given power Es 0.

Actually, Es is not really power, but symbol energy.1 This is because we
are in a discrete-time setup and power (energy per time) is not dened.
Example 15.3. Lets assume we choose xk {+ Es , Es }. Then the

average-power constraint is satised as can be checked easily:
1
n
x2
k
k=1
1
=
n
Es = Es .
(15.7)
k=1
Lets further restrict ourselves to case n = 1. Then the decoder looks at

the
Y1 and tries to guess whether + Es or Es has been sent. What is the

optimal decision rule?
Recall our discussion of decoding in Section 9.3: The strategy is to
best
design an ML decoder (assuming that both messages x1 = Es or x1 = Es

are equally likely). Due to the symmetric noise (a Gaussian distribution is
symmetric around 0), we immediately see that
if Yk 0 = decide +
1
if Yk < 0 = decide
Es ;
(15.8)
Es .
(15.9)
Therefore the name Es : E stands for energy and the subscript s stands for symbol.
15.2. Information Capacity
315
What is the probability of error? Again assuming that both values of X1 are
equally likely, we get the following:
Es Pr X1 = +
Pr(error) = Pr error X1 = +
+ Pr error X1 =
Es Pr X1 =
Es
1
1
Pr error X1 = + Es + Pr error X1 =
2
2
= Pr error X1 = + Es
= Pr Y1 < 0 X1 = +
Es
=Q
Es
(15.11)
(15.12)
(15.13)
Es
(15.14)
where the Q-function is dened as

Q(x)
(15.10)
Es
e 2 d = Pr[G > x]
(15.15)
for G N (0, 1) (see also Appendix A). Hence, we have transformed the
E
Gaussian channel into a BSC with cross-over probability = Q s .
Obviously, this is not optimal. So the question is how to do it better.
15.2
Information Capacity
Analogously to our discussion of channel coding and capacity for a DMC we

start by dening a purely mathematical quantity: the information capacity.
Denition 15.4. The information capacity of a Gaussian channel with average-power constraint Es is dened as
Cinf (Es )
max
fX () : E[X 2 ]Es
I(X; Y ),
(15.16)
where fX () denotes the PDF of the input X.

Note that the constraint E X 2 Es is not the average-power constraint,
but simply a constraint on the maximization dened here.
Lets compute the value of Cinf (Es ):
I(X; Y ) = h(Y ) h(Y |X)
(15.17)
= h(Y ) h(Z|X)
(15.19)
= h(Y ) h(X + Z|X)
(15.18)
= h(Y ) h(Z)
1
= h(Y ) log(2e 2 )
2
1
1
log(2e Var(Y )) log(2e 2 ),
2
2
(15.20)
(15.21)
(15.22)
316
where the last inequality follows from the fact that a Gaussian RV of given
variance maximizes dierential entropy (Corollary 14.17). This step can be
achieved with equality if (and only if) Y is Gaussian distributed with variance
Var(Y ). This is actually possible if we choose the input to be Gaussian as
well.
So we have realized that the best thing is to choose the input to be Gaussian distributed. It only remains to gure out what variance to choose. Since
our expression depends on the variance of Y , lets compute its value:
Var(Y ) = E Y E[Y ]
(15.23)
= E X + Z E[X] E[Z]
= E X + Z E[X]
= E (X E[X])
= E (X E[X])
= E (X E[X])
(15.25)
2 E[Z(X E[X])] + E Z
=E X
E X
(15.26)
2
(15.27)
= 2
(E[X]) +
2
2 E[Z] E[X E[X]] + E Z

=0
(15.24)
(15.28)
2
(15.29)
(15.30)
Es + .
(15.31)
Here, in (15.25) we use that Z has zero mean; (15.27) follows because Z and
X are independent; and the last step (15.31) follows by the constraint on the
maximization. Actually, we could have got from (15.23) to (15.29) directly if
we had remembered that for X Z
Var(X + Z) = Var(X) + Var(Z).
(15.32)
Note that the upper bound (15.31) can actually be achieved if we choose X
to have zero mean and variance Es .
Hence,
I(X; Y )
1
1
Es
1
log 2e(Es + 2 ) log 2e 2 = log 1 + 2 ,
2
2
2
(15.33)
and this upper bound can be achieved if (and only if) we choose X N (0, Es ).
So we have shown the following.
Proposition 15.5. The information capacity of a Gaussian channel is
Cinf (Es ) =
Es
1
log 1 + 2
2
and is achieved by a Gaussian input X N (0, Es ).
(15.34)
15.3. Channel Coding Theorem
317
Remark 15.6. Note that it is silly to transmit a mean E[X] = 0 because

the mean is deterministic and does therefore not contain any information (the
decoder can always simply subtract it from the received signal without changing the mutual information) and because transmitting a mean uses (wastes!)
power:
E X 2 (E[X])2 E X 2 .
15.3
(15.35)
Channel Coding Theorem
Next, we will prove a coding theorem for the Gaussian channel. The proof
will be very similar to the proof of the coding theorem for a DMC given in
Chapter 9. The main dierence is that we have to take into account the
average-power constraint (15.6).
We quickly repeat the denition of a coding scheme and its most fundamental parameters.
Denition 15.7. An (M, n) coding scheme for the Gaussian channel with
average-power constraint Es consists of
a message set M = {1, 2, . . . , M};
a codebook of M length-n codewords x(m) = (x1 (m), . . . , xn (m)), where
every codeword needs to satisfy the average-power constraint
1
n
n
k=1
x2 (m) Es ,
k
m = 1, . . . , M;
(15.36)
an encoding function : M Rn that maps the message m into the

codeword x(m); and
a decoding function : Rn M that maps the received sequence y

into a guess m of which message has been transmitted, where usually
M M {0} (with 0 denoting error).

Denition 15.8. The rate R of an (M, n) coding scheme is dened as
R
log2 M
n
(15.37)
and is said to be achievable if there exists a sequence of (2nR , n) coding schemes

(with codewords satisfying the power constraint (15.36)) such that the maximal probability of error (n) tends to zero. The operational capacity is the
supremum of all achievable rates.
We now have the following fundamental result.
318
Theorem 15.9 (Coding Theorem for the Gaussian Channel).

The operational capacity of the Gaussian channel with average-power
constraint Es and noise variance 2 is
C(Es ) =
1
Es
log2 1 + 2
2
bits per transmission.
(15.38)
Hence, operational capacity and information capacity are identical, and

we simply speak of capacity.
15.3.1
Plausibility
Before we properly prove the coding theorem, we give a plausibility argument

of why it holds.
So assume we transmit a codeword x over the channel and receive a vector
Y = x + Z. Hence, Y lies with high probability in a sphere of radius r around
x, where
r=
E[ Z 2 ] =
2
2
E Z1 + + Zn =
n 2 .
(15.39)
So, think such a sphere around each of the possible codewords. As long as
there exist no codewords whose spheres overlap, we will be able to guess the
right message with high probability.
Due to the power constraint on the input, our available energy for each
component of the codewords is limited to Es on average. Therefore, the received sequences are likely to be in a sphere of radius
rtot =
E[ Y 2 ] =
2
E Y12 + + Yn
(15.40)
E Yk2
(15.41)
k=1
n
2
2
E Xk + E Zk
(15.42)
k=1
n
2
Xk + n 2
(15.43)
k=1
nEs + n 2
n(Es +
2 ),
(15.44)
(15.45)
where in (15.44) we have used (15.36).

Hence, to make sure that we can transmit as much information as possible,
we need to try to use as many codewords as possible, but the small spheres
of the codewords should not overlap. So, the question is: How many small
319
Figure 15.1: The large sphere depicts the n-dimensional space of all possible
received vectors. Each small sphere depicts a codeword with some noise
around it. To maximize the number of possible messages, we try to
put as many small spheres into the large one as possible, but without
having overlap such as to make sure that the receiver will not confuse
two messages due to the noise.
spheres of radius r can be squeezed into the large sphere of radius rtot without
having any overlap (see Figure 15.1)?
Note that a sphere of radius r in an n-dimensional space has a volume of
n where a is a constant that depends on n, but not on r. So, if we do
an r
n
not worry about the shape of the spheres, then at most we can squeeze the
following number of small spheres into the big sphere:
n(Es + 2 )
n
n 2
n
an rtot
Vol(large sphere)
=
=
M=
Vol(small sphere)
an r n
Es + 2
2
n
2
.
(15.46)
Hence, we get a rate

R=
log M
1 n
Es + 2
= log
n
n 2
2
1
Es
log 1 + 2
2
= Cinf (Es ).
(15.47)
This is exactly what Theorem 15.9 claims.
15.3.2
Achievability
Now we have a look at a real proof. We start with a lower bound on the rate,
i.e., the achievability part of the proof. The idea is very similar to the proof
given in Section 9.8. The only dierence is the additional diculty of having
an average-power constraint to be taken care of.
We go through the following steps:
320
1: Codebook Design: Identically to the coding theorem for a DMC we

generate a codebook at random. To that goal we x a rate R, a codeword
length n, and a PDF fX (). From the information capacity calculations
above we know that the capacity-achieving input distribution is Gaussian,
hence our rst attempt is to choose
fX (x) =
x2
1
e 2Es .
2Es
(15.48)
But if we generate all codewords like that, what happens with the average-power constraint? There is hope: By the weak law of large numbers
we know that
1
n
n
k=1
2
Xk E X 2 = Es
in probability.
(15.49)
Hence, it looks like that with a high chance the randomly generated
codewords will satisfy the power constraint.
But we have to be careful with the details! What we actually need in our
analysis of the error probability is the following property: For any given
> 0 we must have that
Pr
1
n
n
k=1
2
Xk > Es .
(15.50)
Unfortunately, (15.49) only guarantees that for given > 0 and > 0
there exists an n0 such that for all n n0
Pr
1
n
k=1
2
Xk E X 2
> .
(15.51)
Lets choose = . Then we get from (15.51):

Pr
1
n
n
2
Xk E X 2
(15.52)
k=1
n
1
n
= Pr
>
k=1
2
Xk E X 2 >
1
n
n
k=1
2
Xk E X 2 <
(15.53)
Pr
1
n
= Pr
1
n
(15.54)
2
Xk > E X 2 + .
k=1
n
2
Xk E X 2 >
(15.55)
k=1
This diers from (15.50) by a small added to the expected energy E X 2 .

So, to make sure that (15.55) agrees with the needed (15.50), we must
choose
E X 2 + = Es ,
(15.56)
321
i.e., for the random generation of the codewords we do not use the distribution in (15.48), but instead we reduce the variance slightly: We
generate every letter of all codewords IID N (0, Es ):
fX (x) =
1
2(Es )
x
2(E )
s
(15.57)
Note that since can be chosen arbitrarily small, we are not really concerned about this small change.
So we randomly generate the codebook IID according to (15.57), and
then show the codebook to both the encoder and the decoder.
Remark 15.10. Note that in our random design of the codebook we
do not have the guarantee that all codewords will satisfy the power constraint (15.36)! (We only have the property that a violation of the power
constraint is not very likely to happen too often.) However, in spite
of this, we do not delete a codeword that does not satisfy the power
constraint!
From a practical point of view this makes no sense at all, but it simplies
our analysis considerably: If we checked every codeword and regenerated
some of them if they did not satisfy the power constraint, then we would
introduce a dependency between the dierent letters in each codeword
and our analysis would break down. Instead we will take care of the
illegal codewords at the receiver side, see below.
2: Encoder Design: Given some message m {1, . . . , 2nR } that is chosen
by the source according to a uniform distribution
Pr[M = m] =
1
,
2nR
(15.58)
the encoder picks the mth codeword X(m) and sends it over the channel.
3: Decoder Design: We again use a threshold decoder based on the instantaneous mutual information
i(x; y)
log
fX,Y (x, y)
,
fX (x)fY (y)
(15.59)
where
n
fX,Y (x, y) =
k=1
fX (xk )fY |X (yk |xk )
(15.60)
with fX () dened in (15.57) and fY |X (|) denoting the channel law, i.e.,
the conditional PDF of the channel output given the channel input given
in (15.4).
The decoder receives a sequence Y and searches for an m such that for
some given threshold > 0,
322

for m
i(X(m); y) > log2 ;
(15.61)
i(X(m); y) log2 ;
(15.62)
for all m = m
the codeword X(m) satises the average-power constraint
1
n
n
k=1
2
Xk (m) Es .
(15.63)
If the decoder can nd such an m, then it will put out this m, otherwise
it puts out m = 0, i.e., it declares an error.
We choose the threshold as in Section 9.8 to be
2n(I(X;Y ))
(15.64)
for some xed > 0. Here I(X; Y ) is the mutual information between
a channel input symbol X and its corresponding output symbol Y when
the input has the distribution fX () given in (15.57), i.e.,
I(X; Y ) = h(Y ) h(Y |X)
(15.65)
= h(X + Z) h(Z)
1
1
= log 2e(Es + 2 ) log 2e 2
2
2
1
Es
= log 1 +
.
2
2
(15.66)
(15.67)
(15.68)
Note that if the power constraint is not satised by the used codeword,
the decoder will for sure declare an error, even if it actually could have
decoded correctly. With this (from a practical point of view very strange)
decoding rule we make sure that any codebook that contains codewords
that violate the power constraint will behave poorly.
4: Performance Analysis: Identically to the argument given in the proof
of the coding theorem of a DMC, without loss of generality we can assume
that M = 1 is transmitted (recall that each codeword is generated randomly and independently of all other codewords anyway, so it does not
matter which message is chosen!). We now dene the following events:
F0
Fm
1
n
n
2
Xk (1) > Es ,
(15.69)
k=1
i(X(m), Y) > log2 ,
m = 1, . . . , 2nR ,
(15.70)
i.e., F0 is the event that the transmitted codeword violates the power
constraint, and Fm is the event that the mth codeword and the received
323
sequence Y have an instantaneous mutual information above the given

threshold. Hence an error occurs if F0 occurs (i.e., the power constraint is
c
violated), if F1 occurs (i.e., if the transmitted codeword and the received
sequence have a too small instantaneous mutual information), or if F2
F3 F2nR occurs (i.e., if one or more wrong codewords have an
instantaneous mutual information with the received sequence Y that is
too big). We get:
c
Pr(error) = Pr(F0 F1 F2 F3 F2nR | M = 1)
Pr(F0 | M = 1) +
c
Pr(F1
2nR
+
m=2
(15.71)
| M = 1)
Pr(Fm | M = 1).
(15.72)
From (15.55) and (15.56) we know that

Pr(F0 | M = 1) .
(15.73)
Then from our choice (15.64) and from the weak law of large numbers it
follows that
c
Pr(F1 | M = 1) = Pr i X(1); Y log2 M = 1
(15.74)
= Pr
k=1
Pr
1
n
i(Xk ; Yk ) n I(X; Y )
(15.75)
(15.76)
(for n large enough).
k=1
i(Xk ; Yk ) I(X; Y )
(15.77)
And for m 2, we have

Pr(Fm | M = 1) = Pr i X(m); Y > log2 M = 1
(15.78)
fX,Y (X(m), Y)
= Pr log2
> log2 M = 1 (15.79)
fX (X(m))fY (Y)
fX,Y (X(m), Y)
(15.80)
> M =1
= Pr
fX (X(m))fY (Y)
=
fX (x)fY (y) dx dy
(15.81)
fX (x)fY (y) dx dy
(15.82)
(x,y) such that

fX,Y (x,y)>fX (x)fY (y)
(x,y) such that

<
1
fX (x)fY (y)< fX,Y (x,y)
<
1
f
(x,y)
X,Y
1
fX,Y (x, y) dx dy
(15.83)
(x,y) such that

1
fX (x)fY (y)< fX,Y (x,y)
324
fX,Y (x, y) dx dy =
1
.
(15.84)
=1
Here, again, we make use of the fact that X(m) is independent of Y such
that their joint density is in product form.
Combining these results with (15.72) now yields
2nR
Pr(error) < + +
m=2
nR
= 2 + 2
(15.85)
1 2n(I(X;Y ))
2nR
n(I(X;Y )R)
2 + 2
(15.86)
(15.87)
3,
(15.88)
if n is suciently large and if I(X; Y ) R > 0 so that the exponent is

negative. Hence we see that for any > 0 we can choose n such that the
average error probability, averaged over all codewords and all codebooks,
is less than 3 as long as
R < I(X; Y ) =
1
Es
log 1 +
2
2
(15.89)
where we have used (15.68). Note that this is arbitrarily close to the
information capacity.
5: Strengthening: The strengthening part of the proof (i.e., the part where
we show that the result holds also for the maximum error probability,
and not only for the average error probability) is identical to the coding
theorem for the DMC case. We therefore omit it.
This concludes the achievability proof.
15.3.3
Converse
Consider any (2nR , n) coding scheme that satises the average-power constraint (15.36) for all m = 1, . . . , 2nR . We remind the reader about Fanos
inequality (Corollary 9.27):
(n)
(n)
H M M log 2 + Pe log 2nR = 1 + Pe nR bits,
(15.90)
(n)
where Pe = Pr M = M . Recalling that M is uniformly distributed over

nR }, i.e., H(M ) = log 2nR = nR bits, we bound as follows:
{1, . . . , 2
2
nR = H(M )
= I M; M + H M M
(n)
I M ; M + 1 + Pe nR
(15.91)
(15.92)
(by (15.90)) (15.93)
325
n
(n)
I(X1 ; Y1n ) + 1 + Pe nR
h(Y1n )
h(Y1n )
n
=
k=1
n
=
k=1
n
k=1
(DPI)
n
(n)
h(Y1n |X1 ) + 1 + Pe nR
n
(n)
h(Z1 ) + 1 + Pe nR
k1
h Yk Y1k1 h Zk Z1
(15.94)
(15.95)
(15.96)
(n)
+ 1 + Pe nR (chain rule)
(n)
h Yk Y1k1 h(Zk ) + 1 + Pe nR
({Zk } IID)
(n)
h(Yk ) h(Zk ) + 1 + Pe nR bits.
(15.97)
(15.98)
(15.99)
Here, (15.94) follows from the data processing inequality (Lemma 9.30) based
on the Markov chain M X Y M (see Denition 9.28), and the

last inequality follows because conditioning cannot increase entropy.
Now let
1
2nR
Ek
2nR
x2 (m)
k
(15.100)
m=1
be the average energy of the kth component of the codewords when averaged
uniformly over all codewords. Note that due to the average-power constraint
(15.36) we have
1
n
n
k=1
Ek Es .
(15.101)
Moreover note that

E Yk2 = E (Xk + Zk )2
=E
=E
(15.102)
2
2
Xk + E Zk
x2 (M ) + 2
k
2nR
=
m=1
(Xk Zk and E[Zk ] = 0)
(Xk is random because of M )
= Ek + 2
(15.104)
(M is uniformly distributed)
(15.105)
(by denition (15.100)).
1
x2 (m) + 2
2nR k
(15.103)
(15.106)
Hence, we continue with (15.99) as follows:

R
1
n
1
n
1
n
n
k=1
n
k=1
n
k=1
h(Yk ) h(Zk ) +
1
(n)
+ Pe R
n
(15.107)
1
1
(n)
log2 2e Var(Yk ) h(Zk ) + + Pe R
2
n
(15.108)
1
log2 2e E Yk2
2
(15.109)
h(Zk ) +
1
(n)
+ Pe R
n
326

1
=
n
=
1
n
n
k=1
n
k=1
1
1
log2 2e(Ek + 2 ) log2 2e 2
2
2
1
Ek
log2 1 + 2
2
1
1
log2 1 +
2
n
n
k=1
1
Es
log2 1 + 2
2
Ek
2
1
(n)
+ Pe R (15.110)
n
1
(n)
+ Pe R
n
(15.111)
1
(n)
+ Pe R
n
(15.112)
1
(n)
+ Pe R bits.
n
(15.113)
Here, (15.108) follows from Theorem 14.16 that shows that for given variance and mean the dierential entropy is maximized by a Gaussian RV; the
subsequent inequality (15.109) follows from
Var(Yk ) = E Yk2 (E[Yk ])2 E Yk2 ;
(15.114)
(15.110) follows from (15.106); in (15.112) we apply Jensens inequality (Theorem 1.29); and the nal step (15.113) is due to (15.101) and the monotonicity
of log().
(n)
Hence, if Pe 0 as n , we must have
R
1
Es
log 1 + 2 .
2
(15.115)
This concludes the proof of the converse and thereby the proof of the coding
theorem.
15.4
Joint Source and Channel Coding Theorem
Similarly to the discrete-alphabet case, next we would like to combine data

compression and channel transmission. I.e., we will look at the combined information transmission system shown in Figure 15.2, where a general discrete
stationary source shall be transmitted over the Gaussian channel.
dest.
U1 , . . . , UK
decoder
Y1 , . . . , Y n
U1 , . . . , UK
Gauss. X1 , . . . , Xn
DSS
encoder
channel
Figure 15.2: Information transmission system using a joint source channel

coding scheme.
If we recall our derivations of the information transmission theorem (Theorem 13.3) and the source channel coding separation theorem (Corollary 13.4)
for a DMC, we will quickly note that the derivations can be taken over one-toone to the situation of a Gaussian channel. So we will not bother to re-derive
the results again, but simply state it. Note that instead of C we will directly
use the capacity expression of the Gaussian channel (15.38).
15.4. Joint Source and Channel Coding Theorem
327
Theorem 15.11 (Information Transmission Theorem and Source

Channel Coding Separation Theorem for the Gaussian Channel).
Consider a Gaussian channel with average-power constraint Es and noise
variance 2 , and a nite-alphabet stationary and ergodic source {Uk }.
Then, if
H({Uk })
1
Es
bits/s <
log2 1 + 2
Ts
2Tc
bits/s
(15.116)
(where Ts and Tc are the symbol durations of the source and the channel,
respectively), there exists a joint source channel coding scheme with a
(K)
probability of a decoding error Pe 0 as K .
Moreover, it is possible to design the transmission system such as to
rstly compress the source data in a fashion that is independent of the
Gaussian channel and then use a channel code for the Gaussian channel
that is independent of the source.
Conversely, for any stationary source {Uk } with
Es
1
H({Uk })
bits/s >
log2 1 + 2
Ts
2Tc
bits/s,
(15.117)
(K)
the probability of a decoding error is bounded away from zero, Pe > 0,

i.e., it is not possible to send the process over the Gaussian channel with
arbitrarily low error probability.
We can also take over our discussion from Section 13.6 about the limitations of communication when we transmit above capacity: The lower bound
to the average bit error rate (BER) given in expression (13.77) also holds for
the Gaussian channel.
So, we assume an IID uniform binary source and, in (13.77), we replace
the capacity C by the capacity expression of the Gaussian channel (15.34) and
dene
N0
Eb
b
2 2
Ts
1
Es
= Es
Tc
R
Eb
N0
(noise energy level in Joule2 );
(15.118)
(energy per information bit in Joule);
(15.119)
(signal-to-noise ratio).
(15.120)
Here we again have used (compare with (13.38))

R=
2
Tc
Tc
H({Uk }) =
bits
Ts
Ts
(15.121)
For a justication for this denition, see (16.46).
328
for {Uk } IID uniform binary.

Using these denitions, we transform (13.77) into
1
log2 (1 + 2Rb )
2R
(15.122)
1
log2 (1 + 2Rb ) .
2R
Hb (Pb ) > 1
(15.123)
or
Pb > H1 1
b
Here, again, H1 () is the inverse function of the binary entropy function Hb ()

b
for 0, 1 (and we dene H1 () 0 for < 0). As we have discussed in
b
2
Section 13.6 already, this lower bound is tight, i.e., there exist schemes that
can approach it arbitrarily closely.
0
10
10
BER Pb
10
10
10
10
R = 1/2
R = 1/3
10
b [dB]
Figure 15.3: Shannon limit for the Gaussian channel.
The curve (15.123) is shown in Figure 15.3 as a function of the signal-to1
noise ratio (SNR) for R = 1 and R = 3 . We see that there exists a rate- 1
2
2
system design that can achieve an average bit error rate Pb = 105 at an SNR
b close to 0 dB, i.e., for Eb N0 .
On the other hand, we are sure that no system with a rate R = 1 can yield
2
a BER less than 105 if the signal energy per information bit Eb is less than
the noise energy level N0 .
Due to its fundamental importance, the lower bound (15.123) is called
Shannon limit. For many years after Shannon has drawn this line in 1948
[Sha48], researchers have tried very hard to nd a design that can actually
15.4. Joint Source and Channel Coding Theorem
329
achieve this lower bound. Over the years and with great eort, they managed
to gradually close the gap between the existing transmission schemes and the
theoretical limit. For a long period between the 1960s and the early 1990s, the
Shannon limit could only be approached to around 2 dB. However, in 1993
a big breakthrough was presented at a conference in Geneva: The turbo code
was the rst design that managed to get within 1 dB of the limit [BGT93].
A couple of years later, between 19961998, the low-density parity-check
(LDPC) codes were rediscovered [MN96], [DM98] and shown to reduce the
performance gap to even smaller values, e.g., 0.1 dB. We use rediscover
here because LDPC codes were originally proposed by Robert G. Gallager in
1962 [Gal62]. Yet, at that time, the research community did not realize their
potential. One might argue that due to their high complexity, the computers
at that time could not perform any simulations on LDPC codes and that
therefore this discovery went unnoticed for so long. However, interestingly
enough, Gallager actually ran quite a lot of numerical computations in the
early 60s already and demonstrated performance values of LDPC codes up to
an error probability of 104 . They were completely ignored for 35 years. . .
So it took the engineers about 50 years (from 1948 to 1998) until they
nally managed to catch up with the pioneering prediction of Shannon for a
DMC and for the Gaussian channel.
Chapter 16
Bandlimited Channels
In Chapter 15 we have dropped the assumption of discrete alphabets and
allowed more general continuous alphabets. Now we will also drop the simplication of discrete time and consider a continuous-time channel.
As a matter of fact, in the physical world almost every signal and system is
of continuous-time nature. Still, we engineers prefer to work with discrete-time
problems because they are so much easier to handle and control. Therefore
textbooks often start directly with a discrete-time model and simply assume
that the sampling and digitalization has been taken care of. This script is no
exception!
In this chapter, however, we will try to get a avor of the real physical
world and how one can try to convert it into a discrete-time setup. We have
to warn the reader that this conversion from continuous time to discrete time
is mathematically very tricky. We will therefore try to give a rough overview
only and skip over quite a few mathematical details. For those readers who
are interested in a clean treatment, the book by Amos Lapidoth [Lap09] is
highly recommended.
Note that in this chapter we will be using the Fourier transform. It is
assumed that the reader has a good basic knowledge of the Fourier transform
and its properties.
16.1
Additive White Gaussian Noise Channel
In contrast to the regularly clocked discrete-time systems, in real continuoustime systems, the shape of the signal is free to change at any time, but we
usually have a constraint concerning the used frequencies: Only a limited
frequency band is available, either because of technical limitations (any lter
or amplier has implicit band limitations) or, even more importantly, because
the frequency bands are licensed. Hence, the input signal to our channel is
not only constrained with respect to its average power, but also with respect
to its frequencies.
We consider the following channel model, that is the simplest, but at
the same time also the most important continuous-time channel model in
331
332
engineering. Its name is additive white Gaussian noise (AWGN) channel.

For a given input signal at time t R, the channel output is
Y (t) = x(t) + Z(t).
(16.1)
As already mentioned, the input x() has to satisfy an average-power constraint

P and is bandlimited to W Hz, i.e., it only may contain frequencies in the
interval [W, W].
The term Z(t) denotes additive noise. We will assume that Z() is a white
Gaussian noise process with a double-sided1 power spectral density (PSD)
that is at inside of the band [W, W] and has a value N0 . Such a PSD is
2
called white in an analogy to white light that contains all frequencies equally
strongly. Since the input of the channel is restricted to lie inside the band
[W, W], it seems reasonable that we do not need to care what the PSD of
Z() looks like outside of this band. This intuition can be proven formally.
Claim 16.1. Without loss of generality we may bandlimit Y () to the band
[W, W].
Proof: We omit the proof and refer the interested reader to [Lap09].
This result is intuitively pleasing because all components of Y () outside the
frequency band only contain noise and no signal components, so it indeed
seems clear that we can ignore them.
Hence, we do not bother about specifying the PSD completely, but simply
leave its value outside of [W, W] unspecied. Note, however, that the PSD
certainly cannot be white for all frequencies because such a process would
have innite power!
Beside the already mentioned constraints on the used power and frequencies, from a practical point of view, the input signal also needs some kind
of time limitation because any practical system will only send nite-duration
signals. But here our rst mathematical problem occurs: By a fundamental property of the Fourier transform no signal can simultaneously be strictly
timelimited and bandlimited!
To deal with this issue we set up the system as follows. We design an
encoder that receives a certain number of IID uniform binary information bits
as input. Depending on the value of these information bits, it then chooses
from a given set of strictly bandlimited signals one signal that it transmits
over the channel. We could call these signals codeword-signals, since they
corresponds to the codewords used in the discrete-time setup. Recall that
beside the bandwidth restrictions all codeword-signals also need to satisfy the
average-power constraint (where the average is to be understood to be over
the time-duration of the signal).
1
We call the PSD double-sided because we dene it both for positive and negative frequencies, < f < . Some people prefer to only consider the (physically possible)
positive frequencies f 0. The PSD is then called single-sided and is simply twice the value
of the double-sided PSD, such that when integrated over its range, both version yield the
same value for the total power.
16.1. Additive White Gaussian Noise Channel
333
The decoder will receive a distorted version of the transmitted signal. It

will observe this received signal for a certain limited time duration of T seconds
and then make a guess of which codeword signal has been transmitted. Note
that we can now dene the rate of our system in a similar fashion to the
discrete-time setup: The rate is dened as the logarithmic number of possible
codeword-signals, normalized to the duration of the signal T (or rather the
duration of the observation of the signal).
Due to the noise in the channel and because the decoder only observes
during T seconds, there will be a certain strictly positive probability of making
an error. However, we hope that as long as the rate is small enough, we
can make this error arbitrarily small by making T large. So, we see that the
signal duration (observation duration) T in the continuous-time coding system
corresponds to the blocklength n for a discrete-time coding system.
We describe this setup mathematically.
Denition 16.2. A 2RT , T) coding scheme for the AWGN channel with
average-power constraint P consists of
2RT codeword-signals x(m) () that are bandlimited to W and that satisfy
the average-power constraint
1
T
T
0
x(m) (t)
dt P,
m = 1, . . . , 2RT ;
(16.2)
an encoder that for a message M = m transmits the codeword-signal

x(m) (); and
a decoder that makes a decision M about which message has been

sent based on the received signal Y (t) for t [0, T].
Here, R denotes the rate and is dened as (compare to Denition 9.16):
R
log2 (# of codeword-signals)
,
T
(16.3)
measured in bits per second. A rate is said to be achievable if there exists a
sequence of 2RT , T) coding schemes such that the error probability Pr[M = M ]
tends to zero as T .
The alert reader will note that we are cheating here: By limiting only
the receiver to the time interval [0, T], but not the encoder, we try to cover
up the problem that a timelimited signal at the transmitter for sure is not
bandlimited. How we are supposed to transmit an innite-length signal in the
rst place is completely ignored. . . However, this issue is then resolved in
our derivation once we let T tend to innity.
334
16.2
The Sampling Theorem
The basic idea of our treatment of bandlimited time-continuous signals is

to rely on so-called complete orthonormal systems (CONS),2 i.e., on basis
functions that can be used to describe these functions. We are sure that you
already know a CONS that holds for all energy-limited signals: the signal
description based on the signals samples as given by the sampling theorem.
Theorem 16.3 (Sampling Theorem (by Nyquist & Shannon)).
Let s() be a nite-energy function that is bandlimited to [W, W] Hz.
Then s() is completely specied by the samples of the function spaced
1
2W seconds apart and can be written as
s(t) =
k
2W
k=
sinc(2Wt k)
(16.4)
where
sin(t)
t
sinc(t)
t = 0,
t = 0.
(16.5)
We note that
k (t)
1
sinc(2Wt k)
2W
(16.6)
are the basis functions used to describe s() and that
2W s
k
2W
(16.7)
denote the coordinates of s() in this basis.

Proof: The proof relies on the Fourier series of a periodic function g():
Suppose g(x) is periodic with period X, i.e.,
g(x + X) = g(x),
x R, Z.
(16.8)
Then we know from the theory of Fourier series that g() can be written as
g(x) =
1
X
ck ei2k X
(16.9)
k=
with coecients
X
0
2
g(x) ei2k X dx.
ck
For more details, see Appendix C.8.
(16.10)
16.2. The Sampling Theorem
335
We now apply this idea to a periodic function sp (f ) with period 2W. I.e.,
sp (f ) =
1
2W
ck ei2k 2W
(16.11)
k=
where
W
ck =
sp (f ) ei2f 2W df.
(16.12)
Note that we have not yet specied what sp (f ) actually is. We correct this
laps by choosing
ck
k
.
2W
(16.13)
Now sp (f ) is well dened. But what is it? To nd out we next compute
the inverse Fourier transform3 of the bandlimited signal s(t) using its Fourier
transform s(f ):
s(t) =
s(f ) ei2f t df =
s(f ) ei2f t df.
(16.14)
Note that we can restrict the integration limits because we have assumed that
s() is bandlimited in the rst place.
k
Now consider the points t = 2W :
s
k
2W
s(f ) ei2f 2W df.
(16.15)
Comparing (16.15) with (16.12), we see that we have

W
sp (f ) ei2f 2W df =
s(f ) ei2f 2W df
(16.16)
for all k Z, i.e., we must have that

sp (f ) = s(f ),
f [W, W].
(16.17)
Since sp (f ) is a periodic function with period 2W, it also follows that outside
the band [W, W], the function sp (f ) will simply repeat itself periodically.
So, nally, we understand that by our denition (16.13) we have chosen sp (f )
to be the periodic extension of s(f ) with period 2W!
We now also see a way of getting back our original function s() from its
samples {s(k/2W)}:
3
Since capital letters are already reserved for random quantities, in this script we use a
hat to denote the Fourier transform of a continuous signal. Also note that our denition of
Fourier transform does not make use of the radial frequency . The advantage is that we do
not need to bother about many factors of 2. Moreover, f has a very fundamental physical
meaning, while I do not really understand the physical meaning of . For more about this
choice, I highly recommend to read the rst couple of chapters (in particular Chapter 6) of
[Lap09].
336
Using the samples s
k
2W
as coecients ck in (16.11) we get sp (f ).
From sp (f ) we can then get s(f ) back using an ideal lowpass lter
LP (f ):
h
s(f ) = hLP (f ) sp (f ) = hLP (f )
1
2W
k=
k
2W
ei2k 2W . (16.18)
Finally we get s(t) back by the inverse Fourier transform.

This proves the rst part of the theorem.
To derive the expression (16.4), we use the engineering trick of Dirac Delta
functions (). Note that () is not a proper function in the mathematical
sense. But we sloppily describe it as a function
0 for all t = 0,
t = 0,
(t)
(16.19)
where the value is chosen such that
(t) dt = 1.
(16.20)
See also Example 14.4.

So consider (16.18) and make the following observations:
The Fourier transform of a Dirac Delta (t kT) is
(t kT)
(t kT) ei2f t dt = ei2f kT ,
(16.21)
so that
sp (f ) =
1
2W
k=
k
2W
sp (t) = 1
2W
ei2k 2W
s
k=
k
2W
k
.
2W
(16.22)
The inverse Fourier transform of a lowpass lter hLP () with bandwidth

W is
1
= hLP (f ) r
W
hLP (t) = 2W sinc(2Wt).
(16.23)
16.2. The Sampling Theorem
337
Hence we get
s(f ) = hLP (f ) sp (f )
s(t) = hLP (t) sp (t)
(16.24)
= 2W sinc(2Wt)
=
=
k=
k=
1
2W
k=
k
2W
k
2W
k
2W
k
2W
sinc 2W t
k
2W
sinc(2Wt k).
(16.25)
(16.26)
(16.27)
This proves the second part of the theorem.

Remark 16.4 (Sampling via Shah-Function). There is a nice engineering
way of remembering and understanding the sampling theorem. We dene the
Shah-function as
(t)
T
k=
(t kT).
(16.28)
This is a strange function, but it has some cool properties. The coolest is
the following:
(t)
T
r 1 (f ),
1
(16.29)
i.e., the Fourier transform of a Shah-function is again a Shah-function! Moreover, observe the following:
Multiplying a function s(t) by a Shah-function is equivalent to
(t)
T
sampling the function s() in intervals of T:
sp (t) = s(t) T
(t)
T
(16.30)
(the additional factor T is introduced here to make sure that the energy
of sp () is identical to the energy of s()).
Convolving a function s(f ) with a Shah-function (f ) is equivalent
W
to periodically repeating the function s() with period W:
sp (f ) = s(f ) (f ).
(16.31)
Now, these two operations are coupled by the Fourier transform! So we understand the following beautiful relation:
sp (t) = s(t) T
(t)
T
r sp (f ) = s(f ) (f ).
(16.32)
338
s(t)
s(f )
1
T
Figure 16.1: Sampling and replicating frequency spectrum: This gure depicts the relationship between sampling a bandlimited time signal and
periodically repeating the corresponding spectrum.
The relationship is also depicted in Figure 16.1.
We see that if 1 2W we will have overlap in the spectrum (so-called
T
aliasing) and therefore we will not be able to gain the original spectrum back.
1
On the other hand, if T = 2W , then there is no overlap and we can gain
the original function back using an ideal lowpass lter.
1
If we sample at a slightly higher rate than 2W, i.e., T < 2W , then we can
even relax our requirement on the lowpass lter, i.e., it does not need to be
ideal anymore, but we can actually use a lter that can be realized in practice.
16.3
From Continuous To Discrete Time
So the basic idea of the sampling theorem (Theorem 16.3) or any other CONS
is that any signal s() can be described using some basis functions:
sk k (t).
s(t) =
(16.33)
The basis function have the property of being orthonormal. This means that
they are orthogonal to each other and that they are normalized to have unitenergy:
k (t)k (t) dt =
1 if k = k ,
0 if k = k .
(16.34)
Due to this property it is easy to derive the coordinates sk of s():
sk =
s(t)k (t) dt
(16.35)
as can be seen when plugging (16.33) into (16.35):
s(t)k (t) dt =
s (t) k (t) dt
(t)k (t) dt
(16.36)
(16.37)
= 1 for =k,
= 0 otherwise
= sk .
(16.38)
16.3. From Continuous To Discrete Time
339
The problem for us now is that we actually do not want to describe Y (t)
for all t R using innitely many samples, but actually would like to constrain
Y (t) to the interval t [0, T]. According to the sampling theorem we sample
1
at a rate of 2W , i.e., we have 2WT samples in this interval. So, we would
simply love to consider only these samples and write
2WT
Y (t)
Yk k (t),
k=1
t [0, T].
(16.39)
This, however, does not really work because k (t) are sinc-functions and therefore decay very slowly resulting in large contributions outside of the required
time-interval [0, T]. (We keep ignoring the additional problem that cutting
any function to [0, T] will destroy its band limitations. . . )
However, one can save oneself from this problem if rather than relying on
the CONS based on the sampling theorem, we use another CONS based on the
family of the so-called prolate spheroidal functions k (t). These orthonormal
basis functions have some very nice properties:
They are strictly bandlimited.
Even though they are innite-time, they concentrate most of their energy
inside the time interval [0, T].
They retain their orthogonality even when they are restricted to [0, T].
Hence, a signal that is bandlimited to [W, W] can described in the timeinterval [0, T] using only those 2WT orthonormal basis functions k () that
mainly live inside [0, T]. The remaining basis functions only give a very small
contribution that can be neglected. For more details, see Appendix C.8 and
[Lap09, Ch. 8].
To cut these long explanations short (and ignoring many details), it is
basically possible to write
Y (t) = Y1 1 (t) + Y2 2 (t) + + Y2WT 2WT (t),
(m)
(t) =
(m)
x1
1 (t) +
(m)
x2
2 (t) + +
(m)
x2WT
2WT (t),
(16.40)
m = 1, . . . , 2RT ,
Z(t) = Z1 1 (t) + Z2 2 (t) + + Z2WT 2WT (t).
(16.41)
(16.42)
Hence, we can describe the channel model (16.1) using the following 2WT
equations:
Y k = x k + Zk ,
k = 1, . . . , 2WT.
(16.43)
Here the coordinates of the white Gaussian noise process Z(t)

T
Zk
0
Z(t) k (t) dt
(16.44)
340
are random variables that turn out to be IID N 0, N0 (for more details see
2
Appendix C.5). So we realize that (16.43) actually corresponds to a Gaussian
channel model
Y =x+Z
(16.45)
as discussed in Chapter 15, with a noise variance

2 =
N0
2
(16.46)
and with a blocklength n = 2WT.

By the way, note that the energy of Z(t) inside the the band [W, W] and
inside the time-interval t [0, T] is identical to the total energy of all 2WT
samples Zk :
2W
bandwidth
N0
2
PSD
duration
2WT
# of samples
N0
.
2
(16.47)
variance
power in band [W, W]
It only remains to derive how the average-power constraint (16.2) translates into an average-power constraint Es for this Gaussian channel. Recall
that for the discrete-time model (16.45) the average-power constraint is given
as
1
n
n
k=1
x2 Es .
k
(16.48)
Plugging the description (16.41) into the average-power constraint (16.2) we

get
1 T 2
x (t) dt
T 0
1 T
x1 1 (t) + x2 2 (t) + + x2WT 2WT (t)
=
T 0
1
= x2 + x2 + + x2
2
2WT
T 1
2WT
1
x2
= 2W
k
2WT
(16.49)
2
dt
(16.50)
(16.51)
(16.52)
k=1
= 2W
1
n
x2 ,
k
(16.53)
k=1
where in (16.51) we have used the orthonormality of k (). Hence, we see that
1
n
n
k=1
x2
k
P
2W
(16.54)
16.3. From Continuous To Discrete Time
341
i.e., in (16.48) we have

P
.
(16.55)
2W
Now, we recall from Section 15.3 that the (discrete-time) capacity per
channel use of the Gaussian channel (16.45) under an average-power constraint
Es is
Es
1
(16.56)
Cd = log2 1 + 2 bits per channel use,
2
Es =
which in our situation, using (16.46) and (16.55), translates to

Cd =
1
log2 1 +
2
P
2W
N0
2
P
1
log2 1 +
2
N0 W
bits per sample.
(16.57)
Since in total we have 2WT samples, the total maximum amount of information
that can be transmitted during the interval [0, T] is
2WT Cd = WT log2 1 +
P
N0 W
bits.
(16.58)
When normalized to T to describe the amount of information that can be

transmitted per second, this nally gives
C = W log2 1 +
P
N0 W
bits/s.
(16.59)
So we see that as long as the rate R of our coding scheme (Denition 16.2)
is below the value C given in (16.59), we can choose the signal duration (observation duration) T long enough to make the error probability as small as
wished. Once again we see the duality of the blocklength n in the discrete-time
system and the signal duration T in the continuous-time system.
Note that by letting T tend to innity, we also get rid of all the approximations and imprecisions in our derivation that we have so assiduously swept
under the rug. . .
Hence, we have shown the following, of all his theorems, Shannons most
famous theorem.
Theorem 16.5 (Coding Theorem for the AWGN Channel).
Consider a continuous-time channel where the output Y () is given by
Y (t) = x(t) + Z(t)
(16.60)
with an input x() that is bandlimited to W Hz and that is constrained to

have an average power of at most P, and with AWGN Z() of double-sided
spectral density N0 . The capacity of this channel is given by
2
C = W log2 1 +
P
N0 W
bits/s.
(16.61)
342
We would like to briey discuss this exciting result. Note that even though
W shows up in the denominator inside the logarithm, it also shows up as a
factor outside of the log. This factor outside is dominant for values of W P.
Moreover, for W we get
C=
P
log2 e bits/s,
N0
(16.62)
i.e., for innite bandwidth the capacity grows linearly with the power. Hence,
we see that usually it is more attractive to increase bandwidth rather than
to increase power. Unfortunately, in practice bandwidth normally is far more
expensive than power.
For this very same reason, the engineers are not so much interested in how
much they can transmit per second, but rather how much they can transmit
C
per unit band. The gure of interest then becomes W [bits/s/Hz] and the
capacity formula then can be written as
C
P
= log2 1 +
W
N0 W
bits
.
s Hz
(16.63)
C
The term W is known as maximum bandwidth eciency.
So consider some coding scheme using a bandwidth W that transmits at a
R
rate of R C bits/s. This scheme has a bandwidth eciency of W . Suppose
further that this coding scheme needs an energy Eb for every information bit
it transmits, i.e., the total power P of this system is
P = R Eb .
(16.64)
Using this in (16.63) leads to

REb
C
R
= log2 1 +
W
W
N0 W
= log2 1 +
Eb R
,
N0 W
i.e., this gives us a lower bound on the required signal-to-noise ratio 4

the given bandwidth eciency:
(16.65)
Eb
N0
for
Eb
2W 1
.
R
N0
W
Note that the right-hand side of (16.66) is monotonically increasing in
the smallest value occurs for W and is given by ln 2. Hence,
b =
Eb
ln 2 = 0.693 = 1.6 dB,
N0
(16.66)
R
W,
i.e.,
(16.67)
i.e., it is not possible to transmit reliably over the AWGN channel at

an SNR smaller than 1.6 dB.
E
The ratio Nb is often referred to as ebno and is the most common denition for a
0
signal-to-noise ratio (SNR); see (15.120).
4
Chapter 17
Parallel Gaussian Channels

In this chapter we leave the continuous-time domain again and return to our
discrete-time Gaussian channel model. In Chapter 15 we have already studied
this model in detail. Now we would like to add one new ingredient: What
happens if we have several Gaussian channels available at the same time?
Of course, we can use them all at the same time, and we can simply divide
our available power equally among all of them. But this might not be the
best thing to do, in particular, if some of the Gaussian channels are better
than other (i.e., they have less noise) or if the noise processes that impair the
dierent signals on the dierent channels are actually correlated.
17.1
Independent Parallel Gaussian Channels:

Waterlling
We start with the simpler case where the available channels are independent
of each other. Consider J parallel Gaussian channels:
Y j = x j + Zj ,
j = 1, . . . , J,
(17.1)
2
with Zj N (0, j ) and Zj Zj for j = j . We can think of this setup as
a transmitter that has J independent antennas or lines available that can be

used at the same time. Note that the channels do not all have the same noise
variance, i.e., some channels are better (smaller variances) than others.
The channel can actually be nicely written in vector form:
Y =x+Z
(17.2)
2
2
with Z N 0, diag 1 , . . . , J . For every random message U , the transmitter now does not choose one, but J codewords. Hence, the codebook does
not consist of 2nR dierent length-n codewords, but rather of 2nR dierent
size-(J n) code-matrices X(m) , m = 1, . . . , 2nR .
We assume a total average-power constraint E, i.e., every code-matrix must
satisfy
1
n
(m) 2
xj,k
j=1 k=1
E,
343
m = 1, . . . , 2nR ,
(17.3)
344
or, in matrix-form,
1 (m)
X
n
2
F
E,
m = 1, . . . , 2nR ,
(17.4)
where F denotes the Frobenius norm of matrices, i.e., the square-root of

the sum of all squared components of the matrix. The engineering meaning
of this total average-power constraint is that the transmitter has (on average
over time) a power E available that it is free to distribute among the J dierent
channels.
We now ask the following question: What is the maximum total amount
of information we can transmit reliably through these J parallel channels?
To answer this question we actually need to derive another coding theorem
and prove the capacity. However, this derivation looks very much like the
derivations we have seen so far, so we omit it and simply state the result.
Theorem 17.1 (Coding Theorem for Parallel Gaussian Channels).

The capacity of the parallel Gaussian channel (17.2) under the total
average-power constraint (17.4) is given as
C(E) =
max
fX : E[ X
2 ]E
max
fX1 ,...,XJ
J
E
j=1
I(X; Y)
I(X1 , . . . , XJ ; Y1 , . . . , YJ ).
(17.5)
(17.6)
2
Xj E
So it only remains to try to derive a closed-form expression for this capacity

expression, i.e., to compute the maximum in (17.6). Luckily, this is possible.
We start as follows:
I(X; Y) = h(Y) h(Y|X)
= h(Y) h(X + Z|X)
= h(Y) h(Z|X)
= h(Y) h(Z)
(17.7)
(17.8)
(17.9)
(17.10)
=
j=1
h(Yj |Y1 , . . . , Yj1 ) h(Zj )
(17.11)
h(Yj ) h(Zj )
(17.12)
h(Yj ) h(Zj + Xj |Xj )
(17.13)
j=1
J
=
j=1
17.1. Independent Parallel Gaussian Channels: Waterlling
345
J
j=1
h(Yj ) h(Yj |Xj )
(17.14)
I(Xj ; Yj )
(17.15)
Ej
1
log 1 + 2 ,
2
j
(17.16)
=
j=1
J
j=1
where we have dened

2
E Xj
Ej
(17.17)
and used the fact that a Gaussian channel of power Ej has a capacity
Ej
1
log 1 + 2 .
2
j
(17.18)
Here, (17.8)(17.10) follows from (17.2) and the assumption that the noise is
independent of the input; in (17.11) we use the chain rule and the independence of the parallel channels Zj Zj ; (17.12) follows because conditioning
cannot increase dierential entropy; (17.13) and (17.14) follow again from
(17.2); and the nal step, as mentioned, follows from the capacity of a Gaussian channel under a given power constraint.
Note that because
it must hold that
i.e.,
E E
J
j=1
j=1
2
Xj E,
(17.19)
2
Xj =
J
2
E Xj =
j=1
Ej ,
(17.20)
j=1
J
j=1
Ej E.
(17.21)
Moreover, note that the upper bound (17.16) actually can be achieved if Xj
N (0, Ej ), Xj Xj :
X N (0, diag(E1 , . . . , EJ )).
(17.22)
Hence, we now know the optimal input, and the only remaining question is
how to choose (E1 , . . . , EJ )!
346

We reformulate this problem as follows:
J
C(E) =
max
E1 ,...,EJ
Ej 0, j=1,...,J j=1
Ej
1
log 1 + 2 .
2
j
(17.23)
j=1
Note that
1
2
log 1 +
Ej
2
j
Ej E
is monotonically increasing in Ej . Therefore, we
see that the maximum will be achieved for a choice of (E1 , . . . , EJ ) such that
J
Ej = E
(17.24)
j=1
with equality.
Note further that
1
2
log 1 +
Ej
2
j
is a concave function and therefore that

J
g(E1 , . . . , EJ )
j=1
Ej
1
log 1 + 2
2
j
(17.25)
is concave, too. We therefore see that (17.24) matches exactly the situation of
the KarushKuhnTucker conditions: An optimal choice of (E1 , . . . , EJ ) must
satisfy:
g(E1 , . . . , EJ ) !
=
E
g(E1 , . . . , EJ ) !
with E > 0,
(17.26)
with E = 0.
(17.27)
Lets compute the KarushKuhnTucker conditions:
J
j=1
Ej
1
log 1 + 2
2
j
1
1
1
1
1
2 log e =
2 log e, (17.28)
E
2 1 + 2
2 E +
i.e.,
1 log e
2 =
2 E +
1 log e
2
2 E +
We introduce a new parameter
log e
2
2
E =
if E > 0,
(17.29)
if E = 0.
(17.30)
and get
if E > 0,
(17.31)
if E = 0,
(17.32)
17.1. Independent Parallel Gaussian Channels: Waterlling
347
i.e.,
2
E =
2
if > 0,
if
E = 0
(17.33)
0,
(17.34)
or
2
E =
(17.35)
Here must be chosen such that (17.24) is satised, i.e.,

J
j=1
2
j
= E.
(17.36)
This rule can be very nicely interpreted graphically (see Figure 17.1): we deE
11111111
00000000
1
0
11111111
00000000
1
0
11111111
00000000
1
E 0
11111111
00000000
1
0 E
1
0
11111111
00000000
11111111
00000000
1
0
11111111
00000000
11111111
00000000
11111111
00000000
1
1111
0000
E
1111
0000
1111
0000
5
2
3
2
4
2
5
2
1
2
2
Figure 17.1: Waterlling solution of the power assignment of m parallel independent Gaussian channels.
pict the various channels with bars representing the value of the noise variance,
and we pour the available power into the graphic like water. The parameter
then represents the water level. Hence, if we start with E = 0 and slowly
increase the available power, we will rstly only use the best channel (the one
with the smallest noise variance). All other channels are switched o. However, because the logarithm is a concave function, the slope is decreasing with
increasing power. Therefore, once we have increased E enough, the second
best channel will yield a slope that is identical to the rst and hence we start
also pouring power into the second one. And so on.
This solution is therefore called the waterlling solution.
348
17.2
Dependent Parallel Gaussian Channels
We now extend the discussion from the last section to the case where the
dierent channels are dependent. We assume that Z denotes the zero-mean
Gaussian noise vector with a given nonsingular covariance matrix1 KZZ . The
channel output is now given as
Y =x+Z
(17.37)
where Z N (0, KZZ ). It is not dicult to show that Theorem 17.1 still
holds true. So we again attempt to compute a closed-form expression for the
maximization given in (17.6). We start o in a very similar way as for the
independent case:
I(X; Y) = h(Y) h(Z)
1
= h(Y) log (2e)J det KZZ
2
1
1
log (2e)J det KYY log (2e)J det KZZ
2
2
1
1
J
= log (2e) det(KXX + KZZ ) log (2e)J det KZZ
2
2
det(KXX + KZZ )
1
.
= log
2
det KZZ
(17.38)
(17.39)
(17.40)
(17.41)
(17.42)
Here the inequality in (17.40) follows from Theorem 14.16; and in (17.41) we
used the fact that X Z and therefore
KYY = KXX + KZZ .
(17.43)
Note that the upper bound (17.42) is achieved if we choose2 X N (0, KXX ).
Hence it remains to choose an optimum covariance matrix KXX . Note that
since the noise has dependencies, we might have to allow our optimal input to
have some dependencies, too. Note further that the average-power constraint
can be expressed depending on KXX as follows:
E X
= E
j=1
2
Xj =
2
E Xj =
j=1
j=1
Var(Xj ) = tr(KXX ) E, (17.44)
where we have already made use of the fact that X is zero-mean.

Hence, the capacity is given by
C(E) =
max
KXX
tr(KXX )E
det(KXX + KZZ )
1
log
.
2
det KZZ
(17.45)
For those readers who are worried about dependent Gaussian random vectors, we highly
recommend to have a look at Appendices A and B.
2
Note that we have chosen E[X] = 0 since we know that an optimum input must be of
zero mean. Otherwise the encoder wastes precious power to transmit something that the
receiver already knows in advance and therefore does not contain any information.
17.2. Dependent Parallel Gaussian Channels
349
To nd the covariance matrix that maximizes this expression, we need the

eigenvalue decomposition of KZZ :
KZZ = UUT .
(17.46)
Note that since KZZ is positive semi-denite, this decomposition always exists,
U is orthogonal, i.e., UUT = UT U = IJ , and is diagonal,
= diag 1 , . . . , J ,
2
2
(17.47)
with only nonnegative entries on the diagonal. For more details, see the
discussion in Appendix B.1. Using that
det(AB) = det(BA),
(17.48)
we then have
det(KXX + KZZ ) = det KXX + UUT
(17.49)
= det UU KXX UU + UUT
(17.50)
= det U UT KXX U + UT
(17.51)
UT KXX U + UT U
(by (17.48)) (17.52)
= det
= det UT KXX U +
= det K +
(17.53)
(17.54)
XX
where we have introduced
UT X
(17.55)
with corresponding covariance matrix3
KXX = E XXT = E UT XXT U = UT KXX U.
(17.56)
Note that the multiplication by UT corresponds to a rotation of X and there

fore the power constraint for X is unchanged:
!
tr KXX = tr UT KXX U = tr UUT KXX = tr(KXX ) E.
(17.57)
Here we have used the fact that

tr(AB) = tr(BA).
(17.58)
To see how we have to choose KXX to maximize (17.54), we use the following
well-known lemma [HJ85, Theorem 7.8.1].

3
Again, we implicitly already use our knowledge that X is zero-mean.
350
Lemma 17.2 (Hadamards Inequality). For any J J positive semi-denite4 matrix A we have
J
det A
(A)j,j .
(17.59)
j=1
Furthermore, when A is positive denite, then equality holds if, and only if, A
is diagonal.
Applied to (17.54), Hadamards inequality reads
J
det(KXX + )
2
(KXX )j,j + j ,
(17.60)
j=1
which can be achieved with equality if we choose X such that KXX is diagonal:
KXX = diag E1 , . . . , EJ
(17.61)
for some choice of E1 , . . . , EJ such that (17.57) is satised:

J
j=1
Ej E.
(17.62)
Moreover,
J
det KZZ
= det UUT = det =
j ,
2
(17.63)
j=1
so that (17.45) can be written as

C(E) =
max
E1 ,...,EJ
J
j E
j=1 E
1
log
2
J
max
E1 ,...,EJ j=1
J
j E
j=1 E
J
max
E1 ,...,EJ j=1
J
j E
j=1 E
J
j=1
Ej + j
2
J
2
j=1 j
(17.64)
Ej + j
2
1
log
2
j
2
(17.65)
Ej
1
log 1 + 2 .
2
j
(17.66)
But this is exactly the problem we have solved in the last section: Compare
with (17.23)! The optimal solution is the waterlling solution with respect to
the j :
2
Ej = ( j )+
2
(17.67)
with such that (17.62) is satised with equality. Our optimal choice of KXX
then is by (17.61) and (17.56)
KXX = UKXX UT = U diag E1 , . . . , EJ UT .
(17.68)
For a denition of positive semi-denite matrices, see Section B.1 in Appendix B.
17.3. Colored Gaussian Noise
351
Remark 17.3. Note that what really is happening here is that we whiten
the problem, i.e., we rotate the input vector in such a way that the parallel
channels become independent. Then we solve the problem using waterlling
as before and nally rotate the solution back.
17.3
Colored Gaussian Noise
Instead of thinking of several parallel Gaussian channels that are dependent,

we can use the same model (17.37), but interpret it dierently as a single
Gaussian channel that is dependent over time, i.e., that has memory. So this
is a rst step away from our memoryless models to a more practical situation,
where the noise depends on its past.
We consider the following version of a Gaussian channel
Y k = x k + Zk ,
(17.69)
where {Zk } is a Gaussian process with memory, i.e., with a certain given autocovariance function.5 In the frequency domain this dependence is expressed
by the fact that the power spectral density (PSD) 6 is not at. Such a noise
process is usually called colored noise in contrast to white noise.
We omit the mathematical details, but only quickly describe the result.
The optimal solution again will apply waterlling, however, this time not over
dierent channels, but instead in the frequency domain into the PSD (see
Figure 17.2).
N(f )
1
2
1
2
Figure 17.2: Power spectral density of colored noise with corresponding waterlling solution.
5
For a discussion about the dierences between the autocovariance function and the
autocorrelation function see Appendix C.
6
Actually, as we are in a discrete-time setting, we should speak of energy spectral density
as power is not even dened for discrete-time signals. However, it is customary to call this
function PSD in spite of the fact that when integrating over it, one obtains the average
energy of the signal, not its power. See also [Lap09, Ch. 13].
352

The capacity of such a colored Gaussian channel with PSD N(f ) is given
by
C(E) =
1
2
1
2
N(f )
1
log 1 +
2
N(f )
df
(17.70)
with such that

1
2
1
2
N(f )
df = E.
(17.71)
In practice this means that we actually modulate our information in the

frequency domain of the transmitted signal. This procedure is called discrete
multitone and is the basic working principle of ADSL7 modems: For practical purposes we also discretize the PSD, i.e., instead of the z-transform, we
rely on the discrete Fourier transform (or rather its fast implementation fast
Fourier transform (FFT)). The information is modulated onto each frequency
sample where the used energy is determined by waterlling. This modulated
discrete spectrum is then converted by the inverse discrete Fourier transform
into a discrete-time signal, which goes converted by a D/A-converter to a
continuous-time signal through the channel. On the receiver side, the received signal (after the A/D-converter) is transformed back into the frequency
regime by the discrete Fourier transform. Each modulated frequency sample
suers from noise, however, because of the FFT, this noise is independent
between each frequency sample, so in eect, we have J parallel independent
channels (where J is the blocksize of the discrete Fourier transform) as discussed in Section 17.1, and we can therefore decode each frequency sample
independently.
So we see the similarities between the correlated parallel channels of Section 17.2 and the time-correlated channel of Section 17.3: In Section 17.2, we
encode the information onto X and use the matrix U to rotate it to get X,
which is then transmitted. The matrix U basically decouples the channels and
makes the noise independent.
In Section 17.3, the fast Fourier transform takes over the role of decoupling:
The FFT matrix makes sure that in spite of the memory in the channel, we
get independent parallel channels.
Note that orthogonal frequency division multiplexing (OFDM) at the
moment one of the most important modulation schemes used for wireless communication systems is basically the same thing as discrete multitone, but
applied to passband instead of baseband. However, since in a wireless channel
the spectral density changes often and quickly, one usually does not bother
with waterlling, but simply assigns the same energy to all subchannels.
ADSL stands for asymmetric digital subscriber line.
Chapter 18
Asymptotic Equipartition
Property and Weak
Typicality
In this chapter we describe a very powerful tool that is used very often in
information theory: typicality. Since the concept is rather abstract, we have
not used it so far, but actually, we could have proven several of our main
results in this class using this tool. As a matter of fact, in the advanced
information theory class [Mos13], we will make typicality our main tool! The
basis of typicality is the law of large numbers.
18.1
Motivation
To explain the basic idea of typicality, we use a simple example. Consider an

unfair coin X with PMF
PX (x) =
0.7 if x = 1,
0.3 if x = 0,
(18.1)
such that H(X) 0.881 bits.

Now Mr. Unreliable claims that he has twice observed a length-10 sequence
of this unfair coin, i.e., (X1 , . . . , X10 ) where Xi is IID PX (). He asserts
that he got the following two outputs:
1.
(X1 , . . . , X10 ) = 1101011110;
(18.2)
2.
(X1 , . . . , X10 ) = 1111111111.
(18.3)
Do we believe him?
Well, probably you will agree that the second event is somehow unexpected. Our life experience tells us that such a special outcome is not very
likely and we rather expect to see a sequence like the rst. This feeling gets
even much stronger if the sequence is longer, e.g., has length 900 instead of
only 10.
353
354
Asymptotic Equipartition Property and Weak Typicality
So far so good, but the whole thing becomes strange when we compute
the probabilities of the two sequences:
1.
2.
Pr[(X1 , . . . , X10 ) = 1101011110] = 0.77 0.33 0.22%,
Pr[(X1 , . . . , X10 ) = 1111111111] = 0.7
10
2.82%.
(18.4)
(18.5)
Hence, the second sequence is about 13 times more likely than the rst! So
how could we think that the rst is more likely?!?
To resolve this paradox, rstly note that we should not say that the rst
sequence is more likely (since it is not!), but rather that it is more expected
or more typical .
Secondly and this is the clue! we need to realize that even though the
second sequence is much more likely than the rst, the probability of observing
a sequence of the same type as the rst sequence (18.2) is close to 1. By the
the same type we mean a sequence that corresponds to the PMF, i.e., in our
case a sequence with about 70% ones and 30% zeros. On the other hand, the
probability of observing some special (nontypical) sequences like the all-one
sequence or the all-zero sequence is close to 0. This becomes more and more
pronounced if the length of the sequences is increased.
Since we usually see sequences like (18.2), we call these sequences typical
sequences.
We can summarize this observation roughly as follows:
When generating an IID length-n sequence, the probability of
obtaining a typical sequence with approximately nPX (0) zeros and
nPX (1) ones is almost equal to 1 for n large.
This ts well with our own experience in real life.
We will show below that we can state this observation dierently in a
slightly more abstract way as follows:1
Pr
(x1 , . . . , xn ) : PX1 ,...,Xn (x1 , . . . , xn ) 2n H(X)
1.
(18.6)
This means that we do not only almost for sure get a typical sequence, but
that actually every typical sequence has almost the same probability!
Almost certainly, we get a typical sequence. Among all typical
sequences, the probability of one particular typical sequence is almost
uniform.
In the example above we have 210 H(X) 0.22%.
In the following sections we will make these ideas a bit more concrete.
1
In this chapter we will follow the engineering notation and always assume that entropy
is measured in bits. Hence we write 2n H(X) instead of, e.g., en H(X) .
18.2. AEP
18.2
355
AEP
The denition of convergence of a deterministic sequence a1 , a2 , . . . to some

value a is quite straightforward. Basically it says that for some n big enough,
an , an+1 , . . . will not leave any given small -environment around the convergence point. In more mathematical wording, we have the following. If
lim an = a,
(18.7)
then for any > 0 there exists a N such that for all n N
|an a| .
(18.8)
You have for sure seen this before.

Unfortunately, things are a bit more complicated if we look at the convergence of random sequences. Firstly, we might suspect that in general a
random sequence will converge not to a deterministic value, but to a random
variable.2 Secondly, there are several dierent types of convergence.
Denition 18.1. We say that a sequence of random variables X1 , X2 , . . .
converges to a random variable X
with probability 1 (also called almost surely), if
Pr lim Xn = X = 1;
n
(18.9)
in probability, if for any > 0 we have Pr[|Xn X| > ] 0 as n ,

i.e.,
lim Pr[|Xn X| ] = 1;
(18.10)
in distribution, if PXn (x) PX (x) for all x as n , i.e.,

lim PXn (x) = PX (x),
x.
(18.11)
One can prove that these three types of convergence are in order: If a
sequence converges with probability 1, then it will also converge in probability
and in distribution; and if a sequence converges in probability, it also converges
in distribution. The other way round is not always true. For more details we
refer to probability theory textbooks.
Equipped with these terms, we are now ready to state a very simple, but
very powerful result, which is a direct consequence of the weak law of large
numbers.
2
Luckily there is a large class of cases where this converging random variable turns out
to be deterministic. In particular this holds true in any situation where the weak law of
large numbers can be applied.
356
Theorem 18.2 (Asymptotic Equipartition Property (AEP)).

If X1 , X2 , . . . are IID PX (), then
1
n
log PX1 ,...,Xn (X1 , . . . , Xn ) H(X)
n
in probability.
(18.12)
We remind the reader that PX1 ,...,Xn (x1 , . . . , xn ) denotes the joint probability distribution of X1 , . . . , Xn .
Proof: Since functions of independent random variables are also independent random variables, and since all Xk are IID, we realize that Yk
log PX (Xk ) are also IID. Moreover, we note that
n
PX (xk )
PX1 ,...,Xn (x1 , . . . , xn ) =
(18.13)
k=1
because all Xk are IID. Now recall the weak law of large numbers, which says
that for a sequence Z1 , . . . , Zn of IID random variables of mean and nite
variance and for any > 0,
lim Pr
Z1 + + Zn
= 0.
n
(18.14)
Hence,
1
lim log PX1 ,...,Xn (X1 , . . . , Xn )
n n
n
1
= lim log
PX (Xk )
n n
(18.15)
k=1
1
= lim
n n
log PX (Xk )
(18.16)
k=1
= E[log PX (X)]
in probability
= H(X).
(18.17)
(18.18)
Remark 18.3. The AEP says that with high probability

PX1 ,...,Xn (X1 , . . . , Xn ) 2n H(X)
(18.19)
(simply exponentiate both sides of (18.12)). Hence, for large n almost all sequences are about equally likely! This explains the name asymptotic equipartition property.
The AEP holds of course for any base of the logarithm. We will concentrate
on log2 , such that the inverse function is 2 .
18.3. Typical Set
18.3
357
Typical Set
Based on (18.19) we now give the following denition, that turns out to be
extremely useful in proofs.
Denition 18.4. Fix > 0, a length n, and a PMF PX (). Then the typical
(n)
set A (PX ) with respect to PX () is dened as the set of all those length-n
sequences (x1 , . . . , xn ) X n that have a probability close to 2n H(X) :
A(n) (PX )
(x1 , . . . , xn ) X n :
2n(H(X)+) PX1 ,...,Xn (x1 , . . . , xn ) 2n(H(X)) .
(18.20)
Example 18.5. For n = 10, = 0.01, and PX () as in (18.1), we get

(10)
(X1 , . . . , X10 ) A0.01 (PX )
2.1 103 PX1 ,...,X10 (X1 , . . . , X10 ) 2.4 103 .
(18.21)
Hence, the rst sequence (18.2) with probability

PX1 ,...,X10 (1101011110) 2.22 103
(18.22)
is typical, while the second sequence (18.3) with probability

PX1 ,...,X10 (1111111111) 28.2 103
(18.23)
is not typical.
(n)
Theorem 18.6 (Properties of A (PX )).

(n)
1. If (x1 , . . . , xn ) A (PX ), then
1
H(X) log PX1 ,...,Xn (x1 , . . . , xn ) H(X) + .
n
(18.24)
2. If X1 , . . . , Xn are IID PX (), then
Pr (X1 , . . . , Xn ) A(n) (PX ) > 1 ,
(18.25)
for n suciently large.

3. For all n, the size of the typical set is upper-bounded as follows:
A(n) (PX ) 2n(H(X)+) .
(18.26)
4. For n suciently large, the size of the typical set is lower-bounded

as follows:
A(n) (PX ) > (1 ) 2n(H(X)) ,
n suciently large.
(18.27)
358

(n)
Proof: Part 1 follows trivially from the denition of A (PX ).

(n)
Part 2 follows directly from the AEP applied to the denition of A (PX ).
To see this, lets rst rewrite the AEP in mathematical form. The statement
(18.12) says that > 0 and > 0 there exists a threshold N, such that
for n N, we have
1
Pr log PX1 ,...,Xn (X1 , . . . , Xn ) H(X) > 1 .
n
(18.28)
(n)
Note that by Part 1 any sequence A (PX ) satises this!
Now note that the expression inside of the probability in (18.28) actually
(n)
corresponds exactly to the denition of the typical set A (PX ). Hence,
written in a new form, the AEP says nothing else than
Pr (X1 , . . . , Xn ) A(n) (PX ) > 1 .
(18.29)
Since and are arbitrary, we can choose them freely. We choose and
are done.
To derive Part 3 we simplify our notation and use a vector notation for
sequences X = (X1 , . . . , Xn ).
1=
PX (x)
(18.30)
xX n
PX (x)
(drop some terms)
(18.31)
2n(H(X)+)
(by denition of A (PX ))
(n)
xA (PX )
(n)
xA (PX )
= 2n(H(X)+)
(n)
(n)
xA (PX )
= 2n(H(X)+) A(n) (PX ) .
(18.32)
(18.33)
(18.34)
Hence,
A(n) (PX ) 2n(H(X)+) .
(18.35)
Finally, Part 4 can be derived as follows. For n suciently large we have

1 < Pr X A(n) (PX )
(by Part 2, for n su. large)
PX (x)
(18.36)
(18.37)
(n)
xA (PX )
2n(H(X))
(n)
xA (PX )
(n)
(by denition of A (PX ))
= A(n) (PX ) 2n(H(X)) .
(18.38)
(18.39)
18.4. High-Probability Sets and the Typical Set
359
Hence,
A(n) (PX ) > (1 )2n(H(X))
(18.40)

Hence, we see that the typical set is a relatively small set (its size is only
about 2n H(X) , which is in contrast to the total of 2n length-n sequences), but
(n)
it contains almost all probability3 Pr A (PX ) > 1 .
18.4
High-Probability Sets and the Typical Set

(n)
We know that A (PX ) is small, but contains most of the probability. So

the question arises if there exists some other set that is even smaller than
(n)
A (PX ), but still contains most of the probability? In other words, does a
(n)
set B exist such that
(n)
A(n) (PX )
(18.41)
and
(n)
Pr X B
> 1 ?
(18.42)
(n)
Note that A (PX ) is not the smallest such set. This can be seen, for exam(n)
ple, by recalling that A (PX ) usually does not contain the most probable
sequence. However, as we will see, the smallest set has essentially the same
(n)
size as A (PX ).
In the following we will use the slightly sloppy, but common notation
Pr(B)
Pr[(X1 , . . . , Xn ) B],
(18.43)
where we assume that (X1 , . . . , Xn ) are IID PX ().

Denition 18.7 (High-Probability Set). For any r-ary alphabet X , any > 0,
(n)
and any positive integer n, let B X n be some set of length-n sequences
with
(n)
Pr B
(n)
Particularly, we could choose B

holds.
> 1 .
(18.44)
to be the smallest set such that (18.44)
Theorem 18.8. Let (X1 , X2 , . . . , Xn ) X n be a random sequence chosen IID

PX (). For 0 < < 1 and for any > 0 we then have
2
1
(n)
log B > H(X)
n
(18.45)

3
(n)
In literature, one often nds the slightly sloppy notation Pr A (PX ) instead of the
(n)
more precise Pr X A (PX ) .
360

(n)
1
Proof: Fix some 0 < < 2 and consider the typical set A (PX ). We
know that for n suciently large, we have
Pr A(n) (PX ) > 1 .
(18.46)
Moreover, by assumption we have

(n)
Pr B
> 1 .
(18.47)
Using
Pr(A B) = Pr(A) + Pr(B) Pr(A B)
(18.48)
we obtain
(n)
Pr A(n) (PX ) B
(n)
= Pr A(n) (PX ) + Pr B
> 1
(n)
Pr A(n) (PX ) B
> 1
>1+11
=1
(18.49)
(18.50)
(18.51)
for n suciently large. Hence,

(n)
(18.52)
PX (x)
(18.53)
2n(H(X))
(18.54)
1 < Pr A(n) (PX ) B
(n)
(n)
xA (PX )B
(n)
(n)
xA (PX )B
(n)
= A(n) (PX ) B
(n)
B
2n(H(X))
2n(H(X)) ,
(18.55)
(18.56)
where the st inequality (18.52) follows from (18.51); the subsequent equality
(18.53) from (18.43); the subsequent inequality (18.54) from the fact that since
(n)
(n)
(n)
x A (PX ) B the sequence x must be element of A (PX ) and that
(n)
therefore according to the denition of A (PX )
PX (x) 2n(H(X)) ;
(18.57)
the subsequent equality (18.55) follows from counting the number of summands in the sum; and the nal inequality (18.56) follows because the number
(n)
(n)
of elements in A (PX ) B cannot be larger than the number of elements
(n)
in B .
Since it was assumed that < 1 and < 1 it follows that 1 > 0
2
2
and we can take the logarithm on both sides of (18.56):
(n)
log2 B
> log2 (1 ) + n(H(X) ),
(18.58)
18.5. Data Compression Revisited
361
from which follows that

1
(n)
log2 B > H(X)
n
(18.59)
1
where = n log2 (1 ). Note that can be made arbitrarily small
by an appropriate choice of and of n large enough.
(n)
(n)
Hence B has at least 2n(H(X) ) elements. This means that A (PX )
(n)
has about the same size as B , and hence also the same size as the smallest
high-probability set.
18.5
Data Compression Revisited
(Reminder: In the following all logarithms have base 2, particularly H(U ) is

in bits!)
We now will see an example of very typical information theory: We use
typical sets to prove our fundamental result about maximum lossless compression (that we know already). The trick is to design a special compression
scheme that is not really practical, but very convenient for a proof.
Ck
binary
Vk
n-block
codewords
encoder
messages
parser
U1 , U2 , . . .
symbols
r-ary
DMS
Figure 18.1: A binary blocktovariable-length source compression scheme.

We would like to compress the output sequence generated by an r-ary
DMS with distribution PU . We use an n-block parser, so that each message
V is a sequence in U n . See Figure 18.1.
We now design a simple coding scheme. We start by grouping all possible
message sequences into two sets: the typical messages that are member of
(n)
(n)
A (PU ) and the nontypical messages that are not member of A (PU ). The
coding for the two groups of messages will then be quite dierent:
(n)
For each typical message v A (PU ), we assign a distinct xed-length

binary codeword of a certain length l1 . Since
A(n) (PU ) 2n(H(U )+) ,
(18.60)
it follows that a possible choice of l1 is

l1
log2 A(n) (PU )
log2 A(n) (PU ) + 1
n(H(U ) + ) + 1.
(18.61)
(18.62)
(18.63)
362

(n)
For each nontypical message v A (PU ), we also assign a distinct

xed-length binary codeword, but of a dierent length l2 . Since
U n = rn ,
(18.64)
it follows that a possible choice of l2 is

log2 rn n log2 r + 1.
l2
(18.65)
Now we have two types of codewords, some are of length l1 and some are
of length l2 . In order to make sure that we can distinguish between them
(i.e., to make sure that the code is prex-free), we now put a 0 in front of all
codewords of length l1 , and a 1 in front of all codewords of length l2 . Hence,
the length of the codeword for a message v U n satises
(n)
l(v)
n(H(U ) + ) + 2 if v A (PU ),
n log2 r + 2
otherwise.
(18.66)
Note that the typical messages are mapped to a short codeword, while the
nontypical messages have long codewords. This seems to be a good compression scheme because we expect to see typical message most of the time.
Moreover, from Theorem 18.6 we also know that all typical sequences have approximately the same probability, so it seems reasonable to assign codewords
of identical length to them.
Lets now compute the expected codeword length:
PV (v)l(v)
E[l(V)] =
(18.67)
vU n
PV (v)l(v) +
(n)
vA (PU )
PV (v)l(v)
(18.68)
(n)
vA (PU )
/
PV (v) n(H(U ) + ) + 2
(n)
vA (PU )
PV (v) n log2 r + 2
(18.69)
(n)
vA (PU )
/
= n(H(U ) + ) + 2 Pr A(n) (PU )
+ n log2 r + 2
1 Pr
1
A(n) (PU )
(18.70)
< for n large enough

by Theorem 18.6-2
< n(H(U ) + ) + 2 + (n log2 r + 2)
= n(H(U ) + )
(18.71)
(18.72)
for n large enough, where
+ log2 r +
2 + 2
.
n
(18.73)
18.6. AEP for General Sources with Memory
363
Note that can be made arbitrarily small by choosing small and n large
enough.
So once again we have proven the following result that we know already
from Chapter 4.
Theorem 18.9. For any > 0, we can nd a prex-free binary code for an
r-ary DMS U such that the average codeword length per source symbol satises
E[L]
H(X) +
n
(18.74)
for n suciently large. Here n is the message length of the n-block parser.
Moreover, also note that from the properties of typical sets we once again
see that the output of a good data compression scheme will be almost uniformly distributed (compare with Remark 9.1): In order to make our scheme
almost perfect, we choose an extremely small , e.g., = 101 000 000 , and
(n)
> 1
make the blocklength (parser length) n very large such that Pr A
(Theorem 18.6 proves that this is possible!).
Now, basically all sequences that show up are typical and therefore all
get assigned a codeword of the same length l1 . Moreover, we also know from
Theorem 18.6 that all typical sequences have (almost) the same probability
2n H(X) . Hence, we see that the output codewords of this compressing scheme
are all equally likely and have the same length. Hence, the output digits indeed
are IID and uniformly distributed.
18.6
AEP for General Sources with Memory
So far the AEP and its consequences were based on the assumption that the
letters of the sequences are chosen IID. This turns out to be too restrictive.
There are many sources with memory that also satisfy the AEP.
Denition 18.10 (Source satisfying the AEP). A general source {Uk } taking
value in some discrete alphabet U is said to satisfy the AEP if the entropy
rate exists,
1
H(U1 , . . . , Un ),
n n
H({Uk }) = lim
(18.75)
and if
1
n
log PU1 ,...,Un (U1 , . . . , Un ) H({Uk })
n
in probability.
(18.76)
The class of sources that satisfy the AEP is rather large. Indeed the
following holds.
Theorem 18.11 (ShannonMcMillanBreiman Theorem). Any stationary and ergodic source taking value in a nite alphabet satises the AEP.
364
Proof: The proof is quite elaborate and therefore omitted. The interested reader is referred to, e.g., [CT06, pp. 644] or [AC88].
We can now generalize the denition of typical sets as follows.
Denition 18.12. For a given > 0, a length n, and a source that satises
(n)
the AEP {Uk }, the typical set A ({Uk }) is dened as the set of all those
length-n source output sequences (u1 , . . . , un ) that have a probability close to
2n H({Uk }) :
A(n) ({Uk })
(u1 , . . . , un ) : 2n(H({Uk })+)

PU1 ,...,Un (u1 , . . . , un ) 2n(H({Uk })) . (18.77)
All the properties of Theorem 18.6 continue to hold. This is obvious because for the derivation of Theorem 18.6 only the AEP has been used. But in
Denition 18.12, we have made it the main assumption that the AEP holds.
We therefore omit the details of the proof, but only repeat the properties here.
Theorem 18.13.
(n)
1. If (u1 , . . . , un ) A ({Uk }), then

1
H({Uk }) log PU1 ,...,Un (u1 , . . . , un ) H({Uk }) + .
n
(18.78)
(n)
2. Pr (U1 , . . . , Un ) A ({Uk }) > 1 , for n suciently large.

(n)
3. A ({Uk }) 2n(H({Uk })+) , for all n.

(n)
4. A ({Uk }) > (1 ) 2n(H({Uk })) , for n suciently large.
18.7
General Source Coding Theorem
We next use our general denition of typical sets (Denition 18.12) to generalize the source coding theorem to general sources that satisfy the AEP. For
simplicity, we only state it for the case of binary (D = 2) codes. The extension
to general D is straightforward.
We weaken our requirements for source coding by allowing the coding
scheme to be unsuccessful for certain source output sequences, i.e., in certain
cases the source sequence cannot be recovered from the codeword. The success probability is then dened as the probability that the source generates a
sequence that is successfully encoded.
Specically, we consider a general encoding function
n : U n {0, 1}K
(18.79)
18.7. General Source Coding Theorem
365
that maps a source sequences (U1 , . . . , Un ) into a binary codeword C of length

K. The eciency of the coding scheme n is given by its coding rate
Rn
K
bits per source symbol.
n
(18.80)
We can now prove the following general coding theorem.

Theorem 18.14 (General Source Coding Theorem for Sources
satisfying the AEP).
Consider a source {Uk } that satises the AEP. Then for every > 0,
there exists a sequence of coding schemes n with coding rate Rn satisfying
lim Rn < H({Uk }) +
(18.81)
such that the success probability tends to 1 as n tends to innity.

Conversely, for any source {Uk } that satises the AEP and any coding scheme n with
lim Rn < H({Uk }),
(18.82)
the success probability must tend to 0 as n tends to innity.

Proof: The direct part can been proven completely analogously to Section 18.5. For any source sequence that is typical, we assign a codeword of
length l1 , while for any nontypical sequence we allow the system to be unsuccessful. We omit the details.
For the converse, suppose that limn Rn < H({Uk }). Then for every
> 0 there exists some n0 large enough such that
Rn < H({Uk }) 2,
Using U
n n0 .
(18.83)
(U1 , . . . , Un ), we can now bound as follows:
Pr(success) = Pr success and U A(n) ({Uk })
+ Pr success and U A(n) ({Uk })
(18.84)
Pr success and U A(n) ({Uk }) + Pr U A(n) ({Uk })
(18.85)
=
(n)
uA ({Uk })
Pr[U = u] + Pr U
A(n) ({Uk })
(18.86)
u is successful
2n(H({Uk })) +
<
(18.87)
(n)
uA ({Uk })
u is successful
2K 2n(H({Uk })) +
(18.88)
366

= 2nRn 2n(H({Uk })) +
n(H({Uk })Rn )
=2
<2
(18.89)
(18.90)
+ .
(18.91)
Here the inequality in (18.87) follows from (18.77) and from Theorem 18.13-2;
the subsequent inequality (18.88) holds because under the condition that all
sequences u are successful, each needs a distinct codeword, but there are only
at most 2K such binary sequences; and the last inequality (18.91) follows from
(18.83).
18.8
Joint AEP
We now return to the simpler situation of IID distributions. We next generalize the AEP (Theorem 18.2) to pairs of sequences (X, Y) where each letter
pair (Xk , Yk ) is dependent, but the sequences are still IID over k.
Theorem 18.15 (Joint AEP).

If the sequence of pairs of RVs (X1 , Y1 ), (X2 , Y2 ), . . . is IID PX,Y (, ),
then
1
n
log PX1 ,Y1 ,...,Xn ,Yn (X1 , Y1 , . . . , Xn , Yn ) H(X, Y )
n
in prob.
(18.92)
Proof: We note that

n
PX1 ,Y1 ,...,Xn ,Yn (x1 , y1 , . . . , xn , yn ) =
PX,Y (xk , yk ).
(18.93)
k=1
Hence, by the weak law of large numbers,

1
lim log PX1 ,Y1 ,...,Xn ,Yn (X1 , Y1 , . . . , Xn , Yn )
n
n
1
PX,Y (Xk , Yk )
= lim log
n n
(18.94)
k=1
1
= lim
n n
log PX,Y (Xk , Yk )
(18.95)
k=1
= E[log PX,Y (X, Y )]

= H(X, Y ).
in probability
(18.96)
(18.97)
18.9. Jointly Typical Sequences
18.9
367
Jointly Typical Sequences
Based on the joint AEP, we will also generalize the idea of typical sets to sets
of pairs of sequences that are jointly typical.
(n)
Denition 18.16. The set A (PX,Y ) of jointly typical sequences (x, y) with
respect to the joint distribution PX,Y (, ) is dened as
A(n) (PX,Y )
(x, y) X n Y n :
1
log P (x) H(X) < ,
n
1
log P (y) H(Y ) < ,
n
1
log P (x, y) H(X, Y ) <
n
(18.98)
where
n
PX,Y (xk , yk )
P (x, y)
(18.99)
k=1
and where P (x) and P (y) are the corresponding marginal distributions.
Remark 18.17. We see that if x and y are jointly typical, then by denition
they are also typical by themselves. However, the opposite is not true: If x
and y both are typical, then this does not necessarily mean that they are also
jointly typical. For this the additional condition
1
log P (x, y) H(X, Y )
n
(18.100)
has to be satised.
(n)
Now we can derive similar properties of the jointly typical set A (PX,Y ) as
(n)
we did for the typical set A (PX ). Like for Theorem 18.2 and Theorem 18.6,
the basic ingredient is again the weak law of large numbers.
(n)
Theorem 18.18 (Properties of A (PX,Y )).

Let (X, Y) be random vectors of length n that are drawn according to the
n
distribution P (x, y)
k=1 PX,Y (xk , yk ) (i.e., the component pairs are
IID). Then we have the following:
(n)
(n)
1. Pr (X, Y) A (PX,Y ) = Pr A (PX,Y ) > 1 , for n suciently large.

(n)
2. A (PX,Y ) 2n(H(X,Y )+) , for all n.
368
(n)
3. A (PX,Y ) > (1 ) 2n(H(X,Y )) , for n suciently large.

4. Assume (X, Y) PX PY , i.e., the sequences X and Y are independently drawn according to the marginals of (18.99). Then

Pr (X, Y) A(n) (PX,Y ) 2n(I(X;Y )3)
for all n,

Pr (X, Y)

(18.102)
A(n) (PX,Y )
n(I(X;Y )+3)
> (1 ) 2
(18.101)
Proof: The proof is very similar to the proof of Theorem 18.6. The
rst part follows from the weak law of large numbers: Given > 0, we know
that
1
(18.103)
N1 s.t. n N1 : Pr log P (X) H(X) > < ,
n
3
1
(18.104)
N2 s.t. n N2 : Pr log P (Y) H(Y ) > < ,
n
3
1
N3 s.t. n N3 : Pr log P (X, Y) H(X, Y ) > < . (18.105)

n
3
By the union bound, we hence see that for all n max{N1 , N2 , N3 },
1 Pr A(n) (PX,Y )
= Pr
1
x : log P (x) H(X) >
n
1
y : log P (y) H(Y ) >
n
1
(x, y) : log P (x, y) H(X, Y ) >
n
1
Pr log P (X) H(X) >
n
1
+ Pr log P (Y) H(Y ) >
n
1
+ Pr log P (X, Y) H(X, Y ) >
n
< + + = .
3 3 3
The second part can be derived as follows:
1=
P (x, y)
(x,y)X n Y n
(18.106)
(18.107)
(18.108)
(18.109)
18.9. Jointly Typical Sequences
369
P (x, y)
(n)
(x,y)A (PX,Y
(18.110)
2n(H(X,Y )+)
(n)
(x,y)A (PX,Y
(18.111)
A(n) (PX,Y ) ,
(18.112)
n(H(X,Y )+)
=2
where (18.111) follows from the third condition in the denition of the jointly
typical set (Denition 18.16). Hence,
A(n) (PX,Y ) 2n(H(X,Y )+) .
(18.113)
For the third part, note that for n suciently large we have from Part 1
1 < Pr A(n) (PX,Y )
(18.114)
P (x, y)
(18.115)
2n(H(X,Y ))
(18.116)
) 2n(H(X,Y )) ,
(18.117)
(n)
(x,y)A (PX,Y
(for n su. large)
(n)
(x,y)A (PX,Y
A(n) (PX,Y
(n)
where (18.116) follows again from the denition of A (PX,Y ). Hence,

A(n) (PX,Y ) > (1 ) 2n(H(X,Y ))
(18.118)

Finally, the fourth part is the only new statement. It basically says that if
we generate X and Y independently, then the chance that they accidentally

look like jointly typical is very small.
To prove this, assume that X and Y are independent, but have the same
marginals as X and Y, respectively. Then

Pr (X, Y) A(n) (PX,Y )
P ( ) P ( )
x
y
(n)
( , )A (PX,Y
xy
(n)
( , )A (PX,Y
xy
A(n) (PX,Y )
n(H(X,Y )+)
(18.119)
2n(H(X)) 2n(H(Y ))
(18.120)
2n(H(X))n(H(Y ))
n(H(X)+H(Y ))
(18.121)
(18.122)
n(H(X)+H(Y )H(X,Y )3)
(18.123)
n(I(X;Y )3)
(18.124)
=2
=2
where (18.120) follows from the rst two conditions of the denition of the
jointly typical set (Denition 18.16), and where (18.122) follows from Part 2.
The other direction can be derived in a similar fashion.
370

Table 18.2: An example of a joint probability distribution.
Y
PX,Y
PX
1
2
1
4
1
4
3
4
1
4
PY
1
2
1
4
1
4
Example 18.19. Consider a joint probability distribution PX,Y given in Table 18.2.
Consider now two sequences of length n = 10:
x = (0, 1, 0, 0, 0, 1, 1, 0, 0, 0)
(18.125)
y = (0, 0, 2, 1, 0, 2, 0, 0, 1, 2).
(18.126)
We see immediately that the sequences look typical with respect to PX and
PY , respectively, however, they cannot have been generated jointly according
to PX,Y because we have various position that are not possible: e.g., (x, y) =
(1, 0) in the second position, or (x, y) = (1, 2) in position 6. So this pair of
sequences is not jointly typical.
To summarize we can say the following: There are about 2n H(X) typical
X-sequences, 2n H(Y ) typical Y -sequences, but only 2n H(X,Y ) jointly typical
sequences. Hence, the probability that a randomly chosen, but independent
pair of an X- and a Y -sequence happens to look jointly typical is only about
2nI(X;Y ) .
18.10
Data Transmission Revisited
The jointly typical set can be used to nd a very elegant proof of the achievability part of the coding theorem (Theorem 9.34). To demonstrate this, we
will now give a new version of the proof shown in Section 9.8.
Recall our IID random codebook construction based on a distribution
PX () (see (9.106)) and our encoder that, given the message m, simply transmits the mth codeword X(m) of C over the channel.
We will now design a new (also suboptimal) decoder: a so-called typicality
decoder. This decoder is quite useless for a practical system, however, it allows
a very elegant and simple analysis. It works as follows. After having received
some y, a typicality decoder looks for an m M such that
(X(m), y) is jointly typical, and
18.10. Data Transmission Revisited
371
there is no other message m = m such that (X(m ), y) is jointly typical.
If the decoder can nd such an m, then it will make the decision m = m,
otherwise it decides m = 0 (which is equivalent to declaring an error).
We will now analyze the performance of our scheme. We can still (without
loss of generality) assume that M = 1. Let
Fm
(X(m), Y) A(n) (PX,Y ) ,
m = 1, . . . , 2nR ,
(18.127)
be the event that the mth codeword and the received sequence Y are jointly
c
typical. Hence an error occurs if F1 occurs (i.e., if the transmitted codeword
and the received sequence are not jointly typical) or if F2 F3 F2nR occurs
(i.e., if one or more wrong codewords are jointly typical with the received
sequence Y). Hence, we can write
Pr(error) = Pr(error | M = 1)
=
c
Pr(F1
(18.128)
F2 F3 F2nR | M = 1)
(18.129)
2nR
c
Pr(F1 | M = 1) +
Pr(Fm | M = 1)
m=2
(18.130)
where the inequality follows from the union bound. From Theorem 18.15-1
we know that
Pr (X, Y) A(n) (PX,Y ) > 1
(18.131)
c
Pr(F1 | M = 1) <
(18.132)
i.e.,
for n suciently large. Furthermore, we know that Y is completely independent of X(m) for all m 2:
generate X(1)
independent
send it
through DMC
generate other
codewords X(m) using
the same PX (),
but independently
independent
receive Y
Hence, by Theorem 18.15-4 it now follows that for m 2,
Pr(Fm | M = 1) = Pr (X(m), Y) A(n) (PX,Y ) M = 1
n(I(X;Y )3)
n.
(18.133)
(18.134)
372
Combining these results with (18.130) then yields

2nR
Pr(error)
c
Pr(F1
| M = 1) +
m=2
Pr(Fm | M = 1)
(18.135)
2nR
2n(I(X;Y )3)
<+
(18.136)
m=2
= + 2nR 1 2n(I(X;Y )3)
(18.137)
< 2nR
nR
n(I(X;Y )3)
(18.138)
n(I(X;Y )R3)
(18.139)
<+2
=+2
(18.140)
if n is suciently large and if I(X; Y ) R 3 > 0 so that the exponent is

negative. Hence we see that as long as
R < I(X; Y ) 3
(18.141)
for any > 0 we can choose n such that the average error probability, averaged
over all codewords and all codebooks, is less than 2:
Pr(error) < 2.
(18.142)
Note that while in Section 9.8 we relied on the weak law of large numbers
to prove our result, here we used typicality. But since typicality is based on
the AEP, which is simply another form of the weak law of large numbers, there
is no fundamental dierence between these two approaches.
18.11
Joint Source and Channel Coding Revisited
Finally, we also go back to the proof given in Section 13.3 and show another
version that is based on typicality. So lets assume that the source {Uk }
satises the AEP. We x an > 0 and consider the set of typical sequences
(K)
A ({Uk }). Our system now works in two stages. In a rst stage the encoder
will look at a sequence of K source symbols and check whether it is typical
or not. If it is typical, it assigns to it a unique number between 1 and M. If
it is not typical, it does not really matter what the encoder does because the
decoder will fail anyway. So we can assign any number in this case, e.g., we
might choose to assign the number 1.
Note that according to Theorem 18.13-3 the total number of typical sequences is upper-bounded by
M 2K(H({Uk })+) .
(18.143)
Moreover, also note that all typical sequences are by denition (almost) equally
(K)
likely: For every u A ({Uk }) we have
K
2K(H({Uk })+) Pr U1 = u 2K(H({Uk })) .
(18.144)
18.11. Joint Source and Channel Coding Revisited
373
Hence, we can use a standard channel code (designed for a uniform source)
to transmit these messages over the channel. So in the second stage, the
encoder will assign a unique codeword of length n to each of the M possible
messages. At the receiver side we use a joint-typicality decoder: Given the
received sequence Y1n , the decoder looks for a unique codeword x that is
jointly typical with Y1n . If it nds exactly one such codeword, it regenerates
the corresponding (typical) source sequence uK . If it nds none or more than
1
one, it declares an error.
This system will make an error if either the source sequence happens to
be nontypical or if the channel introduces a too strong error:
(K)
K
K
Pe = Pr U1 = U1
= Pr
K
U1
+ Pr
(K)
A
K
U1
(18.145)
Pr
(K)
A
K
U1
Pr
K
U1
K
U1
K
U1
=1
K
= U1
A(K)
K
(K)
U 1 A
Pr
= Pr
K
U1
K
U1
(K)
A
(K)
A
(18.146)
+ Pr
if K is
large enough
n
Pr (X1 , Y1n )
+ = 2
K
U1
K
(K)
K
= U 1 U 1 A
(n)
K
(K)
n
A(n) or (X1 , Y1n ) A
/
U 1 A
if R<C and K is large enough
(if R < C and K is large enough).
(18.147)
(18.148)
(18.149)
Here in (18.149) the rst upper bound follows from the properties of typical
sequences, and the second upper bound follows from the channel coding theorem and holds as long as the rate of the channel code is less than the channel
capacity.
The rate of our channel code can be bounded as follows:
R=
log2 M
log2 2K(H({Uk })+)
n
n
K
H({Uk }) +
=
n
Tc
Tc
=
H({Uk }) +
Ts
Ts
Tc
H({Uk }) + ,
Ts
(by (18.143))
(18.150)
(18.151)
(by (13.3))
(18.152)
(18.153)
where can be made arbitrarily small by making small.

Hence, we can guarantee that R < C as long as
Tc
H({Uk }) + < C.
Ts
(18.154)
374
Since , and therefore , are arbitrary, our scheme will work if

H({Uk })
C
< .
Ts
Tc
18.12
(18.155)
Continuous AEP and Typical Sets
The AEP generalizes directly to the situation of continuous random variables.

Therefore, also all results concerning the Gaussian channel can be derived
based on typicality.
Theorem 18.20 (AEP for Continuous Random Variables).
Let X1 , . . . , Xn be a sequence of continuous random variables that are drawn
IID fX (). Then
1
n
log fX1 ,...,Xn (X1 , . . . , Xn ) E[log fX (X)] = h(X)
n
in probability.
(18.156)
Proof: The proof relies as before on the weak law of large numbers:
1
lim log fX1 ,...,Xn (X1 , . . . , Xn )
n n
n
1
fX (Xk )
= lim log
n n
(18.157)
k=1
1
= lim
n n
log fX (Xk )
(18.158)
k=1
= E[log fX (X)] = h(X)
in probability.
(18.159)
(n)
Denition 18.21. For > 0 and any n N we dene the typical set A (fX )
with respect to fX as follows:
A(n) (fX )
1
(x1 , . . . , xn ) X n : log f (x1 , . . . , xn ) h(X) (18.160)
n
where
n
fX (xk ).
f (x1 , . . . , xn )
(18.161)
k=1
(n)
Note that A (fX ) has innitely many members because X is continuous.

So it does not make sense to talk about the size of the set, but instead we
dene its volume.
Denition 18.22. The volume Vol(A) of a set A Rn is dened as
Vol(A)
dx1 dxn .
(18.162)
18.12. Continuous AEP and Typical Sets
375
(n)
Theorem 18.23. The typical set A (fX ) has the following properties:
(n)
1. Pr A (fX ) > 1 , for n suciently large;

(n)
2. Vol A (fX ) 2n(h(X)+) , for all n;

(n)
3. Vol A (fX ) > (1 ) 2n(h(X)) , for n suciently large.

Proof: The proof is complete analogous to the proof of Theorem 18.6.
Part 1 follows from the AEP: > 0, N such that for n N we have
1
Pr log f (X1 , . . . , Xn ) h(X) < > 1 .
n
(18.163)
Now choose .
Part 2 can be derived as follows:
1=
Xn
f (x1 , . . . , xn ) dx1 dxn
(n)
(n)
A
(18.164)
f (x1 , . . . , xn ) dx1 dxn
(18.165)
2n(h(X)+) dx1 dxn
(18.166)
= 2n(h(X)+)
= 2n(h(X)+) Vol
(n)
dx1 dxn
A
A(n)
(18.167)
(18.168)
Hence,
Vol A(n) 2n(h(X)+) .
(18.169)
Part 3 follows from Part 1: For n suciently large we have

1 < Pr A(n)
(n)
(n)
(18.170)
f (x1 , . . . , xn ) dx1 dxn
(18.171)
2n(h(X)) dx1 dxn
(18.172)
A
n(h(X))
=2
Vol A(n) .
(18.173)
Hence,
Vol A(n) > (1 ) 2n(h(X))
(18.174)

The generalization to a corresponding joint AEP, jointly typical sets
(n)
A (fX,Y ), and their properties is completely analogous to Theorem 18.15,
Denition 18.16, and Theorem 18.18. We will therefore only state the results,
but omit their proofs.
376
Theorem 18.24 (Joint AEP for Continuous Random Variables). If the

sequence of pairs of continuous RVs (X1 , Y1 ), (X2 , Y2 ), . . . is IID fX,Y (, ),
then
1
n
log fX1 ,Y1 ,...,Xn ,Yn (X1 , Y1 , . . . , Xn , Yn ) h(X, Y )
n
in probability.
(18.175)
(n)
Denition 18.25. The set A (fX,Y ) of jointly typical sequences (x, y) with
respect to the joint PDF fX,Y (, ) is dened as
1
(x, y) X n Y n : log f (x) h(X) < ,
n
1
log f (y) h(Y ) < ,
n
1
log f (x, y) h(X, Y ) <
n
A(n) (fX,Y )
(18.176)
where
n
fX,Y (xk , yk )
f (x, y)
(18.177)
k=1
and where f (x) and f (y) are the corresponding marginal PDFs.
(n)
Theorem 18.26 (Properties of A (fX,Y )). Let (X, Y) be random vectors

of length n with joint PDF
n
fX,Y (xk , yk )
f (x, y)
(18.178)
k=1
(i.e., (X, Y) are IID over k). Then we have the following:
(n)
(n)
1. Pr (X, Y) A (fX,Y ) = Pr A (fX,Y ) > 1 , for n suciently

large.
(n)
2. Vol A (fX,Y ) 2n(h(X,Y )+) , for all n.

(n)
3. Vol A (fX,Y ) > (1 ) 2n(h(X,Y )) , for n suciently large.

4. Assume (X, Y) fX fY , i.e., the sequences X and Y are independently

drawn according to the marginals of fX,Y . Then

Pr (X, Y) A(n) (fX,Y ) 2n(I(X;Y )3)

Pr (X, Y) A(n) (fX,Y ) > (1 ) 2n(I(X;Y )+3)
for all n, (18.179)

(18.180)
18.13. Summary
18.13
377
Summary
Data Compression (based on AEP): There exists a small subset of all possible source sequences of size 2n H that contains almost all probability.
Hence, we can represent a source with H bits per symbol with a very
small probability of error.
Data Transmission (based on joint AEP): The output sequence is very
likely to be jointly typical with the transmitted codeword, and the probability that it is jointly typical with any other codeword is 2nI . Hence
we can use about 2nI codewords and still have a very small probability
of error.
Source Channel Separation Theorem: We can design a source code and
a channel code separately and still be optimal.
Chapter 19
Cryptography
19.1
Introduction to Cryptography
Cryptography is Greek and means hidden writing. It is the art of transmitting

a message in hidden form such that only the intended recipient will be able to
read it. There are many ways of trying to achieve this goal. A message could
be hidden such that only a person who knows where to look for it will nd it.
Or the recipient needs to know a certain decoding scheme in order to make
sense into a scrambled text. Or the recipient needs to know or possess a key
to unlock an encrypted message.
Cryptography has been very important for a long time in history, simply
because information and knowledge often means power and money. Some
typical examples of how cryptography used to work before it became a science
are as follows:
The message is written in invisible ink that can be made visible by
adding a certain chemical or heat.
The message is tattooed into the skin on the head of a man, then the
man waits until his hair has grown back before he travels to the required
destination. There he shaves the hair o again to reveal the message.
The message is written row-wise into a square grid and then read columnwise to scramble it.
Cryptography has two very dierent goals that are often mixed up:
A cryptographic system provides secrecy if it determines who can receive
a message.
A cryptographic system provides authenticity if it determines who can
have sent a message.
As a matter of fact, it was only recently in the long history of cryptography
that people have started to appreciate this fundamental dierence.
379
380
Cryptography
In the following we will give a very brief overview of some main ideas
behind cryptographic systems. We start with Shannon who was the rst
person to look at cryptography in a proper scientic way. Basically, he changed
cryptography from being an art to become a scientic art.
19.2
Cryptographic System Model
In 1949, Claude E. Shannon published a paper called Communication Theory of Secrecy Systems [Sha49], which gave the rst truly scientic treatment
of cryptography. Shannon had been working on this topic due to the second
world war and the insights he gained caused him later on to develop information theory. However, after the war his work on cryptography was still
declared condent, and so he rst published his work on data compression
and transmission [Sha48]. So usually people rstly refer to the latter and
only then to cryptography, but it actually should be the other way around:
Cryptography really started the whole area of information theory.
In his 1949 paper, Shannon started by providing the fundamental way
of how one should look at a cryptographic system. His system is shown in
Figure 19.1. Actually, Shannon did not yet consider the public and private
public
random
source
enemy
cryptanalyst
message
source
private
random
source
encryptor
decryptor
public channel
destination
secure channel
Z
key source
Figure 19.1: System model of a cryptographic system the way Shannon dened
it.
random source, but their inclusion is straightforward. In this system model,
we have the following quantities:
plaintext:
X = (X1 , X2 , . . . , XJ )
(19.1)
ciphertext:
Y = (Y1 , Y2 , . . . , Yn )
(19.2)
secret key:
Z = (Z1 , Z2 , . . . , Zz )
(19.3)
public randomization:
R = (R1 , R2 , . . . , Rr )
(19.4)
private randomization:
S = (S1 , S2 , . . . , Ss )
(19.5)
Note that the length of these dierent sequences do not need to be the same.
19.3. Kerckho Hypothesis
381
The aim of the system is that the message X arrives at the destination in
such a way that the enemy cryptanalyst cannot decipher it. To that goal the
encryptor will use the private key Z, the private randomizer S and the public
randomizer R to transform the message or plaintext X into an encrypted
message Y, called ciphertext. It is then assumed that the enemy cryptanalyst
can only observe Y and R, but does not know the realization of S and Z.
Based on this observation, he will then try to reconstruct the original message
X. The intended receiver, on the other hand, additionally knows the secret
key Z, which must allow him to perfectly recover X from Y. Note, however,
that S is not known to the decryptor, so it must be possible to get X back
without it.
In this most common setup, we assume that the enemy cryptanalyst has
access only to Y and R: This is called ciphertext-only attack. There exist
also more problematic situations where the enemy cryptanalyst also knows
the original message X (and, e.g., tries to gure out the key) this is known
as known-plaintext attack or, even worse, can actually choose the plaintext
X himself denoted chosen-plaintext attack and thereby look at optimally
bad cases for the encryption that will simplify the breaking.
19.3
Kerckho Hypothesis
There is one very important assumption implicit to our system model of Figure 19.1: We always assume that the enemy cryptanalyst has full knowledge
of how our cryptosystem works. The only parts of the system that remain
secret are the values of key, message, and private randomizer.
This assumption cannot be proven and has been (or even sometimes still
is) questioned. However, there are very good reasons for it: History has shown
that any secret design will eventually be discovered. Either an insider changes
sides, the enemy cryptanalyst is able to steal the plans or a machine, or he is
simply able to reverse-engineer the systems design. Hence, security based on
the secret design is not fail-proof at all.
This has been properly recognized towards the end of the 19th century by
a man called Auguste Kerckho, who made the following postulate.
Postulate 19.1 (Kerckho s Hypothesis (1883)). The only part of a
cryptographic system that should be kept secret is the current value of the
secret key.
19.4
Perfect Secrecy
It is now time to investigate the system of Figure 19.1 more in detail. We see
that the ciphertext Y is a function of X, Z, R, and S. Hence we have
H(Y|X, Z, R, S) = 0.
(19.6)
The decryptor has no access to S, but must be able to recover the plaintext
from the ciphertext using the secret key and the public randomizer. Hence, a
382
Cryptography
requirement for the system to work is that

H(X|Y, Z, R) = 0.
(19.7)
Implicit in the gure there are two more important assumptions:

X, Z, R, and S are all statistically independent; and
the secret key Z and the public randomizer R are to be used once only!
Based on these denitions and assumptions, Shannon now gave a simple,
but very powerful denition of what a perfect cryptographic system should
do.
Denition 19.2. A cryptographic system provides perfect secrecy if the enemy cryptanalysts observation (Y, R) is statistically independent of the plaintext X:
(Y, R) X.
(19.8)
In information theoretic notation, this is equivalent to saying that

H(X|Y, R) = H(X).
(19.9)
It is good to know that it is actually possible to design such perfect-secrecy

systems!
Example 19.3 (One-Time Pad). One possible design of a system that provides perfect secrecy is shown in Figure 19.2. It is called one-time pad.
enemy
cryptanalyst
binary
source
Xk
Yk
Zk
X
Xk
secure channel
destination
Zk
BSS
Figure 19.2: A cryptosystem that provides perfect secrecy: one-time pad.
The main idea is to use modulo-2 adders that scramble the plaintext with
the key bits. As long as the key bits are IID uniform, the system provides
perfect secrecy, independently of the distribution of the plaintext, as can be
shown as follows:
Pr[Y = y | X = x] = Pr[X Z = y | X = x]
(19.10)
= Pr[Z = y x | X = x]
1 n
=
,
2
(19.12)
= Pr[x Z = y | X = x]
(19.11)
(19.13)
19.4. Perfect Secrecy
383
where the last step follows because we have assumed that {Zk } is IID uniform.
Therefore we see that Pr[Y = y|X = x] does not depend on x, i.e., X
Y.
Shannon did not only come up with the fundamental denition of perfect
secrecy, but he was also able to prove a fundamental property of it.
Theorem 19.4 (Bound on Key Size).
In any system that provides perfect secrecy, we must have
H(X) H(Z).
(19.14)
Proof: In spite of the theorems importance, its proof is astonishingly

simple. It is based on basic properties of entropy:
H(X|Y, R) H(X, Z|Y, R)
(19.15)
= H(Z|Y, R) + H(X|Z, Y, R)
(19.16)
= H(Z|Y, R)
(19.17)
H(Z).
(19.18)
Here, (19.15) follows because adding random variables cannot decrease uncertainty; (19.16) follows from the chain rule; in (19.17) we use our assumption of
perfect recovery (19.7); and the nal inequality (19.18) is due to conditioning
that cannot increase entropy.
Hence, if we want perfect secrecy, i.e., H(X|Y, R) = H(X), then we must
have that H(X) H(Z).
In other words, Theorem 19.4 says that the used key must be at least as
long1 as the message!
The alert reader will therefore now complain that instead of going through
all the trouble of encrypting and decrypting the message and transmitting the
secret key (which is longer than the message!) over the secure channel, we
should instead simply use the secure channel to transmit the message in the
rst place!
This is not a completely fair observation. For example, there might be
a time dierence between the transmission of the secret key over the secure
channel and the transmission of the ciphertext over the public channel. Consider, e.g., the situation of a ship: The key could be generated and brought
aboard before a ship departs, but the secret messages are then sent via radio transmission once the ship is far out at sea and no secure channel exists
anymore.
1
Strictly speaking this is only true with the additional assumption that all message bits
and all key bits are independent and uniform, such that they maximize entropy. If the
message bits have big dependencies, i.e., H(X) is not very large, we might have a shorter
key than the message and still get perfect secrecy as long as (19.14) is satised.
384
Cryptography
Nevertheless, the alert reader has a point. His observation very clearly
puts its nger on the two most fundamental problems of the system shown in
Figure 19.1:
In a practical situation, we would like to have the key size much smaller
than the plaintext. In other words, we would like to reuse the same key
Z for dierent messages X.2
In a practical situation we often do not have a secure channel available.
We will try to address these two issues in the following sections. We start
with an investigation of what happens if one shortens the key length in such
a way that (19.14) is violated.
19.5
Imperfect Secrecy
We have learned that perfect secrecy is possible, but that it is rather expensive
because it needs a large amount of secret key. In many practical situations
perfect secrecy seems too complicated. So the question is what we can do if we
are happy with imperfect secrecy. To analyze this, Shannon again introduced a
quantity that very accurately describes almost every imperfect cryptographic
system.
So we consider again Figure 19.1, but assume for the moment that there
are no public and private randomizers R and S. Moreover, we assume that
we use a system that uses a relatively short nite-length secret key Z for a
long message (or equivalently, many concatenated short messages) and that
therefore cannot provide perfect secrecy.
How can we characterize our system? Obviously, since the secret key Z
is of nite length, if we keep using it to encode a sequence of message digits
{Xk }, eventually it must be possible to compute the value of Z simply from
the observation of {Yk }. So the quality of our system will be determined by
the minimum number of ciphertext digits Y1 , . . . , Yku that are needed to
determine Z uniquely; and
the diculty of actually computing Z from such a sequence Y1 , . . . , Yku .
The latter depends very strongly on the system and on our ingenuity of attacking the cryptographic system. We will discuss more about that in Section 19.6.
The former, however, can be described in quite general form for any reasonably good system. To that goal, Shannon gave the following denitions.
Denition 19.5. The key equivocation function feq () is dened as follows:
feq (k)
H(Z)
H(Z|Y1 , . . . , Yk )
k = 0,
k = 1, 2, 3, . . .
(19.19)
2
Note that reusing the same key for dierent messages can be seen as a short-length key
Z being used for a very long message X that actually consists of the concatenation of many
messages.
19.5. Imperfect Secrecy
385
Furthermore, the unicity distance ku is dened as the smallest k such that

feq (k) 0. Hence ku is the smallest amount of ciphertext necessary to determine the key uniquely.3
Ingeniously, Shannon was able to derive a general form of feq () that holds
basically for all reasonably good cryptographic systems. He only made two
reasonable assumptions about a cryptosystem:
1. A good cryptosystem will try to make Y1 , . . . , Yk look totally random
for as large k as possible. Hence, for as large k as possible,
H(Y1 , . . . , Yk ) k log |Y|.
(19.20)
2. By denition, when the secret key Z is known, Y = (Y1 , . . . , Yn ) determines X completely:

H(Y1 , . . . , Yn |Z) = H(X).
(19.21)
For most ciphers and sources, it then holds that Y1 , . . . , Yk will determine
a certain percentage of X. Usually,
H(Y1 , . . . , Yk |Z)
k
H(X)
n
(19.22)
is a good approximation.
Based on these two assumptions we now get
H(Y1 , . . . , Yk , Z) = H(Z) + H(Y1 , . . . , Yk |Z)
k
H(Z) + H(X)
n
(19.23)
(19.24)
and
H(Y1 , . . . , Yk , Z) = H(Y1 , . . . , Yk ) + H(Z|Y1 , . . . , Yk )
k log |Y| + feq (k).
(19.25)
(19.26)
Hence,
k
H(X)
n
H(X)
= H(Z) k log |Y| 1
n log |Y|
H(Z) k log |Y|,
feq (k) H(Z) k log |Y| +
(19.27)
(19.28)
(19.29)
where the last equality should be read as denition for .

3
However, as mentioned above already, note that even though theoretically it must be
possible to determine the key Z from Y1 , . . . , Yku because H(Z|Y1 , . . . , Yku ) 0, in practice
this should still be a very dicult task!
386
Cryptography
Proposition 19.6. For most reasonable ciphers, the key equivocation function
is linear in k:
feq (k) H(Z) k log |Y|
(19.30)
with a slope log |Y| where
H(X)
.
n log |Y|
(19.31)
The basic behavior of the key equivocation function is shown in Figure 19.3.
feq (k)
H(Z)
slope: log |Y|
ku
Figure 19.3: The key equivocation function.

From Proposition 19.6 now follows that the unicity distance is approximately given as
ku
H(Z)
.
log |Y|
(19.32)
Therefore, since we would like ku to be as big as possible, we need to be as

small as possible, i.e., H(X) should be as large as possible! For exactly this
goal, the private randomizer S is useful: It helps to increase the entropy!
Example 19.7 (Usage of Private Randomizer). In English text (26 letters +
space), the most likely letter is space (probability around 0.1859), then e
(probability around 0.1031), etc. If we convert such text into 11-bit symbols
(211 = 2048) in the manner that 381 (= 2048 0.1859) of the 2048 symbols
represent space, 211 (2048 0.1031) symbols represent e, etc., and then
choose any of the possible representations uniformly at random using the
private randomizer S, then the output looks uniformly distributed! Note that
the receiver does not need to know S because he can simply replace any of
the 381 space-symbols by space, etc.
19.6. Computational vs. Unconditional Security
387
So we see that if we use a private randomizer S, in (19.31) we have to
replace H(X) by H(X) = H(X) + H(S). Therefore the private randomizer

helps increasing the unicity distance, i.e., we will need to know more ciphertext
until Z can (theoretically) be computed from it.
The above analysis is not aected by the use of a public randomizer.
Note that if instead of a ciphertext-only attack we allow a known-plaintext
attack (enemy cryptanalyst also knows X) or, even worse, a chosen-plaintext
attack (enemy cryptanalyst can choose X himself), then we need to replace
H(X) by 0, i.e., = 1.
19.6
Computational vs. Unconditional Security
Since perfect secrecy is dicult to achieve, most practical systems rely on

practical security, which means that we do not rely on the impossibility of
breaking a code, but on the diculty in breaking it.
If the best super-computer needs 500 years on average to break a certain
cipher, then this is good enough for us (as in 500 years the encrypted message
is probably useless anyway). We say that such a system is computationally
secure.
However, note that we have the problem that we are not able to prove the
security! There still is the possibility that someone has a brilliant idea and
suddenly nds a way of cracking a code within two hours. So this means that
computational security is not a guaranteed security.
So, by relying on computational security, we try to deal with the unpractical requirement that the secret key has to have more entropy than the plaintext
(see Theorem 19.4). Another way to put this is that if we rely on computational security, then we can reuse the same key many times. An attacker will
not be able to break our code even if according to the equivocation function
feq the secret key is already fully determined by {Yk }.
Unfortunately, in Figure 19.1 we still have another obstacle that hinders
us in practical implementations: the secure channel. We have already mentioned that sometimes there might be a secure channel for a certain time
period (e.g., during the time a ship is in the harbor), but not anymore afterwards. Unfortunately, in many practical situations there simply is no secure
channel. Consider, e.g., the Internet: Nobody is willing to y to the United
States before being able to establish a secure connection with the amazon.com
servers. . .
Incredibly, there is a solution to this problem: Public-key cryptography
gets rid of the need of a secure channel!
19.7
Public-Key Cryptography
In 1976, Die and Hellman [DH76] (and independently Merkle [Mer78]) published a paper that gave a fundamentally new idea: They realized that if we
388
Cryptography
allow to rely on computationally secure systems only, then we can do without

a secure channel !
So, for the remainder of this chapter we leave Figure 19.1 behind, but
consider a new system. We will depict the exact shape of this system later in
Section 19.7.2, see Figures 19.5 and 19.6.
The basic idea could be explained by an analogy shown in Figure 19.4:
Mr. A would like to send a locked suitcase to his friend B without having to
send the key. How can he do it? The idea is simple. Firstly he sends the
suitcase locked. The friend B cannot open it as he does not have the key. But
he can add an additional lock and send the suitcase back to Mr. A. Mr. A
now removes his lock (the lock of his friend B remains there and Mr. A cannot
open it because only B has the key) and then sends the suitcase again to B. B
can now remove his own lock and open the suitcase. We realize: The suitcase
has never been traveling without lock, but there was no need of sending a key!
A
public channel
A B
A B
A B
B
Figure 19.4: Analogy on how one can send secret messages without exchanging
keys.
To explain how this idea can be translated into the world of digital com-
19.7. Public-Key Cryptography
389
munication we need to discuss two crucial denitions introduced by Die,

Hellman, and Merkle: the one-way function and the trapdoor one-way function.
19.7.1
One-Way Function
Denition 19.8. A one-to-one function f () is a one-way function if

it is easy to compute y = f (x) for every x, but
it is computationally impossible to compute x = f 1 (y) for virtually all
y.
The reader may note the unclear language we have been using here: easy,
computationally impossible, virtually all, . . . These are all terms that
mathematically do not make much sense, however, it turns out that for engineers they are quite clear and very useful.
As a matter of fact, one-way functions were not inventions of Die and
Hellman, but they have been used in dierent context before. Best known
is their use in login-systems: A user i chooses a password xi that he needs
to present to the login-screen in order to be allowed to log into a computer
system. The computer, however, does not store this password, but computes
yi = f (xi ) and stores (i, yi ). When logging in, user i gives xi , the computer
computes yi and compares it with the stored value. If they are identical, the
login is approved, otherwise rejected. The advantage of this system is that if
someone somehow gets hold of (i, yi ), this will not help him to break into the
system because he cannot compute xi = f 1 (yi ).
We will now give an example of a (conjectured) one-way function.
Example 19.9. The function
y = f (x) = x
(mod p),
(19.33)
where p is a large prime number such that n = p 1 has a large prime factor
(ideally n = p1 = 2p for p also prime), is a (conjectured4 ) one-way function:
x = log y (mod p) is very hard to compute;
x (mod p) is easy to compute by squaring: For example,
1300 = 1024 256 16 4
=
210
28
2 2 ... 2
24
(19.34)
22
2 2 ... 2
(19.35)
2 2 2 2
2 2
= ((( ) ) ) ((( ) ) ) ((( ) ) ) ( )

10 squaring
operations
(19.36)
8 squaring
operations
4
Again, this has not been proven. There is strong evidence, but theoretically there exists
the possibility that one day someone has a brilliant idea on how to quickly compute discrete
logarithms, changing the one-way function to a normal two-way function.
390
Cryptography
needs only 10 squaring operations (if we store the intermediate values
for 2, 4, and 8 squaring operations) and 3 multiplications. Hence, it is
very fast.
How can we now use one-way functions in the context of cryptography?

Consider the following system of generating a common key between two users:
Fix an and a p in advance. User i chooses a private key xi (mod p) and
computes a public key yi = xi (mod p). User j does the same thing. If user
i and j want to communicate, then user i fetches yj and user j fetches yi from
a trusted public database that contains the public keys of many users.
User i now computes
(yj )xi = (xj )xi = xj xi
(mod p)
(19.37)
(yi )xj = (xi )xj = xi xj
(mod p).
(19.38)
and user j computes
Both get the same value! This common value xi xj (mod p) can now be used
as common secret key on a standard cryptographic system.
An enemy cryptanalyst can also fetch yi and yj , but he can only compute
the common secret key if he manages to compute xi = log yi (mod p) or
xj = log yj (mod p), which, as said, is very dicult.
The problem of this system is that both users will always use the same
secret key, which is not a good idea. So Die and Hellman expanded their
idea to not only nd a common key, but directly create a complete public-key
cryptosystem.
19.7.2
Trapdoor One-Way Function
Denition 19.10. A trapdoor one-way function fz () is a family of one-to-one

functions with parameter z (the trapdoor ) satisfying the following properties:
1. If z is known, it is easy to nd algorithms Ez and Dz that easily compute
1
fz and fz , respectively.
2. For virtually all z and y, it is computationally infeasible to compute
1
x = fz (y) even if Ez is known.
3. It is easy to pick a value of z at random.
Again note the mathematically unclear terms easy, computationally
infeasible, and virtually all.
Let us rst assume that such a trapdoor one-way function actually exists
and show how we can use it.
We can create a public-key cryptosystem in the following way. Every user
i chooses a trapdoor zi . (This is easy due to Property 3.) User i then nds his
encryption algorithm Ezi and his decryption algorithm Dzi . He publishes Ezi
in a trusted public database (TPD). Note that usually Ezi and Dzi are known
391
in advance apart from some parameters, i.e., we only need to publish these
parameters. The decryption algorithm Dzi is kept secret. We say that
Ezi is the public key, and
Dzi is the private key.
This system now works both for secrecy and authenticity:
Secrecy: User i wants to send a secret message X to user j. He fetches Ezj
of user j from the TPD, computes Y = fzj (X), and transmits Y. The
ciphertext Y can only be decrypted by user j because only he knows
1
Dzj to compute X = fzj (Y).
The corresponding system is shown in Figure 19.5.
enemy
cryptanalyst
Ez
message
source
encryptor
decryptor
destination
Dz
Ez
TPD
Ez
key source
public channel
Figure 19.5: System model of a public-key cryptosystem that provides secrecy.
Authenticity (digital signature): User i applies his private decryption al1

gorithm Dzi on X: X = fzi (X), then transmits (X, X ) to user j.
User j can get Ezi from the TPD and use it to compute fzi (X ). If
X = fzi (X ), then this proves that X must come from user i since only
user i knows Dzi .
The enemy cryptanalyst tries to send a fake message X to user j and to

pretending that the message came from the
forge a wrong signature X
legitimate user i. But even though the enemy cryptanalyst has access
to Ezi , he cannot create a proper X because for that he would need to

know Dzi .
The corresponding system is shown in Figure 19.6.
Note that this idea of a digital signature was the most important contribution
of public-key cryptography!
392
Cryptography
enemy
cryptanalyst
Ez
message
source
encryptor
Dz
key source
Ez
decryptor
Ez
destination
TPD
public channel
Figure 19.6: System model of a public-key cryptosystem that provides authenticity.
We nish our short insight into public cryptography by stating one of the
best known (conjectured) trapdoor one-way functions. We will only give a
rough big picture and will omit any mathematical details.
Conjecture 19.11 (RSA Trapdoor One-Way Function). The River
ShamirAdleman (RSA) trapdoor is given by
z = (p1 , p2 , e),
(19.39)
where p1 , p2 are large distinct prime numbers such that p1 1 and p2 1 have
large prime factors and where e is a unit in Z(p1 1)(p2 1) (i.e., an integer for
which there exists a multiplicative inverse in Z(p1 1)(p2 1) ). This is easy to
pick at random because powerful algorithms are available to check whether a
number is prime or not.
Using
m
p1 p2
(19.40)
we now dene the trapdoor one-way function as follows:

fz (x) = xe
(mod m).
(19.41)
The public key Ez is

Ez = (m, e).
(19.42)
Note that with the knowledge of (m, e), fz (x) can be easily computed by the
squaring approach shown in Section 19.7.1.
The private key Dz is
Dz = (m, d),
(19.43)
393
where
d
e1
(mod (p1 1)(p2 1)).
(19.44)
It can be shown that

(xe )d
(mod m) = x,
(19.45)
1
i.e., it is easy to compute x = fz (y) by the squaring approach. However,
even if we know (m, e), to compute d, we need to factorize m = p1 p2 , which
is very hard!
It has been proven that breaking the RSA trapdoor one-way function is
equivalent to factorizing m. So as long as no one can nd a way of quickly
factorizing large numbers into its primes, our cryptosystems used throughout
the world for banking, controlling, communicating, etc., are safe.
Appendix A
Gaussian Random Variables

In many communication scenarios the noise is modeled as a Gaussian stochastic process. This is sometimes justied by invoking the central limit theorem
that says that many small independent disturbances add up to a random process that is approximately Gaussian. But even if no such theorem can be used,
the engineers like to use Gaussian random processes because even though
Gaussian processes might seem quite scary at rst they are actually well
understood and often allow closed-form solutions.
Rather than starting immediately with the denition and analysis of Gaussian processes, we shall take the more moderate approach and start by rst
discussing Gaussian random variables. Building on that we shall discuss Gaussian vectors in Appendix B, and only then introduce Gaussian processes in
Appendix C. Our aim is to convince you that
Gaussians are your friends!
A.1
Standard Gaussian Random Variables
We begin with the denition of a standard Gaussian random variable.

Denition A.1. We say that the random variable W has a standard Gaussian
distribution if its probability density function (PDF) fW () is given by
w2
1
fW (w) = e 2 ,
2
w R.
(A.1)
See Figure A.1 for a plot of this density.

For this denition to be meaningful it had better be the case that the
right-hand side of the above is a density function, i.e., that it is nonnegative
and that it integrates to one. This is indeed the case. In fact, the density is
positive, and it integrates to one because
w2
2
dw =
2.
(A.2)
395
396
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
5
2 1
Figure A.1: The standard Gaussian probability density function.

This latter integral can be veried by computing its square as follows:
w
2
dw
w2
2
v2
2
dv
(A.3)
w2 +v 2
2
r e
= 2
dw
r2
2
dw dv
dr d
eu du
(A.4)
(A.5)
(A.6)
= 2.
(A.7)
Here we used the change of variables from Cartesian (w, v) to polar (r, ):
w = r cos
v = r sin
r 0, < , dw dv = r dr d.
(A.8)
Note that the density of a standard Gaussian random variable (A.1) is

symmetric about the origin, so that if W is a standard Gaussian, then so
is W . This symmetry also establishes that the expectation of a standard
Gaussian is zero:
E[W ] = 0.
(A.9)
The variance of a standard Gaussian can be computed using integration by

parts:
w2
1
1
w2 e 2 dw =
2
2
d w2
e 2
dw
dw
(A.10)
A.2. Gaussian Random Variables
397
w2
1
+
=
w e 2
2
w2
1
=
e 2 dw
2
= 1.
w2
2
dw
(A.11)
(A.12)
(A.13)
Here we have used the fact that the limit of w exp(w2 /2) as |w| tends to
innity is zero, and in the last line we used (A.2).
A.2
We next dene a Gaussian (not necessarily standard) random variable as the

result of applying an ane transformation to a standard Gaussian.
Denition A.2. We say that a random variable X is Gaussian if it can be
written in the form
X = aW + b
(A.14)
for some deterministic real numbers a, b R and for a standard Gaussian W .

If b = 0, i.e., if X can be written as X = aW for some a R, then we call
X to be centered Gaussian.
Remark A.3. We do not preclude a from being zero. The case a = 0 leads
to X being deterministically equal to b. We thus include the deterministic
random variables in the family of Gaussian random variables.
Remark A.4. From Denition A.2 it follows that the family of Gaussian
random variables is closed with respect to ane transformations: If X is
Gaussian and , R are deterministic, then X + is also Gaussian.
Proof: Since X is Gaussian, it can be written as
X = aW + b,
(A.15)
where W is a standard Gaussian. Consequently,

X + = (aW + b) +
= (a)W + (b + ),
(A.16)
(A.17)
which demonstrates that X + is Gaussian, because a is a deterministic

real number, b + is a deterministic number, and W is a standard Gaussian.
If (A.14) holds then the random variables on the right-hand and the lefthand sides of (A.14) must be of equal mean. But the mean of a standard
Gaussian is zero so that the mean of the right-hand side of (A.14) is b. The
left-hand side is of mean E[X], and we thus conclude that in the representation
398
(A.14) the deterministic constant b is uniquely determined by the mean of X,

and in fact,
b = E[X].
(A.18)
Similarly, since the variance of a standard Gaussian is one, the variance of the
right-hand side of (A.14) is a2 . So because the variance of the left-hand side
is Var(X), we conclude that
a2 = Var(X).
(A.19)
Up to its sign, the deterministic constant a in the representation (A.14) is

thus also unique.
Based on the above, one might mistakenly think that for any given mean
and variance 2 there are two dierent Gaussian distributions corresponding
to
W +
and
W +
(A.20)
where W is a standard Gaussian. This, however, is not the case, as we show

in the following lemma.
Lemma A.5. There is only one Gaussian distribution of a given mean and
variance 2 .
Proof: This can be seen in two dierent ways. The rst way is to note
that both representations in (A.20) actually lead to the same distribution,
because the standard Gaussian W has a symmetric distribution, so that W
and W have the same law. The other approach is based on computing
the density of W + and showing that it is a symmetric function of , see
(A.22).
Having established that there is only one Gaussian distribution of a given
mean and variance 2 , we denote it by
N , 2
(A.21)
and set out to study its PDF. If 2 = 0, then the Gaussian distribution is
deterministic with mean and has no density1 . If 2 > 0, then the density
can be computed from the density of the standard Gaussian distribution: If X
is N , 2 distributed, then since +W is Gaussian of mean and variance
2 where W is a standard Gaussian, it follows from the fact that there is only
one Gaussian distribution of given mean and variance, that the distribution of
X is the same as the distribution of + W . The density of the latter can be
computed from (A.1) to yield that the density fX of a N , 2 -distributed
Gaussian random variable X is
fX (x) =
1
2 2
(x)2
2 2
x R,
(A.22)
A.2. Gaussian Random Variables
399
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
5 4 3 2 1
Figure A.2: The Gaussian probability density function with mean = 2 and
variance 2 = 4.
which is depicted in Figure A.2.
Here we have used the fact that if X = g(W ) where g() is a deterministic
one-to-one function (in our case g(w) = + w) and where W is of density
fW () (in our case (A.1)), then the density fX () of X is given by
0
if for no with fW () > 0 is x = g(),
fX (x) =
1
if satises x = g() and fW () > 0. (A.23)
fW
dg(w)
dw
|w=
From the fact that Gaussian random variables are closed under deterministic ane transformations, it follows that if X N , 2 with 2 > 0 then
(X )/ is also a Gaussian random variable. Since it is of zero mean and of
unit variance, it follows that it must be a standard Gaussian, because there
is only one Gaussian distribution of zero mean and unit variance. We thus
conclude:
X N , 2
X
N (0, 1),
2 > 0.
(A.24)
Recall that the cumulative distribution function (CDF) FX () of a random

variable X is dened as
Pr[X x]
FX (x)
fX () d,
=
1
(A.25)
x R,
(A.26)
Some would say that the density of a deterministic random variable is given by a Dirac
Delta, but we prefer not to use Dirac Deltas in this discussion as they are not functions.
400
where the second equality holds if X has a density function fX (). If W is a

standard Gaussian then its CDF is thus given by
w
FW (w) =
2
1
e 2 d,
2
w R.
(A.27)
There is, alas, no closed-form expression for this integral. To handle such
expressions we next introduce the Q-function.
A.3
The Q-Function
Denition A.6. The Q-function is dened by

Q()
e 2 d.
(A.28)
We thus see that Q() is the probability that a standard Gaussian random
variable will exceed the value . For a graphical interpretation of this integral
see Figure A.3.
Note the relationship of the Q-function to the error function:
Q() =
1
1 erf
2
2
(A.29)
The Q-function is a well-tabulated function, and we are therefore usually

happy when we can express answers to various questions using this function.
The CDF of a standard Gaussian W can be expressed using the Q-function
as follows:
FW (w) = Pr[W w]
(A.30)
= 1 Pr[W w]
(A.31)
= 1 Q(w).
(A.32)
Similarly, with the aid of the Q-function we can express the probability that a
standard Gaussian random variable W will take a value in some given interval
[a, b]:
Pr[a W b] = Pr[W a] Pr[W b]
= Q(a) Q(b),
a b.
(A.33)
(A.34)
More generally, if X N , 2 with > 0 then

Pr[a X b] = Pr[X a] Pr[X b]
a
b
X
X
Pr
= Pr
a
b
=Q
Q
, a b.
(A.35)
(A.36)
(A.37)
A.3. The Q-Function
401
Q()
Q()
Figure A.3: Graphical interpretation of the Q-function.
402
Here the last equality follows because (X )/ is a standard Gaussian random variable, see (A.24).
The Q-function is usually only tabulated for nonnegative arguments. The
reason has to do with the symmetry of the standard Gaussian density (A.1).
If W N (0, 1) then, by the symmetry of its density,
Pr[W ] = Pr[W ]
(A.38)
= 1 Pr[W > ]
= 1 Pr[W ];
(A.39)
(A.40)
see Figure A.3. Consequently,

Q() + Q() = 1,
R,
(A.41)
and it suces to tabulate the Q-function for nonnegative arguments. Note

also that from the above it follows that Q(0) = 1/2.
There is an alternative expression for the Q-function, that can be quite
useful. It is an integral expression with xed integration limits:
Q() =
/2
e 2 sin2 d,
0.
(A.42)
This expression can be derived as follows. One computes a two dimensional

integral in two dierent ways. Let X N (0, 1) and Y N (0, 1) be independent random variables. Consider the probability of the event X 0 and
Y where > 0. Since the two random variables are independent, it
follows
Pr[X 0 and Y ] = Pr[X 0] Pr[Y ]
1
= Q().
2
(A.43)
(A.44)
We shall now proceed to compute the left-hand side of the above in polar
coordinates centered at the origin (see Figure A.4). This yields
1
Q() =
2
/2
=
0
=
=
1
2
1
2
1 x2 +y2
2
dy dx
e
2
1 r2
e 2 r dr d
(A.45)
(A.46)
sin
/2
2
2 sin2
0
/2
et dt d
e 2 sin2 d,
(A.47)
(A.48)
where we have performed the change of variable t = r2 /2.

We collect some of the properties of the Q-function in the following proposition.
A.3. The Q-Function
403
y
area of integration
r sin =
1
Figure A.4: Computing 2 Q() by the use of polar coordinates.
Proposition A.7. Let W N (0, 1) and let X N , 2 for > 0.

1. The CDF of W is given in terms of the Q-function as
FW (w) = 1 Q(w).
(A.49)
2. The probability of a half-ray can be expressed as

Pr[X > ] = Q
(A.50)
3. The symmetry of the standard Gaussian distribution implies that it sufces to tabulate the Q-function for nonnegative arguments:
Q() + Q() = 1.
(A.51)
4. In particular, the probability that W is positive is given by

1
Q(0) = .
2
(A.52)
We next describe various approximations for the Q-function. We are particularly interested in its value for large arguments2 . Since Q() is the probability that a standard Gaussian random variable W exceeds , it follows
that
lim Q() = 0.
(A.53)
This corresponds in many digital communication applications to low probabilities of

bit errors.
404
Thus, large arguments to the Q-function correspond to small values of the

Q-function. The following bound will justify the approximation
Q() e
2
2
1.
(A.54)
Proposition A.8. The Q-function can be bounded as follows:
2
2
1
1
1
e 2 1 2 < Q() <
e 2 , > 0,
2
2
2
2
1
< Q(), R,
e 2
1 + 2
2
(A.55)
(A.56)
and
Q()
1 2
e 2,
2
0.
(A.57)
Moreover, the following bounds are less tight for || 1, but tighter around
= 0 (and also valid for negative ):
max 1
1
e
2
,0
Q() min
1
e
2
,1 ,
R.
(A.58)
The bounds of Proposition A.8 are depicted in Figure A.5.
1.2
Q()
upper bound in (A.55)
1
lower bound in (A.55)

upper bound (A.57)
0.8
lower bound (A.56)

upper bound in (A.58)
lower bound in (A.58)
0.6
0.4
0.2
0.2
3
Figure A.5: Various upper and lower bounds on the Q-function according to
Proposition A.8.
A.3. The Q-Function
405
Proof: Using the substitution t = x2 /2, the upper bound in (A.55) can
be derived as follows:
x2
1
e 2 dx
Q() =
(A.59)
2
x 1 x2
e 2 dx
( > 0)
(A.60)
<
2
=
et dt
(A.61)
2 /2
2
2
1
(A.62)
=
e 2 .
2
For the lower bound (A.56), rstly dene
x2
1
e 2
2
(x)
(A.63)
and note that

x2
d
1
(x) = x e 2 = x(x)
dx
2
(A.64)
and therefore
d (x)
dx
x
d
x dx (x) (x)
x2
x2 (x) (x)
=
x2
2+1
x
(x).
=
x2
(A.65)
(A.66)
(A.67)
So, for > 0, we get

1+
x2
1
1
e 2 dx
2
2
1
1 x2
1 + 2 e 2 dx
>
x
2
2
x +1
(x) dx
=
x2
(x)
=
x
2
()
1
=
e 2 .
=
1
Q() =
2
1+
(A.68)
(A.69)
(A.70)
(A.71)
(A.72)
For 0, we note that the left-hand side of (A.56) is nonpositive and therefore trivially a (alas quite loose) lower bound.
The lower bound in (A.55) is strictly smaller than the lower bound (A.56)
and therefore implicitly satised.
The upper bound (A.57) follows by replacing the integrand in (A.42) with
its maximal value, namely, its value at = /2.
The proof of (A.58) is omitted.
406
A.4
The Characteristic Function of a Gaussian
Denition A.9. The characteristic function X () of a random variable X

is dened for every R by
X ()
E eiX =
fX (x) eix dx,
(A.73)
where the second equality holds if X has a density.

The characteristic function is intimately related to the Fourier transform of
the density function. Some things to recall about that characteristic function
are given in the following proposition.
Proposition A.10. Let X be a random variable of characteristic function
X ().
(r)
1. If E[X n ] < for some n 1, then the rth order derivative X ()

exists for every r n and
(r)
E[X r ] =
X (0)
.
ir
(A.74)
2. If X and Y are two random variables of identical characteristic functions, then they have the same distribution.
3. If X and Y are two independent random variables of characteristic functions X () and Y () respectively, then the characteristic function
X+Y () of the sum X + Y is given by the product of the individual
characteristic functions:
X+Y () = X () Y (),
X Y.
(A.75)
If X N , 2 , then it can be veried by direct computation that

1
X () = ei 2
2 2
(A.76)
Consequently, by repeatedly taking the derivatives of the above, we obtain

from (A.74) the moments of a standard Gaussian distribution:
E[W ] =
1 3 ( 1)
0
if is even,
if is odd,
W N (0, 1).
(A.77)
We mention here also that

E[|W | ] =
1 3 ( 1)
2
2(1)/2
if is even,
1
2
if is odd,
W N (0, 1).
(A.78)
A.4. The Characteristic Function of a Gaussian
407
Closely related to the characteristic function is the moment generating

function MX () of a random variable X, which is dened as
MX ()
E eX =
fX (x) ex dx
(A.79)
for those values of for which the expectation is nite. Here the second equality holds if X has a density. The moment generating function is intimately
related to the double-sided Laplace transform of the density function.
Using the characteristic function we can readily show the following.
2
2
Proposition A.11. If X N x , x and Y N y , y are two independent Gaussian random variables, then also their sum X + Y is Gaussian:
2
2
X + Y N x + y , x + y .
that
(A.80)
2
2
Proof: Since X N x , x and Y N y , y it follows from (A.76)
1
X () = eix 2
Y () = e
2 2
x
1
2
iy 2 2 y
(A.81)
(A.82)
Since the characteristic function of the sum of two independent random variables is equal to the product of the individual characteristic functions, we
have
X+Y () = X () Y ()
=e
2
ix 1 2 x
2
=e
2
2
i(x +y ) 1 2 (x +y )
2
(A.83)
1
2
iy 2 2 y
(A.84)
(A.85)
2
2
But this is also the characteristic function of a N x + y , x + y random
variable. Since two random variables can have the same characteristic function only if they have the same distribution, we conclude that X + Y
2
2
N x + y , x + y .
Using the fact that a deterministic ane transformation of a Gaussian random variable results in a Gaussian random variable and the above proposition,
we obtain the following.
Proposition A.12. If X1 , . . . , Xn are independent Gaussian random variables, and 1 , . . . , n are deterministic real numbers, then the random variable
n
(A.86)
=1
is a Gaussian random variable with mean

n
E[X ]
(A.87)
2
Var(X ).
E[Y ] =
(A.88)
=1
and variance
n
Var(Y ) =
=1
408
A.5
A Summary
Top things to remember about Gaussian random variables are as follows:

There is only one Gaussian distribution of a given mean and variance
2 . It is denoted N , 2 and it is either deterministic (if 2 = 0) or
of density
fX (x) =
1
2 2
(x)2
2 2
x R.
(A.89)
If X has a Gaussian distribution, then any deterministic ane transformation of X is also Gaussian. That is, if X is Gaussian and , R
are deterministic, then
X +
(A.90)
is also Gaussian.
If X N , 2 and 2 > 0, then (X )/ has a standard Gaussian
distribution, i.e.,
X
N (0, 1),
The Q-function is dened as

1
Q() =
2
2 > 0.
e 2 d.
(A.91)
(A.92)
It can be used to compute the probability that a Gaussian random variable lies within an interval.
The sum of any nite number of independent Gaussian random variables
is also a Gaussian random variable. The sums mean is the sum of the
means, and the sums variance is the sum of the variances.
Appendix B
Gaussian Vectors
Before we can introduce the multivariate Gaussian distribution 1 or Gaussian
vectors, we need to make a short detour and discuss some special types of matrices, so-called positive semi-denite matrices, and random vectors in general.
B.1
Positive Semi-Denite Matrices
We will see that positive semi-denite matrices are a special kind of symmetric
matrices. Therefore, before we state their denition, we quickly review some
properties of symmetric matrices.
Recall that a real matrix A is said to be symmetric if
A = AT .
(B.1)
Obviously, a symmetric matrix must be square. Next, we remind the reader

that square matrices can be characterized using the concept of eigenvectors
and eigenvalues. An eigenvector is a vector x with a special direction particular to the given square matrix such that when it is multiplied with the matrix,
it again results in a vector of the same direction (but possibly of dierent
length):
Ax = x.
(B.2)
The scaling factor is called eigenvalue.

We also remind the reader that even a real matrix can have complex eigenvalues.2 For symmetric matrices, however, things are much nicer.
Lemma B.1. All eigenvalues of a (real) symmetric matrix are real (and therefore the corresponding eigenvectors can be chosen to be real, too). Moreover,
1
Multivariate distribution is just a fancy word for joint distribution of several random
variables. It contrasts univariate distribution, which means the distribution of a single
random variable.
2
At least it is not dicult to show that the eigenvalues of real matrices always come in
complex conjugate pairs.
409
410
Gaussian Vectors
the eigenvectors of a symmetric matrix that belong to dierent eigenvalues are

always perpendicular.
Proof: Suppose that for a symmetric A, (B.2) holds where for the
moment we need to allow and x to be complex. Now multiply (B.2) by x
from the left:3
x Ax = x x = x x = x 2 .
(B.3)
Here x 2 > 0 because an eigenvector cannot be the zero vector by denition.

On the other hand, taking the Hermitian conjugate of (B.2) and multiplying
both sides by x from the right yields:
!
x A x = x AT x = x Ax = x x = x * x = * x x = * x 2 ,
(B.4)
where we have used that A is real and symmetric. Comparing (B.3) and (B.4)
now shows that
= * ,
(B.5)
i.e., R. This proves the rst claim.

To prove the second claim, suppose that x1 and x2 are two eigenvectors
of A belonging to two eigenvalues 1 and 2 , respectively, where we assume
that 1 = 2 . Since A and i are real, we can assume that xi are also real.
We now have the following:
1 xT x2 = (1 x1 )T x2
1
T
= (Ax1 ) x2
T
= x1 A x2
(B.6)
(B.7)
(B.8)
(B.9)
(B.10)
= x1 Ax2
= x1 (Ax2 )
T
= x 1 2 x 2
T
= 2 x 1 x 2 ,
(B.11)
(B.12)
where we again have used the symmetry A = AT . Since 1 = 2 by assumption, we must have that
xT x2 = 0,
1
(B.13)
proving that the eigenvectors are orthogonal.

Note that Lemma B.1 can be generalized further to prove that a real
symmetric nn matrix always has n eigenvectors that can be chosen to be real
and orthogonal (even if several eigenvectors belong to the same eigenvalue).
For more details see, e.g., [Str09].
We are now ready for the denition of positive (semi-)denite matrices.
3
Note that by the Hermitian conjugation x we denote the transpose and complex conjugate version of the vector x, i.e., x = (x* )T .
B.1. Positive Semi-Denite Matrices
411
Denition B.2. We say that an n n real matrix K is positive semi-denite

and write
K
(B.14)
Rn .
(B.15)
if it is symmetric and if
T K 0,
We say that a matrix K is positive denite and write

K0
(B.16)
if it is symmetric and if
T K > 0,
= 0.
(B.17)
Positive semi-denite matrices have nice properties as is shown in the

following proposition.
Proposition B.3 (Properties of Positive Semi-Denite Matrices). A
symmetric n n real matrix K is positive semi-denite if, and only if, one
(and hence all) of the following equivalent conditions holds:
1. All the eigenvalues of K are nonnegative.
2. The matrix K can be written in the form K = UUT where the matrix
is diagonal with nonnegative entries on the diagonal and where the
matrix U is orthogonal, i.e., it satises UUT = In .
3. The matrix K can be written in the form K = ST S for some n n matrix
S.
4. One possible choice of S is
S = 1/2 UT .
(B.18)
The analogous proposition of positive denite matrices is as follows.

Proposition B.4 (Properties of Positive Denite Matrices). A symmetric n n real matrix K is positive denite if, and only if, one (and hence
all) of the following equivalent conditions holds:
1. All the eigenvalues of K are positive.
2. The matrix K can be written in the form K = UUT where the matrix
is diagonal with positive diagonal entries and where the matrix U is
orthogonal, i.e., it satises UUT = In .
3. The matrix K can be written in the form K = ST S for some nonsingular
n n matrix S.
412
Gaussian Vectors
4. One possible choice of S is

S = 1/2 UT .
(B.19)
Proof: The second statement follows directly from the fact that positive
(semi-)denite matrices are symmetric and hence have n real and orthogonal
eigenvectors 1 , . . . , n : we dene

(B.20)
U 1
n
and note that it is orthogonal:
UUT = UT U = In .
Further, we dene the diagonal matrix
1 0
2
. .
.
..
.
0
..
.
..
.
0
0
.
.
.
(B.21)
(B.22)
where i denote the n eigenvalues (possibly some of them are identical). Then
we have

(B.23)
KU = K 1
n
K K
= 1
(B.24)
n
1 n
(B.25)
=
1
n
B.1. Positive Semi-Denite Matrices
= 1
= U,
413
1 0 0

.
0 2 . . . .
.
n . .
.
..
..
.
. 0
0 0 n
(B.26)
(B.27)
and hence, using (B.21),

KUUT = K = UUT .
(B.28)
To prove the rst statement note that by denition (B.15), for the choice
= ,
0 T K = T (K ) = T ( ) = T = .
(B.29)
=1
For positive denite matrices the inequality is strict.

To prove the remaining two statements, note that because of the nonnegativity of the eigenvalues, we can dene
1 0 0
0 . . . .
.
.
2
(B.30)
1/2 .
.. ..
.
.
. 0
0 0 n
and
1/2 UT .
(B.31)
Note that for positive denite matrices, all eigenvalues are strictly positive
and therefore S is nonsingular. Then
ST S = 1/2 UT
1/2 UT
(B.32)
= U1/2 1/2 UT
(B.33)
(B.34)
= UU
= K.
(B.35)
Here the last equality follows from (B.28).

As a consequence of Propositions B.3 and B.4 (and of Denition B.2) we
have the following two corollaries.
Corollary B.5. If K is a real nn positive semi-denite matrix and if Rn
is arbitrary, then T K = 0 if, and only if, K = 0.
414
Gaussian Vectors
Proof: One direction is trivial and does not require that K be positive
semi-denite: If K = 0 then T K must also be equal to zero. Indeed,
in this case we have by the associativity of matrix multiplication T K =
T (K) = T 0 = 0.
As to the other direction we note that since K
0 it follows that there
T
exists some n n matrix S such that K = S S. Hence,
T K = T ST S
(B.36)
(B.37)
(B.38)
= (S) (S)
= S .
Consequently, T K = 0 implies that S = 0, which implies that ST S = 0,

i.e., that K = 0.
Corollary B.6. If K is a real n n positive denite matrix then T K = 0
if, and only if, = 0.
Proof: This follows immediately from Denition B.2.
B.2
Random Vectors and Covariance Matrices
A random vector is a quite straightforward generalization of a random variable.

Denition B.7. An n-dimensional random vector (or also called a random
n-vector ) X is a (measurable) mapping from the set of experiment outcomes
to the n-dimensional Euclidean space Rn :
X : Rn
x.
(B.39)
Thus, a random vector X is very much like a random variable, except that
rather than taking value in the real line R, it takes value in Rn .
The density of a random vector is the joint density of its components. The
density of a random n-vector is thus a real valued function from Rn to the
nonnegative reals that integrates (over Rn ) to one.
Denition B.8. The expectation E[X] of a random n-vector
X = (X1 , . . . , Xn )T
(B.40)
is a vector whose components are the expectations of the corresponding elements of X:
E[X1 ]
.
E[X] . .
(B.41)
.
E[Xn ]
B.2. Random Vectors and Covariance Matrices
415
Hence, we see that the expectation of a random vector is only dened if

the expectations of all of the components of the vector are dened. The jth
element of E[X] is thus the expectation of the jth component of X, namely,
E[Xj ]. Similarly, the expectation of a random matrix is the matrix of expectations.
If all the components of a random n-vector X are of nite variance, then
we can dene its n n covariance matrix KXX as follows.
Denition B.9. The n n covariance matrix KXX of X is dened as
KXX
That is,
KXX
E (X E[X])(X E[X])T .
(B.42)
X1 E[X1 ]
.
X1 E[X1 ] Xn E[Xn ]
.
= E
.
Xn E[Xn ]
Var(X1 )
Cov[X1 , X2 ]
Var(X2 )
Cov[X2 , X1 ]
=
.
.
.
.
.
.
Cov[Xn , X1 ] Cov[Xn , X2 ]
Cov[X1 , Xn ]
Cov[X2 , Xn ]
.
.
..
.
.
.
Var(Xn )
(B.43)
(B.44)
If n = 1 so that the n-dimensional random vector X is actually a scalar,

then the covariance matrix KXX is a 1 1 matrix whose sole component is the
variance of the sole component of X.
Note that given the n n covariance matrix KXX of a random n-vector X,
it is easy to compute the covariance matrix of a subset of its components. For
example, if we are only interested in the 2 2 covariance matrix of (X1 , X2 )T ,
all we have to do is pick the rst two columns and two rows of KXX . More
generally, the r r covariance matrix of (Xj1 , Xj2 , . . . , Xjr )T for 1 j1 < j2 <
< jr n is obtained from KXX by picking rows and columns j1 , . . . , jr .
For example,4 if
30 31 9 7
31 39 11 13
,
(B.45)
KXX =
9 11 9 12
7 13 12 26
4
The alert reader might notice that this matrix is positive semi-denite: It can be written
as ST S where
1 2 2 5
2 3 2 1
S=
.
3 1 1 0
This is no accident as we will see soon.
416
Gaussian Vectors
then the covariance matrix of (X2 , X4 )T is

39 13
.
13 26
(B.46)
Remark B.10. At this point we would like to remind the reader that matrix
multiplication is a linear transformation and therefore commutes with the
expectation operation. Thus if H is a random n m matrix and A is a
deterministic n matrix, then
E[AH] = A E[H],
(B.47)
and similarly if B is a deterministic m matrix then

E[HB] = E[H] B.
(B.48)
Also the transpose operation commutes with expectation: If H is a random

matrix then
E HT = (E[H])T .
(B.49)
We next prove the following important lemma.

Lemma B.11. Let X be a random n-vector with covariance matrix KXX and
let A be some (deterministic) m n matrix. Dene the random m-vector Y
as
Y
(B.50)
AX.
Then the m m covariance matrix KYY is given by

KYY = AKXX AT .
(B.51)
Proof: This follows from

KYY
E (Y E[Y])(Y E[Y])T
(B.52)
= E (AX E[AX])(AX E[AX])
= E A(X E[X])(A(X E[X]))
= E A(X E[X])(X E[X]) A
= A E (X E[X])(X E[X]) A
T
= A E (X E[X])(X E[X]) A
T
= AKXX A .
(B.53)
(B.54)
(B.55)
(B.56)
(B.57)
(B.58)
One of the most important properties of covariance matrices is that they

are all positive semi-denite.
Theorem B.12. Covariance matrices satisfy the following three properties.
B.2. Random Vectors and Covariance Matrices
417
1. All covariance matrices are positive semi-denite.

2. Any positive semi-denite matrix is the covariance matrix of some random vector.5
3. Covariance matrices are symmetric.
Proof: Part 3 follows both from the denition of covariance matrices
and from Part 1, once Part 1 is proven.
To prove Part 1, let X be a random n-vector with covariance matrix KXX
and dene Y
T X for an arbitrary deterministic vector . Then
0 Var(Y ) = E (Y E[Y ])2 = KYY = T KXX ,
(B.59)
where the last equality follows from (B.51). Since this holds for arbitrary ,
it follows by denition that KXX 0.
To prove Part 2, let K
0. Then we know that there exists a matrix
T
S such that K = S S. Now let W be a random n-vector with independent
components of variance 1, i.e., KWW = In , and dene X ST W. By (B.51)
we then have
KXX = ST In S = ST S = K.
(B.60)
Proposition B.13. Let Z be a random n-vector of zero mean and of covariance matrix KZZ . Then one of the components of Z is a linear combination of
the other components with probability 1 if, and only if, KZZ is singular.
Proof: One of the components of Z is a linear combination of the other
components with probability 1 if, and only if, there exists some nonzero vector
Rn such that T Z is zero with probability 1. This, on the other hand,
is equivalent to T Z having a zero variance. By Lemma B.11 this variance is
T KZZ , which by Corollary B.5 is zero if, and only if, KZZ = 0, i.e., KZZ is
singular.
The above proposition can be extended to take account of the rank of KZZ .
Proposition B.14. Let Z be a zero-mean random vector of covariance matrix
KZZ . The columns 1 , . . . , r of the n n covariance matrix KZZ are linearly
independent and span all the columns of KZZ if, and only if, the random rvector (Z1 , . . . , Zr )T has a nonsingular covariance matrix and with probability
1 each component of Z can be written as a linear combination of Z1 , . . . , Zr .
Example B.15. For example, suppose
1 1 1
1 2
KZZ = 2 2 11 2
3 3 1
1 1
5
that

3
3 5 7

3 = 5 9 13.
1
7 13 19
(B.61)
Actually, it is always possible to choose this random vector to be jointly Gaussian. See
Denition B.19 of Gaussian random vectors below.
418
Gaussian Vectors
We note that the three columns of KZZ satisfy the linear relationship:

7
5
3

(B.62)
(1)5 + 2 9 113 = 0.
19
13
This linear relationship implies the linear relationship

Z1 + 2Z2 Z3 = 0
(B.63)
with probability 1.
B.3
The Characteristic Function
Denition B.16. Let X be a random n-vector taking value in Rn . Then its

characteristic function X () is a mapping from Rn to C. It maps a vector
= (1 , . . . , n )T to X () where
E ei
X ()
= E ei
TX
n
=1
(B.64)
X
Rn .
(B.65)
(B.66)
If X has the density fX (), then

X () =
fX (x) ei
n
=1
dx1 dxn .
Note that (B.66) is a variation of the multi-dimensional Fourier transform

of fX (). It is thus not surprising that two random n-vectors X, Y are of the
same distribution if, and only if, they have identical characteristic functions:
L
X = Y X () = Y (), Rn .
(B.67)
From this one can show the following lemma.

Lemma B.17. Two random variables X and Y are independent if, and only
if,
E ei(1 X+2 Y ) = E ei1 X E ei2 Y ,
1 , 2 R.
(B.68)
Proof: One direction is straightforward: If X and Y are independent,

then so are ei(1 X) and ei(2 Y ) , so that the expectation of their product is the
product of their expectations:
E ei(1 X+2 Y ) = E ei1 X ei2 Y
= E ei1 X E ei2 Y ,
(B.69)
1 , 2 R.
(B.70)
B.4. A Standard Gaussian Vector
419
As to the other direction, suppose that X has the same law as X, that
Y has the same law as Y , and that X and Y are independent. Since X has
the same law as X, it follows that
E ei1 X
= E ei1 X ,
1 R,
(B.71)
E ei1 Y
= E ei1 Y ,
2 R.
(B.72)
and similarly for Y
Consequently, since X and Y are independent,

E ei(1 X
2Y
= E ei1 X
E ei2 Y
= E ei1 X E ei2 Y ,
(B.73)
1 , 2 R.
(B.74)
where the second equality follows from (B.71) and (B.72).

We thus see that if (B.68) holds, then the characteristic function of the vector (X, Y )T is identical to the characteristic function of the vector (X , Y )T .
Consequently the joint distribution of (X, Y ) must be the same as the joint
distribution of (X , Y ). But according to the latter distribution the two components (X and Y ) are independent, hence the same must be true for the
joint distribution of the two components of the former. Thus, X and Y must
be independent.
B.4
A Standard Gaussian Vector
We are now ready to introduce Gaussian random vectors. Similar to the Gaussian random variable situation, we start with the standard Gaussian vector.
Denition B.18. We shall say that a random n-vector W is an n-dimensional
standard Gaussian vector if its n components are independent scalar standard
Gaussians. In that case its density fW () is given by
n
fW (w) =
=1
1 2
1
e 2 w
2
1
1
e 2
n/2
(2)
n
= (2) 2 e 2
n
=1
(B.75)
2
w
w Rn .
(B.76)
(B.77)
Here we use w to denote the argument of the density function fW () of W,

and we denote by w1 , . . . , wn the n components of w.
Note that the denition of standard Gaussian random variables is an extension of the standard scalar Gaussian distribution. The sole component of a
standard one-dimensional Gaussian vector is a scalar of a N (0, 1) distribution.
Conversely, every scalar that has a N (0, 1) distribution can be viewed as a
one-dimensional random vector of the standard Gaussian distribution.
420
Gaussian Vectors
A standard Gaussian random n-vector W has the following mean and

covariance matrix:
E[W] = 0 and
KWW = In .
(B.78)
This follows because the components of W are all of zero mean; they are
independent (and hence uncorrelated); and of unit variance.
The characteristic function of a standard Gaussian n-vector W can be
easily computed from the characteristic function of a N (0, 1) scalar random
variable, because the components of a standard Gaussian vector are IID
N (0, 1):
W () = E ei
n
=1
(B.79)
ei W
(B.80)
E ei W
(B.81)
=E
=1
n
=
=1
n
e 2
=1
1
2
=e
(B.82)
Rn .
(B.83)
Here in (B.82) we have used the characteristic function of a standard Gaussian

RV.
B.5
Gaussian Vectors
Denition B.19. A random n-vector X is said to be a Gaussian vector if

for some positive integer m there exists a deterministic n m matrix A; a
standard Gaussian random m-vector W; and a deterministic n-vector b such
that
X = AW + b.
(B.84)
A Gaussian vector is said to be a centered Gaussian vector if b = 0, i.e.,

if we can write
X = AW.
(B.85)
Note that any scalar N , 2 random variable, when viewed as a onedimensional random vector, is a Gaussian vector. Indeed, it can be written
as W + . Somewhat less obvious is the fact that the (sole) component
of any one-dimensional Gaussian vector has a scalar Gaussian distribution.
To see that note that if X is a one-dimensional Gaussian vector then X =
B.5. Gaussian Vectors
421
aT W + b, where a is an m-vector, W is a standard Gaussian m-vector, and b

is a deterministic scalar. Thus,
m
a W + b,
X=
(B.86)
=1
and the result follows because we have seen that the sum of independent
Gaussian scalars is a Gaussian scalar (Proposition A.12).
The following are immediate consequences of the denition:
Any standard Gaussian n-vector is a Gaussian vector.
Choose in the above denition m = n; A = In ; and b = 0.
Any deterministic vector is Gaussian.

Choose the matrix A as the all-zero matrix.
If the components of X are independent scalar Gaussians (not necessarily

of equal variance), then X is a Gaussian vector.
Choose A to be an appropriate diagonal matrix.
If
X1 = (X1,1 , . . . , X1,n1 )T
(B.87)
is a Gaussian n1 -vector and

X2 = (X2,1 , . . . , X2,n2 )T
(B.88)
is a Gaussian n2 -vector, which is independent of X1 , then the random

(n1 + n2 )-vector
(X1,1 , . . . , X1,n1 , X2,1 , . . . , X2,n2 )T
(B.89)
is Gaussian.
Suppose that the pair (A1 , b1 ) represent X1 where A1 is of size
n1 m1 and b1 Rn1 . Similarly let the pair (A2 , b2 ) represent X2
where A2 is of size n2 m2 and b2 Rn2 . Then the vector (B.89)
can be represented using the (n1 + n2 ) (m1 + m2 ) block-diagonal
matrix A of diagonal components A1 and A2 , and using the vector
b Rn1 +n2 that results when the vector b1 is stacked on top of the
vector b2 .
Assume that X is a Gaussian n-vector. If C is an arbitrary deterministic

n matrix and d is a deterministic -vector, then the random -vector
CX + d is a Gaussian -vector.
422
Gaussian Vectors
Indeed, if X = AW + b where A is a deterministic n m matrix, W
is a standard Gaussian m-vector, and b is a deterministic m-vector,
then
CX + d = C(AW + b) + d
= (CA)W + (Cb + d),
(B.90)
(B.91)
which demonstrates that CX + d is Gaussian, because (C A) is a

deterministic m matrix, W is a standard Gaussian m-vector;
and Cb + d is a deterministic -vector.
If the Gaussian vector X has the representation (B.84), then

E[X] = b
and
KXX = AAT .
(B.92)
The behavior of the mean follows from (B.47) by viewing W as a

random matrix and noting that its expectation is zero. The covariance follows from Lemma B.11 and (B.78).
By (B.92) we conclude that an alternative denition for a Gaussian random

vector is to say that X is a Gaussian random vector if X E[X] is a zeromean or centered Gaussian vector. We shall thus focus our attention now on
centered Gaussian vectors. The results we obtain will be easily extendable to
noncentered Gaussian vectors.
Specializing (B.91) to the zero-mean case we obtain that if X is a centered
Gaussian vector, then CX is also a centered Gaussian vector. This has the
following consequences:
If X is a centered Gaussian n-vector, then any subset of its components
is also a centered Gaussian vector.
The vector (Xj1 , . . . , Xjp )T where 1 j1 , . . . , jp n results from
applying CX where C is a p n matrix of entries
(C), = I{ = j }.
(B.93)
For example,
X3
X1
0
1
X1
0 1
X2 .
0 0
X3
(B.94)
Similarly, if X is centered Gaussian, then any permutation of its elements

is centered Gaussian.
The proof is identical. For example,

X1
0 0 1
X3

X1 = 1 0 0X2 .
0 1 0
X3
X2
(B.95)
B.6. The Mean and Covariance Determine the Law of a Gaussian
423
By (B.92) we see that in the representation (B.84) the vector b is fully

determined by the mean of the random vector. In fact, the two are the same.
The matrix A, however, is not determined by the covariance matrix KXX of
the random vector. Indeed, KXX may be written in many dierent ways as
the product of a matrix by its transpose. One would therefore be mistakenly
tempted to think that there are many dierent Gaussian distributions of a
given mean and covariance matrix, but this is not the case. We shall see that
there is only one.
B.6
The Mean and Covariance Determine the Law

of a Gaussian
One way to see that the mean and covariance matrix of a Gaussian vector fully
determine its law is via the characteristic function: If X = AW where A is an
nm deterministic matrix and the m-vector W is a standard Gaussian vector,
then the characteristic function X () of X can be calculated as follows:
X () = E ei
= E ei
TX
(B.96)
T AW
(B.97)
T )T W
= E ei(A
1
= e 2
AT
(B.98)
(B.99)
=e
1
2 (AT )T AT
=e
1
2 T AAT
=e
1 T KXX
2
(B.100)
(B.101)
R .
(B.102)
The characteristic function of a centered Gaussian vector is thus fully determined by its covariance matrix. Since the characteristic function of a random
variable fully species its law (see (B.67)), we conclude that any two centered
Gaussians of equal covariance matrices must have identical laws, thus proving
the following theorem.
Theorem B.20. There is only one multivariate centered Gaussian distribution of a given covariance matrix. Accounting for the mean we have that there
is only one multivariate Gaussian distribution of a given mean vector and a
given covariance matrix.
We denote the multivariate Gaussian distribution of mean and covariance K by N (, K).
This theorem has extremely important consequences. One of which has to
do with the properties of independence and uncorrelatedness. Recall that any
two independent random variables are also uncorrelated. The reverse is, in
general, not true. It is, however, true for jointly Gaussian random variables.
If X and Y are jointly Gaussian, i.e., (X, Y )T is a Gaussian 2-vector, then X
424
Gaussian Vectors
and Y are independent if, and only if, they are uncorrelated. More generally
we have the following corollary.
Corollary B.21. Let X be a centered Gaussian (n1 + n2 )-vector. Let X1 =
(X1 , . . . , Xn1 )T correspond to its rst n1 components, and let X2 = (Xn1 +1 ,
. . . , Xn1 +n2 )T correspond to the rest of its components.6 Then the vectors X1
and X2 are independent if, and only if, they are uncorrelated, i.e., if, and only
if,
E X1 XT = 0.
2
(B.103)
Proof: The easy direction (and the direction that has nothing to do
with Gaussianity) is to show that if X1 and X2 are independent, then (B.103)
holds. Indeed, by the independence and the fact that the vectors are of zero
mean we have
E X1 XT = E[X1 ] E XT
2
2
= E[X1 ](E[X2 ])
= 00
(B.104)
T
(B.105)
(B.106)
= 0.
(B.107)
We now prove the reverse using the Gaussianity. Let X be the (n1 +n2 )-vector
that results from stacking X1 on top of X2 . Then
KXX =
=
E X1 XT
2
T
E X2 X2
E X1 XT
1
T
E X1 X2
KX1X1
0
(B.108)
(B.109)
0
KX2X2
where we have used (B.103).

Let X and X be independent vectors such that the law of X is identical
1
2
1
to the law of X1 and such that the law of X is identical to the law of X2 .
2
Let X be the (n1 + n2 )-vector that results from stacking X on top of X .
1
2
Note that X is a centered Gaussian vector. This can be seen because, being
the components of a Gaussian vector, both X and X must be Gaussian,
2
1
and since they are independent, the vector that is formed by stacking them
on top of each other must also be Gaussian. Since X and X are zero-mean
1
2
and independent, and since the law (and hence covariance matrix) of X is
1
the same as of X1 and similarly for X we have that the covariance matrix of
2
the vector X is given by
KXX =
KX1X1
0
KX1X1
0
(B.110)
KX2X2
0
KX2X2
(B.111)
6
Note that we are slightly overstretching our notation here: X1 denotes the rst component of X, while X1 denotes a vector (whose rst component by denition also happens
to be X1 ).
B.6. The Mean and Covariance Determine the Law of a Gaussian
425
But this is identical to the covariance matrix (B.109) of the zero-mean Gaussian X. Consequently, since zero-mean Gaussian vector of the same covariance
matrix must have the same law, it follows that the law of X is identical to the
law of X . But under the law of X the rst n1 components are independent
of the last n2 components, so that the same must be true of X.
A variation on the same theme:
Corollary B.22. If the components of the Gaussian vector X are uncorrelated
so that the matrix KXX is diagonal, then the components of X are independent.
Proof: This follows by a repeated application of the previous corollary.
Another consequence of the fact that there is only one multivariate Gaussian distribution with a given mean and covariance is the following corollary.
Corollary B.23. If the components of a Gaussian vector are independent
in pairs, then they are independent. (Recall that in general X, Y, Z can be
such that X, Y are independent; Y, Z are independent; X, Z are independent;
and yet X, Y, Z are not independent. This cannot happen if X, Y, Z are the
components of a Gaussian vector.)
Proof: To see this consider the covariance matrix. If the components
are independent in pairs the covariance matrix must be diagonal. But this is
also the structure of the covariance matrix of a Gaussian random vector whose
components are independent. Consequently, since the covariance matrix of a
Gaussian vector fully determines the vectors law, it follows that the two
vectors have the same law, and the result follows.
The following corollary is extremely useful in the study of signal detection.
Corollary B.24. Suppose that Z is a Gaussian vector of covariance 2 In ,
where In is the n n identity matrix. This is just another way of saying that
Z has n components that are IID, each distributed N 0, 2 .
Let { }n be n orthonormal deterministic vectors so that
=1
,
T =
1
0
if = ,
if = .
(B.112)
Then the random variables Z, 1 , . . . , Z, n are IID N 0, 2 .
Proof: Let Z denote the vector
Z, 1
.
Z= .
.
Z, n
Note that Z can be represented as
Z=
T
1
.
Z,
.
.
T
n
(B.113)
(B.114)
426
Gaussian Vectors
thus demonstrating that Z is a Gaussian vector. Its covariance can be computed according to Lemma B.11 to yield
.
T
E ZZ
.
KZZ =
(B.115)
.
1 n
T
n

2
.
In
.
=
(B.116)
.
1
n
T
n

.
2

.
(B.117)
=
.
n
1
T
n

= 2 In .
(B.118)
Here the last equality follows by (B.112).

This same corollary can be also stated a little dierently as follows.
Corollary B.25. If W is a standard Gaussian n-vector and U is orthogonal,
i.e.,
UUT = UT U = In ,
(B.119)
then UW is also a standard Gaussian vector.

Proof: By the denition of a Gaussian vector we conclude that UW is
Gaussian. Its covariance matrix is
UKWW UT = UIn UT = UUT = In .
(B.120)
Thus, UW has the same mean (zero) and the same covariance matrix (In ) as a
standard Gaussian. Since the mean and covariance of a Gaussian fully specify
its law, UW must be standard.
The next lemma shows that in the denition of Gaussian vectors x =
AW + b we can restrict ourselves to square n n matrices A.
Lemma B.26. If X is an n-dimensional centered Gaussian vector, then there
exists a square deterministic nn matrix A such that the law of X is the same
as the law of AW where W is an n-dimensional standard Gaussian.
Proof: Let KXX denote the n n covariance matrix of X. Being a
covariance matrix, it must be positive semi-denite, and there therefore exists
some n n matrix S such that KXX = ST S. Consider now the random vector
B.7. Canonical Representation of Centered Gaussian Vectors
427
ST W where W is standard n-dimensional Gaussian. Being of the form of a

matrix multiplying a Gaussian vector, ST W is a Gaussian vector, and since W
is centered, so is ST W. By Lemma B.11 and by the fact that the covariance
matrix of W is the identity, it follows that the covariance matrix of ST W
is ST S, i.e., KXX . Thus X and ST W are centered Gaussians of the same
covariance, so that they must be of the same law. The law of X is thus the
same as the law of the product of a square matrix by a standard Gaussian.
This idea will be extended further in the following section.
B.7
Canonical Representation of Centered

Gaussian Vectors
As we have seen, the representation of a centered Gaussian vector as the

result of the multiplication of a deterministic matrix by a standard Gaussian
vector is not unique. Indeed, whenever the n m deterministic matrix A
satises AAT = K it follows that AW has a N (0, K) distribution. (Here W
is a standard m-vector). This follows because AW is a random vector of
covariance matrix AAT ; it is by the denition of the multivariate centered
Gaussian distribution a centered Gaussian; and all centered Gaussians
of a given covariance matrix have the same law. In this section we shall
focus on a particular choice of the matrix A that is particularly useful in the
analysis of Gaussians. In this representation A is a square matrix and can be
written as the product of an orthogonal matrix by a diagonal matrix. The
diagonal matrix acts on W by stretching and shrinking its components, and
the orthogonal matrix then rotates the result.
Theorem B.27. Let X be a centered Gaussian n-dimensional vector of covariance matrix KXX . Then X has the same law as the centered Gaussian
U1/2 W
(B.121)
where W is a standard n-dimensional Gaussian vector, the columns of the

orthogonal matrix U are orthonormal eigenvectors of the covariance matrix
KXX , and the matrix is a diagonal matrix whose diagonal entries correspond
to the eigenvalues of KXX with the th diagonal element corresponding to the
eigenvector in the th column of U.
Remark B.28. Because is diagonal we can explicitly write 1/2 W as
1 W 1
.
.
.
1/2 W =
(B.122)
.
n W n
Thus, in the above theorem the random vector 1/2 W has independent components with the th component being N (0, ) distributed. This demonstrates
that a centered Gaussian vector is no other than the rotation (multiplication
428
Gaussian Vectors
by an orthogonal matrix) of a random vector whose components are independent centered univariate Gaussians of dierent variances.
Proof of Theorem B.27: The matrix KXX is positive semi-denite so
that it can be written as KXX = ST S where S = 1/2 UT where KXX U = U;
U is orthogonal; and is a diagonal matrix whose diagonal elements are the
(nonnegative) eigenvalues of KXX . If W is a standard Gaussian then ST W is
of zero mean and covariance ST S, i.e., the same covariance as X. Since there
is only one centered multivariate Gaussian distribution of a given covariance
matrix, it follows that the law of ST W is the same as the law of X.
Example B.29. Figures B.1 and B.2 demonstrate this canonical representation. There the contour and mesh plot of four two-dimensional Gaussian
vectors are shown:
X1 =
1 0
W,
0 1
KX1X1 = I2 ,
X2 =
2 0
W,
0 1
KX2X2 =
4 0
,
0 1
(B.124)
X3 =
1 0
W,
0 2
KX3X3 =
1 0
,
0 4
(B.125)
KX4X4 =
1 5 3
,
2 3 5
(B.126)
1 1
1
X4 =
2 1 1
1 0
W,
0 2
(B.123)
where W is a standard Gaussian 2-vector

W N 0,
1 0
0 1
(B.127)
The above proposition can also be used in order to demonstrate the linear
transformation one can perform on a Gaussian to obtain a standard Gaussian.
Thus, it is the multivariate version of the univariate result showing that if
X N , 2 , where 2 > 0 then (X )/ has a N (0, 1) distribution.
Proposition B.30. Let X N (, K) and let U and be as above. Assume
that K is positive denite so that its eigenvalues are strictly positive. Then
1/2 UT (X ) N (0, I)
(B.128)
where 1/2 is the matrix inverse of 1/2 , i.e., it is a diagonal matrix whose
diagonal entries are the reciprocals of the square-roots of the eigenvalues of K.
Proof: Since the product of a deterministic matrix by a Gaussian vector
results in a Gaussian vector and since the mean and covariance of a Gaussian
vector fully specify its law, it suces to check that the left-hand side of (B.128)
is of the specied mean (namely, zero) and of the specied covariance matrix
(namely, the identity matrix).
B.8. The Characteristic Function of a Gaussian Vector
B.8
429
The Characteristic Function of a Gaussian

Vector
Let X N (, K) where K = ST S. Then X can be written as

X = ST W + .
(B.129)
Using the characteristic function of a standard Gaussian vector (B.83), we can

now easily derive the characteristic function of X:
X () = E ei
TX
(B.130)
= E ei
T (ST W+)
= E ei
T ST W+i T
= E ei(S)
TW
ei
= W (S) ei
(B.131)
i T
(B.132)
(B.133)
(B.134)
=e
=e
1
2 (S)T (S)+i T
(B.136)
=e
1
2 T ST S+i T
(B.137)
=e
B.9
1 S
2
1
2 T K+i T
(B.138)
(B.135)
The Density of a Gaussian Vector
As we have seen, if the covariance matrix of a centered random vector is singular, then with probability 1 at least one of the components of the vector can
be expressed as a linear combination of the other components. Consequently,
random vectors with singular covariance matrices cannot have a density.
We shall thus only discuss the density of the multivariate Gaussian distribution for the case where the covariance matrix K is nonsingular, i.e., is
positive denite. In this case all its eigenvalues are strictly positive. To derive
this density we shall use Theorem B.27 to represent the N (0, K) distribution
as the distribution of the result of
U1/2 W
(B.139)
where U is an orthogonal matrix and is a diagonal matrix satisfying KU =

U. Note that is nonsingular because its diagonal elements are the eigenvalues of the positive denite matrix K.
The following example motivates the discussion about the rank of the
matrix A in the representation (B.84).
Example B.31. Suppose that the random 3-vector
X = (X1 , X2 , X3 )T
(B.140)
430
Gaussian Vectors
can be represented in terms of the standard Gaussian 4-vector W as

W1
X1
1 1 2 3

W2
(B.141)
X2 = 1 5 3 1 .
W
3
X3
1 7 7 7
W4
Hence, X is centered Gaussian. Since
1 7 7 7 = 2 1 1 2 3 + 1 5 3 1
(B.142)
it follows that

W1

W2
X3 = 1 7 7 7
W
3
W4
(B.143)
W1

W 2
= 2 1 1 2 3 + 1 5 3 1
W
3
W4

W1
W1

W2
W2
= 2 1 1 2 3 + 1 5 3 1
W
W
3
3
W4
W4
= 2X1 + X2 ,
(B.144)
(B.145)
(B.146)
irrespective of the realization of W! Consequently, the random vector X can

also be written as:

W1

X1
1 1 2 3 W 2
; and
=
(B.147)
X2
1 5 3 1 W3

W4
X3 = 2X1 + X2 .
(B.148)
The reader is encouraged to verify that X can also be written as:

W1

X2
1 5 3 1 W2
; and
=
X3
1 7 7 7 W 3

W4
1
1
X1 = X2 + X3 .
2
2
(B.149)
(B.150)
B.9. The Density of a Gaussian Vector
431
This example is closely related to Example B.15: Also here we have a

singular covariance matrix
KXX
15 13 43
= 13 36 62 ,
43 62 148
(B.151)
and therefore at least (in this case actually exactly) one component of X can
be written as a linear combination of the others with probability 1. However,
note the subtle dierence here: Since the Gaussian vector is fully specied
by mean and covariance, the fact that the covariance matrix shows a relation
between components will not only imply a corresponding relation between the
components with probability 1, but actually deterministically! So, we emphasize that (B.148) holds deterministically and not only with probability 1.
The above example generalizes to the following proposition.
Proposition B.32. If the zero-mean Gaussian n-vector X has the representation (B.84) with an n m matrix A of rank r, then it can also be represented in the following way. Choose r linearly independent rows of A, say
1 j1 , . . . , jr n. Express the vector X = (Xj1 , . . . , Xjr )T as AW where A is

an rm matrix consisting of the r rows numbered j1 , . . . , jr of A. The remaining components of X can be described as deterministic linear combinations of
X j 1 , . . . , Xj r .
Note that it is possible to reduce m to r using the canonical representation
of X.
If one of the components of a random vector is a linear combination of the
others, then the random vector has no density function. Thus, if one wishes
to work with densities, we need to keep the linear dependency on the side.
We are now ready to use the change of density formula to compute the
density of
X = U1/2 W,
(B.152)
where as discussed we assume that the diagonal matrix has only positive
entries. Let
A
U1/2
(B.153)
so that the density we are after is the density of AW. Using the formula for
computing the density of AW from that of W and noting that
AAT = U1/2 U1/2
= U1/2 1/2 UT = UUT = K
(B.154)
432
Gaussian Vectors
we have
fX (x) =
=
=
=
=
fW (A1 x)
| det(A)|
1
exp 2 (A1 x)T (A1 x)
(2)n/2 | det(A)|
exp 1 xT (A1 )T (A1 x)
2
(2)n/2 | det(A)|
exp 1 xT (AAT )1 x
2
(2)n/2 | det(A)|
1
exp 2 xT K1 x
.
(2)n/2 | det(A)|
(B.155)
(B.156)
(B.157)
(B.158)
(B.159)
We now write | det(A)| in terms of the determinant of the covariance matrix

K as
| det (A)| =
=
=
=
det (A) det (A)
(B.160)
det (A) det (AT )
(B.161)
det (AAT )
(B.162)
det (K)
(B.163)
to obtain that if X N (, K) where K is nonsingular, then

fX (x) =
B.10
exp 1 (x )T K1 (x )
2
(2)n det(K)
x Rn .
(B.164)
Linear Functions of Gaussian Vectors
From the denition of the multivariate distribution, it follows quite easily that
if X is an n-dimensional Gaussian vector and if C is an m n deterministic
matrix, then the m-dimensional vector CX is also a Gaussian vector (see also
(B.91)). By considering the case m = 1 we obtain that if is any deterministic
n-dimensional vector then T X is a one-dimensional Gaussian vector. But
the components of a Gaussian vector are univariate Gaussians, and we thus
conclude that the sole component of T X is a univariate Gaussian, i.e., that
the random variable T X has a univariate Gaussian distribution. Note that
the mapping x T x is a linear mapping from the d-dimensional Euclidean
space to the real line. Such a mapping is called a linear function.
We shall next show that the reverse is also true. If X is an n-dimensional
random vector such that for every deterministic Rn the random variable
T X has a univariate Gaussian distribution, then X must be a Gaussian
vector.
We have the following theorem.
B.11. A Summary
433
Theorem B.33. The n-dimensional random vector X is Gaussian if, and

only if, for every deterministic n-vector Rn the random variable T X has
a univariate Gaussian distribution.
Proof: We thus assume that X is an n-dimensional random vector such
that for every deterministic Rn the random variable T X has a univariate
Gaussian distribution and proceed to prove that X must have a multivariate
Gaussian distribution. Let KXX be the covariance matrix of X, and let X
denote its mean. Let Rn be arbitrary, and consider the random variable
T X. This is a random variable of mean T X and of covariance T KXX ;
see Lemma B.11. But we assumed that any linear functional of X has a
Gaussian distribution, so
T X N T X , T KXX .
(B.165)
Consequently, by the characteristic function of the univariate Gaussian distribution (A.76) we conclude that
E ei
TX
= e 2
TK
X +i
X
Rn .
(B.166)
But this also represents the characteristic function of X. According to (B.138)

this is identical to the characteristic function of a multivariate Gaussian of
mean X and covariance matrix KXX . Consequently, since two random vectors
that have the same characteristic function must have the same law (B.67), it
follows that X must be a Gaussian vector.
B.11
A Summary
Top things to remember about the multivariate centered Gaussian distribution

are as follows:
1. The multivariate centered Gaussian distribution is a generalization of
the univariate centered Gaussian distribution in the following sense. A
random vector that has only one component, i.e., that takes value in
R, has a centered multivariate Gaussian distribution if, and only if, its
sole component when viewed as a scalar random variable has a
univariate N 0, 2 distribution for some 2 0.
2. Moreover, every component of a multivariate centered Gaussian vector
has a centered univariate Gaussian distribution.
3. The reverse, however, is not true. Just because each component of a random vector has a univariate centered Gaussian distribution, it does not
follow that the vector has a multivariate centered Gaussian distribution.
(ATTENTION!). Compare also with Points 11 and 12 below.
4. Given any nn positive semi-denite matrix K 0, there exists a unique
multivariate centered Gaussian distribution of the prescribed covariance
matrix K. This distribution is denoted N (0, K).
434
Gaussian Vectors
5. To generate a random n-vector X of the law N (0, K) one can start with
a random n-vector W whose components are IID N (0, 1) and set
X = U1/2 W,
(B.167)
where the matrix U is orthogonal; the matrix is diagonal and

KU = U.
(B.168)
Here 1/2 is a diagonal matrix whose diagonal elements are the nonnegative square-roots of the corresponding diagonal entries of .
6. If X N (0, K) where K is positive denite (K 0) and U and are as
above, then
1/2 UT X N (0, In ).
(B.169)
7. If X N (0, K) where K is positive denite (K 0), then X has the

density
fX (x) =
1
(2)n det(K)
e 2 x
T K1 x
(B.170)
This can be interpreted graphically as follows. Let W, U and be as

above. Note that the condition K 0 is equivalent to the condition
that the diagonal entries of and 1/2 are strictly positive. The level
lines of the density of W are centered circles; those of the density of
1/2 W are ellipsoids of principal axes parallel to the Cartesian axes;
and those of U1/2 W are rotated versions of those of 1/2 W. Compare
with Figures B.1 and B.2.
8. If the covariance matrix K 0 is not positive denite (but only positive
semi-denite), then the N (0, K) distribution does not have a density.
However in this case, if X N (0, K), then a subset of the components of
X have a jointly Gaussian distribution with a density and the rest of the
components of X can be expressed as deterministic linear combinations
of the components in the subset. Indeed, if columns 1 , . . . , d are linearly
independent and span all the columns of K, then every component of X
can be expressed as a deterministic linear combination of X1 , . . . , Xd
and the random d-vector X = (X1 , . . . , Xd )T has a centered Gaussian
distribution with a density. In fact, X has a N 0, K distribution where
the d d covariance matrix K is positive denite and can be computed

from K by picking from it rows 1 , . . . , d and columns 1 , . . . , d .
9. If C is a deterministic m n matrix and the random n-vector X has a
N (0, K) distribution, then the random m-vector CX is also a centered
Gaussian vector. Its m m covariance matrix is CKCT .
10. Consequently:
B.11. A Summary
435
Any random vector that results from permuting the components of

a centered Gaussian vector is also a centered Gaussian vector.
Picking a subset of the components of a Gaussian vector and stacking them into a vector results in a centered Gaussian vector.
If the covariance matrix of a centered Gaussian X is a scalar multiple of the identity matrix I, then the distribution of X is invariant
under multiplication by a deterministic orthogonal matrix U.
11. Let the random n1 -vector X1 be N (0, K1 ) distributed and let the random n2 -vector X2 be N (0, K2 ) distributed. Then if X1 and X2 are
independent, the (n1 + n2 )-vector that results when the vector X1 is
stacked on top of the vector X2 has a multivariate centered Gaussian
distribution:
X1
X2
N (0, K),
K=
K1 0
.
0 K2
(B.171)
(Here K is an (n1 + n2 ) (n1 + n2 ) matrix.)

12. The above extends to the case where more than two independent Gaussian vectors are stacked into a longer vector. In particular, if the components of a random vector are independent and they each have a univariate Gaussian distribution, then the random vector has a multivariate
Gaussian distribution with a diagonal covariance matrix.
13. If a subset of the components of a Gaussian vector are uncorrelated with
a dierent subset of the components of the same vector, then the two
subsets are statistically independent.
14. If the components of a Gaussian vector are independent in pairs, then
they are independent.
436
Gaussian Vectors
x2,2
x1,2
4
4
4
4
x1,1
x4,2
x3,2
x2,1
4
4
x3,1
4
4
x4,1
Figure B.1: Contour plot of the joint densities of four dierent two-dimensional
Gaussian vectors given in Example B.29: from left to right and top to
down X1 , . . . , X4 .
B.11. A Summary
437
Figure B.2: Mesh plot of the densities of the joint two-dimensional Gaussian vectors given in Example B.29: from left to right and top to down
X1 , . . . , X4 .
Appendix C
Stochastic Processes
Stochastic processes are not really dicult to understand when viewed as a
family of random variables that is parametrized by the time index t. However,
there are a few mathematical subtleties concerning measurability, integrability,
existence, and certain nasty events of zero Lebesgue measure. Our treatment
here will mostly ignore these issues and try to concentrate on the main engineering picture. We highly recommend, however, to have a closer look at the
exact treatment in [Lap09, Ch. 25].
C.1
Stochastic Processes & Stationarity
A stochastic process X(t, ) is a generalization of random variables: It is a

a collection of random variables that is indexed by the parameter t. So, for
every xed value t, X(t, ) is a random variable, i.e., a function that maps the
outcomes of a random experiment to a real number:
X(t, ) : R,
X(t, ).
(C.1)
On the other hand, for a xed , X(, ) denotes the realizations of all
these random variables, i.e., we have now a function of time:
X(, ) : R R,
t X(t, ).
(C.2)
This function is usually called sample-path, sample-function, or realization.

Similar to the case of random variables, it is usual to drop and simply write
X(t) for t R. Here the capitalization of X points to the fact that a random
experiment is behind this function.
More formally, we have the following denition.
Denition C.1. A stochastic process {X(t)}tT is an indexed family of random variables that are dened on a common probability space (, F, P ). Here
T denotes the indexing set that usually will be T = R, denoting continuous
time, or T = Z, denoting discrete time.
439
440
In the following we will concentrate on continuous-time stochastic processes, i.e., we will assume that T = R.
In order to make sure that a stochastic process {X(t)} is clearly dened,
we need to specify a rule from which all joint distribution functions can, at
least in principle, be calculated. That is, for all positive integers n, and all
choices of n epochs t1 , t2 , . . . , tn , it must be possible to nd the joint cumulative
distribution function (CDF),
FX(t1 ),...,X(tn ) (x1 , . . . , xn )
Pr[X(t1 ) x1 , . . . , X(tn ) xn ].
(C.3)
Once such a rule is given, any other property of the stochastic process can
be computed. For example, the mean E[X(t)] = X(t) and the covariance
Cov[X(t), X(u)] of the process are specied by the joint distribution functions.
In particular,
Cov[X(t), X(u)]
E X(t) X(t) X(u) X(u) .
(C.4)
Representing {X(t)} in terms of its mean X(), and its uctuation

X(t)
X(t) X(t),
t R,
(C.5)
the covariance simplies to

Cov[X(t), X(u)] = E X(t)X(u) .
(C.6)
Most noise functions have zero mean, in which case X(t) = {X(t)} and the
covariance therefore Cov[X(t), X(u)] = E[X(t)X(u)].
Example C.2. Let . . . , S1 , S0 , S1 , . . . be a sequence of IID random variables
and dene the pulse h(t) I{0 t < T}. Then
X(t) =
S h(t T)
(C.7)
is a stochastic process where the sample functions are piecewise constant over
intervals of size T. The mean is E[X(t)] = E[S ] for t [T, ( + 1)T). Since
{S } is IID, this means that
E[X(t)] = E[S],
t.
(C.8)
The covariance is nonzero only for t and u in the same interval, i.e.,
Cov[X(t), X(u)] = Var(S )
(C.9)
if, for some integer , both t and u lie in the interval [T, ( + 1)T). Otherwise
Cov[X(t), X(u)] = 0:
Cov[X(t), X(u)] =
Var(S)
0
if Z s.t. t, u [T, ( + 1)T),

otherwise.
(C.10)
C.2. The Autocovariance Function
441
We see from Denition C.1 that it is not a simple matter to properly dene
a general stochastic process: We need to specify innitely many dierent joint
distributions! To ease our life, we will have to make some more assumptions
that will allow to simplify the specication and analysis of a stochastic process.
Denition C.3. A stochastic process {X(t)} is called stationary or strictsense stationary (SSS) if for every n N, any epochs t1 , . . . , tn R, and every
R,
L
X(t1 + ), . . . , X(tn + ) = X(t1 ), . . . , X(tn ) ,
(C.11)
where = stands for equal in law, i.e., the random vectors on both sides
have the same probability distribution.
In words this means that a stationary stochastic process will not change
its probability distribution over time. In particular this means that, e.g.,
the mean X(t) must be constant (i.e., not depend on t) and the covariance
Cov[X(t), X(u)] must be a function of the dierence t u only.
A weaker condition that nevertheless proves useful in many situations is
wide-sense stationarity.
Denition C.4. A stochastic process {X(t)} is called weakly stationary or
wide-sense stationary (WSS) if the following conditions hold:
1. The variance of X(t) is nite.
2. The mean of X(t) is constant for all t R:
E[X(t)] = E[X(t + )],
t, R.
(C.12)
3. The covariance of X(t) satises

Cov[X(t), X(u)] = Cov[X(t + ), X(u + )],
t, u, R.
(C.13)
In the following we will usually use stationary to refer to SSS processes,

but WSS to refer to weakly stationary processes.
From our discussion above we immediately get the following proposition.
Proposition C.5. Every nite-variance stationary stochastic process is WSS.
However, note that the reverse is not true: Some WSS processes are not
stationary.
C.2
The Autocovariance Function
We next restrict ourselves to WSS processes. For such processes we can dene
the autocovariance function.
442
Denition C.6. The autocovariance function KXX : R R of a WSS stochastic process {X(t)} is dened for every R by
KXX ( )
Cov[X(t), X(t + )],
(C.14)
where the right-hand side does not depend on t because {X(t)} is assumed to
be WSS.
Remark C.7. Many people like to call KXX () autocorrelation function. This,
however, is strictly speaking not correct because correlation is normalized
to have values between 1 and 1, i.e., the autocorrelation function is the
autocovariance function divided by the variance of the process:
Cov[X(t), X(t + )]
,
Var(X(t))
XX ( )
R.
(C.15)
(Note again that the right-hand side does not depend on t because {X(t)}
is assumed to be WSS.) To increase the confusion even further, the term
autocorrelation function is also used to describe the function
Rss :
s(t + )s(t) dt
(C.16)
for some deterministic function s(). I prefer to call the function (C.16)
self-similarity function (following the suggestion of Prof. Amos Lapidoth and
Prof. James L. Massey).
The following properties of the autocovariance function are analogous to
Theorem B.12.
Theorem C.8 (Properties of the Autocovariance Function).
1. The autocovariance function KXX () of a continuous-time WSS process
{X(t)} is a positive denite function, i.e., for every n N, and for
every choice of coecients 1 , . . . , n R and epochs t1 , . . . , tn R:
n
i=1 i =1
i i KXX (ti ti ) 0.
(C.17)
2. The autocovariance function KXX () of a continuous-time WSS process

{X(t)} is a symmetric function:
KXX ( ) = KXX ( ),
R.
(C.18)
3. Every symmetric positive denite function is the autocovariance function

of some WSS stochastic process.1
1
Actually, it is always possible to choose this WSS process to be stationary Gaussian.

See Denition C.9 of a Gaussian process below.
C.3. Gaussian Processes
443
Proof: To prove Part 2 we calculate

KXX ( ) = Cov[X(t + ), X(t)]
(C.19)
= Cov X(t ), X(t )
(C.20)
= Cov X(t ), X(t )
(C.21)
= KXX ( ),
(C.22)
where we have dened t = t + and used that Cov[X, Y ] = Cov[Y, X].

To prove Part 1 we compute
n
i=1 i =1
i i KXX (ti ti ) =
i i Cov[X(ti ), X(ti )]
i=1 i =1
n
= Cov
i X(ti )
i X(ti ),
(C.24)
i =1
i=1
n
= Var
(C.23)
i X(ti )
(C.25)
i=1
0.
(C.26)
We omit the proof of Part 3 and refer to [Lap09, Ch. 25] for more details.
C.3
Gaussian Processes
The most useful stochastic processes for representing noise are the Gaussian
processes.
Denition C.9. A stochastic process is Gaussian if, for every n N and
T
all choices of epochs t1 , . . . , tn R, the random vector X(t1 ), . . . , X(tn ) is
jointly Gaussian.
From our discussion about Gaussian vectors in Appendix B we know that
a Gaussian vector has a nite variance and is completely specied by its mean
and covariance matrix. More precisely, we have the following.
Corollary C.10 (Properties of Gaussian Processes).
1. A Gaussian process has a nite variance.
2. A Gaussian process is stationary if, and only if, it is WSS.
3. A stationary Gaussian process is fully specied by its autocovariance
function and its mean.
Proof: The proof follows in a straightforward way from our results in
Appendix B. We omit the details.
Example C.11. The process in Example C.2 is a Gaussian process if the
underlying variables {S } are Gaussian. However, note that this process is
444
neither stationary nor WSS. This can be seen if we realize that from (C.10)
we have
Cov X
T
3T
,X
4
4
= Var(S),
(C.27)
= 0,
(C.28)
but
Cov X
3T
5T
,X
4
4
even though for both cases

3T T
5T 3T
T
=
= .
4
4
4
4
2
Thus, Part 3 of Denition C.4 is violated.
C.4
(C.29)
The Power Spectral Density
It is quite common under engineers to dene the power spectral density (PSD)
of a WSS stochastic process as the Fourier transform of its autocovariance
function. There is nothing wrong with this, however, it does not really explain
why it should be so. Note that the PSD has a clear engineering meaning: It
is a power density in the frequency domain! This means that when integrated
over a frequency interval, it describes the amount of power within this interval.
So a practical, operational denition of the PSD should state that the PSD
is a function that describes the amount of power contained in the signal in
various frequency bands.
Luckily, it can be shown that for WSS processes this (operational) power
spectral density coincides with the Fourier transform of the autocovariance
function.
Theorem C.12. Let {X(t)} be a zero-mean WSS stochastic process with continuous autocovariance function KXX ( ), R. Let SXX (f ), f R, denote
the double-sided power spectral density function of {X(t)}. Then:
1. SXX () is the Fourier transform of KXX ().
2. For every linear lter with integrable impulse response h(), the power
of the lter output {Y (t) = (X h)(t)} when fed at the input by X(t) is
Power of (X h)() =
SXX (f )|h(f )|2 df,
(C.30)
where h() denotes the Fourier transform transform2 of h().

2
Since capital letters are already reserved for random quantities, in this script we use a
hat to denote the Fourier transform.
C.5. Linear Functionals of WSS Stochastic Processes
445
Proof: We ignore the mathematical details and refer to [Lap09, Ch. 15

& 25] again. The proof of Part 2 follows from Theorem C.16 below.
Note that in (C.30) we can choose the lter to be a bandpass lter of very
narrow band and thereby measure the power contents of {X(t)} inside this
band: For
1 if f [f0 , f1 ],
0 otherwise
h(f ) =
(C.31)
we have that the total power of the output signal {Y (t) = (X h)(t)} is
f1
SXX (f ) df.
(C.32)
f0
Hence, SXX (f ) represents the density of power inside a frequency band. For
a more detailed description of linear ltering of stochastic processes we refer
to Section C.6.
C.5
Linear Functionals of WSS Stochastic

Processes
Most of the processing that must be done on noise waveforms involves either
linear lters or linear functionals. We start by looking at the latter. Concretely, we are interested in integrals of the form
X(t)s(t) dt
(C.33)
for some WSS process {X(t)} and for a (properly well-behaved) deterministic
function s(). Ignoring some mathematical subtleties,3 we think of (C.33) as
a random variable, i.e., we dene
X(t)s(t) dt.
(C.34)
Heuristically, we can derive the mean of Y as follows:
=
=
X(t)s(t) dt
(C.35)
E[X(t)] s(t) dt
E[Y ] = E
(C.36)
E[X(0)] s(t) dt
(C.37)
= E[X(0)]
3
s(t) dt,
(C.38)
For example, we have to worry about the measurability of the stochastic process and the
integrability of the deterministic function. Moreover, even for such nice assumptions, there
might be some realizations of the stochastic process for which the integral is unbounded.
However, one can show that such events have zero probability. For more see [Lap09, Ch. 25].
446
where we have used the linearity of expectation and the WSS assumption of
{X(t)}. Similarly, writing E[X(0)] and X(t) = X(t) , we get
Var(Y ) = Var
= Var
= Var
= Var
X(t) + s(t) dt
(C.40)
X(t) s(t) dt +
s(t) dt
X(t) s(t) dt
(C.42)
2
(C.43)
X(t) s(t) dt
=E
(C.41)
X(t) s(t) dt
=E
X( ) s( ) d
(C.44)
X(t) s(t) X( ) s( ) dt d
(C.45)
E X(t) X( ) s(t) s( ) dt d
(C.39)
=E
X(t) s(t) dt
(C.46)
s(t) KXX (t ) s( ) dt d.
(C.47)
Here in (C.42) we use the fact that adding a constant to a random variable
does not change its variance, and the last equality follows from the denition
of the autocovariance function.
Note that using the denition of the self-similarity function (C.16), we can
write this as follows:
Var(Y ) =
=
=
=
s(t) KXX (t ) s( ) dt d
(C.48)
s( + ) KXX () s( ) d d
(C.49)
KXX ()
s( + ) s( ) d d
(C.50)
KXX () Rss () d.
(C.51)
Recalling Parsevals theorem,
g(t) h(t) dt =
g (f ) h* (f ) df,
(C.52)
we can express this in the frequency domain as

Var(Y ) =
SXX (f ) |(f )|2 df.

s
(C.53)
Even though these derivations are heuristic and actually need some justication, they are basically correct. We have the following theorem.
C.5. Linear Functionals of WSS Stochastic Processes
447
Theorem C.13. Let {X(t)} be a WSS stochastic process with autocovariance

function KXX (). Let s() be some decent deterministic function. Then the
random variable Y in (C.34) is well-dened and has a mean given in (C.38)
and a variance given in (C.51) or (C.53).
We continue with integrals of the form X(t)s(t) dt, but with the additional assumption that {X(t)} is Gaussian. Actually, we even consider a
slightly more general form of a linear functional:
n
i X(ti ),
X(t)s(t) dt +
(C.54)
i=1
for some stationary Gaussian process {X(t)}, a decent deterministic function s(), some n N, and arbitrary coecients 1 , . . . , n R and epochs
t1 , . . . , tn R. Following a similar derivation as shown above, one can prove
the following theorem.
Theorem C.14. Consider the setup of (C.54) with {X(t)} being a stationary
Gaussian stochastic process. Then Y is a Gaussian random variable with mean
E[Y ] = E[X(0)]
s(t) dt +
(C.55)
i=1
and variance
Var(Y ) =
KXX () Rss () d +
i=1 i =1
n
+2
i=1
i i KXX (ti ti )
s(t)KXX (t ti ) dt.
(C.56)
Here, Rss () denotes the self-similarity function of s(), see (C.16).

Proof: This theorem is a special case of Theorem C.15.
As a matter of fact, this theorem can be extended as follows.
Theorem C.15. Let the random vector Y = (Y1 , . . . , Ym )T have components
Yj
nj
j,i X(tj,i ),
X(t) sj (t) dt +
j = 1, . . . , m,
(C.57)
i=1
where {X(t)} is a stationary Gaussian process, {sj ()}m are m (decent) dej=1
terministic functions, {nj }m are nonnegative integers, and the coecients
j=1
{j,i } and the epochs {ti,j } are deterministic real numbers for all j {1, . . . ,
m} and all i {1, . . . , nj }. Then Y is a Gaussian vector with a covariance
448
matrix whose components can be computed as follows:
Cov[Yj , Yk ] =
KXX ()
nj
sj (t)sk (t ) dt d
j,i (sk KXX )(tj,i )
+
i=1
nk
k,i (sj KXX )(tk,i )
+
i =1
nj
nk
+
i=1 i =1
j,i k,i KXX tj,i tk,i
(C.58)
or, equivalently,
Cov[Yj , Yk ] =
nj
SXX (f ) sj (f ) s* (f ) df
j,i
i=1
nk
SXX (f ) sk (f ) ei2f tj,i df
k,i
i =1
nj nk
SXX (f ) sj (f ) ei2f tk,i df
j,i k,i
SXX (f ) ei2f (tj,i tk,i ) df.
(C.59)
i=1 i =1
Proof: Since by Denition C.9 for any choice of time epochs t1 , . . . , tn ,

T
the vector X(t1 ), . . . , X(tn ) is Gaussian and since the components Yj all
are linear functionals of such vectors, it is quite intuitive that Y is a Gaussian vector. We ignore the mathematical subtleties of this proof and refer to
[Lap09, Ch. 25]. The expression (C.59) follows from (C.58) using the Parseval
theorem; and (C.58) can be justied as follows:
Cov[Yj , Yk ] = Cov
nj
= Cov
nj
i=1
nk
X(t) sj (t) dt,
(C.60)
i =1
X( ) sk ( ) d
X( ) sk ( ) d
i=1
nk
k,i X(tk,i )
X( ) sk ( ) d +
j,i Cov X(tj,i ),
j,i X(tj,i ),
X(t) sj (t) dt +
k,i Cov
i =1
nj nk
X(t) sj (t) dt, X(tk,i )
j,i k,i Cov X(tj,i ), X(tk,i )
+
i=1 i =1
(C.61)
C.6. Filtering Stochastic Processes
Cov[X(t), X( )] sj (t) sk ( ) dt d

nj
j,i
+
i =1

nj
j,i k,i KXX tj,i tk,i
i=1
nk
i =1
nj nk
i=1 i =1
C.6
KXX (tj,i ) sk ( ) d
k,i
(C.62)
KXX (t ) sj (t) sk ( ) dt d
j,i
Cov X(t), X(tk,i ) sj (t) dt
i =1
nj nk
k,i
i=1
Cov[X(tj,i ), X( )] sk ( ) d
i=1
nk
449
KXX tk,i t sj (t) dt
j,i k,i KXX tj,i tk,i .
(C.63)
Filtering Stochastic Processes
Next we discuss the result of passing a WSS stochastic process through a

stable4 linear lter with impulse response h(). Figure C.1 illustrates the
situation.
{X(t)}
{Y (t) = (X h)(t)}
h()
Figure C.1: Filtered stochastic process.

Since for a xed t, the output can be written as
Y (t) =
X()h(t ) d,
(C.64)
we can (more or less) directly apply our results from the Section C.5 about
linear functionals. Again we will ignore some mathematical subtleties here
and refer to [Lap09, Ch. 25] for any proofs.
4
A linear lter is called stable if its impulse response h() is integrable.
450
Theorem C.16. Let {Y (t)} be the result of passing the centered (zero-mean)
WSS stochastic process {X(t)} of autocovariance function KXX () through the
linear lter with (decent) impulse response h(). Then we have the following:
1. {Y (t)} is a centered WSS stochastic process with autocovariance function
KYY ( ) = (KXX Rhh )( ),
R.
(C.65)
where Rhh () is the self-similarity function of h() (see (C.16)).

2. If {X(t)} has PSD SXX , then {Y (t)} has PSD
2
SYY (f ) = h(f ) SXX (f ),
f R.
(C.66)
3. For every t, R,
E[X(t)Y (t + )] = (KXX h)( ),
(C.67)
where the right-hand side does not depend on t.

4. If {X(t)} is Gaussian, then so is {Y (t)}. Even better, for every choice
of n, m N and epochs t1 , . . . , tn , tn+1 , . . . , tn+m R, the random vector
X(t1 ), . . . , X(tn ), Y (tn+1 ), . . . , Y (tn+m )
(C.68)
is a centered Gaussian random vector. The covariance matrix can be

computed using KXX (), (C.65), and (C.67).
Note that the joint Gaussianity of the random variables in (C.68) follows
from Theorem C.13. Indeed we can express X(tj ) as
X(tj ) =
X(t)sj (t) dt + j X(tj ),
j = 1, . . . , n,
(C.69)
for sj () the zero function and j = 1, and Y (tj ) similarly as

Y (tj ) =
X(t)sj (t) dt + j X(tj ),
j = n + 1, . . . , n + m,
(C.70)
for sj : t h(tj t) and j = 0.

Also note that Theorem C.12 actually is a consequence of Theorem C.16.
C.7
White Gaussian Noise
The most important stochastic process in digital communication is so-called

white Gaussian noise. This noise is often used to model additive noise in
communication systems and in many other places, too. Sometimes, its use is
actually not really justied from a practical point of view, but is motivated
merely because the white Gaussian noise can be handled analytically.
C.7. White Gaussian Noise
451
Our denition diers from the usual denition found in most textbook in
the sense that we specify a certain bandwidth and dene the white Gaussian
noise only with respect to this given band. The reason for this is that it
is impossible to have white Gaussian noise over the whole innite spectrum
as such a process would have innite power and its autocovariance function
would not be dened.5
Denition C.17. We say that {N (t)} is white Gaussian noise of doublesided power spectral density N0 with respect to the bandwidth W if {N (t)} is a
2
stationary centered Gaussian random process that has PSD SNN () satisfying
N0
, f [W, W].
(C.71)
2
Note that our denition on purpose leaves the PSD unspecied outside the
given band [W, W].
We can now apply our knowledge from Sections C.5 and C.6 to this denition and get the following key properties of white Gaussian noise.
SNN (f ) =
Corollary C.18. Let {N (t)} be white Gaussian noise of double-sided power

spectral density N0 with respect to the bandwidth W.
2
1. If s() is a properly well-behaved function that is bandlimited to W Hz,
then
N0
N (t)s(t) dt N 0,
s, s ,
(C.72)
2
where we dene the inner product of two real functions u() and v()
as
u, v
u(t)v(t) dt.
(C.73)
2. If s1 (), . . . , sm () are properly well-behaved functions that are bandlimited to W Hz, then the m random variables
Y1
.
.
.
Ym
N (t)s1 (t) dt,
(C.74)
N (t)sm (t) dt
(C.75)
are jointly centered Gaussian random variables of covariance matrix
s1 , s1
s1 , s2
s1 , sm
s2 , s2
s2 , sm
N0 s2 , s1
.
.
(C.76)
KYY =
.
.
..
.
.
2 .
.
.
.
.
sm , s1
sm , s2 sm , sm
5
The common denition tries to circumvent this issue by using Dirac Deltas. While this
might be convenient from an engineering point of view, it causes quite a few mathematical
problems because Dirac Deltas are not functions.
452
3. If 1 (), . . . , m () are properly well-behaved functions that are bandlimited to W Hz and that are orthonormal
1
0
i , i =
if i = i ,
otherwise,
(C.77)
then the m random variables
N (t)1 (t) dt, . . . ,
N (t)m (t) dt
(C.78)
are IID N (0, N0 /2).

and if KNN () is the autocovariance function of {N (t)}, then
N0
s(t),
2
KNN s (t) =
t R.
(C.79)

then for every epoch t R
Cov
N ( )s( ) d , N (t) =
N0
s(t).
2
(C.80)
Proof: Parts 1 and 3 are special cases of Part 2, so we directly prove

Part 2. Using (C.59) of Theorem C.15 with j,i = 0, we get
Cov[Yj , Yk ] =
=
W
N0
2
N0
=
2
N0
=
2
SNN (f ) sj (f ) s* (f ) df
(C.81)
SNN (f ) sj (f ) s* (f ) df
(C.82)
sj (f ) s* (f ) df
(C.83)
sj (f ) s* (f ) df
(C.84)
sj (t) sk (t) dt,
(C.85)
where in (C.82) we use that s1 (), . . . , sm () are bandlimited.

For Part 4 we write KNN () as inverse Fourier transform of SNN ():
KNN s (t) =
=
=
s( )KNN (t ) d
s( )
SNN (f ) ei2f (t ) df d
SNN (f )
s( ) ei2f d ei2f t df
(C.86)
(C.87)
(C.88)
C.8. Orthonormal and KarhunenLoeve Expansions
453
SNN (f )(f ) ei2f t df

s
(C.89)
SNN (f )(f ) ei2f t df

s
(C.90)
N0 W
s(f ) ei2f t df
2 W
N0
s(t).
=
2
(C.91)
(C.92)
Finally, Part 5 follows from a derivation similar to (C.60)(C.63) in combination with Part 4.
C.8
Orthonormal and KarhunenLoeve

Expansions
The following material is pretty advanced and is presented only for those
readers who want to go a little more deeply into the topic of representing
stochastic processes by orthonormal expansions. Recall that a set of (complex)
functions 1 (), 2 (), . . . is called orthonormal if
i (t)* (t) dt = I i = i
i
(C.93)
for all integers i, i . These functions can be complex, but we will mostly stick
to real examples. The limits of integration throughout this section will be
taken as (, ), but usually the functions of interest are nonzero only over
some nite interval [T0 , T1 ].
The best known such set of orthonormal functions are those in the Fourier
series,
t
1
n (t) = ei2n T
T
(C.94)
for the interval [T/2, T/2]. We can then take any continuous square-integrable
function x() over [T/2, T/2] and represent it by
x() =
xi i (),
where xi =
x(t)* (t) dt.

i
(C.95)
For real functions, one can avoid the complex orthonormal functions by using
sines and cosines in place of the complex exponentials.
The nature of this transformation is not due to the special nature of sinusoids, but only due to the fact that the function is being represented as
a series of orthonormal functions. To see this, let {i ()} be any set of real
orthonormal functions, and assume that a function x() can be represented as
x() =
xi i ().
(C.96)
454
Multiplying both sides of (C.96) by j () and integrating yields
x(t)j (t) dt =
xi i (t)j (t) dt.
(C.97)
Interchanging the order of integration and summation and using (C.93), we

get
x(t)j (t) dt = xj .
(C.98)
Thus, if a function can be represented by orthonormal functions as in (C.96),

then the coecients {xi } must be determined as in (C.98), which is the same
pair of relations as in (C.95). We can also represent the energy in x() in terms
of the coecients {xi }. Since
x2 (t) =
xi i (t)
for all t, we get
x2 (t) dt =
xj j (t)
(C.99)
x2 .
i
xi xj i (t)j (t) dt =
i
(C.100)
Next suppose x() is any real square-integrable function and {i ()} is an
orthonormal set. Let xi

x(t)n (t) dt and let
k
k (t)
x(t)
xi i (t)
(C.101)
i=1
be the error when x() is represented by the rst k of these orthonormal

functions. First we show that k () is orthogonal to j () for 1 j k:
k (t)j (t) dt =
x(t)j (t) dt
= xj xj = 0.
xi i (t)j (t) dt
i=1
(C.102)
(C.103)
Viewing functions as vectors, k () is the dierence between x() and its projection on the linear subspace spanned by {i ()}1ik . The integral of the
squared error is given by
x (t) dt =
k (t) +
xi i (t)
dt
(C.104)
xi xj i (t)j (t) dt
(C.105)
i=1
2 (t) dt +
k
i=1 j=1
k
2 (t) dt +
k
x2 .
i
i=1
(C.106)
455
Since 2 (t) 0, we have Bessels inequality,

k
k
i=1
x2
i
x2 (t) dt.
(C.107)
We see from (C.106) that 2 (t) dt is nonincreasing with k. Thus, in the

k
limit k , either the energy in k () approaches 0 or it approaches some
positive constant. A set of orthonormal functions is called complete over some
class C of functions if this error energy approaches 0 for all x() C. For
example, the Fourier series set of functions in (C.94) is complete over the set
of functions that are square integrable and zero outside of [T/2, T/2]. There
are many other countable sets of functions that are complete over this class
of functions. Such sets are called complete orthonormal systems (CONS).
The question now arises whether a countable set of functions exists that is
complete over the square-integrable functions in the interval (, ). The
answer is yes. One example starts with the Fourier series functions over
[T/2, T/2]. This set is extended by adding the time shifts of these functions,
shifting by kT for each integer k. Note that the Fourier integral functions
ei2f t over all f R do not work here, since there are an uncountably innite
number of such functions and they cannot be made orthonormal (i.e., they
each have innite energy).
Parsevals relationship is another useful relationship among functions
x() =
xi ()
(C.108)
yi ().
(C.109)
and
y() =
This relationship states that
x(t)y(t) dt =
x i yi .
(C.110)
(xi yi )2 .
(C.111)
To derive this, note that
x(t) y(t)
dt =
i
Squaring the terms inside brackets and canceling out the terms in x2 and y 2 ,
we get (C.110).
These same relationships work for stochastic processes. That is, consider a
segment of a stochastic process, X(t), T1 t T2 . Assume that the covariance
function
KXX (t, )
Cov[X(t), X( )]
(C.112)
456
is continuous, and represent X() by an orthonormal set that is complete over

nite-energy functions in the interval [T1 , T2 ]. Then we have
X() =
Xi i (),
Xi =
X(t)i (t) dt.
(C.113)
The KarhunenLoeve expansion is an orthonormal expansion using a particular set of orthonormal functions. These orthonormal functions are chosen so
that for some particular zero-mean stochastic process {X(t)}, T1 t T2 , of
interest, the random variables {Xi } in (C.113) are uncorrelated.
It will be easiest to understand this expansion if we rst look at the related
problem for a discrete-time, zero-mean stochastic process over a nite interval,
say {X(1), X(2), . . . , X(T)}. This is simply a set of T random variables with
a covariance matrix KXX . We have seen that KXX has a set of T orthonormal
eigenvectors qi , 1 i T. The corresponding eigenvalues are nonnegative,
and they are positive if KXX is positive denite.
If we visualize sampling the nite-duration continuous-time process
{X(t)}, T1 t T2 , very nely, at intervals of width , then we have the
same problem as the discrete-time case above. If we go to the continuoustime limit without worrying about mathematical details, then the eigenvector
equation KXX q = q changes into the integral eigenvalue equation
T2
KXX (t, ) ( ) d = (t),

T1
T1 t T2 .
(C.114)
The following theorem is a central result of functional analysis, and is proved,

e.g., in [RSN90, p. 242].
Theorem C.19. Assume that KXX (, ) is continuous and square integrable
with a value greater than 0, i.e.,
T2
T2
0<
T1
T1
2
KXX (t, ) dt d < .
(C.115)
Then all solutions of (C.114) have nonnegative eigenvalues . There is at least

one solution with eigenvalue > 0 and at most a nite number of orthonormal
eigenvectors with the same positive eigenvalue. Also, there are enough eigenfunctions {i ()} with positive eigenvalues so that for any square-integrable
continuous function x() over [T1 , T2 ],
x() =
xi i () + h(),
(C.116)
where h() is an eigenfunction of eigenvalue 0.

Note that T1 can be taken to be and/or T2 can be taken to be + in
the above theorem. (Unfortunately, this cannot be done for a stationary process since KXX (, ) would not be square integrable then.) What this theorem
says is that the continuous-time case here is almost the same as the matrix
457
case that we studied before. The only dierence is that the number of eigenvalues can be countably innite, and the number of orthonormal eigenvectors
with eigenvalue 0 can be countably innite.
The eigenvectors of dierent eigenvalues are orthonormal (by essentially
the same argument as in the matrix case), so that what (C.116) says is that
the set of orthonormal eigenvectors is complete over the square-integrable
functions in the interval [T1 , T2 ]. This indicates that there are many possible
choices for complete orthonormal sets of functions.
There are a few examples where (C.114) can be solved with relative ease,
but usually it is quite dicult to nd solutions. But this is not important
because the KarhunenLoeve expansion is usually used to solve detection and
estimation problems abstractly in terms of this orthonormal expansion, and
then convert the solution into something that is meaningful without actually
knowing the concrete values of the expansion.
We now proceed to derive some of the properties of this expansion. So, we
assume that KXX (, ) is continuous and satises
T2
KXX (t, ) i ( ) d = i i (t),

T1
t [T1 , T2 ],
(C.117)
and dene the random variables {Xi } as

T2
Xi
X(t) i (t) dt.
(C.118)
T1
To see that these random variables are uncorrelated, we note that

T2
T2
E[Xi Xj ] = E
X(t) X( ) i (t) j ( ) dt d
T1
T2
T2
KXX (t, ) i (t) j ( ) dt d

T1
T2
(C.119)
T1
(C.120)
T1
i i (t) j (t) dt
(C.121)
T1
= i I{i = j},
(C.122)
where in (C.121) we have used (C.117) and where the last equality (C.122)
follows from orthogonality.
Next we want to show that
KXX (t, ) =
i i (t)i ( ),
(C.123)
where equality holds almost everywhere. This is called Mercers theorem. To

show this, assume that the eigenvalues of (C.114) are arranged in descending
order and consider approximating KXX (t, ) by k i i (t)i ( ). Letting k
i=1
458
be the the integral of the squared error, we have

T2
KXX (t, )
k =
T1
T1
T2
T2
=
T1
T1
T2
T2
T1
=
T1
T1
k
2
i
+
i=1
T2
T2
=
T1
T1
dt d
(C.124)
T2
T2
i i (t)i ( )KXX (t, ) dt d

i=1
T1
T1
T2
T2
i i (t)i ( )
i=1
2
KXX (t, ) dt d 2
2 2 (t)2 ( ) dt d
i i
i
+
T1
T2
(C.125)
i=1
k
2
KXX (t, ) dt d
T2
T1
T2
2 (t) dt
i
T1
T2
i=1
T1
2 i (t)i (t) dt
i
2 ( ) d
i
(C.126)
=1
k
2
KXX (t, ) dt d
2 .
i
(C.127)
i=1
We see from (C.124) that k is nonnegative, and from (C.127) that k is nonincreasing with k. Thus k must reach a limit. We now argue that this limit must
be 0. It can be shown that KXX (t, ) k i i (t)i ( ) is a covariance funci=1
tion for all k, including the limit k . If limk k > 0, then this limiting
covariance function has a positive eigenvalue and a corresponding eigenvector
which is orthogonal to all the eigenvectors in the expansion i i (t)i ( ).
i=1
Since we have ordered the eigenvalues in decreasing order and limi i = 0,
this is a contradiction, so limk k = 0. Since limk k = 0, and since k is
the integral squared of the dierence between KXX (t, ) and k i i (t)i ( ),
i=1
we see that
KXX (t, ) =
i i (t)i ( )
(C.128)
i=1
with equality almost everywhere. As an added bonus, we see from (C.127)

that
T2
T2
T1
T1
2 .
i
(C.129)
t [T1 , T2 ],
(C.130)
KXX (t, ) dt d =
i=1
Finally, we want to show that

X(t) =
Xi i (t),
i=1
where the sum is over those i for which i > 0 and with equality in the sense
that the variance of the dierence in the two sides of (C.130) is 0. To see this,
459
note that
T2
T1
X(t)
T2
=E
T1
dt
Xi i (t)
i
X 2 (t) dt 2
T2
E Xi
X(t)i (t) dt
T1
= Xi
T2
i (t)i (t) dt
E[Xi Xi ]
+
i
i
T2
=E
T1
T2
=E
T1
(C.131)
T1
= i I{i=i }
= I{i=i }
X 2 (t) dt 2
X 2 (t) dt
E Xi2 +
(C.132)
i .
(C.133)
We know that the left side of (C.131) is zero if the eigenvectors of zero eigenvalue are included, but (C.133) shows that the equality holds without the
eigenvectors of zero eigenvalue. Since E X 2 (t) = KXX (t, t), this also veries
the useful identity
T2
KXX (t, t) dt =
T1
i .
(C.134)
i=1
These results have all assumed that KXX (, ) is continuous. We now look
at two examples, the rst of which helps us to understand what happens if
KXX (, ) is not continuous, and the second of which helps us to understand a
common engineering notion about degrees of freedom.
Example C.20 (Pathological Nonexistent Noise). Consider a Gaussian process {X(t)} where for each t R, X(t) N (0, 1). Assume also that
E[X(t)X( )] = 0,
t = .
(C.135)
Thus KXX (t, ) is 1 for t = and 0 otherwise. Since KXX (, ) is not continuous,
Theorem C.19 does not necessarily hold, but because KXX (, ) is zero almost
everywhere, any orthonormal set of functions can be used as eigenvectors of
eigenvalue 0. The trouble occurs when we try to dene random variables
T2
Xi
T1 X(t)i (t) dt. Since the sample functions of {X(t)} are discontinuous
everywhere, it is hard to make sense out of this integral. However, if we try
to approximate this integral as
T2 /
X()i (),
Y =
(C.136)
=T1 /
460
then we see that
T2 /
Y N 0, 2
=T1 /
2 ().
i
(C.137)
This variance goes to zero as 0, so it is reasonable to take Xi = 0. This

means that we cannot represent X() = i Xi i (). It also means that if we
lter X(), the lter output is best regarded as 0. Thus, in a sense, this kind of
process cannot be observed. We can also view this process as white Gaussian
noise in the limit as the power spectral density approaches 0.
The point of this example is to show how strange a stochastic process
{X(t)} is when KXX (t, ) is discontinuous at t = . It is not hard to show
that if a WSS process has an autocovariance function KXX () that is continuous
at t = 0, then it is continuous everywhere, so that in some sense the kind of
phenomena in this example is typical of discontinuous covariance functions.
Example C.21 (Prolate Spheroidal Wave Functions). We usually view the

set of waveforms that are essentially timelimited to [T/2, T/2] and essentially
bandlimited to [0, W] as having about 2WT degrees of freedom if WT is large.
For example, there are 2WT + 1 orthonormal sine and cosine waves with frequency f W in the Fourier series over [T/2, T/2]. This notion of degrees of
freedom is inherently imprecise since strictly timelimited functions cannot be
strictly bandlimited and vice versa. Our objective in this example is to make
this notion as close to precise as possible.
Suppose {X(t)} is a stationary process with power spectral density
SXX (f ) =
1 f W,
0 elsewhere.
(C.138)
The corresponding autocovariance function is

KXX ( ) =
sin(2W )
,
R.
(C.139)
Consider truncating the sample functions of this process to [T/2, T/2]. Before being truncated, these sample functions are bandlimited, but after truncation, they are no longer bandlimited. Let {i , i ()} be the eigenvalues and
eigenvectors of the integral equation (C.114) for KXX () given in (C.139) for
T/2 T/2. Each eigenvalue/eigenvector pair i , i () has the property
that when i () is passed through an ideal low pass lter and then truncated
again to [T/2, T/2], the same function i () appears attenuated by i (this is
in fact what (C.114) says). With a little extra work, it can be seen that the
output i () of this ideal low pass lter before truncation has energy i and
after truncation has energy 2 .
i
The eigenvectors i () have an interesting extremal property (assuming
that the eigenvalues are ordered in decreasing order). In particular, i ()
461
is the unit energy function over [T/2, T/2] with the largest energy in the
frequency band [0, W], subject to the constraint of being orthogonal to j ()
for each j < i. In other words, the rst i eigenfunctions are the orthogonal
timelimited functions with the largest energies in the frequency band [0, W].
The functions i () dened above are called prolate spheroidal wave functions and arise in many areas of mathematics, physics, and engineering. They
are orthogonal, and also their truncated versions to [T/2, T/2] are orthogonal.
Finally, when WT is large, it turns out that i is very close to 1 for i
2WT, i is very close to 0 for i 2WT, and the transition region has a
width proportional to ln(4WT). In other words, as WT gets large, there
are about 2WT orthogonal waveforms that are timelimited to [T/2, T/2] and
approximately bandlimited to [0, W].
Bibliography
[AC88]
Paul H. Algoet and Thomas M. Cover, A sandwich proof of the

shannonmcmillanbreiman theorem, The Annals of Probability,
vol. 16, no. 2, pp. 899909, April 1988.
[BGT93] Claude Berrou, Alain Glavieux, and Punya Thitimajshima, Near

Shannon limit error-correcting coding and decoding: Turbo-codes,
in Proceedings IEEE International Conference on Communications
(ICC), Geneva, Switzerland, May 2326, 1993, pp. 10641070.
[BT02]
Dimitri P. Bertsekas and John N. Tsitsiklis, Introduction to Probability. Belmont, MA, USA: Athena Scientic, 2002.
[CA05a] Po-Ning Chen and Fady Alajaji, Lecture notes on information theory, vol. 1, Department of Electrical Engineering, National Chiao
Tung University, Hsinchu, Taiwan, and Department of Mathematics
& Statistics, Queens University, Kingston, Canada, August 2005.
[CA05b] Po-Ning Chen and Fady Alajaji, Lecture Notes on Information
Theory, vol. 2, Department of Electrical Engineering, National
Chiao Tung University, Hsinchu, Taiwan, and Department of Mathematics & Statistics, Queens University, Kingston, Canada, August
2005.
[CT06]
Thomas M. Cover and Joy A. Thomas, Elements of Information

Theory, 2nd ed. New York, NY, USA: John Wiley & Sons, 2006.
[DH76]
Whiteld Die and Martin E. Hellman, New directions in cryptography, IEEE Transactions on Information Theory, vol. 22, no. 6,
pp. 644654, November 1976.
[DM98]
Matthew C. Davey and David J. C. MacKay, Low-density parity

check codes over GF (q), IEEE Communications Letters, vol. 2,
no. 6, pp. 165167, June 1998.
[Eli75]
Peter Elias, Universal codeword sets and representations of the

integers, IEEE Transactions on Information Theory, vol. 21, no. 2,
pp. 194203, March 1975.
463
464
Bibliography
[Fan49]
Robert M. Fano, The transmission of information, Research Laboratory of Electronics, Massachusetts Institute of Technology (MIT),
Technical Rebort No. 65, March 17, 1949.
[Fei54]
Amiel Feinstein, A new basic theorem of information theory, IRE

Transactions on Information Theory, vol. 4, no. 4, pp. 222, September 1954.
[Gal62]
Robert G. Gallager, Low Density Parity Check Codes. Cambridge,

MA, USA: MIT Press, 1962.
[Gal68]
Robert G. Gallager, Information Theory and Reliable Communication. New York, NY, USA: John Wiley & Sons, 1968.
[Har28]
Ralph Hartley, Transmission of information, Bell System Technical Journal, vol. 7, no. 3, pp. 535563, July 1928.
[HJ85]
Roger A. Horn and Charles R. Johnson, Matrix Analysis.

bridge, UK: Cambridge University Press, 1985.
[HY10]
Siu-Wai Ho and Raymond W. Yeung, The interplay between entropy and variational distance, IEEE Transactions on Information
Theory, vol. 56, no. 12, pp. 59065929, December 2010.
[Khi56]
Aleksandr Y. Khinchin, On the fundamental theorems of information theory (Russian), Uspekhi Matematicheskikh Nauk XI, vol. 1,
pp. 1775, 1956.
[Khi57]
Aleksandr Y. Khinchin, Mathematical Foundations of Information

Theory. New York, NY, USA: Dover Publications, 1957.
[Lap09]
Amos Lapidoth, A Foundation in Digital Communication.

bridge, UK: Cambridge University Press, 2009.
Cam-
Cam-
[Mas96] James L. Massey, Applied Digital Information Theory I and II, Lecture notes, Signal and Information Processing Laboratory, ETH
Zrich, 1995/1996. Available: http://www.isiweb.ee.ethz.ch/archi
u
ve/massey scr/
[MCck]
Stefan M. Moser and Po-Ning Chen, A Students Guide to Coding and Information Theory. Cambridge, UK: Cambridge University Press, January 2012. ISBN: 9781107015838 (hardcover) and
9781107601963 (paperback).
[Mer78] Ralph Merkle, Secure communication over insecure channels,

Commmunications of the Association for Computing Machinery
(ACM), pp. 294299, April 1978.
465
Bibliography
[MN96]
David J. C. MacKay and Radford M. Neal, Near Shannon limit

performance of low density parity check codes, Electronics Letters,
vol. 32, no. 18, pp. 16451646, August 1996, reprinted with printing
errors corrected in vol. 33, no. 6, pp. 457458.
[Mos05] Stefan M. Moser, Duality-Based Bounds on Channel Capacity, ser.

ETH Series in Information Theory and its Applications. Konstanz,
Germany: Hartung-Gorre Verlag, January 2005, vol. 1, edited by
Amos Lapidoth. ISBN 3896499564. Available: http://moser.cm
.nctu.edu.tw/publications.html
[Mos13] Stefan M. Moser, Advanced Topics in Information Theory (Lecture
Notes), version 2, Signal and Information Processing Laboratory,
ETH Zrich, and Department of Electrical & Computer Engineeru
ing, National Chiao Tung University (NCTU), 2013. Available: htt
p://moser.cm.nctu.edu.tw/scripts.html
[PPV10] Yury Polyanskiy, H. Vincent Poor, and Sergio Verd, Channel codu
ing rate in the nite blocklength regime, IEEE Transactions on
Information Theory, vol. 56, no. 5, pp. 23072359, May 2010.
[RSN90] Frigyes Riesz and Bla Sz.-Nagy, Functional Analysis.
e
NY, USA: Dover Publications, 1990.
[Say99]
New York,
Jossy Sayir, On coding by probability transformation, Ph.D. dissertation, ETH Zurich, 1999, Diss. ETH No. 13099. Available: http
://e-collection.ethbib.ethz.ch/view/eth:23000
[SGB66] Claude E. Shannon, Robert G. Gallager, and Elwyn R. Berlekamp,

Lower bounds to error probability for coding on discrete memoryless channels, Information and Control, pp. 65103, December
1966, part I.
[SGB67] Claude E. Shannon, Robert G. Gallager, and Elwyn R. Berlekamp,
Lower bounds to error probability for coding on discrete memoryless channels, Information and Control, pp. 522552, May 1967,
part II.
[Sha48]
Claude E. Shannon, A mathematical theory of communication,

Bell System Technical Journal, vol. 27, pp. 379423 and 623656,
July and October 1948.
[Sha49]
Claude E. Shannon, Communication theory of secrecy systems,

Bell System Technical Journal, vol. 28, no. 4, pp. 656715, October
1949.
[Str09]
Gilbert Strang, Introduction to Linear Algebra, 4th ed.

MA, USA: Wellesley-Cambridge Press, 2009.
Wellesley,
466
Bibliography
[Tun67] Brian P. Tunstall, Synthesis of noiseless compression codes, Ph.D.

dissertation, Georgia Institute of Technology, Atlanta, September
1967.
List of Figures
Dependency chart . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Two hats with four balls each . . . . . . . . . . . .

Binary entropy function Hb (p) . . . . . . . . . . . .
Illustration of the IT inequality . . . . . . . . . . .
Diagram depicting mutual information and entropy
Graphical proof of Jensens inequality . . . . . . . .
Example demonstrating how to maximize entropy .
Example demonstrating how to minimize entropy .
.
.
.
.
.
.
.
4
9
10
21
24
30
31
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.17
3.18
3.19
3.20
3.21
3.22
3.23
3.24
3.25
3.26
3.27
3.28
3.29
Basic data compression system for a single random message U

Denition of a binary tree . . . . . . . . . . . . . . . . . . . .
A binary tree with ve codewords . . . . . . . . . . . . . . . .
A ternary tree with six codewords . . . . . . . . . . . . . . . .
A decision tree corresponding to a prex-free code . . . . . . .
Examples of codes and its trees . . . . . . . . . . . . . . . . .
Extending a leaf in a ternary tree . . . . . . . . . . . . . . . .
Illustration of the Kraft inequality . . . . . . . . . . . . . . . .
An example of a binary tree with probabilities . . . . . . . . .
A binary tree with probabilities . . . . . . . . . . . . . . . . .
Graphical proof of the leaf entropy theorem . . . . . . . . . .
A Shannon-type code for the message U . . . . . . . . . . . .
Another binary prex-free code for U . . . . . . . . . . . . . .
Construction of a binary Fano code . . . . . . . . . . . . . . .
Construction of a ternary Fano code . . . . . . . . . . . . . .
One possible Fano code for a random message . . . . . . . . .
A second possible Fano code for the random message . . . . .
Code performance and unused leaves . . . . . . . . . . . . . .
Improving a code by removing an unused leaf . . . . . . . . .
Creation of a binary Human code . . . . . . . . . . . . . . .
Dierent Human codes for the same random message . . . .
More than D 2 unused leaves in a D-ary tree . . . . . . . .
Creation of a ternary Human code . . . . . . . . . . . . . . .
Wrong application of Humans algorithm . . . . . . . . . . .
Construction of a binary Fano code . . . . . . . . . . . . . . .
A Fano code for the message U . . . . . . . . . . . . . . . . .
45
47
48
49
49
49
50
52
54
57
58
62
63
68
68
69
69
78
78
80
82
83
85
85
86
87
467
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
468
List of Figures
3.31
3.33
A binary Human code for a message U . . . . . . . . . . . .

Set of all codes . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
A coding scheme for an information source . . . . .

The codewords of an arithmetic code are prex-free
Variable-lengthtoblock coding of a DMS . . . . .
Two examples of message sets . . . . . . . . . . . .
A proper message set for a quaternary source . . .
A proper message set for a DMS . . . . . . . . . . .
Example of a Tunstall message set for a BMS . . .
Tunstall message set for a BMS . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
103
108
110
111
112
114
117
5.1
5.2
5.3
5.5
A series of processing black boxes . . . . . . . . . . . . . .

State transition diagram of a binary Markov source . . . .
Examples of irreducible and/or periodic Markov processes
Transition diagram of a source with memory . . . . . . . .
.
.
.
.
.
.
.
.
122
124
127
129
6.1
6.2
6.9
A general compression scheme for a source with memory . . . 135

A rst idea for a universal compression system . . . . . . . . . 139
An EliasWillems compression scheme . . . . . . . . . . . . . 144
7.1
7.2
7.3
7.4
Examples of convex and nonconvex regions in R2 . .

Example of a concave function . . . . . . . . . . . . .
Maximum is achieved on the boundary of the region
Slope paradox: maximum on boundary of region . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
166
168
169
174
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
Most general system model in information theory

Binary symmetric channel (BSC) . . . . . . . . .
Binary erasure channel (BEC) . . . . . . . . . . .
Simplied discrete-time system model . . . . . . .
Decoding regions . . . . . . . . . . . . . . . . . .
Binary erasure channel (BEC) . . . . . . . . . . .
A Markov chain . . . . . . . . . . . . . . . . . . .
Independence of Y and X(m) for m 2 . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
197
200
201
201
208
211
214
222
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
The BEC is uniformly dispersive . . . . . . . . . . . .

The uniform inverse BEC is uniformly focusing . . .
Example of a strongly symmetric DMC . . . . . . . .
The BEC is split into two strongly symmetric DMCs
A weakly symmetric DMC . . . . . . . . . . . . . . .
Two strongly symmetric subchannels . . . . . . . . .
Mutual information is concave in the input . . . . . .
Mutual information is convex in the channel law . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
230
231
232
234
237
237
239
240
11.1
11.2
11.3
Convolutional encoder . . . . . . . . . . . . . . . . . . . . . . 246

Tree code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Trellis code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
88
90
469
List of Figures
11.4
11.5
11.6
11.9
11.10
11.11
11.12
11.13
11.14
11.15
Binary symmetric channel (BSC) . . . . . . . . .

Decoding of trellis code . . . . . . . . . . . . . . .
Binary symmetric erasure channel (BSEC) . . . .
Viterbi decoding of convolutional code over BSEC
Detour in a trellis . . . . . . . . . . . . . . . . . .
State-transition diagram . . . . . . . . . . . . . .
Signalowgraph of the state-transition diagram .
Example of a signalowgraph . . . . . . . . . . .
Another example of a signalowgraph . . . . . . .
Example of detours . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
249
250
251
253
254
256
257
258
260
263
12.1
12.2
12.3
12.4
12.5
Gallager and Bhattacharyya exponents . . . . . . . . . . .

Understanding the properties of the Gallager exponent . .
Upper and lower bound to the channel reliability function
Pathological cases for ESP (R) . . . . . . . . . . . . . . . . .
Two extremal shapes of the Gallager exponent . . . . . . .
.
.
.
.
.
.
.
.
.
.
279
280
282
282
284
13.1
13.2
13.3
Joint source channel coding . . . . . . . . . . . . . . . . . . . 285

Encoder for information transmission system . . . . . . . . . . 288
Lossy compression added to joint source channel coding . . . . 294
14.1
CDF of noncontinuous RV . . . . . . . . . . . . . . . . . . . . 303
15.1
15.2
15.3
The n-dimensional space of all possible received vectors . . . . 319

Joint source channel coding . . . . . . . . . . . . . . . . . . . 326
Shannon limit for the Gaussian channel . . . . . . . . . . . . . 328
16.1
Sampling and replicating frequency spectrum . . . . . . . . . 338
17.1
17.2
Waterlling solution . . . . . . . . . . . . . . . . . . . . . . . . 347

Power spectral density of colored noise . . . . . . . . . . . . . 351
18.1
A binary blocktovariable-length source compression scheme 361
19.1
19.2
19.3
19.4
19.5
19.6
System model of a cryptographic system . . . . . . . . .

Perfect secrecy: one-time pad . . . . . . . . . . . . . . .
The key equivocation function . . . . . . . . . . . . . . .
Secret messages without key exchange . . . . . . . . . .
System model of a public-key cryptosystem: secrecy . . .
System model of a public-key cryptosystem: authenticity
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
380
382
386
388
391
392
A.1
A.2
A.3
A.4
A.5
The standard Gaussian probability density

The Gaussian probability density function
Graphical interpretation of the Q-function
Computing 1 Q() . . . . . . . . . . . . . .
2
Bounds on Q() . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
396
399
401
403
404
B.1
Contour plot of joint Gaussian density . . . . . . . . . . . . . 436
function
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
470
List of Figures
B.2
Mesh plot of joint Gaussian density . . . . . . . . . . . . . . . 437
C.1
Filtered stochastic process . . . . . . . . . . . . . . . . . . . . 449
List of Tables
3.1
3.15
3.16
3.30
3.32
Binary phone numbers for a telephone system . . . . . . . . .

The binary Shannon code for a random message U . . . . . .
The ternary Shannon code for a random message U . . . . . .
The binary Shannon code for a random message U . . . . . .
Various codes for a random message with four possible values
4.9
A Tunstall code . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4
Convergence of Markov source to steady-state distribution . . 128
6.3
6.4
6.5
6.6
6.7
6.8
Example of a recency rank list . . . .

Updated recency rank list . . . . . .
Example of a default starting recency
Standard code . . . . . . . . . . . . .
First Elias code . . . . . . . . . . . .
Second Elias code . . . . . . . . . . .
8.1
Derivation of optimal gambling for a subfair game . . . . . . . 191
11.7
11.8
Viterbi metric for a BSEC, unscaled . . . . . . . . . . . . . . . 252

Viterbi metric for a BSEC . . . . . . . . . . . . . . . . . . . . 252
18.2
An example of a joint probability distribution . . . . . . . . . 370
471
. . .
. . .
rank
. . .
. . .
. . .
. .
. .
list
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
65
65
88
89
140
140
141
141
142
143
Index
Italic entries are to names.
Symbols
| |, 11
, , 451
L
=, 441
, 410
38
,
214
,
,
337
(n)
A , 357, 364, 374
C, 226
D, 17, 307
Eb , 327, 342
Eb /N0 , 327, 342
Es , 314
b , 327
H, 6
h, 302
Hb , 8
I, 19
I{}, 206
N0 , 327
Q, 315, 400
Rx (y), 83
V , 19
alphabet, 37
arithmetic coding, 66, 98
decoding, 106
encoding, 102
asymptotic equipartition property,
158, 356, 363
for continuous random
variables, 374
joint, 366, 376
attack
chosen-plaintext, 381, 387
ciphertext-only, 381
known-plaintext, 381, 387
authenticity, 379
autocorrelation function, 442
autocovariance function, 351, 442
properties, 442
average, see expectation
AWGN, see additive white
Gaussian noise under
noise
B
bandwidth, 332, 334, 341
bandwidth eciency, 342
BEC, see binary erasure channel
under channel
BER, see bit error rate
Berlekamp, Elwyn R., 281
Berrou, Claude, 329
Bertsekas, Dimitri P., 126
Bessels inequality, 455
betting
A
achievability part, 59, 278
ADSL, 352
AEP, see asymptotic equipartition
property
Alajaji, Fady, xiii
Algoet, Paul H., 364
aliasing, 338
473
474
bookies strategy, 183
dependent races, 193
doubling rate, 180, 185, 192
increase, 193
Dutch book, 184
expected return, 178
fair odds, 183
growth factor, 178, 185, 191
odds, 177
optimal, 181, 189, 192, 193
proportional, 181, 193
risk-free, 183
strategy, 177
subfair odds, 184
superfair odds, 184
uniform fair odds, 184
conservation theorem, 185
with side-information, 191
Bhattacharyya bound, 210
union, 269
Bhattacharyya distance, 210, 275
Bhattacharyya exponent, 274
binary digit, 7
bit, 3, 7
codeword, 245
information, 245, 295
bit error rate, 253, 294, 327
average, 262, 266
lower bound, 298, 328
minimum, 296
blocklength, 201
Breiman, Leo, 363
BSC, see binary symmetric
channel under channel
BSEC, see binary symmetric
erasure channel under
channel
C
capacity, 226
information, 204, 315
KarushKuhnTucker
conditions, 241
of AWGN channel, 341
Index
of colored Gaussian channel,
352
of DMC, 225
of Gaussian channel, 313, 316
of parallel Gaussian channel
dependent, 348
independent, 346
of strongly symmetric DMC,
232
of weakly symmetric DMC,
235
operational, 212, 317
CDF, see cumulative distribution
function
central limit theorem, 395
Cesro mean, 130
a
chain rule, 16, 22, 307
channel, 198
additive white Gaussian noise,
332
AWGN, see additive white
Gaussian noise channel
under channel
binary erasure, 200, 211, 230,
233, 266
inverse, 231
binary symmetric, 200, 232,
243, 249
binary symmetric erasure, 251
capacity, 226
conditional probability
distribution, 199
decoder, 201
discrete memoryless, 199
discrete-time, 198
encoder, 201
Gaussian, 313
with memory, 351
input alphabet, 199
input constraints, 314
law, 199
output alphabet, 199
parallel Gaussian
independent, 343
strongly symmetric, 232
uniformly dispersive, 229
Index
uniformly focusing, 230
weakly symmetric, 233
test algorithm, 234
channel coding, 197
main objective, 206
channel coding theorem, 215, 225
channel reliability function, 280
characteristic function, 406, 418
Chen, Po-Ning, xi, xiii
ciphertext, 380
code, 43, 201
adaptive Human, 136
arithmetic, 98
decoding, 106
encoding, 102
block, 247
blocktovariable-length, 96
convolutional, 245
eciency, 120
EliasWillems, 144
for positive integers
rst Elias code, 142
second Elias code, 142
standard code, 141
Hamming, 295
Human, 76
binary, 79
D-ary, 84
instantaneous, 46
LDPC, see low-density
parity-check code under
code
LempelZiv
analysis, 150
asymptotic behavior, 163
LZ-77, see sliding window
LZ-78, see tree-structured
sliding window, 146
tree-structured, 149
low-density parity-check, 329
LZ-77, see LempelZiv under
code
LZ-78, see LempelZiv under
code
prex-free, 44, 46, 89
quality criterion, 108
475
rate, 205
Shannon, 64, 74
Shannon-type, 61, 74
wrong design, 75
singular, 89
three-times repetition, 202
tree, 247
trellis, see convolutional code
turbo, 329
typicality, 361
uniquely decodable, 44, 89, 90
universal, 138, 146
variable-lengthtoblock, 107
variable-lengthtovariablelength,
119
codebook, 201, 204, 205
codeword, 43, 45, 201
average length, 44
code-matrix, 343
coding scheme, 45, 204, 205
eciency, 120
for Gaussian channel, 317
joint source channel, 285
rate, 205, 317
coding theorem
channel, 215
error exponent, 283
for a DMC, 225
converse, 217
for a DMS
blocktovariable-length, 97
converse, 113
variable-lengthtoblock,
118
for a DSS, 138
blocktovariable-length,
146
EliasWillems, 146
for a single random message,
74
converse, 60
for a source satisfying the
AEP, 365
for parallel Gaussian channels,
344
476
for the AWGN channel, 341
for the Gaussian channel, 318
information transmission
theorem, 290
joint source and channel, 290
source, 97, 215
source channel coding
separation, 291
complete orthonormal system,
334, 455
concave function, 167
concavity, see convexity
CONS, see complete orthonormal
system
convergence, 355
almost surely, 355
in distribution, 355
in probability, 39, 157, 355
with probability one, 355
converse part, 59, 281
convexity, 23, 167, 297
of a region, 166
of mutual information, 238
convolutional code, 245
decoder, 248
encoder, 245
covariance matrix, 415
properties, 416
Cover, Thomas M., xi, xiii, 75,
150, 204, 232, 236, 364
critical rate, 279
cryptography, 197, 379
attack, see attack
basic assumptions, 382
Kerckhos hypothesis, 381
public-key, see public-key
cryptography
system model, 380
cumulative distribution function,
303, 399, 440
cut-o rate, 273, 274, 283
D
data compression, 45, 135, 197,
377
Index
optimal, 198, 287
data processing inequality, 215
data transmission, 197, 377, see
also information
transmission system
above capacity, 293
delay, 212
reliable, 212
Davey, Matthew C., 329
decoder, 201
convolutional code, see trellis
decoder under decoder
MAP, 207
ML, 207, 248
threshold, 220
trellis code, 248
typicality, 370
decoding region, 208
delay, 98, 212, 226
dierential entropy, 302
conditional, 306
conditioning reduces entropy,
308
joint, 306
maximum entropy
distribution, 311
of Gaussian vector, 310
properties, 306
Die, Whiteld, 387
Dirac delta, 303, 304, 336, 337
discrete memoryless channel, 199,
285
capacity, 225
strongly symmetric, 232
uniformly dispersive, 229
uniformly focusing, 230
weakly symmetric, 233
test algorithm, 234
without feedback, 202
discrete memoryless source, 95
discrete multitone, 352
discrete stationary source, 122,
135, 285
encoding, 136
entropy rate, 130
discrete time, 198
Index
distribution, see probability
distribution
multivariate, 409
univariate, 409
DMC, see discrete memoryless
channel
DMS, see discrete memoryless
source
doubling rate, 180, 183, 185, 192
fair odds, 183
increase, 193
DPI, see data processing
inequality
DSS, see discrete stationary source
E
ebno, 342
eciency, 120
eigenvalue decomposition, 349
Elias code, 142
Elias, Peter, 75, 142
encoder
channel, 201
convolutional, 245
signalowgraph, see
signalowgraph
state-transition diagram,
256
rate, 245
source, 45, 96
trellis code, see convolutional
encoder under encoder
energy
noise, 327
per information bit, 327, 342
per symbol, 314
entropy, 6, 129
binary entropy function, 8
conditional
conditioned on event, 12
conditioned on RV, 12
conditioning reduces entropy,
13, 308
dierential, see dierential
entropy
477
joint, 8
maximum entropy
distribution, 28, 153, 311
minimum entropy
distribution, 29, 32
more than two RVs, 15
of continuous RV, 302, see
also dierential entropy
properties, 10, 12, 27, 28, 32
relative, see relative entropy
unit, 7
entropy rate, 129, 130
properties, 130
ergodicity, 144
error exponent
Bhattacharyya, 274
Gallager, 278
sphere-packing, 281
error function, 400
error probability, 205, 211, 268
average, 206
Bhattacharyya bound, 210
bit, see bit error rate
channel reliability function,
280
Gallager bound, 270
maximum, 206
union Bhattacharyya bound,
269
worst case, 206
Euclids division theorem, 83, 116
event, 35
atomic, 36
certain, 35
impossible, 35
expectation, 38, 41, 414
conditional, 39
total, 39
expected return, 178
F
Fanos inequality, 213, 214, 324
for bit errors, 299
Fano, Robert M., 67, 76, 213, 214
feedback, 202
478
Feinstein, Amiel, xiii
FFT, see fast Fourier transform
under Fourier transform
sh, see under friends, not under
food
Fourier series, 334
Fourier transform, 335, 444
fast, 352
friends, see Gaussian distribution
G
Gallager bound, 270
Gallager exponent, 278
Gallager function, 277, 278, 281
Gallager, Robert G., xiii, 277, 278,
280, 281, 329
gambling, 177
Gaussian distribution, 305, 395
Q-function, 315, 400
properties, 403, 404
centered Gaussian RV, 397
centered Gaussian vector, 420
characteristic functions, 406
Gaussian process, 443
Gaussian RV, 397
Gaussian vector, 420
PDF, 432
standard Gaussian RV, 395
standard Gaussian vector, 419
white, 451
Glavieux, Alain, 329
growth factor, 178, 185, 191
H
Hadamards inequality, 350
Hamming code, 295
Hamming distance, 248
Hamming weight, 248
Hartley, 7
Hartley information, 2
Hartley, Ralph, 2
Hellman, Martin E., 387
Ho, Siu-Wai, 2729
Ho,Siu-Wai, 31
Horn, Roger A., 349
Index
horse betting, 177
Human code, 76
binary, 79
D-ary, 84
Humans algorithm
for D-ary codes, 84
for binary codes, 79
Human, David A., 76
I
IID, see independent and
identically distributed
under independence
independence, 37, 41
independent and identically
distributed, 95
indicator function, 206
information, 1, 7, see also mutual
information
information theory inequality, 9
information transmission system,
197, 285, see also data
transmission
information transmission theorem,
290
inner product, 451
orthonormal, 452, 453
IT inequality, 9
J
Jensens inequality, 23, 168
Johnson, Charles R., 349
K
KarhunenLoeve expansion, 456
KarushKuhnTucker conditions,
171
for betting, 186
for capacity, 241
for parallel Gaussian channels,
346
Kerckhos hypothesis, 381
Kerckho, Auguste, 381
Index
key equivocation function, 384,
386
Khinchin, Aleksandr Y., 25, 26
KKT conditions, see
KarushKuhnTucker
conditions
Koch, Tobias, xiii
Kraft inequality, 51
Kroupa, Tomas, 66
Kullback, Solomon, 17
KullbackLeibler divergence, see
relative entropy
L
Lagrange multiplier, 171
Lapidoth, Amos, xiii, 309, 331,
332, 335, 339, 439, 443,
445, 448, 449
law of large numbers, 40, 109, 157,
194, 223, 288, 356
leaf counting lemma, 50
leaf depth lemma, 50
leaf entropy theorem, 57
Leibler, Richard, 17
LempelZiv code
LZ-77, see sliding window
under LempelZiv code
LZ-78, see tree-structured
under LempelZiv code
sliding window, 146
tree-structured, 149
analysis, 150
asymptotic behavior, 163
linear lter, 449
linear functional, 445
LZ-77, see under LempelZiv code
LZ-78, see under LempelZiv code
M
MacKay, David, 329
MAP, see maximum a posteriori
Markov chain, 214
Markov process
balance equations, 126
convergence, 127
479
enlarging alphabet, 123
entropy rate, 132
homogeneous, 123
irreducible, 126
normalization equation, 126
of memory , 123
periodic, 126
specication, 125
state diagram, 124
stationary, 127, 128
steady-state distribution, 125
time-invariant, 123, 128
transition probability matrix,
124
Masons rule, 259
Massey, James L., xiii, 9
matrix
orthogonal, 349, 411
positive denite, 411
positive semi-denite, 411
symmetric, 409
maximum a posteriori, 207
maximum likelihood, 207
McMillans theorem, 90
McMillan, Brockway, 90, 363
Mercers theorem, 457
Merkle, Ralph, 387
message encoder, 45, 96
message set, 108
legal, 110, 119
optimal, 115
proper, 111
quasi-proper, 119
Tunstall, 113
metric, 250
for a BSEC, 252
ML, see maximum likelihood
moment generating function, 407
Moser, Stefan M., xi, xii, xiv, 298,
306, 353
mutual information, 19, 307
conditional, 22
convexity, 238
instantaneous, 219
properties, 21, 308
self-information, 7, 21
480
N
nat, 7
Neal, Radford, 329
noise
additive, 313, 332
additive colored Gaussian, 351
additive white Gaussian, 332,
451
energy, 327
Gaussian, 313
normal distribution, see Gaussian
distribution
Nyquist, Harry T., 334
O
odds, 177
fair, 183
subfair, 184
superfair, 184
uniform fair, 184
OFDM, see orthogonal frequency
division multiplexing
one-time pad, 382
one-way function, 389
orthogonal frequency division
multiplexing, 352
orthonormality, 452, 453
complete, 455
expansion, 456
P
packing, 295
parser, 96
block, 96, 135, 288
Tunstall, 116
variable-length, 107
Parsevals theorem, 446, 455
parsing, 150
distinct, 150
path length lemma, 55
PDF, see probability density
function
perfect packing, 295
plaintext, 380
Index
PMF, see probability mass
function
Polyanskiy, Yury, xiii
Poor, H. Vincent, xiii
positive denite function, 442
positive denite matrix, 411
properties, 411
positive semi-denite matrix, 411
properties, 411
power constraint
average, 314
total average, 343
power spectral density, 332, 351,
444
white, 332
prex-free code, 44, 46, 89
probability density function, 40,
301
conditional, 41
joint, 41
probability distribution, see
probability mass function
or probability density
function
probability mass function, 36, 301
conditional, 38
joint, 37
marginal, 37
probability measure, 35
probability vector, 166
region of, 167
prolate spheroidal function, 339,
460
proportional betting, 181, 193
PSD, see power spectral density
public-key cryptography, 388
one-way function, 389
private key, 390, 391
public key, 390, 391
RSA, 392
squaring approach, 389
system model
authenticity, 392
secrecy, 391
trapdoor one-way function,
390
Index
trusted public database, 390
R
random coding, 218, 271
random message, 45
random process, see stochastic
process
random variable
continuous, 40, 301
discrete, 36
Gaussian, see under Gaussian
distribution
indicator, 213
random vector, 414
Gaussian, see under Gaussian
distribution
randomization
private, 380
public, 380, 386
rate, 205, 211, 293, 317
achievable, 212, 317
critical, 279
cut-o, 273, 274, 283
rate distortion theory, 120, 298
recency rank calculator, 140
relative entropy, 17, 307
reliability function, see channel
reliability function
remainder, 83
Riesz, Frigyes, 456
root, see tree or trellis
RSA, 392
RV, see random variable
S
sample space, 35
sampling theorem, 334
Sayir, Jossy, 107
secrecy, 379
computationally secure, 387
imperfect, 384
perfect, 382
one-time pad, 382
properties, 383
secret key, 380
481
self-information, see under mutual
information
self-similarity function, 442
separation theorem, 291, 377
set volume, 374
Shah-function, 337
Shannon code, 64, 74
Shannon limit, 328
Shannon, Claude E., 4, 25, 63, 75,
76, 211, 219, 267, 281,
298, 328, 334, 363, 380
ShannonMcMillanBreiman
Theorem, 363
Shannon-type code, 61, 74
wrong design, 75
signal-to-noise ratio, 327, 328, 342
signalowgraph, 257
cofactor, 259
determinant, 259
loop, 257
Masons rule, 259
open path, 257
path gain, 258
transmission gain, 258
signalowgraphs, 257
sinc-function, 334
SNR, see signal-to-noise ratio
source coding, 197
source coding theorem, 97, 215
source parser, see parser
sphere-packing exponent, 281
SSS, see strict-sense under
stationarity
state-transition diagram, 256
stationarity, 121, 441
strict-sense, 121, 441
weak, 122, 441
wide-sense, 122, 441
stochastic process, 121, 439
stationary, see stationarity
Strang, Gilbert, 410
support, 6
system model, 197, 201
Sz.-Nagy, Bla, 456
e
482
T
Thitimajshima, Punya, 329
Thomas, Joy A., xi, xiii, 75, 150,
204, 232, 236, 364
toor, see trellis
total expectation, 39
TPD, see trusted public database
under public-key
cryptography
transmission, see data
transmission
transmission gain, see under
signalowgraph
trapdoor one-way function, 390
RSA, 392
tree, 47
branching entropy, 56, 57
extended root, 50
extending a leaf, 50
leaf, 47
leaf entropy, 56, 57
node, 47
root, 47
unused leaf, 48, 51, 77, 81, 83
trellis, 247
detour, 254
number of detours, 254
root, 248
toor, 248
trellis code, see convolutional code
Tsitsiklis, John N., 126
Tunstall lemma, 114
Tunstalls algorithm, 116
Tunstall, Brian P., 119
typical sequence, 354
Index
jointly, 367, 376
typical set, 357, 364, 374
high-probability set, 359
properties, 357, 364, 367, 375,
376
volume, 374
typicality, 353
U
uncertainty, see entropy
unicity distance, 385, 386
uniform distribution, 302
union Bhattacharyya bound, 269
union bound, 221
universal compression, 120
V
variational distance, 19, 27, 31
Verd, Sergio, xiii
u
Viterbi algorithm, 250
volume, 374
W
waterlling, 347
WSS, see wide-sense under
stationarity
Y
Yeung, Raymond W., 2729, 31
Z
Zivs inequality, 159

Information Theory

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Information Theory

Transféré par

Droits d'auteur :

Formats disponibles

Information Theory

c Copyright Stefan M. Moser

1 Shannons Measure of Information

2 Review of Probability Theory

c Stefan M. Moser, v. 3.2

4 Data Compression: Ecient Coding of an Information Source 95

6 Data Compression: Ecient Coding of Sources with Memory135

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

7 Optimizing Probability Vectors over Concave

9 Data Transmission over a Noisy Digital Channel

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

12 Error Exponent and Channel Reliability Function

13 Joint Source and Channel Coding

17 Parallel Gaussian Channels

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

A Gaussian Random Variables

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

c Stefan M. Moser, v. 3.2

followed by a very brief review of probability in Chapter 2. In Chapters 3

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

Dependency chart. An arrow has the meaning of is required for.

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

We start by asking the question: What is information?

c Stefan M. Moser, v. 3.2

Shannons Measure of Information

The number of possible answers r should be linked to information.

Lets have another example.

where r is the number of all possible outcomes of a random message U .

I(U1 , U2 , . . . , Un ) = logb rn = n logb r = nI(U ).

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

nat (natural logarithm),

The measure I(U ) is the right answer to many technical problems.

We choose the following phone numbers:

In spite of its usefulness, Hartleys denition had no eect what so ever in

I(U ) = log2 (6902106897) 32.7 bits.

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

Shannons Measure of Information

Figure 1.1: Two hats with four balls each.

A proper measure of information needs to take into account the

Besides the amazing accomplishment of inventing information theory, at the age of 21

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

four possible realizations, i.e.,

We see the following:

where pi denotes the probability of the ith possible outcome.

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

Shannons Measure of Information

We now formally dene the Shannon measure of self-information of a source.

PU (u) logb PU (u),

where PU () denotes the probability mass function (PMF) 2 of the RV U , and

i.e., we do not need to worry about this case.

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

1.2. Uncertainty or Entropy

We will usually neglect to mention support when we sum over

H(U ) = 0.43 bits,

H(U ) = 0.43 nats 0.620 bits,

H(U ) = 0.43 bytes = 3.44 bits.

Remark 1.7. Be careful not to confuse uncertainty with information! For

c Copyright Stefan M. Moser, version 3.2, 6 Dec. 2013

Shannons Measure of Information

Another important observation is that the entropy of U does not depend