Vous êtes sur la page 1sur 84

Rate distortion theory

An introduction
Paulo J S G Ferreira
SPL Signal Processing Lab / IEETA
Univ. Aveiro, Portugal

April 23, 2010

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

1 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

2 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

2 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

2 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

2 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

2 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

3 / 80

Entropy
The entropy of a discrete random variable X with
probability density p(x) is
X
p(x) log p(x)
H(X) =
x

Base two logarithms tacitly used throughout the lecture


The term
log p(x)
is associated with the uncertainty of X
The entropy is the average uncertainty:
H(X) = Ep(x) log

1
p(x)

Exercise: show that 0 H(X) log n and identify the


distributions for which the bounds are attained
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

4 / 80

Relative entropy
The relative entropy measures the distance between
two probability functions:
D(p||q) =

p(x)
p(x) log

q(x)

p(x)
= Ep(x) log

q(x)

It is nonnegative and zero if and only if p = q


It is not symmetric and it does not satisfy the triangle
inequality
Hence, it is not a metric
Sometimes called the Kullback-Leibler distance
Exercise: show that D(p||q) 0

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

5 / 80

Conditional entropy
H(X|Y) is the expected value of the entropies of the
conditional distributions p(x|y), averaged over the
conditioning variable Y:
H(X|Y) =

p(y) H(X|Y = y)

X
y

p(y)

p(x|y) log p(x|y)

p(y) p(x|y) log p(x|y)

x,y

p(x, y) log p(x|y)

x,y

= Ep(x,y) log p(x|y)

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

6 / 80

Mutual information

It is the Kullback-Leibler distance between the joint


distribution p(x, y) and p(x)p(y):
I(X, Y) =

X
x,y

p(x, y)
p(x, y) log

p(x)p(y)

p(x, y)
= Ep(x,y) log

p(x)p(y)

Meaning:
How far from independent X and Y are
The information that X contains about Y

Exercise: show that I(X, Y) 0

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

7 / 80

Mutual information and entropy


I(X, Y) =

p(x, y)
p(x, y) log

x,y

p(x|y)p(y)
p(x, y) log

x,y

p(x)p(y)
p(x|y)

p(x, y) log

x,y

p(x)p(y)

p(x)

p(x, y) log p(x) +

x,y

p(x, y) log p(x|y)

x,y

p(x) log p(x) +

p(x, y) log p(x|y)

x,y

= H(X) H(X|Y)
Meaning: measures the reduction in the uncertainty of X
due to knowing Y
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

8 / 80

Conditional entropy and entropy

We have seen that


I(X, Y) = H(X) H(X|Y)
We have also seen that I(X, Y) is nonnegative
It follows that
H(X) H(X|Y)
Meaning: on the average, adding more information does
not increase the uncertainty

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

9 / 80

Chain rules
The simplest is
H(X, Y) = H(X) + H(X|Y)
Compare with p(x, y) = p(y|x) p(x)
By repeated application
H(X, Y) = H(X) + H(Y|X)
H(X, Y, Z) = H(X, Y) + H(Z|X, Y)
= H(X) + H(Y|X) + H(Z|X, Y)

The general case should be apparent:


X
H(X1 , X2 , . . . Xn ) =
H(Xi |X1 , . . . Xi1 )
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

10 / 80

Differential entropy
The continuous case is mathematically much more subtle
The differential entropy of X with probability density f (x) is
Z

h(X) =

f (x) log f (x) dx

Log: base e, unit: nats


Example: the differential entropy of a uniform random
variable is
Zb
1
1
h(X) =
log
dx = log(b a)
ba
a ba
It can be negative!
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

11 / 80

Gaussian variable
The differential entropy of a Gaussian variable with
density g(x) is
Z

H(X) =

(x)2
2 2

2 2
|
{z

= log p

2 2

(x)2
2 2

+ log e

(x )2
2 2

1
p

dx

g(x)

2 2
|
{z

g(x)

log p

2 2
|
{z

g(x)

1
2

log 2 2 +

Paulo J S G Ferreira (SPL)

1
2

log e =

1
2

(x)2
2 2

dx

log 2e 2

Rate distortion

April 23, 2010

12 / 80

Gaussian variable

The differential entropy of a Gaussian variable is


maximum among all densities with the same variance
Proof depends on
Z

f (x) log

f (x)
g(x)

dx 0

Recall relative entropy / Kullback-Leibler distance to see


why

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

13 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

14 / 80

Main ideas

Typicality: describe typical sequences


Forms of the asymptotic equipartition property
They are to information theory what the law of large
numbers is to probability
In fact, they depend on the law of large numbers
Weak typicality requires that empirical entropy
approaches true entropy
Strong typicality requires that the relative frequency of
each outcome approaches the probability

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

15 / 80

Weak typicality
Let Xn be an i.i.d. sequence X1 , X2 , . . . , Xn
It follows that
p(Xn ) = p(X1 )p(X2 ) p(Xn )
The weakly typical set is formed by sequences such that



1
log p(xn ) H(X)

n
The term on the left is the empirical entropy
The probability of the typical set approaches 1 as n
This does not mean that most sequences are typical but
rather that the non-typical sequences have small
probability
The most likely sequence may not be weakly typical
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

16 / 80

Strong typicality

Given > 0, a sequence is strongly typical with respect to


p(x) if the average number of occurrences of each symbol a in
the sequence deviates from its probability by less than :



N(a)


n p(a) <
Also required: symbols of zero probability do not occur:
N(a) = 0 if p(a) = 0

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

17 / 80

Strong typicality
It is a stronger than weak typicality
It allows better probability bounds
The probability of a non-typical sequence not only goes to
zero, it satisfies an exponential bound:
P[X 6= T()] 2n()
where is a positive function
Proof depends on Chernoff-type bound: u(x a) 2b(xa)
yields
E[u(X a)] = P(X a) E[2b(Xa) ] = 2ba E[2bX ]

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

18 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

19 / 80

Rate distortion theory

Lossy and lossless compression


Lossy compression implies distortion
Rate distortion theory describes the trade-off between
lossy compression rate and the corresponding distortion

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

20 / 80

Representation of continuous variables


Shannon wrote in Part V of A Mathematical Theory of
Communication:
...a continuously variable quantity can assume an
infinite number of values and requires, therefore, an
infinite number of bits for exact specification.
This means that...
...to transmit the output of a continuous source with
exact recovery at the receiving point requires, in
general, a channel of infinite capacity (in bits per
second). Since, ordinarily, channels have a certain
amount of noise, and therefore a finite capacity, exact
transmission is impossible.

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

21 / 80

Is everything lost?

Still quoting Shannon:


Practically, we are not interested in exact
transmission when we have a continuous source, but
only in transmission to within a certain tolerance.
This leads to the real issue:
The question is, can we assign a definite rate to a
continuous source when we require only a certain
fidelity of recovery, measured in a suitable way.

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

22 / 80

Yet another angle

Consider a source with entropy rate H


Source coding theorem: there are good source codes of
rate R if R > H
What if the available rate is below H?
Source coding theorem converse: error probability tends
to 1 as n increases bad news
What is necessary is a rate distortion code that
reproduces the sequence with a certain allowed distortion

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

23 / 80

The problem

Find a way of measuring distortion


Determine the minimum necessary rate for a certain
given distortion
As the distortion requirements are increased the rate will
increase
It turns out that it is possible to define a rate such that:
Proper encoding makes possible transmission over a
channel with capacity equal to that rate, at that distortion
A channel of smaller capacity is insufficient

This lecture turns around this problem

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

24 / 80

Specific cases

Lossless case
Special case, zero distortion, source coding theorem, etc.

Low rate / high compression


If the maximum distortion has been fixed, what is the
minimum rate?

Low distortion
What is the minimum distortion that can be achieved for a
certain channel?

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

25 / 80

Rate distortion plane

R
r

minimise
distortion
minimise
rate
lossless

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

26 / 80

Terminology

There is a source and a reproduction alphabet


xn is the source sequence, with n symbols x1 , x2 , . . . , xn
b n is the reproduced sequence, also with n symbols
x

xn

Paulo J S G Ferreira (SPL)

Encoder

Decoder

Rate distortion

xb n

April 23, 2010

27 / 80

Distortion measures symbols

How far are the data and their representations?


Distortion measures answer this
b ) on the pairs of symbols x, x
b
They are functions d(x, x
The functions take nonnegative values
The are zero on the diagonal: d(x, x) = 0
b
They measure the cost of replacing x with x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

28 / 80

Average distortion
The average distortion is
b)
Ep(x,xb) d(x, x

Note that
X

b )d(x, x
b) =
p(x, x

b
x,x

b |x) d(x, x
b)
p(x) p(x

b
x,x

There are three terms:


b ), determined by the per-symbol distance
d(x, x
p(x), determined by the source
b |x), determined by the coding procedure
p(x

b |x) to find interesting codes


Corollary: vary p(x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

29 / 80

Distortion measures example

The absolute value distortion measure is


b ) = |x x
b|
d(x, x

The average absolute value distortion is


b|
Ep(x,xb) |x x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

30 / 80

Distortion measures example

The squared error distortion measure is


b ) = (x x
b )2
d(x, x

The average squared distortion is


b )2
Ep(x,xb) (x x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

31 / 80

Distortion measures example

The Hamming distance is



b) =
d(x, x

b
0 x=x
b
1 x 6= x

The average Hamming distortion is


b)
Ep(x,xb) d(x, x

Easy to simplify: equal to


b)
p(x 6= x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

32 / 80

Distortion measures sequences

One way of measuring the distortion between the


sequences is
1X
bi )
bn ) =
d(xi , x
d(xn , x
n
This is simply the distortion per symbol
bi )
Another possibility: the (max) norm of the d(xi , x

Other norms possible, but seldom used

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

33 / 80

Rate distortion code

A (2nR , n)-rate distortion code consists of:


An encoding function fn : xn {1, 2, . . . , 2nR }
bn
A decoding function gn : {1, 2, . . . , 2nR } x

xn

Encoder

Paulo J S G Ferreira (SPL)

fn (xn )

Rate distortion

Decoder

April 23, 2010

xb n

34 / 80

Why sequences?

We are representing n symbols i.i.d. random variables


with nR bits
The encoder replaces the sequence with an index
The decoder does the opposite
There are, of course, 2nR possibilities
Surprisingly, it is better to represent entire sequences at
one go than to treat each symbol separately even
though they are chosen i.i.d.

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

35 / 80

Code distortion

The code distortion is


b n )]
D = E [d(xn , x
= E [d(xn , gn (fn (xn ))]
X
=
p(xn )d(xn , gn (fn (xn )))

The average is over the probability distribution on the


sequences

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

36 / 80

The rate distortion region

A point (R, D) in the rate-distortion plane is achievable if


there exists a sequence of rate distortion codes (fn , gn )
such that
lim E [d(xn , gn (fn (xn ))] D
n

The rate distortion region is the closure of the set of


achievable (R, D)
Its boundary is important (why?)
There are two (equivalent) ways of looking at it

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

37 / 80

Rate distortion function

Roughly speaking: given D, search for the smallest


achievable rate
This defines a function of D (it yields a rate for every
given D)
This function is the rate distortion function
More precisely: the rate distortion function R(D) is the
infimum of rates R such that (R, D) is in the rate distortion
region C for a given distortion D
R(D) =

inf R

(R,D)C

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

38 / 80

Distortion rate function

The rate distortion function, with a twist


Given R, search for the smallest achievable distortion
This defines a function of R (it yields a distortion for every
given R)
More precisely: the distortion rate function D(R) is the
infimum of distortions D such that (R, D) is in the rate
distortion region C for a given rate R
D(R) =

inf D

(R,D)C

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

39 / 80

Information rate distortion


b)
Consider a source X with a distortion measure d(x, x

The information rate distortion function RI (D) for X is


b
RI (D) = min I(X, X)
b |x) such
Minimisation: over all conditional distributions p(x
b
that E[d(x, x)] D
b ) = p(x)p(x
b |x)
Recall that p(x, x
b |x) such that
We thus want the minimum over all p(x
X
b |x) d(x, x
b) D
p(x) p(x
b
x,x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

40 / 80

Information rate distortion

RI (D) is obviously nonnegative


It is also nonincreasing
It is convex:

R(I) (D)
r

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

41 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

42 / 80

Rate distortion theorem

If the rate R is above R(D), there exists a sequence of


b n (Xn ) with at most 2nR codewords with an average
codes X
distortion approaching D:
b n )] D
E[d(Xn , X

If the rate R is below R(D), no such codes exist

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

43 / 80

Rate distortion theorem

The rate distortion and the information rate distortion


functions are equal:
R(D) = RI (D)
More explicitly,
R(D) =

min

b |x):E[d(x,x
b )]D
p(x

b
I(X, X)

Approach: show that R(D) RI (D) and R(D) RI (D)

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

44 / 80

Example: binary source


Consider a binary source with
P(X = 1) = p,

P(X = 0) = q

Distortion: Hamming distance


Need to find
b
R(D) = min I(X, X)
b |x) that
The minimum is computed with respect to all p(x
satisfy the distortion constraint:
X
b |x)d(x, x
b) D
p(x) p(x
b
x,x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

45 / 80

R(D) for the binary source


The rate distortion function for the Hamming distance is

H(p) H(D), 0 D min(p, q)
R(D) =
0,
otherwise
1

p=0.1
p=0.2
p=0.5

0.8
0.6
0.4
0.2
0
0
Paulo J S G Ferreira (SPL)

0.2

0.4

0.6

Rate distortion

0.8

1
April 23, 2010

46 / 80

Proof
b then show that it is
The idea is to lower bound I(X, X),
achievable

We know that
b = H(X) H(X|X)
b
I(X, X)

The first term H(X) is just the entropy


Hb (p) = p log p q log q
b
How to handle the second term H(X|X)?

Assume without loss of generality that p 1/ 2


If D > p, rate zero is sufficient (why?)
Therefore assume D < p 1/ 2

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

47 / 80

Proof (continued)
Let denote addition modulo 2 (that is, exclusive-or)
Note (why?)
b = H(X X|
b X)
b
H(X|X)

Conditioning cannot increase entropy:


b X)
b H(X X)
b
H(X X|
b is Hb (a), where a = probability of X X
b =1
H(X X)
b = 1 if and only if X 6= X
b
XX
b does not exceed D due to the
The probability of X 6= X
distortion constraint, thus
b H(D)
H(X X)

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

48 / 80

Proof (continued)

Putting it all together, if D < p 1/ 2,


b = H(X) H(X|X)
b H(p) H(D)
I(X, X)

The bound is achieved for


b = 1) = P(X = 1|X
b = 0) = D
P(X = 0|X

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

49 / 80

Example: Gaussian source

Source: Gaussian source, zero mean, variance 2


Quadratic distortion:
d(x, y) = (x y)2
Need to find
b
R(D) = min I(X, X)

The minimum must be computed subject to the constraint


b 2] D
E[(X X)

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

50 / 80

Solution for D 2

Claim: for D 2 the minimum rate is zero


b =0
How to get rate zero: set X
b = 0 (why?)
Then no bits need to be transmitted and I(X, X)
b 2 ] = E[X2 ] = 2
This leads to a distortion of E[(X X)

This value does not exceed D and so satisfies the


constraint
Still need to consider the case D < 2

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

51 / 80

Solution for D < 2

b = h(X) h(X|X)
b
I(X, X)
b X)
b
= h(X) h(X X|
b
h(X) h(X X)

b thus X X
b = Gaussian
To minimise this, maximise h(X X);
b 2 ] D, hence variance X X
b can be D
Constraint: E[(X X)

Result:
b
I(X, X)

1
2

log 2e 2

1
2

log 2eD =

1
2

2
log

Bound met if X is Gaussian, zero mean, variance D


Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

52 / 80

R(D) for the Gaussian source


Thus, the rate distortion for the Gaussian source is

2
1
log D , D < 2
2
R(D) =
0,
otherwise
3.5

variance=1
variance=2
variance=3

3
2.5
2
1.5
1
0.5
0
0
Paulo J S G Ferreira (SPL)

0.5

1.5

Rate distortion

2.5

3.5
April 23, 2010

53 / 80

Distortion rate for the Gaussian source


Solve for D in
R=
This leads to

1
2

2
log

D(R) = 2 22R

Increasing R by one bit decreases the distortion by 1/ 4


Define SNR as

2
SNR = 10 log10

Then

SNR = 10 log10 22R 6R dB

Hence SNR varies at the rate of 6 dB per bit

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

54 / 80

Example: multiple Gaussian variables


How to distribute R bits among n variables N (0, i2 )?
In this case

b n)
R(D) = min I(Xn , X

b n |xn ) such
The minimum is over all density functions f (x
that
b n )] D
E[d(Xn , X

Here, d(, ) is the sum of the per-symbol distortions:


X
bn ) =
bi )2
d(xn , x
(xi x
Arguments similar to those previously used lead to
!+
X 1
i2
b n)
I(Xn , X
log
2
Di
i
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

55 / 80

Example: multiple Gaussian variables


It turns out that this lower bound can be met
We still need to find
!
X 1 2 +
i
ln
R(D) = Pmin
Di =D
2
Di
Simple variational problem, Lagrangian is
L=
This leads to

X1

ln

L
Di

i2

Di
1

2Di

Di

+=0

Thus Di =constant, and the optimal bit allocation means


same distortion for each variable
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

56 / 80

Example: multiple Gaussian variables


Based on the previous result it can be shown that the rate
distortion function for multiple N(0, i2 ) variables is
R(D) =
where

Di =

X1

log

i2
Di

c,
c < 2i
i2 , c 2i

and c must be chosen so that


X
Di = D
is satisfied.
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

57 / 80

Example: multiple Gaussian variables


Reverse water-filling: fix c, describe variables such as X1 ,
X2 , X3 that have variance greater than c
Ignore variables such as X4 of variance smaller than c

i2
c
D1 D2 D3 D
4
X1 X2 X3 X4
Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

58 / 80

Rate distortion theorem

The proof of the theorem depends on certain typical


sequences
The idea behind typical sequences is that the typical set
has high probability
Thus if the rate distortion code works well for typical
sequences it will work well on the average
These ideas enter the proof of the rate distortion theorem
Before going on it is necessary to define the typical
sequences that will be used

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

59 / 80

Distortion typical sequences


b n ) is distortion typical if
Given > 0, a pair of sequences (xn , x


log p(xn )


H(X) <

n


log p(x

bn )


b
H(X) <



n


n)
log p(xn , x

b

b <
H(X, X)



n



b <
b n ) E[d(X, X)]
d(xn , x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

60 / 80

Distortion typical sequences

According to the law of large numbers


log p(xn )
n

E[log p(X)] = H(X)

as n
The same argument applies to the other conditions
As n the conditions will be met

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

61 / 80

Strongly typical sequences

Given > 0, a sequence is strongly typical with respect to


p(x) if the average number of occurrences of each alphabet
symbol a in the sequence deviates from its probability by less
than (divided by the alphabet size):



N(a)



n p(a) < |A|
Also required: symbols of zero probability do not occur:
N(a) = 0 if p(a) = 0

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

62 / 80

Strongly typical pairs

Very similar definition, but for pairs of sequences


How many symbol pairs (a, b) exist in a sequence pair?
Definition involves the joint probability p(a, b):


N(a, b)


p(a, b) <
n
|A| |B|
for all possible a, b
Again, pairs of zero probability must not occur

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

63 / 80

Set size

A typical sequence with respect to a distribution follows


that distribution well
Intuitively, if n is large enough, this is very likely
This can be made rigorous: due to the law of large
numbers, the probability of the strongly typical set
approaches 1 as n
Thus, roughly speaking, if a system works well for the
typical set it will work well on average

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

64 / 80

Rate distortion theorem

b |x) and > 0


Select a p(x
bn
The codebook consists of 2nR sequences X

To encode, map Xn to an index k


b
To decode, map the index k to X(k)
b n (k)) is in the
Pick k = 1 if there is no k such that (Xn , X
strongly jointly typical set

Otherwise pick, for example, the smallest such k


In this way, only 2nR values of k are needed

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

65 / 80

Rate distortion theorem

It is necessary to compute the average distortion,


averaging over the random codebook choice:
X
b n )]
p(xn )E[d(xn , X
D=
xn

The sequences xn can be divided in three sets:


Non-typical sequences
b n (k) exists that is jointly
Typical sequences for which a X
typical with them
b n (k) exists
Typical sequences for which no such X

What is the contribution of each set for the distortion?

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

66 / 80

Rate distortion theorem

The total probability of the non-typical sequences can be


made as small as desired by increasing n
Hence, they contribute a vanishingly small amount to the
distortion if d(, ) is bounded and if n is sufficiently large
More precisely, they contribute dmax where dmax is the
maximum possible distortion

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

67 / 80

Rate distortion theorem

Consider now the case of typical sequences that have


codewords jointly typical with them
This means that the relative frequency of pairs (a, b) in
the coded and decoded sequences are close to the joint
probability
The distortion is a continuous function of the joint
probability, thus it will also be close to D
If d(, ) is bounded, the distortion will be bounded by
D + dmax
The total probability of this set is below 1, hence it will
contribute at most D + dmax to the expected distortion

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

68 / 80

Rate distortion theorem

Finally, consider typical sequences without jointly typical


codewords
Let the total probability of this set be Pt
If d(, ) is bounded, the contribution due to this set will be
at
most Pt dmax
It is possible to derive a bound for Pt that shows that it
converges to
zero as n

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

69 / 80

Putting it all together

The three terms contribute differently to the expected


distortion
The non-typical set contributes at most dmax
The typical / jointly typical set contributes at most
D + dmax
Finally, the typical / but not jointly typical set contributes a
vanishingly small fraction
These terms add up to R + , where can be made
arbitrarily small
This completes the argument

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

70 / 80

Rate distortion theorem (converse)

Draw independent samples from a source X with density


p(x)
The rate R of any (2nR , n) rate distortion code with
distortion D satisfies R R(D)
b n) D
To show this assume that E[d(Xn , X
That R R(D) follows from a chain of inequalities

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

71 / 80

Rate distortion theorem (converse)


nR H(fn (Xn ))

fn assumes 2nR values

H(fn (Xn )) H(fn (Xn )|Xn )


b n ) = H(Xn ) H(Xn |X
b n)
= I(Xn , X
X
b n)
=
H(Xi ) H(Xn |X
X
X
b n , X1 . . . Xi1 )
=
H(Xi )
H(Xi |X
X
X
bi)

H(Xi )
H(Xi |X
X
bi)
=
I(Xi , X
X

b i )]

R E[d(Xi , X

1X
b i )]
=n
R E[d(Xi , X
n
1 X

b i )]
nR
E[d(Xi , X
n
b n )])
= nR(E[d(Xn , X
nR(D)
Paulo J S G Ferreira (SPL)

entropy is nonnegative
definition
independence
chain rule
conditioning reduces entropy

rate distortion definition

convexity of R()
definition
b n )] D
E(d(Xn , X

Rate distortion

April 23, 2010

72 / 80

Contents

Preliminaries

Weak and strong typicality

Rate distortion function

Rate distortion theorem

Computation algorithms

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

73 / 80

Alternating projections
Consider two subspaces A and B of a common parent
space
A point in the intersection can be found by alternating
projections

x
Paulo J S G Ferreira (SPL)

Rate distortion

x
April 23, 2010

74 / 80

POCS
The method depends on the possibility of defining a
projection
Subspaces are obviously convex
Everything works for convex sets because projections are
well defined

A
b
a

z
c

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

75 / 80

Distance between convex sets

POCS works to find a point in the intersection of two


convex sets A, B
If the sets are disjoint, one can find the distance between
them
dmin = min d(a, b)
aA
bB

Project zi in A to find zi+1


Project zi+1 in B to find zi+2
Repeat to obtain z1 , z2 , z3 , . . .
The distance between the points converges to dmin

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

76 / 80

POCS and rate distortion

Need to recast rate distortion as convex optimisation


b is the Kullback-Leibler distance
Mutual information I(X, X)
D(p||q)
Hence rate distortion is obtained by minimising this
distance
How to proceed?

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

77 / 80

POCS and rate distortion

The Kullback-Leibler distance from p(x)p(y|x) to p(x)q(y) is


minimised when q(y) is the marginal
X
p(x)p(y|x)
q0 (y) =
x

Proof: compute the distances from p(x)p(y|x) to p(x)q(y)


and from p(x)p(y|x) to p(x)q0 (y) and show that the first is
than the second
This is an elementary consequence of the nonnegativity
of D(||)

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

78 / 80

POCS and rate distortion

R(D) =
b |x):
p(x

b |x):
p(x

b |x):
p(x

min

b
I(X, X)

min

b |x)d(x,x
b )D
p(x)p(x

b |x)d(x,x
b )D
p(x)p(x

min

b |x)d(x,x
b )D
p(x)p(x

= min

b ) p(x
b |x):
q(x

min

b |x) log
p(x)p(x

b |x)
p(x

b)
p(x

b |x) || p(x)p(x
b)
D p(x)p(x
b
x,x

b |x)d(x,x
b )D
p(x)p(x

b |x) || p(x)q(x
b)
D p(x)p(x

= min min D(a||b)


aA bB

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

79 / 80

Blahut-Arimoto

Alternating minimisation as in POCS


b (subject
b ), find p(x
b |x) that minimises I(X, X)
Start with q(x
to the distortion constraint)
b |x), find the q(x
b ) that minimises the mutual
For this p(x
information, which is simply
X
b) =
b |x)
q0 (x
p(x)p(x

Paulo J S G Ferreira (SPL)

Rate distortion

April 23, 2010

80 / 80