Vous êtes sur la page 1sur 22

ELL714

INFORMATION THEORY

ENTROPY, RELATIVE ENTROPY,


AND MUTUAL INFORMATION

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

ENTROPY
Entropy is a measure of the uncertainty of a random variable.
The entropy H(X) of a discrete random variable X is defined by
H ( X ) p( x) log p( x)
x

We denote expectation by E. Thus, if X p(x), the expected value of


the random variable g(X) is written
E p g ( X ) g ( x) p( x)
x

The entropy of X can also be interpreted as the expected value of the


random variable log 1/p(X), where X is drawn according to probability
mass function p(x). Thus,
H ( X ) E p log
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

1
p ( x)

MANAV BHATNAGAR

ELL714

INFORMATION THEORY

H(X ) 0

Lemma 2.1.1
Proof:

0 p( x) 1 implies that log

Lemma 2.1.2

1
0
p( x)

H b ( X ) (logb a) H a ( X )

Proof:
logb p logb a log a p

Properties of entropy
a) H ( X ) 0
Since 1 p( X ) 0 log p( x) 0
b) H ( X ) (log a) H ( X )
b

sin ce log b p log b a log a p

E p log b p ( X ) E p (log b a )log a p ( x)


(log b a) Eb log a p ( X )
=log b a H a ( X )
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

EXAMPLE
Let

with probability p
X 10 with
probability 1-p

Then
H(X) = p log p (1 p) log(1 p)

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

ELL714

INFORMATION THEORY

d
p
1 p
H ( p ) log p (1)log(1 p )
)(1)
dp
p
1 p
=-logp+log(1-p)
d
H ( p) 0
dp
1
p
2
1 1
1
H ( ) log 2 2 log 2 2 1
2 2
2

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

Example 1.1.1
Consider a random variable that has a uniform distribution over 32 outcomes. To
identify an outcome, we need a label that takes on 32 different values. Thus, 5bit strings suffice as labels.
The entropy of this random variable is
32

32

1
1
log log32 5 bits
32
32
i 1

H ( X ) p(i )log p(i )


i 1

Example 1.1.2
Suppose that we have a horse race with eight horses taking part. Assume that
1
1
1
1
1
1 1 1
the probabilities of winning for the eight horses are 2 , 4 , 8 , 16 , 64 , 64 , 64 , 64

1
1 1
1 1
1 1
1
1
1
H ( X ) log log log log 4 log
2
2 4
4 8
8 16
16
64
64

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

ELL714

INFORMATION THEORY

Example 2.1.2
1

a with probability 2

b with probaility 1

4
X
c with probaility 1

1
d with probaility
8

The entropy of X is
1
1 1
1 1
1 1
1 7
H ( X ) log log log log bits
2
2 4
4 8
8 8
8 8

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

JOINT ENTROPY AND CONDITIONAL ENTROPY


Definition :The joint entropy H(X,Y) of a pair of discrete random
variables (X, Y) with a joint distribution p(x, y) is defined as
H ( X , Y ) p( x, y )log p( x, y )
xX yY

Definition If (X, Y) p(x, y), the conditional entropy H(Y|X) is


defined as

H (Y | X ) p ( x) H (Y | X x)
x X

=- p ( x ) p ( y | x ) log p ( y | x )
xX

yY

=- p ( x, y ) log p ( y | x )
xX yY

=-E log p ( y | x )
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

ELL714

INFORMATION THEORY

Theorem 2.2.1(chain rule)


H(X,Y) = H(X) + H(Y|X).
Equivalently, we can write
log p(X, Y) = log p(X) + log p(Y|X)
Corollary
H(X,Y|Z) = H(X|Z) + H(Y|X,Z)

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

Example 2.2.1
Let (X, Y) have the following joint distribution:

The marginal distribution of X is 12 , 14 , 81 , 81 and the marginal distribution


of Y is 14 , 14 , 14 , 14 and hence H ( X ) 7 bits and H(Y) = 2 bits. Also
4

H ( X | Y ) p (Y i ) H ( X | Y i )
i 1

1 1 1 1 1 1 1 1 1 1
= H , , , H , , ,
4 2 4 8 8 4 4 2 8 8

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

1 1 1 1 1 1
+ H , , , H (1,0,0,0)
4 4 4 4 4 4
1 7 1 7 1
1
= 2 0
4 4 4 4 4
4
11
= bits
8
MANAV BHATNAGAR

H(Y|X)=13/8 bits
H(X,Y)=27/8 bits
H(X|Y)\=H(Y|X)
H(X)-H(X|Y)=H(Y)H(Y|X)

10

ELL714

INFORMATION THEORY

RELATIVE ENTROPY AND MUTUAL INFORMATION


The relative entropy is a measure of the distance between two distributions.
The relative entropy D(p||q) is a measure of the inefficiency of assuming
that the distribution is q when the true distribution is p.
For example, if we knew the true distribution p of the random variable, we could
construct a code with average description length H(p). If, instead, we used the
code for a distribution q, we would need H(p) + D(p||q) bits on the average to
describe the random variable.
*Verify this for D-adic distribution for four states.
Definition The relative entropy or KullbackLeibler distance between two
probability mass functions p(x) and q(x) is defined as
D( p || q) p( x)log
xX

=E p log

p ( x)
q ( x)

p ( x)
q( x)

if there is any symbol x X such that p(x) > 0 and q(x) = 0, then D(p||q)=.
relative entropy is always nonnegative and is zero if and only if p = q.
However, it is not a true distance between distributions since it is not symmetric
and does not satisfy the triangle inequality
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

11

Mutual Information
Consider two random variables X and Y with a joint probability mass function
p(x, y) and marginal probability mass functions p(x) and p(y). The mutual
information I (X; Y) is the relative entropy between the joint distribution and
the product distribution p(x)p(y):

I ( x; y )

p ( x, y )

p( x, y ) log p( x) p( y )

x X yY

=D(p(x,y)||p(x)p(y))
=E p ( x , y ) log

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

p ( x, y )
p( x) p( y )

MANAV BHATNAGAR

12

ELL714

INFORMATION THEORY

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

13

RELATIONSHIP BETWEEN ENTROPY AND MUTUAL


INFORMATION
I ( x; y ) p ( x, y ) log
xX yY

p ( x, y )
p( x) p( y )

=- p( x, y )log p( x) p( x, y )log p( x | y )
x, y

x, y

= - p( x)log p( x) ( p( x, y )log p( x | y ))
x

x, y

= H(X) -H(X|Y)
By symmetry, it also follows that
I (X; Y) = H(Y) H(Y|X).
Since H(X,Y) = H(X) + H(Y|X) ,we have
I (X; Y) = H(X) + H(Y) H(X,Y).
Finally, we note that
I (X; X) = H(X) H(X|X) = H(X).
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

14

ELL714

INFORMATION THEORY

Theorem 2.4.1 (Mutual information and entropy)


I (X; Y) = H(X) H(X|Y)
I (X; Y) = H(Y) H(Y|X)
I (X; Y) = H(X) + H(Y) H(X,Y)
I (X; Y) = I (Y; X)
I (X; X) = H(X)

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

15

CHAIN RULES FOR ENTROPY, RELATIVE ENTROPY,


AND MUTUAL INFORMATION
Theorem 2.5.1 (Chain rule for entropy)
Let X1, X 2 ......, X n be drawn according to p( x1 , x2 ,..., xn ) . Then
n

Proof:

H ( X 1 , X 2 ......, X n ) H ( X i | X i 1 ,...,| X 1 )
i 1

Theorem 2.5.2 (Chain rule for information)


n

I ( X 1 , X 2 ......, X n ;Y ) I ( X i ;Y | X i 1,..., X 1 )
i 1

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

16

ELL714

INFORMATION THEORY

Definition For joint probability mass functions p(x, y) and q(x, y), the
conditional relative entropy D(p(y|x)||q(y|x)) is the average of the relative
entropies between the conditional probability mass functions p(y|x) and
q(y|x) averaged over the probability mass function p(x).

D( p( y | x) || q( y | x)) p( x) p( y | x)log
x

=E p ( x , y ) log

p ( y | x)
q ( y | x)

p ( y | x)
q ( y | x)

Theorem 2.5.3 (Chain rule for relative entropy)


D(p(x, y)||q(x, y)) = D(p(x)||q(x)) + D(p(y|x)||q(y|x)).

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

17

JENSENS INEQUALITY AND ITS CONSEQUENCES


Definition A function f (x) is said to be convex over an interval (a, b) if for every x1,
x2 (a, b) and 0 1.
f (x1 + (1 )x2) f (x1) + (1 )f (x2).
A function f(x) is said to be strictly convex if equality holds only if = 0 or = 1.
Theorem 2.6.1 If the function f has a second derivative that is non-negative
(positive) over an interval, the function is convex (strictly convex) over that
interval.
1 = 1
2 1
1 =
1
2 1
2 1
1 =
1 + 1 2 1
2 1

1 = 2 1 1 (2 1 )

Source

Convex functions:
2 , , , (
0)
Concave functions:
, ( 0)
Convexity test:

2 1

0,
Strict convex:

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

>0,

18

ELL714

INFORMATION THEORY

Theorem 2.6.2 Jensens inequality


If f is a convex function and X is a random variable,
Ef (X) f (EX).
Proof:
For a two-mass-point distribution, the inequality becomes
p1f (x1) + p2f (x2) f (p1x1 + p2x2),
which follows directly from the definition of convex functions. Suppose that
the theorem is true for distributions with k 1 mass points. Then writing
pi' pi /(1 - p k ) for i = 1, 2, . . . , k - 1, we have
k

p
i 1

k 1

f ( xi ) pk f ( xk ) (1 pk ) pi' f ( xi )
i 1

k 1

p k f ( xk ) (1 pk ) f ( pi' xi )
i 1

k 1

f(p k xk (1 pk ) pi' xi )
i 1

=f( pi xi )
i 1

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

19

Theorem 2.6.3: Information inequality


Let p(x), q(x), x X, be two probability mass functions. Then
D(p||q) 0
with equality if and only if p(x) = q(x) for all x.
Proof
Let A = {x : p(x) > 0} be the support set of p(x). Then
p( x)
q
( x)
x A
q( x)
= p ( x)log
p ( x)log t ( x)
p ( x) xA
x A
q( x)
log p ( x)t ( x) log p ( x)
: Jensen
p( x)
x A
xA

D ( p || q ) p ( x)log

=log q ( x)
x A

log q ( x)
x X

Equality iff
)= ( ) = 1
(

=log1
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

Equality iff
q(x)=cp(x)
c: constant

=1

=0
MANAV BHATNAGAR

20

10

ELL714

INFORMATION THEORY

Corollary: Non-negativity of mutual information


For any two random variables, X, Y ,
I (X; Y) 0,
with equality if and only if X and Y are independent.

Proof
I ( X ;Y ) D( p( x, y) || p( x) p( y)) 0

with equality if and only if p(x, y) = p(x)p(y) (i.e., X and Y are independent).
Corollary
D(p(y|x)||q(y|x)) 0
with equality if and only if p(y|x) = q(y|x) for all y and x such that p(x) > 0.
Corollary
I (X; Y|Z) 0,
with equality if and only if X and Y are conditionally independent given Z.
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

21

Theorem 2.6.4 (Upper bound of entropy)


H(X) log |X|, where |X| denotes the number of elements in the range of X,
with equality if and only if X has a uniform distribution over X.
Proof
Let u(x) = 1/ |X| be the uniform probability mass function over X, and let p(x)
be the probability mass function for X. Then

D( p || u ) p( x)log

p( x)
log X H ( X )
u ( x)

Hence by the non-negativity of relative entropy

D( p || q) H ( X ) log X 0
H(X) log X

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

22

11

ELL714

INFORMATION THEORY

Theorem 2.6.5: Conditioning reduces entropy


H(X|Y) H(X)
with equality if and only if X and Y are independent.
Proof
H (X |Y) H (X )
I ( X ;Y ) 0
H (X ) H (X |Y) 0
H (X |Y) H (X )

Theorem 2.6.6
Let X1, X 2 ......, X n be drawn according to p(x1, x2, . . . , xn). Then
n

H ( X 1 , X 2 ,....., X n ) H ( X i )
i 1

with equality if and only if the Xi are independent.


n

H ( X 1 , X 2 ,.......... X n ) H ( X i | X i 1, ...., X 1 )
i 1
n

H( Xi)
i 1

MANAV BHATNAGAR

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

23

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

24

12

ELL714

INFORMATION THEORY

LOG SUM INEQUALITY AND ITS APPLICATIONS


Theorem 2.7.1 (Log sum inequality)
For non-negative numbers a1, a2, . . . , an and b1n, b2, . . . , bn,
n

i 1

log

n
ai
( ai ) log
bi
i 1

with equality if and only if ai = constant


bi
Proof:
n
n
ai
B= bi
Let A
i 1
i 1
ai
A

ai '
n

a
i 1

log

bi '

i 1
n

b
i 1

bi
B

n
ai
a 'A
A ai ' log i '
bi
bi B
i 1
n

=A a i ' log
i 1
n

=A ai ' log
i 1

n
ai '
A
A ai ' log
'
bi
B
i 1

ai '
A
A log
bi '
B

=AD(a i ' || bi ') A log


A log
INFORMATION
THEORY(ELL714)

A
B

A
B

MANAV BHATNAGAR

25

We now use the log sum inequality to prove various convexity results.
We begin by reproving Theorem 2.6.3, which states that D(p||q) 0 with
equality if and only if p(x) = q(x). By the log sum inequality,

D( p || q) p( x)log

p( x)
q( x)

( p ( x))log p ( x) / q ( x)
1
=1log 0
1
with equality if and only if p(x)/q(x) = c. Since both p and q are probability
mass functions, c = 1, and hence we have D(p||q) = 0 if and only if p(x) =
q(x) for all x.

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

26

13

ELL714

INFORMATION THEORY

1)Theorem 2.7.2 (Convexity of relative entropy)


D(p||q) is convex in the pair (p, q); that is, if (p1, q1) and (p2, q2) are two
pairs of probability mass functions, then
D(p1 + (1 )p2||q1 + (1 )q2) D(p1||q1) + (1 )D(p2||q2)
for all 0 1.
Proof
p1 ( x)
(1 ) p2 ( x)
p ( x) (1 ) p2 ( x)
(1 ) p2 ( x)log
( p1 ( x) (1 ) p2 ( x))log 1
q1 ( x)
(1 )q2 ( x)
q1 ( x) (1 )q2 ( x)
p1 ( x)
(1 ) p2 ( x)
p ( x) (1 ) p2 ( x)
p1 ( x)log
(1 ) p2 ( x)log
( p1 ( x) (1 ) p2 ( x))log 1

q
(
x
)
(1

)
q
(
x
)
q1 ( x) (1 )q2 ( x)
xX
xX
xX
1
2
=D( p1 (1 ) p2 || q1 (1 )q2
p1 ( x)log

2)

D ( p || q ) 0
proof
D ( p || q ) p ( x) log

INFORMATION
THEORY(ELL714)

p( x)
q( x)

p( x)
( p ( x )) log
q ( x)
BHATNAGAR

MANAV
=0

27

Theorem 2.7.2 (Convexity of relative entropy)


D(p||q) is convex in the pair (p, q); that is, if (p1, q1) and (p2, q2) are two
pairs of probability mass functions, then
D(p1 + (1 )p2||q1 + (1 )q2) D(p1||q1) + (1 )D(p2||q2)
for all 0 1.

Theorem 2.7.3 (Concavity of entropy)


H(p) is a concave function of p.
Proof
H(p) = log |X| D(p||u)
where u is the uniform distribution on |X| outcomes. The concavity of H
then follows directly from the convexity of D.

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

28

14

ELL714

INFORMATION THEORY

DATA-PROCESSING INEQUALITY
Definition
Random variables X, Y, Z are said to form a Markov chain in that order (denoted
by X Y Z) if the conditional distribution of Z depends only on Y and is
conditionally independent of X. Specifically, X, Y, and Z form a Markov chain X
Y Z if the joint probability mass function can be written as
p(x, y, z) = p(x)p(y|x)p(z|y)=p(x,y) p(z|y)
Some simple consequences are as follows:
X Y Z if and only if X and Z are conditionally independent given Y.
Markovity implies conditional independence because
p ( x, z | y )

p( x, y, z ) p( x, y ) p( z | y )

p( x | y ) p( z | y )
p( y )
p( y )

X Y Z implies that Z Y X. Thus, the condition is sometimes written X


Y Z.
If Z = f (Y), then X Y Z.
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

29

Theorem 2.8.1: Data-processing inequality


If X Y Z, then I (X; Y) I (X; Z)
Proof: By the chain rule, we can expand mutual information in two
different ways:
I (X; Y,Z) = I (X; Z) + I (X; Y|Z)
= I (X; Y) + I (X;Z|Y).
Since X and Z are conditionally independent given Y, we have I (X;Z|Y) = 0.
Since I (X; Y|Z) 0, we have
I (X; Y) I (X; Z)
We have equality if and only if I (X; Y|Z) = 0 (i.e., X Z Y forms
a Markov chain). Similarly, one can prove that I (Y; Z) I (X; Z).
Corollary In particular, if Z = g(Y), we have I (X; Y) I (X; g(Y)).
Proof: X Y g(Y) forms a Markov chain.
Thus functions of the data Y cannot increase the information about X.
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

30

15

ELL714

INFORMATION THEORY

Corollary
If X Y Z, then I (X; Y|Z) I (X; Y).
Proof:
I (X;Z|Y) = 0, by Markovity, and I (X; Z) 0 thus
I (X; Y|Z) I (X; Y).
Thus, the dependence of X and Y is decreased (or remains unchanged) by the
observation of a downstream random variable Z. Note that it is also possible
that I (X; Y|Z) > I (X; Y) when X, Y, and Z do not form a Markov chain. For
example, let X and Y be independent fair binary random variables, and let Z = X
+ Y. Then I (X; Y) = 0, but I (X; Y|Z) = H(X|Z) H(X|Y,Z) = H(X|Z) = P(Z = 1)H(X|Z =
1) = 1/2 bit.

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

31

Sufficient statistics
Let f0 (x) :a family of probability mass function
X : a sample form of distribution
T(X) : average statistics of X
then X T(X)
I( ;T(X)) I( ;X)
T(X) is a sufficient statistics if
I( ;X) I ( ;T(X))
and x are independent

X T(X)

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

32

16

ELL714

INFORMATION THEORY

The fisher-Neyman Factorization Theorem


f ( x) a ( x)b (t )
where t=T(X)
a(X) is independent of and b (t ) does not depend upon X
1
X 1 , X 2 ,..... X N ~ N ( ,1) iid,where is unknown .prove that T(X)=
N
sufficient statistics for
N

f (X)=
n 1

=(

i 1

in

1 N 2 ( X n )
f 0 ( X n ) ( ) 2 e n1
2

( X n X X )
1 N2 2
) e n1
2

2
2
1 N 2 (( X n X ) 2(X n X )( X ) ( X ) )
=( ) 2 e n1
2

(X
n 1

X )( X ) X ( X n X )
n 1

f (X ) (

INFORMATION
THEORY(ELL714)

1
( X n X )2
2 n 1

1 N2
) e
2

1
( X ) 2
2 n 1

MANAV BHATNAGAR

33

QUIZ (10 min.)

For the joint distribution p(x,y), find:


(a) H(X),H(Y ).
(b) H(X | Y),H(Y | X).
(c) H(X,Y).
(d) H(Y) H(Y | X).
(e) I (X; Y).

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

34

17

ELL714

INFORMATION THEORY

FANOS INEQUALITY
Suppose that we know a random variable Y and we wish to
guess the value of a correlated random variable X
Fanos inequality relates the probability of error in guessing
the random variable X to its conditional entropy H(X|Y)
The conditional entropy of a random variable X given another
random variable Y is zero if and only if X is a function of Y
Hence we can estimate X from Y with zero probability of error
if and only if H(X|Y) = 0
We expect to be able to estimate X with a low probability of
error only if the conditional entropy H(X|Y) is small
Fanos inequality quantifies this idea
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

35

FANOS INEQUALITY
Let us wish to estimate a random variable X with a
distribution p(x)
We observe a random variable Y that is related to X by the
conditional distribution p(y|x)
From Y, we calculate a function g(Y) = , where is an
estimate of X
We wish to bound the probability that
We observe that X Y forms a Markov chain
Define the probability of error
= Pr{ }

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

36

18

ELL714

INFORMATION THEORY

FANOS INEQUALITY
Theorem 2.10.1: Fanos Inequality
For any estimator X such that
X Y X , with Pe = Pr(X = X ), we have
for any estimator x such that x y x with p e pr ( x x ), we have
H(p ) p log | X | H ( X | X ) H ( X | Y )
e

this inequality can be weakened to


1 + Pe log |X| H(X|Y)
or
pe

H ( X | Y ) 1
log X

Proof:
Define an error random variable,

x
E 10 ifif xx=x

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

37

Then, using the chain rule for entropies to expand H(E,X| X ) in two different
ways, we have
H ( E , X | X ) H ( X | X ) H ( E | X , X )

=H(E|X)+H(X|E,X)

Since conditioning reduces entropy H ( E | X ) H ( E ) H ( Pe )


is equal to zero.
Now since E is a function of X andX , the conditional entropy H(E|X,X)

Also, since E is a binary-valued random variable, H(E) =H(Pe ). The remaining term, H(X|E,X

can be bounded as follows


H ( X | E , X ) Pr ( E 0) H ( X | X , E 0) Pr ( E 1) H ( X | X , E 1)
(1-Pe )0 Pe log X

since given E = 0, X =X , and given E = 1, we can upper bound the conditional


entropy by the log of the number of possible outcomes. Combining these results,
we obtain

H(Pe ) + Pe log |X| H(X|X)

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

38

19

ELL714

INFORMATION THEORY

By the data-processing inequality, we have I ( X ; X ) I ( X ;Y ) since


H ( X | Y ) . Thus, we have
is a Markov chain, and therefore H(X|X)
X Y X
H ( Pe ) Pe log X H ( X | X ) H ( X | Y )

Corollary:
For any two random variables X and Y, let p=Pr(X Y)
H(p) + p log |X| H(X|Y)

Corollary:

Let Pe Pr( X X )and let X:Y


X; then
H(Pe ) Pe log( X 1) H ( X | Y )

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

39

Remark: Suppose that there is no knowledge of Y. Thus, X must be


guessed without any information. Let X 1,2,...., m and p1 p2 ..... pm
= 1 and the resulting probability of error is
Then the best guess of X is X
Pe = 1 p1. Fanos inequality becomes
H(Pe ) + Pelog(m-1) H(X).

The probability mass function


( p1 , p1 ,.... pm ) (1 pe ,

pe
p
,......, e )
m 1
m 1

achieves the equality.


While we are at it, let us introduce a new inequality relating probability of error
and entropy. Let X and X by two independent identically distributed random
variables with entropy H(X). The probability at X = X is given by
Pr( X X ') P 2 ( x)
x

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

40

20

ELL714

INFORMATION THEORY

Lemma 2.10.1
If X and X are i.i.d. with entropy H(X),
Pr( X X ') 2 H ( X )

with equality if and only if X has a uniform distribution.


Proof:
Suppose that X p(x). By Jensens inequality, we have

2E log p ( X ) E 2log p ( X )
which implies that

2 H ( X ) 2

INFORMATION
THEORY(ELL714)

p ( x )log p ( x )

p( x)2log p ( x ) p 2 ( x)

MANAV BHATNAGAR

41

Corollary
Let X, X be independent with X p(x), X r(x), x, x X. Then

Pr( X X ') 2 H ( p ) D ( p||r )


Pr( X X ') 2 H ( r ) D ( r|| p )
Proof:

p ( x ) log p ( x ) p ( x ) log p ( x )
2 H ( p ) D ( p||r ) 2
p ( x ) log r ( x )
=2
r ( x)

p ( x)2log r ( x )
= p ( x )r ( x )
=Pr(X=X')
INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

42

21

ELL714

INFORMATION THEORY

Example :
Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn
according to probability mass functions p1() and p2() over the respective
alphabets X1 = {1, 2, . . . , m} and X2 = {m + 1, . . . , n}. Let

X1 with probability
X 2 with probability 1-

Find H(X) in terms of H(X1), H(X2), and .

Solution
Since X1 and X2 have disjoint support sets ,we can write

X
Define a function X

X1 with probability
X 2 with probability 1-

when X=X
f ( X ) X 12 when
X=X

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

43

H ( X ) H ( X , f ( X )) H ( ) H ( X | )
= H( ) p( 1) H ( X | 1) p( 2) H ( X | 2)
=H( )+ H(X1 ) (1 ) H ( X 2 )
where H( )=- log -(1- )log(1- )

INFORMATION
THEORY(ELL714)

MANAV BHATNAGAR

MANAV BHATNAGAR

44

22

Vous aimerez peut-être aussi