Chapter2 PDF

ELL714
INFORMATION THEORY
ENTROPY, RELATIVE ENTROPY,

AND MUTUAL INFORMATION
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
ENTROPY
Entropy is a measure of the uncertainty of a random variable.
The entropy H(X) of a discrete random variable X is defined by
H ( X ) p( x) log p( x)
x
We denote expectation by E. Thus, if X p(x), the expected value of

the random variable g(X) is written
E p g ( X ) g ( x) p( x)
x
The entropy of X can also be interpreted as the expected value of the

random variable log 1/p(X), where X is drawn according to probability
mass function p(x). Thus,
H ( X ) E p log
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
1
p ( x)
MANAV BHATNAGAR
ELL714
INFORMATION THEORY
H(X ) 0
Lemma 2.1.1
Proof:
0 p( x) 1 implies that log
Lemma 2.1.2
1
0
p( x)
H b ( X ) (logb a) H a ( X )
Proof:
logb p logb a log a p
Properties of entropy
a) H ( X ) 0
Since 1 p( X ) 0 log p( x) 0
b) H ( X ) (log a) H ( X )
b
sin ce log b p log b a log a p
E p log b p ( X ) E p (log b a )log a p ( x)

(log b a) Eb log a p ( X )
=log b a H a ( X )
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
EXAMPLE
Let
with probability p
X 10 with
probability 1-p
Then
H(X) = p log p (1 p) log(1 p)
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
ELL714
INFORMATION THEORY
d
p
1 p
H ( p ) log p (1)log(1 p )
)(1)
dp
p
1 p
=-logp+log(1-p)
d
H ( p) 0
dp
1
p
2
1 1
1
H ( ) log 2 2 log 2 2 1
2 2
2
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
Example 1.1.1
Consider a random variable that has a uniform distribution over 32 outcomes. To
identify an outcome, we need a label that takes on 32 different values. Thus, 5bit strings suffice as labels.
The entropy of this random variable is
32
32
1
1
log log32 5 bits
32
32
i 1
H ( X ) p(i )log p(i )

i 1
Example 1.1.2
Suppose that we have a horse race with eight horses taking part. Assume that
1
1
1
1
1
1 1 1
the probabilities of winning for the eight horses are 2 , 4 , 8 , 16 , 64 , 64 , 64 , 64
1
1 1
1 1
1 1
1
1
1
H ( X ) log log log log 4 log
2
2 4
4 8
8 16
16
64
64
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
ELL714
INFORMATION THEORY
Example 2.1.2
1
a with probability 2
b with probaility 1
4
X
c with probaility 1
1
d with probaility
8
The entropy of X is
1
1 1
1 1
1 1
1 7
H ( X ) log log log log bits
2
2 4
4 8
8 8
8 8
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
JOINT ENTROPY AND CONDITIONAL ENTROPY

Definition :The joint entropy H(X,Y) of a pair of discrete random
variables (X, Y) with a joint distribution p(x, y) is defined as
H ( X , Y ) p( x, y )log p( x, y )
xX yY
Definition If (X, Y) p(x, y), the conditional entropy H(Y|X) is

defined as
H (Y | X ) p ( x) H (Y | X x)
x X
=- p ( x ) p ( y | x ) log p ( y | x )
xX
yY
=- p ( x, y ) log p ( y | x )
xX yY
=-E log p ( y | x )
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
ELL714
INFORMATION THEORY
Theorem 2.2.1(chain rule)

H(X,Y) = H(X) + H(Y|X).
Equivalently, we can write
log p(X, Y) = log p(X) + log p(Y|X)
Corollary
H(X,Y|Z) = H(X|Z) + H(Y|X,Z)
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
Example 2.2.1
Let (X, Y) have the following joint distribution:
The marginal distribution of X is 12 , 14 , 81 , 81 and the marginal distribution

of Y is 14 , 14 , 14 , 14 and hence H ( X ) 7 bits and H(Y) = 2 bits. Also
4
H ( X | Y ) p (Y i ) H ( X | Y i )
i 1
1 1 1 1 1 1 1 1 1 1
= H , , , H , , ,
4 2 4 8 8 4 4 2 8 8
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
1 1 1 1 1 1
+ H , , , H (1,0,0,0)
4 4 4 4 4 4
1 7 1 7 1
1
= 2 0
4 4 4 4 4
4
11
= bits
8
MANAV BHATNAGAR
H(Y|X)=13/8 bits
H(X,Y)=27/8 bits
H(X|Y)\=H(Y|X)
H(X)-H(X|Y)=H(Y)H(Y|X)
10
ELL714
INFORMATION THEORY
RELATIVE ENTROPY AND MUTUAL INFORMATION

The relative entropy is a measure of the distance between two distributions.
The relative entropy D(p||q) is a measure of the inefficiency of assuming
that the distribution is q when the true distribution is p.
For example, if we knew the true distribution p of the random variable, we could
construct a code with average description length H(p). If, instead, we used the
code for a distribution q, we would need H(p) + D(p||q) bits on the average to
describe the random variable.
*Verify this for D-adic distribution for four states.
Definition The relative entropy or KullbackLeibler distance between two
probability mass functions p(x) and q(x) is defined as
D( p || q) p( x)log
xX
=E p log
p ( x)
q ( x)
p ( x)
q( x)
if there is any symbol x X such that p(x) > 0 and q(x) = 0, then D(p||q)=.
relative entropy is always nonnegative and is zero if and only if p = q.
However, it is not a true distance between distributions since it is not symmetric
and does not satisfy the triangle inequality
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
11
Mutual Information
Consider two random variables X and Y with a joint probability mass function
p(x, y) and marginal probability mass functions p(x) and p(y). The mutual
information I (X; Y) is the relative entropy between the joint distribution and
the product distribution p(x)p(y):
I ( x; y )
p ( x, y )
p( x, y ) log p( x) p( y )
x X yY
=D(p(x,y)||p(x)p(y))
=E p ( x , y ) log
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
p ( x, y )
p( x) p( y )
MANAV BHATNAGAR
12
ELL714
INFORMATION THEORY
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
13
RELATIONSHIP BETWEEN ENTROPY AND MUTUAL

INFORMATION
I ( x; y ) p ( x, y ) log
xX yY
p ( x, y )
p( x) p( y )
=- p( x, y )log p( x) p( x, y )log p( x | y )
x, y
x, y
= - p( x)log p( x) ( p( x, y )log p( x | y ))
x
x, y
= H(X) -H(X|Y)
By symmetry, it also follows that
I (X; Y) = H(Y) H(Y|X).
Since H(X,Y) = H(X) + H(Y|X) ,we have
I (X; Y) = H(X) + H(Y) H(X,Y).
Finally, we note that
I (X; X) = H(X) H(X|X) = H(X).
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
14
ELL714
INFORMATION THEORY
Theorem 2.4.1 (Mutual information and entropy)

I (X; Y) = H(X) H(X|Y)
I (X; Y) = H(Y) H(Y|X)
I (X; Y) = H(X) + H(Y) H(X,Y)
I (X; Y) = I (Y; X)
I (X; X) = H(X)
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
15
CHAIN RULES FOR ENTROPY, RELATIVE ENTROPY,

AND MUTUAL INFORMATION
Theorem 2.5.1 (Chain rule for entropy)
Let X1, X 2 ......, X n be drawn according to p( x1 , x2 ,..., xn ) . Then
n
Proof:
H ( X 1 , X 2 ......, X n ) H ( X i | X i 1 ,...,| X 1 )
i 1
Theorem 2.5.2 (Chain rule for information)

n
I ( X 1 , X 2 ......, X n ;Y ) I ( X i ;Y | X i 1,..., X 1 )
i 1
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
16
ELL714
INFORMATION THEORY
Definition For joint probability mass functions p(x, y) and q(x, y), the
conditional relative entropy D(p(y|x)||q(y|x)) is the average of the relative
entropies between the conditional probability mass functions p(y|x) and
q(y|x) averaged over the probability mass function p(x).
D( p( y | x) || q( y | x)) p( x) p( y | x)log
x
=E p ( x , y ) log
p ( y | x)
q ( y | x)
p ( y | x)
q ( y | x)
Theorem 2.5.3 (Chain rule for relative entropy)

D(p(x, y)||q(x, y)) = D(p(x)||q(x)) + D(p(y|x)||q(y|x)).
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
17
JENSENS INEQUALITY AND ITS CONSEQUENCES

Definition A function f (x) is said to be convex over an interval (a, b) if for every x1,
x2 (a, b) and 0 1.
f (x1 + (1 )x2) f (x1) + (1 )f (x2).
A function f(x) is said to be strictly convex if equality holds only if = 0 or = 1.
Theorem 2.6.1 If the function f has a second derivative that is non-negative
(positive) over an interval, the function is convex (strictly convex) over that
interval.
1 = 1
2 1
1 =
1
2 1
2 1
1 =
1 + 1 2 1
2 1

1 = 2 1 1 (2 1 )
Source
Convex functions:
2 , , , (
0)
Concave functions:
, ( 0)
Convexity test:
2 1
0,
Strict convex:
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
>0,
18
ELL714
INFORMATION THEORY
Theorem 2.6.2 Jensens inequality

If f is a convex function and X is a random variable,
Ef (X) f (EX).
Proof:
For a two-mass-point distribution, the inequality becomes
p1f (x1) + p2f (x2) f (p1x1 + p2x2),
which follows directly from the definition of convex functions. Suppose that
the theorem is true for distributions with k 1 mass points. Then writing
pi' pi /(1 - p k ) for i = 1, 2, . . . , k - 1, we have
k
p
i 1
k 1
f ( xi ) pk f ( xk ) (1 pk ) pi' f ( xi )
i 1
k 1
p k f ( xk ) (1 pk ) f ( pi' xi )
i 1
k 1
f(p k xk (1 pk ) pi' xi )
i 1
=f( pi xi )
i 1
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
19
Theorem 2.6.3: Information inequality

Let p(x), q(x), x X, be two probability mass functions. Then
D(p||q) 0
with equality if and only if p(x) = q(x) for all x.
Proof
Let A = {x : p(x) > 0} be the support set of p(x). Then
p( x)
q
( x)
x A
q( x)
= p ( x)log
p ( x)log t ( x)
p ( x) xA
x A
q( x)
log p ( x)t ( x) log p ( x)
: Jensen
p( x)
x A
xA
D ( p || q ) p ( x)log
=log q ( x)
x A
log q ( x)
x X
Equality iff
)= ( ) = 1
(
=log1
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
Equality iff
q(x)=cp(x)
c: constant
=1
=0
MANAV BHATNAGAR
20
10
ELL714
INFORMATION THEORY
Corollary: Non-negativity of mutual information

For any two random variables, X, Y ,
I (X; Y) 0,
with equality if and only if X and Y are independent.
Proof
I ( X ;Y ) D( p( x, y) || p( x) p( y)) 0
with equality if and only if p(x, y) = p(x)p(y) (i.e., X and Y are independent).
Corollary
D(p(y|x)||q(y|x)) 0
with equality if and only if p(y|x) = q(y|x) for all y and x such that p(x) > 0.
Corollary
I (X; Y|Z) 0,
with equality if and only if X and Y are conditionally independent given Z.
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
21
Theorem 2.6.4 (Upper bound of entropy)

H(X) log |X|, where |X| denotes the number of elements in the range of X,
with equality if and only if X has a uniform distribution over X.
Proof
Let u(x) = 1/ |X| be the uniform probability mass function over X, and let p(x)
be the probability mass function for X. Then
D( p || u ) p( x)log
p( x)
log X H ( X )
u ( x)
Hence by the non-negativity of relative entropy
D( p || q) H ( X ) log X 0
H(X) log X
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
22
11
ELL714
INFORMATION THEORY
Theorem 2.6.5: Conditioning reduces entropy

H(X|Y) H(X)
with equality if and only if X and Y are independent.
Proof
H (X |Y) H (X )
I ( X ;Y ) 0
H (X ) H (X |Y) 0
H (X |Y) H (X )
Theorem 2.6.6
Let X1, X 2 ......, X n be drawn according to p(x1, x2, . . . , xn). Then
n
H ( X 1 , X 2 ,....., X n ) H ( X i )
i 1
with equality if and only if the Xi are independent.

n
H ( X 1 , X 2 ,.......... X n ) H ( X i | X i 1, ...., X 1 )
i 1
n
H( Xi)
i 1
MANAV BHATNAGAR
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
23
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
24
12
ELL714
INFORMATION THEORY
LOG SUM INEQUALITY AND ITS APPLICATIONS

Theorem 2.7.1 (Log sum inequality)
For non-negative numbers a1, a2, . . . , an and b1n, b2, . . . , bn,
n
i 1
log
n
ai
( ai ) log
bi
i 1
with equality if and only if ai = constant

bi
Proof:
n
n
ai
B= bi
Let A
i 1
i 1
ai
A
ai '
n
a
i 1
log
bi '
i 1
n
b
i 1
bi
B
n
ai
a 'A
A ai ' log i '
bi
bi B
i 1
n
=A a i ' log
i 1
n
=A ai ' log
i 1
n
ai '
A
A ai ' log
'
bi
B
i 1
ai '
A
A log
bi '
B
=AD(a i ' || bi ') A log

A log
INFORMATION
THEORY(ELL714)
A
B
A
B
MANAV BHATNAGAR
25
We now use the log sum inequality to prove various convexity results.
We begin by reproving Theorem 2.6.3, which states that D(p||q) 0 with
equality if and only if p(x) = q(x). By the log sum inequality,
D( p || q) p( x)log
p( x)
q( x)
( p ( x))log p ( x) / q ( x)
1
=1log 0
1
with equality if and only if p(x)/q(x) = c. Since both p and q are probability
mass functions, c = 1, and hence we have D(p||q) = 0 if and only if p(x) =
q(x) for all x.
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
26
13
ELL714
INFORMATION THEORY
1)Theorem 2.7.2 (Convexity of relative entropy)

D(p||q) is convex in the pair (p, q); that is, if (p1, q1) and (p2, q2) are two
pairs of probability mass functions, then
D(p1 + (1 )p2||q1 + (1 )q2) D(p1||q1) + (1 )D(p2||q2)
for all 0 1.
Proof
p1 ( x)
(1 ) p2 ( x)
p ( x) (1 ) p2 ( x)
(1 ) p2 ( x)log
( p1 ( x) (1 ) p2 ( x))log 1
q1 ( x)
(1 )q2 ( x)
q1 ( x) (1 )q2 ( x)
p1 ( x)
(1 ) p2 ( x)
p ( x) (1 ) p2 ( x)
p1 ( x)log
(1 ) p2 ( x)log
( p1 ( x) (1 ) p2 ( x))log 1
q
(
x
)
(1
)
q
(
x
)
q1 ( x) (1 )q2 ( x)
xX
xX
xX
1
2
=D( p1 (1 ) p2 || q1 (1 )q2
p1 ( x)log
2)
D ( p || q ) 0
proof
D ( p || q ) p ( x) log
INFORMATION
THEORY(ELL714)
p( x)
q( x)
p( x)
( p ( x )) log
q ( x)
BHATNAGAR
MANAV
=0
27
Theorem 2.7.2 (Convexity of relative entropy)

D(p||q) is convex in the pair (p, q); that is, if (p1, q1) and (p2, q2) are two
pairs of probability mass functions, then
D(p1 + (1 )p2||q1 + (1 )q2) D(p1||q1) + (1 )D(p2||q2)
for all 0 1.
Theorem 2.7.3 (Concavity of entropy)

H(p) is a concave function of p.
Proof
H(p) = log |X| D(p||u)
where u is the uniform distribution on |X| outcomes. The concavity of H
then follows directly from the convexity of D.
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
28
14
ELL714
INFORMATION THEORY
DATA-PROCESSING INEQUALITY
Definition
Random variables X, Y, Z are said to form a Markov chain in that order (denoted
by X Y Z) if the conditional distribution of Z depends only on Y and is
conditionally independent of X. Specifically, X, Y, and Z form a Markov chain X
Y Z if the joint probability mass function can be written as
p(x, y, z) = p(x)p(y|x)p(z|y)=p(x,y) p(z|y)
Some simple consequences are as follows:
X Y Z if and only if X and Z are conditionally independent given Y.
Markovity implies conditional independence because
p ( x, z | y )
p( x, y, z ) p( x, y ) p( z | y )
p( x | y ) p( z | y )
p( y )
p( y )
X Y Z implies that Z Y X. Thus, the condition is sometimes written X

Y Z.
If Z = f (Y), then X Y Z.
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
29
Theorem 2.8.1: Data-processing inequality

If X Y Z, then I (X; Y) I (X; Z)
Proof: By the chain rule, we can expand mutual information in two
different ways:
I (X; Y,Z) = I (X; Z) + I (X; Y|Z)
= I (X; Y) + I (X;Z|Y).
Since X and Z are conditionally independent given Y, we have I (X;Z|Y) = 0.
Since I (X; Y|Z) 0, we have
I (X; Y) I (X; Z)
We have equality if and only if I (X; Y|Z) = 0 (i.e., X Z Y forms
a Markov chain). Similarly, one can prove that I (Y; Z) I (X; Z).
Corollary In particular, if Z = g(Y), we have I (X; Y) I (X; g(Y)).
Proof: X Y g(Y) forms a Markov chain.
Thus functions of the data Y cannot increase the information about X.
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
30
15
ELL714
INFORMATION THEORY
Corollary
If X Y Z, then I (X; Y|Z) I (X; Y).
Proof:
I (X;Z|Y) = 0, by Markovity, and I (X; Z) 0 thus
I (X; Y|Z) I (X; Y).
Thus, the dependence of X and Y is decreased (or remains unchanged) by the
observation of a downstream random variable Z. Note that it is also possible
that I (X; Y|Z) > I (X; Y) when X, Y, and Z do not form a Markov chain. For
example, let X and Y be independent fair binary random variables, and let Z = X
+ Y. Then I (X; Y) = 0, but I (X; Y|Z) = H(X|Z) H(X|Y,Z) = H(X|Z) = P(Z = 1)H(X|Z =
1) = 1/2 bit.
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
31
Sufficient statistics
Let f0 (x) :a family of probability mass function
X : a sample form of distribution
T(X) : average statistics of X
then X T(X)
I( ;T(X)) I( ;X)
T(X) is a sufficient statistics if
I( ;X) I ( ;T(X))
and x are independent
X T(X)
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
32
16
ELL714
INFORMATION THEORY
The fisher-Neyman Factorization Theorem

f ( x) a ( x)b (t )
where t=T(X)
a(X) is independent of and b (t ) does not depend upon X
1
X 1 , X 2 ,..... X N ~ N ( ,1) iid,where is unknown .prove that T(X)=
N
sufficient statistics for
N
f (X)=
n 1
=(
i 1
in
1 N 2 ( X n )
f 0 ( X n ) ( ) 2 e n1
2
( X n X X )
1 N2 2
) e n1
2
2
2
1 N 2 (( X n X ) 2(X n X )( X ) ( X ) )
=( ) 2 e n1
2
(X
n 1
X )( X ) X ( X n X )
n 1
f (X ) (
INFORMATION
THEORY(ELL714)
1
( X n X )2
2 n 1
1 N2
) e
2
1
( X ) 2
2 n 1
MANAV BHATNAGAR
33
QUIZ (10 min.)
For the joint distribution p(x,y), find:

(a) H(X),H(Y ).
(b) H(X | Y),H(Y | X).
(c) H(X,Y).
(d) H(Y) H(Y | X).
(e) I (X; Y).
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
34
17
ELL714
INFORMATION THEORY
FANOS INEQUALITY
Suppose that we know a random variable Y and we wish to
guess the value of a correlated random variable X
Fanos inequality relates the probability of error in guessing
the random variable X to its conditional entropy H(X|Y)
The conditional entropy of a random variable X given another
random variable Y is zero if and only if X is a function of Y
Hence we can estimate X from Y with zero probability of error
if and only if H(X|Y) = 0
We expect to be able to estimate X with a low probability of
error only if the conditional entropy H(X|Y) is small
Fanos inequality quantifies this idea
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
35
FANOS INEQUALITY
Let us wish to estimate a random variable X with a
distribution p(x)
We observe a random variable Y that is related to X by the
conditional distribution p(y|x)
From Y, we calculate a function g(Y) = , where is an
estimate of X
We wish to bound the probability that
We observe that X Y forms a Markov chain
Define the probability of error
= Pr{ }
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
36
18
ELL714
INFORMATION THEORY
FANOS INEQUALITY
Theorem 2.10.1: Fanos Inequality
For any estimator X such that
X Y X , with Pe = Pr(X = X ), we have
for any estimator x such that x y x with p e pr ( x x ), we have
H(p ) p log | X | H ( X | X ) H ( X | Y )
e
this inequality can be weakened to

1 + Pe log |X| H(X|Y)
or
pe
H ( X | Y ) 1
log X
Proof:
Define an error random variable,
x
E 10 ifif xx=x
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
37
Then, using the chain rule for entropies to expand H(E,X| X ) in two different
ways, we have
H ( E , X | X ) H ( X | X ) H ( E | X , X )
=H(E|X)+H(X|E,X)
Since conditioning reduces entropy H ( E | X ) H ( E ) H ( Pe )

is equal to zero.
Now since E is a function of X andX , the conditional entropy H(E|X,X)
Also, since E is a binary-valued random variable, H(E) =H(Pe ). The remaining term, H(X|E,X
can be bounded as follows

H ( X | E , X ) Pr ( E 0) H ( X | X , E 0) Pr ( E 1) H ( X | X , E 1)
(1-Pe )0 Pe log X
since given E = 0, X =X , and given E = 1, we can upper bound the conditional

entropy by the log of the number of possible outcomes. Combining these results,
we obtain
H(Pe ) + Pe log |X| H(X|X)
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
38
19
ELL714
INFORMATION THEORY
By the data-processing inequality, we have I ( X ; X ) I ( X ;Y ) since

H ( X | Y ) . Thus, we have
is a Markov chain, and therefore H(X|X)
X Y X
H ( Pe ) Pe log X H ( X | X ) H ( X | Y )
Corollary:
For any two random variables X and Y, let p=Pr(X Y)
H(p) + p log |X| H(X|Y)
Corollary:
Let Pe Pr( X X )and let X:Y

X; then
H(Pe ) Pe log( X 1) H ( X | Y )
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
39
Remark: Suppose that there is no knowledge of Y. Thus, X must be

guessed without any information. Let X 1,2,...., m and p1 p2 ..... pm
= 1 and the resulting probability of error is
Then the best guess of X is X
Pe = 1 p1. Fanos inequality becomes
H(Pe ) + Pelog(m-1) H(X).
The probability mass function

( p1 , p1 ,.... pm ) (1 pe ,
pe
p
,......, e )
m 1
m 1
achieves the equality.

While we are at it, let us introduce a new inequality relating probability of error
and entropy. Let X and X by two independent identically distributed random
variables with entropy H(X). The probability at X = X is given by
Pr( X X ') P 2 ( x)
x
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
40
20
ELL714
INFORMATION THEORY
Lemma 2.10.1
If X and X are i.i.d. with entropy H(X),
Pr( X X ') 2 H ( X )
with equality if and only if X has a uniform distribution.

Proof:
Suppose that X p(x). By Jensens inequality, we have
2E log p ( X ) E 2log p ( X )
which implies that
2 H ( X ) 2
INFORMATION
THEORY(ELL714)
p ( x )log p ( x )
p( x)2log p ( x ) p 2 ( x)
MANAV BHATNAGAR
41
Corollary
Let X, X be independent with X p(x), X r(x), x, x X. Then
Pr( X X ') 2 H ( p ) D ( p||r )

Pr( X X ') 2 H ( r ) D ( r|| p )
Proof:
p ( x ) log p ( x ) p ( x ) log p ( x )
2 H ( p ) D ( p||r ) 2
p ( x ) log r ( x )
=2
r ( x)
p ( x)2log r ( x )
= p ( x )r ( x )
=Pr(X=X')
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
42
21
ELL714
INFORMATION THEORY
Example :
Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn
according to probability mass functions p1() and p2() over the respective
alphabets X1 = {1, 2, . . . , m} and X2 = {m + 1, . . . , n}. Let
X1 with probability
X 2 with probability 1-
Find H(X) in terms of H(X1), H(X2), and .
Solution
Since X1 and X2 have disjoint support sets ,we can write
X
Define a function X
X1 with probability
X 2 with probability 1-
when X=X
f ( X ) X 12 when
X=X
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
43
H ( X ) H ( X , f ( X )) H ( ) H ( X | )
= H( ) p( 1) H ( X | 1) p( 2) H ( X | 2)
=H( )+ H(X1 ) (1 ) H ( X 2 )
where H( )=- log -(1- )log(1- )
INFORMATION
THEORY(ELL714)
MANAV BHATNAGAR
MANAV BHATNAGAR
44
22

Chapter2 PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chapter2 PDF

Transféré par

Droits d'auteur :

Formats disponibles

ELL714

ENTROPY, RELATIVE ENTROPY,

We denote expectation by E. Thus, if X p(x), the expected value of

The entropy of X can also be interpreted as the expected value of the

0 p( x) 1 implies that log

sin ce log b p log b a log a p

E p log b p ( X ) E p (log b a )log a p ( x)

H ( X ) p(i )log p(i )

JOINT ENTROPY AND CONDITIONAL ENTROPY

Definition If (X, Y) p(x, y), the conditional entropy H(Y|X) is

Theorem 2.2.1(chain rule)

The marginal distribution of X is 12 , 14 , 81 , 81 and the marginal distribution

RELATIVE ENTROPY AND MUTUAL INFORMATION

RELATIONSHIP BETWEEN ENTROPY AND MUTUAL

Theorem 2.4.1 (Mutual information and entropy)

CHAIN RULES FOR ENTROPY, RELATIVE ENTROPY,

Theorem 2.5.2 (Chain rule for information)

Theorem 2.5.3 (Chain rule for relative entropy)

JENSENS INEQUALITY AND ITS CONSEQUENCES

Theorem 2.6.2 Jensens inequality

Theorem 2.6.3: Information inequality

Corollary: Non-negativity of mutual information

Theorem 2.6.4 (Upper bound of entropy)

Hence by the non-negativity of relative entropy

Theorem 2.6.5: Conditioning reduces entropy

with equality if and only if the Xi are independent.

LOG SUM INEQUALITY AND ITS APPLICATIONS

with equality if and only if ai = constant

=AD(a i ' || bi ') A log

1)Theorem 2.7.2 (Convexity of relative entropy)

Theorem 2.7.2 (Convexity of relative entropy)

Theorem 2.7.3 (Concavity of entropy)

X Y Z implies that Z Y X. Thus, the condition is sometimes written X

Theorem 2.8.1: Data-processing inequality

The fisher-Neyman Factorization Theorem

QUIZ (10 min.)

For the joint distribution p(x,y), find:

this inequality can be weakened to

Since conditioning reduces entropy H ( E | X ) H ( E ) H ( Pe )

can be bounded as follows

since given E = 0, X =X , and given E = 1, we can upper bound the conditional

H(Pe ) + Pe log |X| H(X|X)

By the data-processing inequality, we have I ( X ; X ) I ( X ;Y ) since

Let Pe Pr( X X )and let X:Y

Remark: Suppose that there is no knowledge of Y. Thus, X must be

The probability mass function

achieves the equality.

with equality if and only if X has a uniform distribution.

Pr( X X ') 2 H ( p ) D ( p||r )

Find H(X) in terms of H(X1), H(X2), and .

Vous aimerez peut-être aussi