Vous êtes sur la page 1sur 11

Information Bottleneck Method

Azamat Berdyshev

University of Toronto

February 22, 2018


Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Entropy

Let X ∈ X be a discrete r.v. distributed as X ∼ P then

X
H(X) = − P (x) log P (x)
x∈X
= − E [log P (X)]

Intuition: How much “uncertainty” is in random variable X.

2
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Conditional Entropy

Let (X, Y ) ∈ X × Y be a pair of discrete r.v. jointly distributed as


(X, Y ) ∼ PXY then

XX PX (x)
H(X|Y ) = PXY (x, y) log
PXY (x, y)
x∈X y∈Y
X
= PX (x)H(Y |X = x)
x∈X

Intuition: How much “uncertainty” is left in random variable X, when


the random variable Y is observed

3
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Mutual Information

As before, let (X, Y ) ∈ X × Y be a pair of discrete r.v. jointly


distributed as (X, Y ) ∼ PXY then

I(X; Y ) = H(X) − H(X|Y )


XX PXY (x, y)
= PXY (x, y) log
PX (x) PY (y)
x∈X y∈Y

Intuition: How much “information” (in bits) about random variable X


is contained in the random variable Y

4
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Data Processing Inequality

I Let X → Y → Z be a Markov chain, then

I(X; Y ) > I(X; Z)

I Reparametrization invariance trick: for any invertible φ, ψ

I(X; Y ) = I(φ(X); ψ(Z))

5
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Information Bottleneck Problem


(N. Tishby, F. Pereira, W. Bialek, 1999)

f (x) g(t)
I Consider the information channel: X −−−→ T −−→ Y
minimize I(X; T )
PT |X (t|x)
(1)
subject to I(T ; Y ) > 

I let β be the Lagrange multiplier, then

min I(X; T ) − βI(T ; Y ) (2)


PT |X (t|x)

6
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Deep Neural Nets and Markov Chains


Given input X ∈ Rd , consider training of the Multilayer Perceptron,
a.k.a. Deep Neural Net (DNN), with Stochastic Gradient Descent.

1
1 Ravid Shwartz-Ziv and Naftali Tishby. “Opening the Black Box of Deep Neural Networks via Information”. In: arXiv preprint
arXiv:1703.00810 (2017). 7
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Deep Neural Nets and Markov Chains

I training is a Markov process, thus the DNN forms Markov Chain

X → T1 · · · Ti−1 → Ti → Ti+1 · · · → Tk → Y

I data processing inequality tells us that

I(X; T1 ) > I(X; T2 ) > · · · > I(X; Tk ) > I(X; Ŷ )

hence having more activation functions (a.k.a. neurons) won’t


create any new information

8
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Key Observation

I for a given , Information Bottleneck (IB) tells how well you can do,
but does not tell how to achieve it, to find that out we still need to
train the network

I cost function in the Deep Neural Net training is highly nonlinear, but
the information bottleneck optimization (2) is convex, thus has
unique optimal solution!

I finding global optimum in DNN training NP-hard, but optimizing


Information Bottleneck is very efficient (O(n3 ))!

9
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Main Takeaway
Solving Information Bottleneck Optimization (1) for all levels of  gives
so called Information Bottleneck Bounds on Information Plane

2
2 Ravid Shwartz-Ziv and Naftali Tishby. “Opening the Black Box of Deep Neural Networks via Information”. In: arXiv preprint
arXiv:1703.00810 (2017). 10
Some Information Theory basics Information Bottleneck Method Applications in Deep Learning

Thank you for attention!

Questions?

11

Vous aimerez peut-être aussi