Vous êtes sur la page 1sur 2

NeuroTheory II: Networks.

Homework 6: Boltzmann Machines


instructor: xaq pitkow

1 Boltzmann Machine
In this homework, you will create two Boltzmann machines, and match one to the other. Guidance follows.

1.1 Create a Boltzmann Machine


Choose a bias vector h and a symmetric coupling matrix J for the ‘data’ distribution P over the multivariate
binary state x ∈ {±1}N ,
e−E(x)
P (x) = (1)
Z
based on the energy
E(x) = −hx − xJx (2)
You can choose these biases and coupling matrices however you like, though you may want to try a few
different ones to get some interesting behavior. For now select N = 5 neurons which will give 25 = 32 distinct
binary states. (You may find it convenient to create an array of all possible binary vectors in MATLAB as
2*de2bi(0:2^N-1)-1. How does that trick work?).
Numerically compute and plot the energy E(x) and the probability P (x) = e−E(x) /Z as bar charts over the
2N possible states.

1.2 Adjust temperature


Scale your energies by a range of different ‘temperatures’ T by plotting the probabilities P (x) ∝ e−E(x)/T .
Describe how the distribution P changes with T . At very low temperatures, what does the distribution look
like? How does it relate to the minimum of the energy? At very high temperatures, what does the distribution
look like? P
Plot the entropy H of the distribution, H(P ) = − x P (x) log P (x), for a range of temperatures. How
does this temperature parameter influence the uncertainty in the probability distribution?

1.3 Sampling
Use Gibbs sampling to draw examples x from your Boltzmann distribution. In Gibbs sampling, you repeatedly
iterate through each node, one at a time, and draw from the conditional distribution P (xi |x−i ) where x−i is the
vector of all states except for the ith component.1 This is just like updating the Hopfield net neurons, except
that instead of picking the minimum energy state, you draw states randomly from the probability P (xi |x−i ).
This conditional distribution is given by

e−E(xi |x−i )
P (xi |x−i ) = = σ(−E(xi |x−i )) (3)
Z
1
So if x = (−1, −1, 1, 1) then x1 = −1 and x−1 = (−1, 1, 1).

1
where σ(y) = 1/(1 + e2y ) and

∆E 1 
E(xi |x−i ) = xi = xi · E(xi = +1|x−i ) − E(xi = −1|x−i ) (4)
2 2
Express this energy mathematically in terms of h, J, and x−i so you can use it for sampling P (xi |x−i ) which
is conditioned on x−i . (You may want to use component-wise sums instead of vector notation here. Be careful
to count terms properly in the xJx term: xi appears in the first and second x.)
Note that since xi is only a single binary number, then the probability of it being +1 is determined by just
one scalar p = P (xi = +1|x−i ). For the other binary state xi = −1, we have probability 1 − p. This means
that you can sample xi as 2*(rand()<p)-1. (Of course, p depends on i as you just derived above.)
You need to start Gibbs sampling with a particular sample x: your single neuron updates are all conditional
on the other states x−i , so they must be defined. This somewhat arbitrary choice of initial state can influence
the samples for a long time, so we usually drop the first K of them, calling this a ‘burn-in’ period.
Draw T = 1000 samples, each updating all N neurons, dropping an extra K = 100 extra burn-in samples
from the beginning. Plot the empirical distribution P̂ (x) of your samples, i.e. the fraction of samples for each
state, and compare it to the exact probability distribution P (x) (Eq. 1). How close are the empirical and true
distributions if you draw many more samples, say, 106 ? (An easy way to compare two distributions visually is
to plot one of them upside down on the same axes.)
For low temperatures or wide energy ranges, there are typically examples where the empirical distribution
using Gibbs sampling starting from a single point will differ from the target Boltzmann distribution P , whereas
for high temperatures or narrow energy ranges the match starting from a single initial point will be better.
Why? Can you explain why restarting the Gibbs sampling from a new random state might improve the match
at low temperatures?

1.4 Training
Now create a second Boltzmann Machine, with bias and coupling matrix h0 and J 0 that define P 0 (x). Use
gradient descent to minimize the Kullback Leibler divergence DKL [P ||P 0 ], using the gradients

∂D
= hxi iP 0 − hxi iP (5)
∂h0i
∂D
= hxi xj iP 0 − hxi xj iP (6)
∂Jij0

You could calculate these averages using your samples, but for the small N we’re using you can quickly compute
the probabilities P and P 0 exactly. Measure the DKL at each iteration, and plot its decrease over time.
At different times while the network is learning (early, middle, and late), plot the two distributions P and P 0
on the same axes to see how well the second Boltzmann machine (model, P 0 ) is matching the first one (‘data’,
P ).
Comment: In this homework, the model distribution is from the exact same family as the data distribution,
so the two can, in principle, match perfectly. For other data distributions, one is not guaranteed that this
particular Boltzmann machine can fit the data. Hidden nodes or additional structure may be required.

Vous aimerez peut-être aussi