7 - 3 - 3 Sparse Coding and Predictive Coding (2354)

Hello, Comp-neuro fans.
In the previous two lectures, we made friends with HEB and learned that has learning rule implements principal component analysis. We then learned about unsupervised learning, which tells us how the brain can learn models of its inputs with no supervision at all. I left you with the question. How do we learn models of natural images? What does the brain do? Well, as the saying goes, when in doubt, trust your Eigenvectors. Well, can we use Eigenvectors or equivalently principal component analysis to represent natural images. Well, let's see. So, here is a famous example from Turk and Pentland. And what they did was, they took a bunch of face images, and let's say that each of these face images has n pixels. So, they took a bunch of face images and they computed the Eigenvectors of the input covariance matrix. And when they did that they found that the eigenvectors look like this, and they call these eigenvectors eigenfaces. Now, we can represent any face image such as this one, as a linear combination of all of our eigenfaces. So, here is the equation that captures this relationship. Now, why can we do that? Well, remember that the eigenfaces are the eigenvectors of the input covariance matrix. And since the covariance matrix is a real and symmetric matrix, its eigenvectors form an orthonormal basis with which we can represent these input vectors. Now, here's something interesting. You can use only the first M principal eigenvectors. So what do we mean by the first M principal eigenvectors? Well, these are the eigenvectors associated with the M largest eigenvalues of the covariance matrix. So, if we use only the first M principal eigenvectors to represent the image, then we get an equation that looks like this. And this equation just tells you that there are some differences between the reconstruction of the image using only the first M principal eigenvectors, and therefore we're going to model those differences between the actual image and
the reconstructed image using a noise term. So, why is this a useful model? It's a useful model because you can use it for image compression. So, suppose your input images were of size thousand by thousand pixels, which means that N is going to equal 1,000,000 pixels. And now if the first, let's say ten principal eigenvectors are sufficient. Which means that the first ten largest eigenvalues are sufficient to explain most of the variance in your data, then M is going to equal ten. Which means that just ten numbers are enough to represent an image. So ten of these coefficients are sufficient to represent any image which consists of 1 million pixels. So, what we have done is a tremendous dimensionality reduction or compression from one million pixels down to just 10 numbers for each image. Now, wait a minute. Not so fast, eigenvectors. The Eigenvector representation may be good for compression, but it's not really very good if you want to extract the local components, or the parts of an image. So, for example, if you want to extract the parts of a face, such as the eyes, the nose, the ears. You're not going to get that from an eigenvector analysis, or equivalently, a principae component analysis of the face images. And likewise, you're not going to be able to extract the local components, such as edges, from natural scenes. Now, this is certainly a sad day for the course, because Eigenvectors have letters down for the first time. But maybe we can resurrect the linear model so beloved to the eigenvectors. So, here's the linear model. We have a natural scene, for example, that is represented by a linear combination of a set of basis vectors or features. So, these do not have to be eigenvectors anymore. And so, here is the equation again that captures this relationship. And the difference now from the case of Eigenvectors that we had in the previous slide is that we are allowing M, the number of these bases vectors or features to be larger than the number of pixels.
So why does that make sense? Well, consider the fact that the number of parts of objects and scenes can be much larger than the number of pixels. So it does make sense to allow a larger value for M, the number of basis vectors and features, than the number of pixels. And here's another way of writing the same equation. So, we're replacing the summation with a matrix multiplication G times v where the columns of this matrix G are the different basis vectors or features. And the vector v has elements which are the coefficients for each of those basis vectors or features. So, the challenge before us now is to learn this matrix G, the different basis vectors, as well as for any given image, we need to be able to estimate the coefficients, this vector, v. In order to learn the basis vectors G and estimate the causes v, we need to specify a generative model for images. And as you recall, we can define the generative model by specifying a prior probable distribution for the causes, as well as a likelihood function. Let's first look at the likelihood function. So, we start with our linear model from the previous slide, and if you assume that the noise vector is Gaussian. And it's a Gaussian white noise, which means there are no correlations across the different components of the noise vector. And if you assume that the Gaussian has zero mean, then we can show that the likelihood function amounts to also a Gaussian distribution with a mean of g times v, and a covariance of just the identity matrix. And here is what this likelihood function, then is proportional to is its exponential function. And finally, if you take the logarithm of the likelihood function, we obtain the log likelihood which now is simply just this quadratic term. So it has just a negative one half of the square of the length of this vector, which is simply the difference between the input image and the reconstruction of the image, or the prediction of the image using your basis vectors. Now here's an interesting observation, a lot of algorithms in engineering and in machine learning attempt to minimize the squared reconstruction error which is
just this term here. And so now you can see that when you are minimizing the reconstruction error is the same thing as maximizing the log likelihood function, or equivalently, maximizing the Likelihood of the data. Isn't that interesting? Now, let's define the prior probability distribution for the causes. So, one assumption you can make is that the causes are independent of each other. And if you make that assumption, then we have the result that the prior probability for the vector v is equal to just the product of the individual prior properties for each of the causes. Now, this assumption might not strictly hold for natural images, because some of the components might depend on other components. But let's start off with this simplifying assumption and see where it takes us. Now, if you take the logarithm of the prior probability distribution for v, then instead of a product we now have the summation of all the individual log prior probabilities for the causes. Now the question is, how do we define these individual prior probabilities for the causes. Now, here's one answer, we can begin with the observation that: for any input, we want only a few of these causes vi to be active. Now why does this make sense? Well, if we are assuming that these causes represent individual parts or components of natural scenes. Then for any given input which contains for example, a particular object only a few of these causes are going to be activated in that particular image, because those are the parts of that particular object, and then the rest of the vis are going to be zero. So, what we have then is that vi, for any particular i, is going to be zero most of the time, but is going to be high, for some inputs. And this leads to the notion of a sparse distribution for pvi. And what this means is that the distribution for pvi is going to have a peak at zero. So, it's going to be zero, most of the time, vi is going to be zero most of the time. But the distribution is going to have a heavy tail, which means that for some input it's going to have a high value.
And this kind of a distribution is also called a super-Gaussian distribution. Now here are some examples of the super-Gaussian or sparse prior distribution. So, this plot here shows three distributions. All of them can be expressed as pv equals exponential of gv, and the dotted distribution here is the Gaussian distribution. And the other two distributions, the dash as well as the solid line here, represent the examples of sparse distributions. And if you take the log of pv, we get a more clear picture here of what these distributions look like. So, you can see that when gv equals minus the absolute value of v, then we get an exponential distribution. And when we have gv equals minus the logarithm of 1 plus v squared, we get something called the Cauchy Distribution. So, to summarize then, the prior probability pv is equal to just the product of these exponential functions. And therefore the logarithm of the prior probability pv is going to equal the summation of all these values, g v i plus some constant. Okay, after all that hard work, we've finally arrived at the grand mathematical finale of figuring out how to find v given any particular image and how to learn G. And we're going to use a Bayesian approach to do that. So, by Bayesian we mean that we going to maximize the posterior property of the causes, so here is p of v given u. And from Bayes' rule we can write p of v given u as just the product of the likelihood times the prior. And k here is just the normalization constant. We can maximize the posterior property by also maximizing the log posterior, so that's the same thing as maximizing the posterior. And so here's the function f, which is the log posterior function, and you can see how the function f has two terms. One of them is a term containing the reconstruction error, the other is a term containing sparseness constraint, and we can maximize this function by essentially doing two things. We have to minimize the reconstruction error while at the same time trying to maximize the sparseness constraint.
And so, you can see how this function f trades off the reconstruction error with the sparseness constraint. And so, we would like our representation to be sparse. We would like only a few of these components to be active, while at the same time we would also like to preserve information in the images. And that's enforced by this reconstruction error term. One way of maximizing F with respect to v and G is to alternate between two steps. The first step is maximizing F with respect to v, keeping G fixed. And the second step is maximizing F with respect to G, keeping v fixed to the value obtained from the previous step. Now, this should remind you of the EM algorithm. So just as in the EM algorithm, in the E step, the computed the posterior property of v. Here we're computing a value for v that maximizes F. And similar to the EM algorithm where, in the M step, we updated the parameters here, we're updating the parameter G, the matrix G to maximize the function F. Now, the big question is, how do we maximize F with respect to v and G. Well, one potential answer is to use something called gradient ascent, which is that we change v, for example, according to the gradient of F with respect to v. So why does this make sense? Here's why it makes sense. So, let me draw F as a function of v. So, suppose F is this function, and you can see that the value of v, which maximizes F is some value here. So, let's call that v star. And if the current value of v is, let's say to the left of v start, let's say that this is where the current value of v is, you can look at the gradient of F with respect to v. So, you can see that it's the slope of this tangent here. Do you think the gradient is positive or negative at this particular value? Well, if you answered positive, you would be correct. So, it is a positive value, so what does that mean? It means that if you update v according to this equation, then you're going to move v this direction. So you're going to add a small positive
value to v, and that's going to move v in the right direction towards the start. Similarly if you're on this side, let's say this is where your current value of v is. I'm calling that v prime. Then you can see that the gradient is in this case, you guess right it's negative. Which means that you're going to subtract a small value from your current value, and that's going to move the value of v again towards the optimal value. So either way Gradient ascent does the right thing. Okay, let's apply the idea of gradient ascent then to our problem. So, you would like to take the derivative of F with respect to v, and here's the expression that we get. So G prime here denotes the derivative of our function G. And, we can now look at the way in which we should update the vector v, and that's given by this differential equation with some time constant. And the interesting thing to note here is that we can interpret the differential equation that we have here for v as simply the firing rate dynamics of a recurrent network. And so, what does this network do? It takes the reconstruction error, and it uses it to update the activities of the recurrent network. And it also takes into account the sparseness constraint that encourages the output activities to be sparse. And here is the recurrent network that implements our differential equation for v. And so you can see how it has both an input layer of neurons and an output layer of neurons. But the interesting observation here is that the network makes a prediction of what it expects the input to be, so G times v is a prediction or a reconstruction of the input. And then we take an error, so u minus Gv is the reconstruction error or the prediction error. And that is then passed back to the output layer. And the output layer neurons then use the error to correct the estimates they have of the causes of the image as given by the vector v. For any given image, the network iterates by predicting and correcting, and eventually converges to a stable value
for v for any given image. We can learn the synaptic weight matrix G, which contains the basis of vectors or the features that we are trying to learn, by again applying the gradient ascent. So, we can set dG/dt proportional to the gradient of F with respect to G. So when we take the derivative of F with respect to G, you're going to get an expression that looks like this. So, it's u minus Gv times v transpose, and so what we end up with then is this learning rule for updating the synaptic weights G. It has a time constant tau G, and that specifies the timescale at which we're going to update the weights G. And so if we set tau G to be bigger than the time constant we had for v, that ensures that v converges faster than G. And so, we have the desired property that for any given image, v will converge fast to some particular value. And then, we can use that value for v to then update the weights for the network. Now, if you look closely at the right hand side of the learning rule, you'll see that it's actually Hebbian. So you can see how it contains the term u times v. So, u times v transpose is basically the Hebbian term. Now, it also contains a subtractive term, and that actually makes this rule very very simila. In fact almost identical to the Oja rule for learning. So, if the learning rule is almost identical to Oja's rule, why doesn't this network then just compute the eigenvector? So, why isn't it just doing principal component analysis? Well, the answer lies in the fact that the network is actually trying to compute a sparse representation of the image, and so that ensures that the network does not just learn the eigenvectors of the covariance matrix. It's actually learning a set of basis vectors that can represent the input in a sparse manner. Okay, so here is a pop quiz question. If you feed your network some patches from natural images, what do you think the network will learn in its matrix G? What kind of bases vectors would you predict are learned for natural image patches? Time for the drum roll.
The answer as first discovered Olshausen and Field, is that the basis vectors remarkably resemble the receptive fields in the primary visual cortex, as originally discovered by Hubel and Wiesel. So, each of these square images is one vector or one column of the matrix G. So, you can obtain a vector from the square image by collapsing each of the rows of the square image into one long vector, and that would be one column of the matrix G. So, what is this result telling us? It's telling us that the brain is perhaps optimizing its receptive fields to code for natural images in an efficient manner. You can look at this model as an example of an interpretive model. So, this is going back to the first week of our course, where we discussed the three different kinds of models in computational neuroscience. So this would be an example of an interpretive model that provides an ecological explanation for the receptive fields that one finds in the primary visual cortex. The Sparse Coding Network that we have been discussing so far is in fact a special case of a more general class of networks known as Predictive Coding Networks. So, here's a schematic diagram of a Predictive Coding Network. And the main idea here is to use feedback connections to convey predictions of the input, and to use the feedforward connections to convey the error signal between the prediction and the input. And this box labelled Predictive Estimator maintains an estimate of the hidden clauses, the vector v of the input. Now here are some more details of the predictive coding network. So as in the case of the sparse coding network, there are a set of feedforward weights and a set of feedback weights. But, we also can potentially include a set of recurrent weights, and these would allow the network to model time varying inputs. So, for example, if the input is not just a static image but natural movies, then we can model the dynamics of the hidden causes of these natural movies by allowing these estimates of the hidden causes to change over time.
And that's modeled using a set of recurrent synapses, and additionally one can also include a component which is a gain on the sensory error. So, this allows you to model certain effects such as visual attention. Well, this brings back some fond memories for me, because I worked on these predictive coding networks as a graduate student and as a postdoc. If you're interested in more details of these predictive coding models, I would encourage you to visit the supplementary materials on the course website where you'll find some papers that I wrote as a graduate student and as a postdoc. And finally, the predictive coding model suggests an answer to a longstanding puzzle about the anatomy of the visual cortex. Here is a diagram by Gilbert and Li of the connections between different areas of the visual cortex. And the puzzle is this, every time you see a feedforward connection, such as the one from v1 to v2 given by the blue arrow, you almost always also find a feedback connection from the second area to the first. So in this case, from v2 back to v1. So, why is there always a feedback connection for every feedforward connection between two cortical areas, and here is this schematic depiction of this puzzle. Information from the retina, as you know, is passed on to the LGN or lateral geniculate nucleus and then the information is passed on to cortical area V1, cortical area V2 and so on. But for every set of feedforward connections, there seems to be a corresponding set of feedback connections. So, what could be the role of these feedforward and feedback connections. The predictive coding model suggests interesting functional roles for the feedforward and feedback connections. According to the predictive coding model, the feedback connections convey predictions of the activities in the lower cortical areas from a higher cortical area. And the feedforward connections between one cortical area to the next convey the error signal between the predictions and the actual activities. It turns out that it can explain certain interesting phenomena that people have
observed in the visual cortex, known as contextual effects or surround suppression or surround effects. And these effects can be explained in an interesting manner by the hierarchical predictive coding model that we have here, when it is trained on natural images. So, I'd encourage you to go to the supplementary materials on the course website, if you're interested in more details. Okay, amigos and amigas, that wraps up this lecture. Next week, we'll learn how neurons can act as classifiers, and how the brain can learn from rewards using reinforcement learning. Until then, Adios and Goodbye.

7 - 3 - 3 Sparse Coding and Predictive Coding (2354)

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

7 - 3 - 3 Sparse Coding and Predictive Coding (2354)

Transféré par

Droits d'auteur :

Formats disponibles

Hello, Comp-neuro fans.

Vous aimerez peut-être aussi