0:01

Hello, again. So now were are moving on to calculating

information in spike trains. And in this section of the lecture, we're

going to be talking about two methods one of which is how to compute information in

spike patterns. And the other is how to compute

information in single spikes. So let's go back to our, our grandma's

information recipe. So remember that we're calculating the

mutual information, which is the difference between the total response

entropy and the mean noise entropy. So, what was the strategy, we're going

to, test strategy. We're going to take a single stimulus S,

repeat it many times to obtain the probability of the responses given S, in

that response distribution, via the noise entropy.

We're going to repeat that for all s, and then average it over s.

Finally, we'll compute the probability of response, and from that the total

response entropy. So now, let's go ahead and compute

information in spike patterns. So far we've really only dealt with

single spikes or firing rates, so what we'd like to ask here is, what

information is carried by patterns of spikes?

By these interesting sequences of 0s and 1s that occur here in the code.

What this allows us to do is to analyze. Patterns of the code and to ask how

informative they are. So the way we're going to turn out our

spike train into, into a pattern code, is that we're going to chop up segments of

these responses, so we take our voltage train when we divided into time bins Of

size delta t. If there's a spike in, in that time, then

we'll put a one. If there's no spike, we'll put a zero.

And now we'll chunk up these zeros and ones into words of some length, big T.

So now that we, we've defined these binary words.

With the letter size delta t and length of T, we can now walk through our data.

So, so, here's a raster plot produced by a stimulus that was randomly chosen on

every trial. And so, if one converts such a raster

plot into sequences of zeros and ones, you can look through that and pull out

many, many examples of these words, again of length T and type in delta t.

2:18

So now, one can form a distribution over these words.

So here, the most common word was silence, there was no spike in this set

of eight consecutive time bins, the next most common was that one spike appeared

and of course, we can have that one appearing at different locations

throughout the word. These are the next most common set of

words. Then one starts to get combinations of

spikes occurring at different locations throughout the word.

So now we can walk through our data and calculate these probabilities and then

calculate the entropy of that word distribution.

Now the information, is the difference between, that entropy, and the

variability, due to noise, averaged over stimuli.

So here was our total entropy. Here's how we're going to compute our

noise entropy. So, in this case, the same stimulus was

given every time, and now, what one sees, over many repetitions of that stimulus.

Is that on the first trial, you see a word, zero, zero, one, zero, zero, zero,

zero. On the next trial, you have the same

word, but now you see that there are some times when there was no spike, and some

times when that spike appeared in a different bin.

What that's going to do is generate a distribution of different wads.

Now that distribution is going to be considerably narrower than the total

distribution. And it's exactly this reduction in the

entropy from knowing nothing about the stimulus, to knowing something about the

stimulus that information will be capturing.

Alright, so let's go ahead and apply Grandma's recipe.

We'll take a stimulus sequence and repeat it many times, by how we're sampling the,

this probability of stimulus. We're going to use a bit of a trick,

which is that instead of averaging over all possible stimuli, we're going to take

a long random stimulus and average it over time.

4:23

So, now, time is standing in, for the average over stimulus.

So now, for each time, in the repeated stimulus, we're going to get a set of

words, p of w, given stimulus at time t. And our noise entropy, our average noise

entropy, is now going to be averaged over those different time points, i.

So if we choose a length of repeated sequences long enough, that will allow us

to sample the noise entropy adequately. So let's have a look at the application

of this idea to data from the LGN in a classic paper by Pam Reinagel and Clay

Reid. They carried out this exact procedure, so

as you saw before they ran a random stimulus over many trials.

Then they ran a fixed stimulus, call it frozen white noise, which has some

structure, in fact here it is. It's the stimulus as a function of time

and you can see that in response to the stimulus spikes appeared in a time lock

sequence. And now for an averages across those

repeats one finds a PSTH, that is a Post Stimulus Time Histogram, Where these

events show these large modulations in the time varying fine rate produced in

response to that stimulus. Now, if one zooms in on a tiny piece of

these responses, you'll see something like this.

So, at, at very Fine time scales. There's quite a bit of jitter in those

responses. Now our goal in computing the

information, and what the author has examined in this paper was ask on what

time scale, do these responses continue to convey information about the stimulus?

So one can see by looking at this picture that there's quite a bit of variability

in the spike train, and so that defines some kind of window around which a spike

can jitter and still signal the same information about the input.

So the questions we'd like to understand is how finely do we have to bend our

spike train and pay attention to the individual timings of spikes in order to

extract all that the neural code has to tell us about the stimulus?

So one can do that by exploring the information produced by the spike train

as a function of these two parameters, as a function of delta t, the Binning time

width and also of the length of the word, as the word gets longer our coding symbol

is able to capture more and more of the correlations in the input.

And so, to what extent does increasing L continue to capture more and more

information about the stimulus. So here's what the authors found in the

LGN, they varied both DT, both the temporal resolution of their words and

the total word length. So, here drawn as a function of 1 over L.

And have plotted here the information that they calculated for different

choices of those parameters of the definition of the word.

7:22

So, clearly, there's going to be a problem in going to this limit of very

large word lengths. So, as the word gets longer an longer,

for a finite amount of data, you're going to have very few samples of a word of

that length. And so when one tries to estimate the

entropy of the distribution of words of this length, it's very unlikely that you

will have seen them all. And so not surprisingly, if you now look

at the entropy, plotted as one over the word length The entropy drops off at this

limit indicating that the information is not completely sampled.

So what can be done is to compute the entropy for different lengths of words

and you can see that these form almost a line.

And so one can simply extrapolate the tendency of this line back toward

infinite word length. And extract an estimated value for the

entropy at that limit. That's not what was done in this figure

this was purely the information directly captured.

And so one can look over different delta t's and different word length to see how

information depended on these parameters. So what you should notice is that there

is some limit. To DT, beyond which the information

doesn't grow anymore. As one looks at the woods in higher and

higher temperol resolution. So one takes into account finer and finer

details about how those spike patterns are generated.

and so that's what's being quantified as we move down this axis.

As the time discordization of the wood. These bin sizes, is getting smaller and

smaller, that's able to capture more and more of the variability, in the spike

train, that's actually signaling something different about the stimulus.

But that at some point, it seems that that, information, stops increasing.

So, this red, we're at about, you know, between 80 and 100 bits per second, is

the information rate. And you see that that stops increasing

with delta t, and of delta t of about 2 milliseconds.

So hopefully you'll remember from the jitter in the spike trance that we looked

at, that they seem to be repeatable on a time scale of about a millisecond or 2

milliseconds. So that time scale dt corresponds to the

time scale in which the jitter in the spike train.

Still allows one to read that off as an encoding of the same stimulus.

It's going to quantify approximately what's the temporal with that one can

discatize this spike train and still extract all the information about the

stimulus that distinguishes it from other stimuli.

So in this example we've seen one case where we didn't have enough data to be

able to sample say very long words. In general this is always true.

When one's trying to calculate information theoretic quantities, one

needs to know the full distribution of responses, and the full distribution of

stimuli. And there's simply never enough data to

come up with really reliable estimates for information, unless one has very

simple experimental setups. And so a lot of effort has been put into

finding ways to correct the sample distributions for the fact that there is

a finite amount of data. And there's been some very interesting

work by a number of groups over the last 15 years or so, that has made significant

advances in being able to compute information theoretic quantities from

finite amounts of data. Now we're going to turn to a different

approach, this one proposed by [UNKNOWN] Brenner and [UNKNOWN].

How much does the observation of a single spike tell us about the stimulus?

Now this is similar to the case that we started with at the beginning of this

lecture, but now we're going to address the question that we noted then What if

we don't know exactly what it is about the stimulus that triggered the spike.

It turns out that, as in the case we just went through, is straightforward to

compute information with an explicit knowledge of what exactly in the input is

being encoded. This is because the mutual information

allows us away to quantify the relationship between input and output

without needing to make any particular model of that relationship relationship.

So, the paradigm is exactly the same as before.

We're going to compute the entropy of responses, when the stimulus is random,

and the entropy, when given a specific stimulus.

So, here, things are a little simpler, than in the case of Wuds/g, without

knowing the stimulus, the probability that a single spike acud/g, is given by

the average firing rate times the bin size.

Similarly, the probability of no spike is just 1 minus that.

Now the probability of a spike at a given time during the presentation of a

stimulus r of t times the time then, when now r of t is the time varying rate

caused by the changing stimulus We can get an estimate of that time varying rate

by repeating the input over and over again.

The variability in these responses means that these events show a continuous

variation, and have some width as we saw before, depending on the jitter and the

spike times. So let's go ahead and compute the

entropy. We're going to define, for the moment, p

equals r bar delta t and p of t to be r of t delta t.

The information will simply be the difference between the total entropy,

we've already computed that in the beginning of the lecture For, for this

binomial case to minus p log p minus 1 minus p log 1 minus p and we need to

subtract from that the noise entropy. Now the noise entropy would take on a

value at every time t depending on the time variant firing rate.

Now again every time t represents a sample of stimulus S.

And averaging over time is equivalent to averaging over the distribution of s.

This ability to swap an average over the ensemble stimuli, for an average over

time, is known as ergodicity. At different values of S are visited in

time with the frequency that's equivalent to their probability.

So now we have our expression for the information between response and

stimulus, we can do some manipulations on it.

So we're placing back P by R delta T. We can take the time average firing rate,

to be equal, to the mean firing rate, so that's equivalent here to this, to the

integral, over, the probability as a function of time, in the mean, going

toward that main firing rate. And getting rid of some small terms, we

have here a couple extra, extra pieces that turn out to be small, we end up with

a rather neat expression for the information per spike.

let's take a closer looks at this expression, as we've emphasized already

This method of computing information has no explicit stimulus dependence.

Meaning no need for any explicit coding or decoding model.

It relies on the repeated part of the stimulus being a good estimate of the

distribution of a possible stimuli. Note also that although we computed this

for the arrival or not of a single spike, this formulism could be applied to the

rate of any event. For example the occurrence of a specific

symbol in the code. So this is a way to evaluate how much

information might be conveyed by a particular pattern of spikes, for example

a sudden inter spike interval. We can also examine what determines the

amount of information in the spike train /g.

So looking again at this expression, we can see that it's going to be determined

by two things. One is timing precision.

That's going to blur this function R of T.

So if events are blurred so that R of T increases and decreases slowly, without

reaching large values, this will reduce the information.

At the extreme, let's imagine, that the response is barely modulated at all by

this particular stimulas. In that case, r of t goes towards the

average firing rate. And one gets no information.

The more sharply and strongly modulated r of t is the more information it contains.

The other factor is the main firing rate. If the spike rate is very low then the

average firing rate is small and information is likely to be the large.

The intuition is that the low firing rate signifies that the neuron response to a

very small number of possible stimuli so that when it does spike its extremely

informative about the stimulus. Note that this is the information per

spike. The information transmitted is a function

of time, for the information rate is going to be small for such a neuron.

So let's look at some hypothetical examples.

Rat hippocampal neurons have what's known as a place field such that when the rat

runs through that region in space, the cell fires.

Let's imagine the place cell looks like this.

As the rat runs around the field, Is going to pass through that place field,

and what's the firing rate going to look like?

Here, as it moves through the field is going to go from zero, ramp up kind of

slowly, go down again. Because that place field is quite large,

the red is likely to pass through it farely often.

So we're going to get some R of T of that form.

Now let's imagine that the place field is very small.

Now, rat runs around. Very, very rarely passes through that,

that place field. And so, now, going to get almost no

firing and then some blip of firing as it passes through that field.

Now, what if the edges of the place fill the very shop?

So now again rat runs around. Very, very rarely passes through that

field, so now as the rat runs around, it passes through that place field very

rarely, but when it does, the firing rate increases very sharply toward its

maximum. So that's going to increase the

information we get from such a receptor field.

Okay, so now we're done with computing information in spike trains.

Next up we'll be talking about information and coding efficiency.

We'll be looking at natural stimuli. What are the challenges posed to our

nervous systems by natural stimuli? What do information theoretic concepts

suggest that neural systems should do when they encode such stimuli?

And finally, what principles seem to be at work in shaping the neural code?