In this case, when we give that same stimulus, now the degree of noise

stretches over a much wider range of the response distribution.

So much more of the variability in R is accounted for by variability in responses

to specific stimuli. And so, I hope you can see that this kind

of response, set of response distributions, is going to encode much

more information about S, that the information about S and R is much larger

in this case. Then it is for this case.

Let's play a little bit with these distributions, because I want to

demonstrate a couple of things that I think really illustrate why information

is useful and an intuitive measure of the relationship between two variables.

I'm using capital letters to denote the random variable, and lower case letters

to denote a specific sample from that random variable.

So, what I'd like to show you is that the information quantifies how far from

independent these two random variables R and S are.

To demonstrate that, I'm going to use the KullbackâLeibler divergence; a measure of

similarity between probability distributions that we introduced earlier.

It's the mutual information, measures independence, then we'd like to quantify

the difference between the joint distribution, of R and S, and the

distribution, these two variables would have if they were independent.

That is, that that, that joint distribution would simply just be the

product of their marginal distributions. So, first, to refresh your memory, D KL,

let's redefine it. D KL, between two different probability

distributions, say P and Q, is equal to an integral.

Over probability of x times the log of P of x over Q of x.

So now let's apply that to these two distributions.

So, let's compute that, we have a integral over ds and over dr.

Joint distribution, times the log the joint distribution divided by the

marginal distributions. Now we can rewrite that, using the

conditional distribution. In the following form, we can rewrite

that as the probability of r, given s times the probability of s, that's just

equivalent to the joint distribution, divided by P of r, P of s.

And now, you can see that P of s cancels out, and we can rewrite this as, now the

difference of those two distributions. So we'll just expand that log.

All right. Now let's concentrate on this term.

Going to be equal to the negative ds dr, probability of s and r, times the log of

P of r, plus integral ds dr, P. Now, let's break that up into P of s, P

of r given s. Just dividing up the, the joint

distribution again into its conditional and marginal.

Times the log of P of r given s. Now, let's look at the terms that we've

developed here. We can see that we can just integrate

over ds. We can integrate the s part out of this

joint distribution. And this part is just simply going to be

the entropy of P of r. Whereas this one is going to be the

entropy of P of r given s, averaged over s, ds P of s.

And so what I've shown you is that this form, in terms of the KullbackâLeibler

divergence, gives us back the form that we've already seen.

The entropy of the responses minus the average, minus the average over, over the

stimuli. Of the noise entropy, for a given

stimulus. What I hope you realize is that

everything we've done here in terms of response and stimulus we could simply

flip, response and stimulus, redo the same calculation, and instead end up with

entropy of the stimulus minus an average over the responses, of the entropy, of

the stimulus given the response. So information is completely symmetric,

in the two variables, being computed between.

Mutual information between response to stimulus is the same as mutual

information, between stimulus and response.