Named after the mathematician Bayes, whose work was then, further extended by the

famous mathematician Laplace. And Laplace raised a very interesting

question, what's the chance that the sun will rise tomorrow if you have observed

that it has been rising every morning in the past, say, 100 million days?

Now, you may think this is a funny question.

But, for now, ignore the fact that you know something about the underlying

physics of why the sun rises. Purely view this as a question that you

observe. Something has been coming up every morning

the last 100 million days, and then what's the chance you predict it will come up

again tomorrow? Can you say, well, can be exactly 100

percent mathematically? Again, we're not talking about underlying

physics. About this is such an overwhelming number.

So, what should it be? And this is an interesting thought

experiment that we can simplify little further to say, suppose I give you a

sequence of n experiment and I say, s out of n experiment, for s is smaller than n

I've served one. Okay?

And I ask you that question, what's the chance the next experiments also return

one? Now, without going through the foundation

of probability theory. Intuitively, the answer is just s over n.

That's in the past s out of n chances runs give us this result of one as, well, our

observation. So, call this question number one.

Suppose now, I switch to another question and say that, also run n experiments.

And the experiment is actually that there's a coin that's loaded, and I flip

it. If it's heads, then I return one.

If it's tail, then I return zero. And again, s out of the last n runs, I

observe one. I'll you ask the question, what's the

chance that next time you'll also see one? You may say, hold on, isn't this question

the same as the last question? Shouldn't the answer also be s over n?

Well, not according to the Bayesian view of probability.

The Bayesian view is a very powerful and for some time, a controversial view in

probability theory community. But, what it says is that, now that I've

given you some prior information that the experiment consists of flipping a loaded

coin, then you'll be able to make some model about that based on the observation

in the last n rounds. And therefore, your answer may be

different. What is that answer?

Let's derive that in the next five minutes and I will come back to answer this

question, why is it different from s over n again?

The essence of the Bayesian view, the philosophical underpinning is captured in

this picture. You got underlying model.

Later in the course, we'll also see different kind of latent factor model,

we'll see hidden model, we'll see reverse engineering of network topology, as well

as protocols. Philosophically, they follow a similar

spirit. In this simple question, the underline

model is just captured by a parameter p. And as the chance that a single coin flip

of this loaded coin will result in head, and therefore observation one.

Now, different piece will clearly lead to different observations.

That goes without saying, but Bayesian view also says different observations

telling me something about p. And, more observation I have, the better I

can build a model for p from which I can then do forward engineering to predict the

future outcome. You first reverse engineer the p before

you make prediction. So, in our case, we say that, if you know

the value of p, then we know that the chance of seeing s out of n flips heads is

simply a binomial distribution. It is p to the power s, cuz observed s

such cases. One minus p to the power n minus s, cuz we

observed n minus one of such cases, and there are edge whose asked possible

[unknown] arrangements of that sequence of s out of n being one.

So, this is just a binomial distribution, and we all know that for a fixed p.

What is now flipping our heads around and turning the table is that, since that's

the case, then the probability distribution of p, and let's call that f

of p, should be proportional to this observation frequency.

If we count the frequency in the observation of heads, then that will tell

us something about the underlying probability distribution of this value of

p. And let's just pause for one second

because it sounds intuitive, it's actually counter-intuitive for a site.

Cuz we're now saying that let's go ahead and predict, we're saying that let's build

a model of p and that model says that p's distribution should be proportional to our

observation. This proportionality principle is what

Laplace did to extend base understanding of the relationship between observation

and model. And I say it's proportional because the

self is not a, a distribution. We have to normalize it by integrating

overall the possible p's and the p there, a p can range from zero to one.

And now, this is indeed a probability distribution, okay?

And as a function of p, that's the probability distribution f of p.

So, all we need to do now is to evaluate this integral and then do the division

skipping those detailed steps, cuz that's not what we care about in this course.

We get the following answer. It's n plus one factorial over s factorial

n minus s factorial times p to the s, n minus one minus p to the n minus s.