0:02

When we previously defined the notion of Bayesian estimation and showed how it

could be applied in the context of a single random variable say, a multinomial

random variable. Now we're going step back to the world of

probabilistic graphical models and think about the application of these ideas to

the problem of estimating parameters in the Bayesian network.

So, let's draw again the probabilistic graphical model that represents Bayesian

estimation in a Bayesian network. So just as before, in the single variable

case, we're going to inject into the model explicitly the parameters that

characterize sorry,

random variables that define the parameters.

And so here we have two random variables theta X which represent the CPD of X,

and theta Y given X, which represents CPD P of Y given X.

Now notice that each of these actually, vector value, because there's going to be

multiple actual numbers in these in each of these CPDs,

but we're going to draw them as single circles.

Now once again we can look at this network and read out certain important

conclusions. So the first important conclusion is that

the instances, these XY pairs, are independent given the parameters and we

can see that by noticing that if I condition on both theta X and theta Y

given X, then the XY pairs become conditionally become d-separated from

each other and so we have conditional independence following as a consequence

of the structure of the graphical model. We also have, and this is another

explicit property that we can read from this diagram is that theta X and theta Y

given X are marginally independent. So a priori, we have that the parameter

prior over all of the, all of the parameters.

theta can be written as the product over i, and in this case, i is being the

random variables in the network of the prior over the CPD for Xi and so the

prior is the product of little priors one through each CPD.

2:17

Now, it follows from this, from, by just writing down the the

graphical model and looking at what the implications of the expressions are, that

the posteriors of this data are also independent given complete data.

And the reason for that is that complete data d-separates the parameters for the

two CPDs. If you look at at this network over here

and assume that we have all of these variables conditioned observed,

then you can see that there is no active path between theta X and theta Y given X.

because, for example, if we look at potentially, this trail, we can see that

X[2] blocks the trail from theta X to theta Y given X.

And so again, following directly from the structure of this network, we can see

that the posterior distribution theta X, theta Y given X given d decomposes as a

product of the posterior over theta x given d times the posterior of theta Y, Y

given X given D. Which means that just as in maximum

likelihood estimation where you could break up the estimation problem into one

of estimating each CPD separately, we can do the same here.

Only now, we can do it using Bayesian estimation, where instead of just picking

a single parameter setting for the CPD, we compute the separate posteriors

separately and then put them together into a single posterior.

4:04

Now, it turns out that we can do even finer breakdown in the context of table

CPDs. So here we have and we're now looking at

the binary case where X is a binary valued random variable.

So now we have two multinomials in our CPD, one for Y given X, one corresponding

to the case of Y given X1 and the other Y given X0.

And it turns out that if, again in this model we're assuming, that these are

independent a priori, which is what this diagram says, because you notice there's

no edges between so you notice that they are marginally independent, theta Y given

X1 and theta Y given X0. If we're postulating that that holds as

we are in this diagram, it turns out that they are also

independent in the posterior. Now, that is a little bit trickier to

show, because it turns out that you can read it directly from the diagram because

in fact even given complete data we have an active trail that goes from theta Y

given X1 through Y1, which since it's observed, it activates the V structure

into theta Y given X0. [COUGH] But it turns out that if we go

back to some of the examples of context-specific independence, that we

had in the case of specifically a multiplexer CPD, we can derive that these

are in fact, despite the appearance of an activated V structure, conditionally

dependent in the posterior as well. And so, once again,

we can compute the posterior as a product of posteriors of the form p of theta X

given d times the probability of theta Y given X1 given d times the probability of

theta one given X0. Okay.

6:08

So we can generalize this to a general Bayesian network and let's assume that we

have a Bayesian netork with table CPDs that's specified in terms of multinomial

parameters of the form theta X given u, where u is some assignment to X's

parents, u, Then if, for each such multinomial

parameter, we have a Dirichlet prior with appropriate hyperparameters, then, we can

show using the kind of analysis that we just did, combine with the analysis of

the posterior for a single multinomial, that the posterior is now a Dirichlet,

with hyperparameters that represents the prior that we had for that multinomial

plus the sufficient statistics that represent the count in the [INAUDIBLE]

particular combinations of the parent and the child.

And so for example, for the, entry in the multinomial representing the value little

x1 for X and the [INAUDIBLE] little u for the parents u,

we have, this prior parameter plus the count in the data for that combination of

x and u. So,

now we know how to take a set of priors, and use the data to update them, to form

posteriors. Now, let's think about where the priors

might come from. And a priori, it might seem very daunting

to construct a set of priors for all of the nodes in a Bayesian network.

It turns out however that there is a general purpose, shceme for doing that,

that, is both, easy and has some good theoretical properties, and, that, scheme

works as follows, so, what we're going to do, is we're going to define a prior

basion network, that has some set of parameters state of zero.

And we're going to define a single equivalent sample size.

8:24

Which is going to be applicable or applied to all of the nodes in the

Bayesian network, and so in order to specify the parameter, the hyper

parameter alpha X given u for for an assignment X equals little x, and u

equals little u. We're simply going to compute the

probability in this parameterized network of X and u and we're going to multiply it

by the equivalent sample size alpha. Now, in many cases, you are just going to

use theta zero to be the uniform parameters which makes it all a very easy

computation. But this provides a simple, coherent way

to specific all of the hyperparameters simultaneously.

And so, let's look at an example, here is a network xy that, that has no

edge and and let's imagine that, that is our

9:47

And now, let's look at what we would get for different situations in a network

where we have X being a parent of Y which is the network with parameters we

actually want to estimate. And so what we would get for.

And let's assume they're both binary, X and Y.

And so X is going to be distributed as a parameter with Dirichlet with

hyperparameters, alpha over two, alpha over two.

And Y is going to be distributed, remember, so,

not what. Not X theta of X is going to be

distributed this way. And theta of Y given X0 was going to be

distributed in the following way. So Dirichlet with hyperparameters alpha

times the probability of X, Y, which is the uniform distribution is a

quarter. And similarly for theta Y given X1 is

going to have the same, here say distribution.

And if you think about this, this makes perfect sense, because it tells us that

we have seen the same number of examples of X as we have of Y.

It's just that in Y, we had to partition the examples of X, of Y over those where

we had Xxo. = X0 and those where we had Xx1.

= X1. If on the other hand, we had say

Dirichlet of alpha over two alpha over two for the two except for the two

multinomials corresponding to, the two corresponding to Y,

this one and this one. It would, it would imply that we've seen

twice as many Ys as we've seen Xs. So let's see what kind of effect using

the Bayesian estimation has on a pseudo real world example.

So, this is actually a real network. It was developed for monitoring patients

in an ICU and we call it the ICU-Alarm network.

And it turns out that the ICU-Alarm network has 37 different variables that

represent things like whether the patient was intubated the patient's blood

pressure heart rate, and various other medical events that might happen.

And, it turns out that overall, the network has 504 parameters.

Now, there aren't actually data cases here, this was a hand constructed

network, and so what we're going to do is we're going to sample instances from the

network. And, then, we're going to pretend that we don't know the network

parameters and see the extent of which we can recover the network parameters via

learning from the instances that we sampled from it.

12:54

I should say that this is a pseudorealistic learning problem, because

the instances that one samples from a network are, are always cleaner than the

instances that one gets in the context of a real world data case, data set, because

it, in a real world scenario, it is rarely the case that the network whose

structure you have the network whose that you're trying to learn has the exact same

structure as the true underlying distribution from which the data were

generated. And so this is a much cleaner scenario,

but still it's useful and indicative. So what we see here, are, the results of

learning, as a function of the x-axis, which is the number of samples and the

y-axis is a distance function between the true distribution and the learn

distribution, and that distance function we're not going to talk about this at the

moment, it's the notion called the relative entropy, it's also called KL

divergence. But what we need to know about this for

the purposes of the current discussion is that when distributions are identical

it's zero, and otherwise it's non-negative.

So, what we see here is that the blue line, corresponds to maximum life data

information. And we can see several things about the

poline. First of all it's very jagged, there's a

lot of bumps in it, and second, it's consistently higher then all of the other

lines. Which means that max and likelihood

estimation, although it does continue to get lower as we get more data, with as

high as five thousand data points, we still haven't gotten close, to the true

underlying distribution. Conversely let's see what happens with a

Bayesian estimation. This is all Bayesian estimation with a

uniform prior. And different, equivalent sample size.

So this is using a prior network with a uniform network in different values of

alpha. And what we see here is that, for alpha

equals five. That's the green line.

Alph equals ten are almost sitting directly on top of each other and they're

both considerably lower then all of the other lines and also the maximum likely

destination. As we increase the prior strength so that

we are have a firmer belief in in. A uniform prior.

We can see that we move a little bit away.

and now the performance becomes a little worse.

But notice that by around 2,000 data points we're already pretty close to the

case that we were for an equivalent sample size of five.

For 50, which is this dark blue line. It takes a little bit longer to converge,

and it doesn't quite make it, But even with an equivalent sample size of 50,

which is pretty high. you get convergence to the correct

distribution much faster than you do from maximum likelihood destination.

So, to summarize. in Bayesian networks, if we're doing

Bayesian parameter estimation. If we're willing to stipulate that the

parameters are independent, a priori. Then they're also independent in the

posterior. Which allows us to maintain the posterior

as a product of posteriors over individual parameters.

For multinomial Bayesian networks, we can go ahead and, con-, perform Bayesian

estimation using the exact same sufficient statistics that we used for

maximum likelihood destination. Which are the counts corresponding to a

value of the variable and a value of its parents.

And whereas, in the context of maximum likelihood estimation, we would simply

use the formula on the left. In the case of Bayesian estimation, we

are going to use the formula on the right.

Which has exactly the same form. Only, it also accounts for the

hyperparameters. And, in order to do this kind of process,

we need a choice of prior, and we show how that can be effectively elisteted

using both a prior distribution specified say, as Bazy network as well as an

equivalent sample size.