We previously discussed the problem of learning the parameters of the Bayesian

network whose structure was given to us. But that might not always be the case,

it's not always the case that we have a domain expert to understand the structure

of the domain, the relationship between the variables well enough to write down

in the network that we think is good enough.

So, what are some cases where this might happen?

the first is when, when we actually want to use a network for answering new

queries, of what ever kind, for example, new medical queries in a, in a medical

domain or other settings. So, in this case we might, it still might

be the case that although there might be some domain expertise, it might not be

sufficient to come up with a model that is good enough to use, and we might do

better by taking data and actually learning the dependencies that the data

tells are most significant. A second setting where this might happen

is where we don't necessarily even care about using the network, we just want to

discover the network. And this kind of use model occurs, for

example, in scientific, or biological data sets for example, where the goal is

to discover the interrelationships between the variables just so that we

understand the domain better. So, let's focus for a moment on the first

of these two cases, where our goal is to use, is to learn a network for the

purpose of using it. And let's think about the kinds of

mistakes that might happen and how they affect the ability of the network to

apply well to new instances. So, let's imagine that the network, that

the true network is this one over here. And now, let's consider different kinds

of mistakes that a learning algorithm might make.

So, in one case, for example, the, the learning algorithm might learn a network

that's missing an arc. On the other hand, it might learn a

network to which we've added an arc. Now, of course, one can also have cases

where we have a bit of each, but let's think about these two types of errors

separately. If we're missing an arc effectively the

net, the independencies that the network, that the learned network tells us is

incorrect. That is relative to the ground truth

network, which is this one, which if you remember we called it G*, the network in,

the network that we learned is making independent assumptions that are not

correct relative to G*. Conversely, in this case, we have

spurious dependencies. That is in, when we have added an arc,

there's a spurious dependency, in this case, between A and B.

Now, what are the implications of this? If we're missing an arc, then the correct

distribution P* cannot be learned. That is, there's no way in general if if

P* was associated with this network over here,

that is G* is a perfect map for P*. Then we can't learn P* using a network

that's missing an edge. Now, that might seem like a very bad idea

and we would prefer perhaps the error model on the right which, even though it

has these excess edges, it does allow us to correctly learn P*, that is, this

additional edge over here between A and B, there is a setting of the parameters

for which the correct distribution P* can still be

estimated on this network structure. So, it might seem that this error model

is actually, error mode, is actually better than the one where we're missing

an arc. But empirically, that actually turns out to be an oversimplified view.

Specifically, the model where we have excess arcs gives rise to an increase in

the number of parameters. And the more parameters we have to

elicit, the harder it is to elicit them correctly, given the limited amount of

data. We've already talked about this when we

talked about fragmentation, as the number of of the data, as the number of parents

increases. And so, one of the important conclusions

from this is that this, as we've already seen, can give rise to worst

generalization which means worst performance on unseen instances.

And so, the fact is that sometimes, having fewer edges could actually

generalize better than having more edges even though the correct distribution

cannot be encoded. And so, the tradeoffs between missing

arcs and extra arcs are subtle. And it's not always clear which, which is

the right which is going to give us the best performance.

So, with that introduction, what is the way in which we typically learn Bayesian

network structure from data? the general paradigm which we will refine

in subsequent parts of this of this segment of the course,

is that we define a scoring function that evaluates, for each structure, how well

that structure matches the data. So here, we have a data set D.

and we have, in this case, three example network structures.

And we're going to define a scoring function that tells us, for each of these

network structures, how good it is relative to the data that we've seen And

then, we have the goal of searching for a network structure that maximizes the

score. And so, we have now turned the learning

problem into an optimization problem. It's an optimization over a combinatorial

space, the space of network structures.

And by defining a score that gives us the objective that we're trying to optimize

and now, we need to come up with an algorithm that optimizes that score.

So, we've separated this out into the score component and the optimization

component. And we'll talk separately about each of

these in subsequent parts.