We previously described the problem of base net structure learning as a

optimization problem where we separated the problem into two pieces one is

defining a scoring function and the second is coming up with an algorithm

that optimizes that scoring function. We talked about how the optimization

problem can be solved in the case where we are trying to learn a network

structure as a tree or rather a forest where each variable has at most one

parent. But what about the more general case

where we are trying to learn an unrestricted network or perhaps one that

has weaker restrictions then simply a tree.

So how do we address that problem? So first as a reminder, our input is a

training set. A scoring function and some set of

possible network structures. And the output, the desired output, is

some kind of network structure that maximizes the score within the set of

families that, structures that we're willing to consider.

So what happens if we try to solve this optimization problem in the case where

the network, restricted network structure is not the class of trees?

Well, in the context of trees we could apply a simple greedy algorithm for

finding the maximum weight spanning tree but when we have a more general network,

even ones where we allow at most two parents per node, this greedy algorithm

that builds up a tree no longer works. It's no longer guaranteed to find the

optimal scoring network. Now in fact, it turns out to be not

surprising that this this simple algorithm is not going to work because of

this following theorem. The theorem states that the problem of

finding the best network, the one that has the highest score, with at most k

parents os an NP-hard problem for any k bigger than 1.

So for k = 1 we're back in the context of trees or forest, so we have the

polynomial-time algorithm, but even for k = 2 the problem turns out to be NP-hard.

And so no efficient algorithm is likely to be found for this problem.

So what do we do? It turns out that the standard solution

is some kind of heuristic hill climbing search.

So we have a current network, which is the one that we're trying to improve, and

we'll talk later about where we start this process.

And now we look at different changes that we could make to this network.

So we can consider for example making local protobations that may or may not

improve the quality of the network relative to the score.

What we see here, this bar represents the score.

So, for example, this transition adds an edge form B to D and we can see that here

that turned out to improve the score because this bar as we see is larger than

this bar. We can consider other changes, for

example, this transition removes the edge form B to C and also gives rise to a

slight improvement in the score but not as large as the one where we added the

edge from B to D. On the other hand, reversing the edge

form B to C. Which is this transition, gives rise to a

decrease in the score. And so we can look around us.

And see which changes to the network improve the score, and which ones don't.

And then probably try and go in the direction that generally improves the

score. There are two main design choices that

one needs to decide on when doing a Hera six search algorithm.

The first is the set of operators. That is, the set of steps that we can

take in perturbing one that worked to the other.

So, for example in the, in the example that we just saw, we were taking local

steps. We were adding edges, the leading edges

or reversing edges and these are fairly small perturbations in the network.

There's also algorithms that use much larger global steps where we've taken

entire node out of the network and stick it in somewhere else.

And that's a larger step that is more costly but also takes larger moves in the

space of networks. So this is one set of design choices.

The second set of design choices is which search technique we end up using, so do

we do what is called greedy hill climbing where we are just trying to climb up as

quickly a we can or do we adopt a different type of local communicatorial

search algorithm of which there are many different kinds some of which are listed

here and there are many others and we are not going to talk about those very much.

So, let's start by talking about Greedy hill-climbing which is the basis for most

of the Bayes network learning algorithm that are in use today.

So, how does Greedy hill--climbing work? We start with a given network initially

often this is just the empty network and the reason for starting with the empty

network is that it induces sparsity. we don't have too complicated a network

and one that's likely to over-fit. we could also start with the best tree,

which we get from one of the efficient tree-learning algorithms that we talked

about. We can also start with a random that

worked, this is a way of, exploring the space more broadly, instead of starting

off from a fixed starting point. We can start from multiple random

starting points or in cases where we have an expert that has some knowledge of

domains, we might use prior knowledge in order to initialize the process.

Having initialized, we now iteratively try and improve the network.

So we consider the different steps that we can take based on the operators that

we defined as being legitimate. And then for each of them we consider the

score that we obtain from all possible changes.

And then we apply the change that most improves the score so this is the greedy

component. That is we look at all possible changes

and we greedily pick the one that improves the score.

That gives us a new network which we can now again update by again considering

possible next step changes that we could apply to this new network.

And this process continues until we get stuck that is when none of them

modifications that we have at our disposal improves the score and when what

happens we are at what is called a local maximum that is, a place in the space of

network where no change gives rise to an improvement in the score.

What are some of the pitfalls of this Greedy hill-climbing process?

Well, the first is that this local maximum is potentially a local maximum.

That is, one that is not the global maximum.

So if we pictorially draw the space of, of possible networks.

We're going to draw it as continuous, even though it's discrete.

What we have is, sort of, a space that looks like this.

And if we start out from a point over here and we just take small hill-climbing

steps, we might end up in a position that is a very poor local maximum, compared to

the global maximum which could be considerably better.

And this is quite common in the kind of combinatorial search that we use in this,

this type of situation. A second problem is what's called a

plateau. A plateau, if we go back to this kind of

visualization that we have here, is a case where, looking around us.

There's a whole bunch of networks that are all, effectively, the same score.

And, while some of the directions might give rise to eventually, after a certain

number of steps to a hill, others might not.

And the problem is that we don't know which direction to go because they all

look exactly the same in terms of the score.

This happens quite often in the case of Bayesian network structure learning

because whenever you have a network. For example, like this,

other networks that are equivalent to them for example the one where we just

reverse this edge and we have the parent that has

two children, as opposed to a chain. This network is equivalent to this one,

I-equivalent. And therefore it's going to have the same

score. And in general we're going to have a lot

of neighbours when you have more complicated graphs.

A lot of networks that are equivalent and therefore have the same score.

And moving around between them, it's not clear which of those is going to

eventually give rise to a move that breaks us out of the plateau.

Into a into a network that actually has a higher score.

So, these are significant issues in terms of using Greedy hill-climbing search for

optimizing the score. [SOUND] The issue of

of local maximum plateau also relates to the question of search operators that we

used and we talked about edge addition, edge deletion, and edge reversal and the

question of that one might ask is why on earth would we need an edge reversal

because after all edge reversal can be subsumed by edge deletion and then an

edge addition so why do we need to have to add a whole new set of operators into

the, into the pool that we have to evaluate.

So this it turns out has to do with the greedy nature of our search.

So to understand that, let's look at this example.

Lets assume that this is our graph, G star, that generated the data and now

lets assume that we're going to try and learn a network from this, this

distribution derived from this graph. And, so here we have initiially our empty

network which we're going to start out with.

And we have two strong correlations in this network.

The one from between A and C and the one between B and C.

And let's say that we pick the one between B and C as being stronger and

therefore we're going to add it first. But now there is complete symmetry

between the edge that might go from C to B and the edge that might go from d to c.

these two networks are equivalent and they're going to have a same score and

so, we can arbitrarily choose the edge that goes from C to D because its as good

as the other one. Now that, that edge is in, we might go

ahead and add the other ones. Say we are going to add the edge from A

to C. And the question is now what?

The edge from C to B is wrong. And really we'd like to reverse it.

But if we don't have an edge reversal operation the only way to do that would

be to remove that edge. And then add in the edge in the other

direction. But removing that edge is going to cost

us a lot, in terms of score, because this was a really good edge.

In fact it was the first one that we added.

So removing it is going to cause a hit, in terms of the score.

And it's going to be, a sub-optimal move. It's not going to be a great move.

And so if we don't have an edge reversal operation this red network, the one that

has the edge from A to C and the edge from C to B, is now going to be a local

maximum. And we're not going to be able to break

out of it by looking just greedily within one step, unless we have an intraversal

step. So as reversal is a way of avoiding these

very common local maxima. but it doesn't address all local maxima

specifically not the bigger ones the we have also discussed.

So, what is the algorithm that people typically use for addressing this problem

without falling into this, local maximum that common, that frequently?

So, turns out that a pretty good simple algorithm is the following.

We do use greedy hill climbing, but we augment it with two strategies that are,

attempts at avoiding local maximum plateau.

The first is what's called random restarts, which is, if you're climbing

up, and you get stuck. You take a certain number of random steps

and then start climbing again. And the hope is that if we're at a local

maximum that's fairly shallow then the small number of random steps will move us

into the attraction area of a better maximum and we'll continue to climb to a

better optimal. A second strategy, which is useful for

both local maxima and plateaus is what's called a taboo list.

A taboo list is a way of avoiding treading the same ground over and over

again. So, here we have a set of steps that we

have taken. For example, adding an edge from A to B.

We're deleting an edge from C to B and these are the steps that we've recently

taken. We keep a list of the k steps that we've

recently taken and we constrain our search so that it cannot reverse any of

these steps. So, specifically, if add A to B is on our

list, then on our taboo list we have that delete,

the edge from a to b is not something that we can do while the add step is

still in the list of the k most recent steps.

And that forces us to make forward progress rather than doing the same step,

reversing it, trying a different version of it, reversing that.

And that turns out to be helpful both in avoiding local maximums as well as making

some progress in the context of plateaus. So how does, with this, how well does

this work in practice? Let's first look at a synthetic example.

This is the ICU alarm network that we've seen before in the context of Bayes net

parameter estimation. And what we see here are two learning

scenarios. One in the blue line, down here, is

learning parameters only. With the true structure given.

And the green line. Is where we're trying to do both

parameters and structure at the same time, so trying to learn both.

. And we can see on the X axis the number

of data instances, M. And on the Y axis the distance between,

the distance are measured as KL divergence or relative entropy, between

the learned network. And the true network.