So let's think about how to actually perform this maximum likely destination.

So let's go back to the log likelihood function, now we're going to add this

we're going to multiply by one over m so that we don't have a scaling issue where

the log likelihood continues to grow as a number of data instances grows and so

that gives us this expression over here. Where a note, that the m has disappeared

as a multiplier from the log partition function, and now we have a one over m

term in the linear component as well. And now if we go back and plug in the

derivatives of theta I. we take the derivative of this expression

relative to particular parameter theta I, we have two pieces.

We have already said that derivative of the log partition function relative to a

parameter theta I is the expectation of Theta i, of fi which is the cul, which is

the feature, that Theta I multiplies, in, piece of theta.

On the other hand, if we look here, at the derivative relative to theta I, we

can see that what we have here, is also an expectation, it's an empirical

expectation, it is the expectation, or average of f i In the data distribution.

And so the derivative of the log likelihood relative to theta I is the

difference of two expectations. An empirical expectation, and, and

expectation in the parametric distribution defined by my current set of

parameters, theta. The one immediate consequence from this,

is that a parameter setting theta half is the maximum likelihood estimate, if and

only if, we have equality of expectations, that is the expectation

relative to the model. Theta hat is equal to the expectation in

the data b. And so the expectation in the model has

to be equal to the expectation in the data.

And that has to hold for every one of the features, in my log linear model.

Bill Froley. Now that we've computed the gradient

let's think about how one might use it to actually do the optimization.

So it turns out that a very commonly used approach to, do this kind of likelihood

optimization is in the context of CRFs is a variant of gradient ascent.

So here we have a likelihood surface which is in, in this case a fairly nice

function. And what we're going to do is we're going

to use a gradient process where at each point, we compute the gradient.

We take a certain step in that direction. And continue to climb up.

And the gradient that we've computed on the previous slide is what we're using.

Now it turns out that plain vanilla gradient ascent is not actually the most

efficient way to go. typically one uses a variant of gradient

adcent called, LBFGS, which is a quasi Newton method.

So very briefly a Newton method is something that, constructs a second order

approximation, to the function, at each point as opposed to just a linear

proximation which looks only at the gradient, that says we don't want to

spend the computational cost in general to compute a second order, or quadratic

approximation to the function, a quasi Newton method.

It kind of makes, an approximate guess, at what the step would be, if we were to

compute a second order of approximation. And we're not going to go into the

details of LDFGS here, but there's plenty of literature on that, that that you can

find. Now in either case, for computing the

gradient in this context. The critical computational cost is the

fact that we need the expected feature counts.

And the expected feature counts come up in two places in the data.

That is when we compute this empirical expectation E relative to D of the

feature FI. But also in the second term in this

derivative relative to the current model. So E relative to the parameter vector

theta in the current model of the feature fi.

Now this second piece is really the rate limiting step, because it requires that

at each point in the process, we conduct inference on the model in order to

compute the gradient. And so this is an expensive computation,

because it requires that we run inference in a graphical model, which we know to be

a not inexpensive step, depending on the complexity of the graphical model.

So to make this concrete, let's look at an actual example.

Let's return to our example of the icing model where the energy function of a set

of random variables. X1 up to xn is a, is the product of a

bunch of pairwise term, is a sum of a bunch of pairwise terms.

WIJXIJ. And a, and a bunch of singleton terms,

UIXI and, it's an energy function, so it's negated.

and so plugging into the equation that we had before the gradient equation,

relative to a particular parameter vector.

Theta I, and let's now look at what this looks like for the two types of

parameters that we have here. The single parameter is UI, and the

pairwise parameter is WIJ. So for the singleton parameters UI, if we

just go ahead and and plug into this, this expression over here we're going to

have for the empirical expectation simply the average of data XI of M, over over

the data instances M. So assuming that each variable, xi, has

its own parameter ui, we're going to sum up over all data instances m, and the

value of the variable xi. Okay.

And that's going to be our empirical expectation once we average it over all

of the [INAUDIBLE]. Now what about the

the model expectation, expectation relative to theta.

Well, here we have two two cases.

I, we need to consider the probability that xi is equal 1 and the probability

that xi is equal to -1/ And, the case where, xi is equal to one, makes a

contribution of +one, to the expectation. And the case where xi is -1 makes a

contribution of -one to the expectation. And so altogether we have P theta, of,

xi1 = 1 minus p theta of xi-1. = -1.

And, that's because the two have, these two different coefficients, +one and -1.

A slightly more complicated version of this comes up when we have the peerwise

derivative of the derivative relative to the parameter WIJ.

And here once again the empirical expectation is simply the, average of XI

of M, XJ of M over the data instances M, and, the probability term has four

different probabilities corresponding to the four cases, XI and XJ are both one,

both, negative one or one takes the value of one and the other negative one vice

versa. And, the two cases, of plus one, plus

one, an minus one minus one, have a coefficient of one, because of the

product XI, XJ is one in those cases and the other two have, coefficients of

negative one and so, when end up with this expression P, sub theta of XI equals

1X, J equals one, plus p theta of XI equals negative one, XJ equals negative

one, minus the probability of the other two cases.

So to summarize. We've seen that the partition function in

undirected graphical models coupled all of the parameters in our likelihood

function. And that introduces complexities into

this setting that we did not see in the context of Bayesian networks,

specifically we've seen that this that there's no closed form solution for

optimizing the likelihood for these undirected graphical models.

On the other hand, we have also seen that the problem is a convex optimization

problem, which allows it to be solved using gradient ascent usually an, LBFGS

algorithm. And because of the.

convexity of the function, we're guaranteed that this is going to give us

the global optimum regardless of the initialization.

The bad news, the further piece of bad news, other than the fact that it has no

close form optimization, is that in order to perform this gradient computation, we

need to perform probabilistic inference. And we need to do that at each gradient

step in order to compute the expected feature counter.

And we've already seen that inference is a costly thing, and so this really

dramatically increases the cost of parameter estimation and Markov networks,

say, in comparison to Bayesian networks. On the other hand, the features that we

are computing, the expectation. Are always within clusters in a cluster

graph or in a clique tree because of the family preservation property.

And and that implies that a single calibration step suffices for all of the

feature expectations at once. And so a single calibration say, of the

clique tree is going to be enough for us to compute all of the expected feature

counts, and and then we can go from there and compute the gradient.