1:23

And so if we now, take that and put it through the log, in order to define the

log likelihood and then something that out or multiple data instances, we see

that we have the following expression, we have a summation over instances M of, log

of. The, pi one of A.

B for that instance plus log of phi two of BC for that instance minus log of Z.

And here importantly notice that I've included the parameters of the model

Theta as an argument to the partition function, which now explains why the

partition function rather than a partition constant.

It is a function of the parameters and as we change the parameters the partition

function changes and and that's going to turn out to be very important.

So, now let's continue deriving this expression, and.

We can see that just like in the context of Bayesian network we can gather like

terms and consider for example how many times a particular entry on the factor

phi one of AB is used and it's used in every instance where the variable A takes

the value little B and the variable B takes the value little B.

And so we can, again, accumulate like terms.

And we end up with a summation over here over all possible value combinations.

Little a, little b. Of the log of that factor entry, phi1 of

ab, multiplied by the counts. A baby.

And similarly, we have exactly the same form for the second factor, BC, where for

every con, for every entry in the factor phi two of BC, we have a number of

occurrences of that, which is simply the counts of BC in our data set.

And finally, we have this partition function hanging off the end, which is

accumulated once for every data instance. So this looks great so far because we

have, as in the complex Bayesian network, we've decomposed the likelihood function

or in this case the log like gives function rather, into a bunch of these

terms, each of which involved one parameter and a count of how many times

the parameter's used. So beautiful decomposition and maybe we

can just go ahead and optimize each parameter separately.

Except that we have that nasty partition function hanging off the end.

And if we look at the form of the partition function, that form is a sum

over all possible values of a, b and c of the product of 51ab, 52bc.

And one important observation is that when you stick this summation into a log,

this, the log doesn't flow through the summation and so you don't get any

decomposition of the parameters nicely into isolated pieces.

And that is the killer, because the partition function is the term that

couples the parameters, and so there's no decomposition of the likelihood into

separate independent pieces, there's parts of it that decompose, but the

partition function kills that decomposition.

And as a consequence, it turns out that there is no closed form solution for this

optimization problem. Unlike in the case of Bayesian networks

where we had maximum likelihood estimation having a closed form that you

can derive directly from the sufficient statistics which are the data count.

So, to see this in action, here's a picture of a projection of a

proj, of this of this, probability distribution.

So here we have, written the probability distribute, distribution as a function of

two. log linear features.

These are the indicator functions of a one b one.

And the indicator function of B0, C1. And so one of them is an entry in the

first potential, in this potential, and the other one is an entry in this

potential. And obviously, you could have other

features than that, but I can only draw the, a two dimensional feature space, and

so, so we focused on just two parameters. And so what you see here for the two axis

one axis is the, is one of these parameters, and the second axis is the

second of these parameters, so the. Thetas that are the coefficient of these

indicator functions. And what you see here is the log

likelihood. So this is the surface, of the log

likelihood. There are several things that can be seen

from looking at the surface. First, it seems, and we'll show this in a

moment, that there is a single global optimum of this function that happens

right here in the middle of this contour. But, we can also see the coupling between

the parameters coming up. That it doesn't decompose nicely as a

product of or in this case a summation of a function over one of the dimensions.

plus a function over the second dimension.

So now let's delve into the math a little bit more and consider what the log

likelihood function looks like in the general case of a log linear model.

So just as a reminder here is the definition of a log linear model and,

it's, defined as, the sum, of a set of coefficients which are parameters theta

I. Multiplied by a set of features, which

could be indicator features as we had on the previous slide.

But we also talked about other forms of features for long linear models in a, in

a previous module so we're not going to go back there but any function here would

work. We add up all of these log linear

functions and exponentiate that and then we have the partition function that makes

sure. That we have a distribution.

18:16

So let's think about how to actually perform this maximum likely destination.

So let's go back to the log likelihood function, now we're going to add this

we're going to multiply by one over m so that we don't have a scaling issue where

the log likelihood continues to grow as a number of data instances grows and so

that gives us this expression over here. Where a note, that the m has disappeared

as a multiplier from the log partition function, and now we have a one over m

term in the linear component as well. And now if we go back and plug in the

derivatives of theta I. we take the derivative of this expression

relative to particular parameter theta I, we have two pieces.

We have already said that derivative of the log partition function relative to a

parameter theta I is the expectation of Theta i, of fi which is the cul, which is

the feature, that Theta I multiplies, in, piece of theta.

On the other hand, if we look here, at the derivative relative to theta I, we

can see that what we have here, is also an expectation, it's an empirical

expectation, it is the expectation, or average of f i In the data distribution.

And so the derivative of the log likelihood relative to theta I is the

difference of two expectations. An empirical expectation, and, and

expectation in the parametric distribution defined by my current set of

parameters, theta. The one immediate consequence from this,

is that a parameter setting theta half is the maximum likelihood estimate, if and

only if, we have equality of expectations, that is the expectation

relative to the model. Theta hat is equal to the expectation in

the data b. And so the expectation in the model has

to be equal to the expectation in the data.

And that has to hold for every one of the features, in my log linear model.

Bill Froley. Now that we've computed the gradient

let's think about how one might use it to actually do the optimization.

So it turns out that a very commonly used approach to, do this kind of likelihood

optimization is in the context of CRFs is a variant of gradient ascent.

So here we have a likelihood surface which is in, in this case a fairly nice

function. And what we're going to do is we're going

to use a gradient process where at each point, we compute the gradient.

We take a certain step in that direction. And continue to climb up.

And the gradient that we've computed on the previous slide is what we're using.

Now it turns out that plain vanilla gradient ascent is not actually the most

efficient way to go. typically one uses a variant of gradient

adcent called, LBFGS, which is a quasi Newton method.

So very briefly a Newton method is something that, constructs a second order

approximation, to the function, at each point as opposed to just a linear

proximation which looks only at the gradient, that says we don't want to

spend the computational cost in general to compute a second order, or quadratic

approximation to the function, a quasi Newton method.

It kind of makes, an approximate guess, at what the step would be, if we were to

compute a second order of approximation. And we're not going to go into the

details of LDFGS here, but there's plenty of literature on that, that that you can

find. Now in either case, for computing the

gradient in this context. The critical computational cost is the

fact that we need the expected feature counts.

And the expected feature counts come up in two places in the data.

That is when we compute this empirical expectation E relative to D of the

feature FI. But also in the second term in this

derivative relative to the current model. So E relative to the parameter vector

theta in the current model of the feature fi.

Now this second piece is really the rate limiting step, because it requires that

at each point in the process, we conduct inference on the model in order to

compute the gradient. And so this is an expensive computation,

because it requires that we run inference in a graphical model, which we know to be

a not inexpensive step, depending on the complexity of the graphical model.

So to make this concrete, let's look at an actual example.

Let's return to our example of the icing model where the energy function of a set

of random variables. X1 up to xn is a, is the product of a

bunch of pairwise term, is a sum of a bunch of pairwise terms.

WIJXIJ. And a, and a bunch of singleton terms,

UIXI and, it's an energy function, so it's negated.

and so plugging into the equation that we had before the gradient equation,

relative to a particular parameter vector.

Theta I, and let's now look at what this looks like for the two types of

parameters that we have here. The single parameter is UI, and the

pairwise parameter is WIJ. So for the singleton parameters UI, if we

just go ahead and and plug into this, this expression over here we're going to

have for the empirical expectation simply the average of data XI of M, over over

the data instances M. So assuming that each variable, xi, has

its own parameter ui, we're going to sum up over all data instances m, and the

value of the variable xi. Okay.

And that's going to be our empirical expectation once we average it over all

of the [INAUDIBLE]. Now what about the

the model expectation, expectation relative to theta.

Well, here we have two two cases.

I, we need to consider the probability that xi is equal 1 and the probability

that xi is equal to -1/ And, the case where, xi is equal to one, makes a

contribution of +one, to the expectation. And the case where xi is -1 makes a

contribution of -one to the expectation. And so altogether we have P theta, of,

xi1 = 1 minus p theta of xi-1. = -1.

And, that's because the two have, these two different coefficients, +one and -1.

A slightly more complicated version of this comes up when we have the peerwise

derivative of the derivative relative to the parameter WIJ.

And here once again the empirical expectation is simply the, average of XI

of M, XJ of M over the data instances M, and, the probability term has four

different probabilities corresponding to the four cases, XI and XJ are both one,

both, negative one or one takes the value of one and the other negative one vice

versa. And, the two cases, of plus one, plus

one, an minus one minus one, have a coefficient of one, because of the

product XI, XJ is one in those cases and the other two have, coefficients of

negative one and so, when end up with this expression P, sub theta of XI equals

1X, J equals one, plus p theta of XI equals negative one, XJ equals negative

one, minus the probability of the other two cases.

So to summarize. We've seen that the partition function in

undirected graphical models coupled all of the parameters in our likelihood

function. And that introduces complexities into

this setting that we did not see in the context of Bayesian networks,

specifically we've seen that this that there's no closed form solution for

optimizing the likelihood for these undirected graphical models.

On the other hand, we have also seen that the problem is a convex optimization

problem, which allows it to be solved using gradient ascent usually an, LBFGS

algorithm. And because of the.

convexity of the function, we're guaranteed that this is going to give us

the global optimum regardless of the initialization.

The bad news, the further piece of bad news, other than the fact that it has no

close form optimization, is that in order to perform this gradient computation, we

need to perform probabilistic inference. And we need to do that at each gradient

step in order to compute the expected feature counter.

And we've already seen that inference is a costly thing, and so this really

dramatically increases the cost of parameter estimation and Markov networks,

say, in comparison to Bayesian networks. On the other hand, the features that we

are computing, the expectation. Are always within clusters in a cluster

graph or in a clique tree because of the family preservation property.

And and that implies that a single calibration step suffices for all of the

feature expectations at once. And so a single calibration say, of the

clique tree is going to be enough for us to compute all of the expected feature

counts, and and then we can go from there and compute the gradient.