Okay. So, let's talk a little bit about estimation of

exponential random graph models now and see what some of the issues are.

So, in particular, what we saw, just to remind you,

we have the probability of a graph now depends on

some set of statistics of the graph and multiplied by some parameters.

In order to make this a probability,

then we have to sum across what this would look like for all possible graphs.

The power of such models, we can put pretty

much anything you want to know the statistics.

So, if you wanted to keep track of links,

triangles, you could make these based on covariates.

So, as we talked about in the block models,

you could have links among blue nodes and links among green nodes,

links between blue and green nodes,

triangles involving blue, greens,

and yellows, et cetera.

So, you could have all kinds of different statistics, keep track of those.

The idea then is that using this model would allow

us to test to see whether certain things are going to be significant or not.

So, are the parameters non-zero,

are they positive, are they negative, what do they look like?

That gives us some idea of how these different things matter in the formation of a graph.

Now, what's the difficulty? The challenge in these things is,

how are we going to estimate these?

So, let me start with just an example of an estimation of one of these things.

So, if you remember our Florentine families example.

So there's a paper by Robins, Pattison,

Kalish and Lusher in 2007.

They're fitting an exponential random graph model to Padgett and Ansel's Florentine data.

In particular, they're going to look at the business ties between the major 16 families.

What they're going to do is try and see whether or

not to what extent this depends on links.

Two stars.

So, two individuals connected to a given individual.

So, we've got links, two stars,

three stars and triangles.

So, these are the different things.

We can count up how many of each one one these

different shapes is present in a network and

see to what extent each one matters in the determination of a graph.

So, we've got the probability of a graph depending on each one one of these things.

Then, we can ask what do these different parameters look like.

So, what do the parameters look like on these?

Okay. So, this is the network.

In this case, this' a very small set of nodes,

small set of links.

So, we see a few triangles, we see some links,

we see some two stars,

three stars, and so forth.

We can go ahead and estimate this.

Here, you can use programs,

we'll do one of these in a problem set.

You can use statnet for instance,

which is a program package in R. You stick in

your dataset and end up just directly trying to fit an exponential random graph model.

It asks which kinds of things you want to throw in,

and then it'll spit out an answer.

So, in this case, the software will go ahead and estimate things.

We'll talk about how it's done in a moment.

What does it give us? It gives us, on links,

we get a perimeter of minus 4.27,

standard error of about 1.13.

So, this is statistically significantly negative.

So, that's a network that doesn't like links.

So, if we go back to this,

it's a fairly sparse network.

So, networks that have fewer links are going to be

more probable under this model with the negative parameter there.

Networks that have lots of links would be less probable.

So, this is a fairly sparse network.

So, it's not surprising that it gets a negative parameter here.

It likes two stars, 1.09.

It dislikes three stars,

minus 0.67 and it likes triangles 1.32.

So, we end up with some different parameters.

Now, exactly how we interpret those is a little complicated.

It's complicated by two facts.

One, we're not quite sure the combinations of these different things.

So, links are going to be inside two stars,

inside three stars, and so forth.

We also haven't given us any theory of exactly why

the business dealing should have one type of structure or another.

So, going at this with a social theory would then allow us to make sense of these.

You can go to the original article for more discussion of that.

But what I want to do is talk a little bit about how these kinds of things are estimated,

and then what we should take away from those estimates.

So, what's the challenge in estimating an exponential random graph model?

So, we put down a formulation.

We got a bunch of statistics.

We want to see if they matter and to what extent they matter, positively or negatively.

In order to make this a probability,

we had the sum over all the possible graphs of what this could be.

So, the idea is how likely graph g is depends on

comparing it to how likely all the other graphs would

be in order to figure out why this one got picked out and not the other one.

So, in order to go ahead and estimate this,

to figure out, for instance,

if we want to do what's known as maximum likelihood estimation,

what we would want to do is let's pick the parameters that actually make

the given graph that we saw as probable as possible.

So, which parameters would make this probability as high as possible?

Well, the difficulty is,

in order to figure out which parameters maximize this function,

we have to be able to do this sum in the denominator, right?

That's going to be tricky. Why is it gonna be tricky?

There's a lot of graphs to sum over.

So, a technique of instead of actually doing an explicit calculation here,

would be to try and estimate this instead.

We'll estimate this, the standard technique is using MCMC,

Markov Chain Monte Carlo techniques for estimation.

This became popularized by Tom Snijders and Mark Hancock.

Made some breakthroughs that allowed this to

be applied to exponential random graph models.

That's what underlies the software.

So, instead of summing over all g primes,

what we'll do is we'll sample some g primes.

Then, try and extrapolate and estimate that denominator based on that.

The program is a little more complicated.

But essentially, what it's doing is walking a space and

trying to pick out different g primes.

Then, based on that, it can get a relative likelihood,

and then figure out by searching across different parameters,

which parameters would meet the highest likelihood of seeing the data we actually saw.

So, what's the difficulty with this?

Well, remember, if we had just 30 nodes in our graph,

there's two to the 435th possible networks,

which is more than-

it's estimated there's less than two to the 258th atoms in the universe.

So, we're way above the number of atoms in the universe.

So, it's going to be impossible for us to actually look at

that space and even pick a small subset of that space.

So, if we're just randomly picking some g primes in the denominator,

the chance that we're actually picking a representative sample is tiny because,

unless we have an algorithm that really walks

a space very well given how many different networks there are,

it's going to cause difficulty.

In particular, there's recent theorem by Bhamidi, Bresler and Sly.

There's related theorems elsewhere in this statistics literature.

What it shows is,

if you've got a dense enough exponential random graph model,

this type of MCMC technique, and in particular,

using what's known as global dynamics,

which is a form of Gibbs sampling,

the estimates are actually going to become efficient.

Only if you're able to look at an exponential number of graphs,

and unless you have approximately independent links.

Okay. So, that's a mouthful.

But what it's saying is,

if you want to be able to do this with less than exponential number of graphs,

the network is going to have to be one that has almost independent links.

Of course, the reason we went to

exponential random graph models was because we didn't want to have independent links,

we want to do allow for triangles,

we want to allow for different stars to be more likely than one would see at random.

So, exponential random graph models that are really interesting,

are going to be ones where we have dependencies,

and those are going to be ones,

which according to this theorem,

are not going to be estimable,

in less than exponential time.

Okay. Now, this applies for dense enough graphs,

but we can also do some simulations to see whether there's going

to be problems with sparser ones.

Let's turn to a simulation next.

So, this is actually a simulation that comes

out of work I'd been doing with Arun Chandrasekhar,

and here what we're doing is looking at the probability of a graph.

Let's keep things simple,

we're going to look at links, triangles, and isolates.

Part of the reason we're going to look at isolates,

is when you look at graphs you tend to,

you often find more isolated nodes than you would find at random.

So, in order to really estimate things well,

this is something you often have to take into account.

So, let's look at 50 nodes and we are going to fit a very simple model,

just an exponential random graph model where we keep

track of how many isolated nodes there are?

How many links there are and how many triangles?

Okay. So, a very simple model.

What we're going to do, is we're actually going to draw a bunch of graphs.

So, we'll give it graphs that have basically,

we're going to feed into the program,

a bunch of networks,

that have roughly 20 isolated nodes.

So, each node is isolated with probability about point four.

We'll feed 10 triangles in, so,

we'll put 10 triangles down at random,

45 links down at random.

So, we're going to draw links so that on average you get 45 links.

We'll draw our triangles, on average you get 10 triangles,

and we drew these by picking binomials so we'll just do this at random,

and then 20 isolated nodes.

Okay. So, we'll start by picking 20 isolated nodes and what remains,

we'll put down a series of triangles,

and then what remains we'll put down, a series of links.

Okay. We do this a 1,000 times.

So, we're drawing very similar graphs to each other,

a 1,000 different graphs,

and each time we'll fit it into the program and then ask it to estimate this function.

Okay. So, we'll give it just a bunch of different graphs that all have roughly,

20 isolated nodes, 10 triangles,

45 links and see what comes back.

So, here's the link parameter estimates.

So, it gives us estimates,

often in a range somewhere between zero and four,

but actually sometimes it even gives in the negative three range,

so it's popping all over in terms of the link parameter.

When you look at the estimates that it's giving back to us,

in terms of what it's standard errors are,

it thinks that it's picking things very accurately.

So, part of the way that it picks standard errors,

is by looking back at the graphs that it wandered through,

and trying to figure out what the variation in possible parameters could be,

given the variation that you could see across different possible graphs,

and the noise that would be

present if the parameters were in the vicinity that you're looking at.

Okay. So, it's using a bootstrap method to basically estimate these confidence intervals,

and when we look at this it thinks that it's very accurate.

So, the median left and right confidence intervals are right on top of each other.

So, it would tell you that it's very sure that the thing should be in a certain range,

even though it's popping all over the place.

If we do the isolates parameter,

the isolates parameter has much less variation in it.

So, it's always coming out somewhere in the range between 15 and 19 basically.

Again, it overestimates how accurate it's being,

because if the confidence intervals are accurate,

they should be capturing the value of variation we're getting here.

Finally, then when we look at the triangle parameter,

the triangle parameter ranges somewhere between zero and about five and a half,

and again, we see sometimes it's very close to zero,

sometimes it's closer to five and a half,

so it's popping all over the place.

Okay. So, it's giving us unstable parameters estimates

and it's coming up with inaccurate standard error estimates.

One thing we can do, is then recreate the network,

so, take the parameters that we get out.

So, the exponential random graph model gives us a parameter,

a set of parameters, so each time we estimate it,

we feed it in a network with 20 isolates,

10 triangles, and 45 links,

it spits out some parameters.

Then you can actually allow the program to then

randomly draw a network from using those parameters.

Hopefully, if you randomly draw a network with those parameters,

you ought to get one back that has

fairly similar characteristics to what you started with.

So, we'll look, the isolates parameter was pretty tight.

It should hopefully give us back something with the right number of isolates.

But the link parameter was all over the place,

the triangle parameter was all over

the place maybe it's not going to be so good at doing that.

Well indeed, if you recreate the number of isolates,

most of the time it gives you something in the range of around 20 isolates which is good.

So, it's recreating the isolates fairly well.

Occasionally, it gets stuck near 50,

so it gives you an empty graph back.

So, isolates are doing okay.

When it does triangles it does miserably.

So, remember we fed it graphs that pretty much have on average 10 triangles,

somewhere between 5 to 15 most of the time.

So, very few triangles,

and when it recreates them occasionally it gets 10 triangles,

but most of the time,

it's somewhere between 1,000 and 3,500.

So, it's packing these networks full of triangles,

even though it shouldn't be.

When we do links, something similar is happening.

So, the actual number of links that should be generating is about 45,

and yet it either gets very close to zero,

or somewhere between 150 and 500.

So, it's having problems estimating these parameters,

and then when you try and recreate,

and draw out random networks from it,

the networks that it's giving back to us,

look nothing like the ones we fed in.

Okay. That shouldn't be surprising given the theorem that we had by Bhamidi and all,

that said that there's going to be difficulties,

and actually wandering through and estimating accurately

where that denominator is in the exponential random graph model.

Okay. So, there's some fundamental difficulties here.

So, MCMC techniques can be

inaccurate here and so we're faced with this problem of how we compute parameters.

There's also questions about even if we could do the computations,

are we confident that for a reasonable class of models,

that if we had large numbers of nodes,

that these parameters would be accurate.

So, there's a computational issue,

and then there's also just an open question of how

consistent are these estimators of these.

It's also important for us to be able to generate networks back.

So, if we want to do questions of counterfactual.

So, what would happen if we had a program to encourage certain ties to form.

So, we want to subsidize research and development and we

put in subsidies to try and create research parks,

that's going to create certain ties in a research and development network.

What networks would emerge?

We want to be able to estimate parameters and then

see what would happen if we did counterfactuals,

change some of those parameters.

Okay. So, that's a question we'd like to be able to do.

Next, we will look at a class of models where we can

begin to actually do that estimation.