0:00

Hi, it's Matt again. And now, we're going to be talking a

little bit about the diameter of a network.

So again, we're still in our basic definitions, trying to understand how we

represent networks, things we can talk about in terms of their characteristics

and diameter is a very important one. Diameter and average path length refer to

how close nodes are to each other. So, how long will it take to get from one

node to another in some sense? How fast will information spread?

How, how quickly can we get from one part of the graph to another?

These are going to be questions that are going to be answered by things like

diameter and average path length. So, when we look at the diameter, the

definition here is the largest geodesic in the network.

So, the largest shortest path. so if, if we have a network which is not

connected, often when people talk about the diameter, they'll talk about the

diameter of the largest component. It get's a little iffy spent a lot of

time's, you're going to be stuck with a few isolated nodes in a network.

And so diameter often will then just to re, refer to the component largest

component. one thing about diameter is it can be

prone to outliers. It could be that there's, that happen to

be one pair of nodes which are very far from each other, but a lot of the other

ones are, are relatively well connected to each other.

So, average path length is going to be a different measure than the diameter

inside of the maximum geodesic. It's going to be the average geodesic in

the network. And if we look at different network

structures, diameter begins to tell us things about it.

So, for instance, if we look here at the network on the left.

We've got a ring or a circle of different nodes connected to each other.

And in that situation, when we look at these nodes, we end up seeing that they

are some of them can be quite far from each other.

And the diameter in this case is on the order of half the number of nodes.

So, either n over 2 or n minus 1 over 2, depending on whether we've got even or

odd number of nodes. Whereas, if we look at a tree, the

situation on the right here in this picture, what we have is a much more

efficient way of getting from one node to another.

And so, if it, if we think about the number of levels that this tree has.

So, in this case it has three different sets of, of links.

So, one, two, three depth. so in this case, we're going to end up

or, or we're, we're going to end up covering a number of nodes where n is

equal 2 to the K plus 1 minus 1. So, you can just go through and, and keep

track of that yourself. when you look at the relationship between

how many levels you have and how many nodes you have, you end up, in this case,

having K is equal to log base 2 of n plus 1 minus 1.

So, a very simple formula. And with that in, in this case the

diameter is 6. It's twice what this K was.

So you, getting from node 25 here to node 30.

it takes a path link of 6 to go up 3, and then down 3.

So, it's twice times K is going to be the, the longest shortest path in this

network. And what this tells us is that the order

of, of the diameter is going to be roughly 2 times this log of n plus 1.

So, that's giving us sort of how quickly we reach out.

And the idea here is a tree is a much more efficient way of connecting nodes in

term of short paths than we saw in terms of this circle.

Because now we, we end up connecting nodes to each other in a manner which is,

is quite efficient in terms of the branching process.

And so, even with a small number of, of links, we're able to more efficiently

connect things. And so, instead of scaling linearly with

the number of nodes, this is going to scale with the log of the number of

nodes, which is going to tend to be a much smaller number.

Now, in, in terms of small average path lengths and diameters, one thing that's

been found quite extensively in terms of looking at social networks in the real

world is that they tend to have short average path lengths and short diameters.

And what do we mean by short? It means intuitively that we connect a

large number of nodes with a, a fairly small number of connections.

The famous most famous example of this was Stanley Milgram's experiment where he

had people sending letters from one part of the US to another.

I believe it was part of the Midwest, say, Kansas and Nebraska, and you were

trying to get a letter to somebody on the East Coast, say, in Massachusetts.

he would give them say, a person's name and the rough location of where this

person was and maybe some basic background information about the person.

So, you know, I'm supposed to get something to John Smith who lives in, in

Massachusetts and who is a lawyer in Amherst.

And I live in Lincoln, Nebraska. And the, the idea was that I could,

should send the letter to somebody I know.

And then, the instructions in the letters tell them to send it to somebody they

know and so forth, with the intent of eventually getting to, the letter to the

person that was named. And the amazing thing that Milgram found

was that out of the letters that were originally mailed from the initial points

the median number of hops that it took to get to the end point was 5.

And in fact, 25% of the letters made it. So that, that's first of all, fairly high

in terms of response rate for this kind of participation in an experiment.

And secondly, the median of five seem to be quite small given that you're starting

with somebody in, in one part of the country who had no knowledge of the other

individual, in another part of the country and it took them only five steps

to get to that individual. this kind of experiment has been

replicated in, in more, more advanced settings, we're working with e-mail and

other kinds of things. one thing that's sort of interesting

about this experiment is that when, you know, if we think about the median of 5,

we have to be a little bit careful about how we interpret that result.

Why? Well out of the 75% that didn't make it,

the longer a path is, that, that would actually take to, to go from one person

to another, if we think that there's some probability that people won't send it on,

the longer the path, the more likely it, it becomes that it won't appear as having

made it in, in the experiment. So the, the letter will never reach the

final point where Stanley Milgram could then collect it and get all the names.

So, the ones that actually reach are going to tend to be shorter than the

others. So in, in experiments that have followed

up on this, you can begin to try and estimate, you know, how many didn't make

it. And, and those are going to be biased

upwards. But it's still sort of an important

number because people are getting it in only five steps and they didn't get to

see the network. It's not as if they saw the network of

all the connections in the US and we're able to figure out what the most

efficient path was. They were just trying to send it to

somebody they thought might know somebody.

You know, if I, If this person's a lawyer then, if I know a lawyer and I know a

lawyer in Massachusetts, maybe I send it to them or if I know a lawyer in Chicago

who might have a friend in Massachusetts. So, I'm, I'm trying to navigate the

network without actually knowing the explicit structure of the network.

7:53

There's a number of other studies that have found small path links in, in

different settings. So Grossman's study of, of co-authorships

among mathematicians average path length 7.6, maximum a month, in a giant

component, 27. Newman's study of physics co-authorships,

average 5.9, max 20. Goyal of, of [UNKNOWN] Gonzales of

Economics 9.5 as an average, max 29. So here, you're seeing things again that,

you know, the averages somewhere in the order 5 to 10, max closer to 30 in this

case. there is a study by Lada Adamic and

Pitkow on the Worldwide Web. There, they found an amazing number a

mean file 3.1 on a giant component reaching 85% of about 50 million pages.

Facebook mean of about 4.74, recently found by Backstrom and a set of

co-authors. This is, is 721 million users.

So, actually, they're, when you start working with, with networks of that size,

even trying to calculate the average path length is going to be something which is

going to take some doing. You're going to have to work carefully to

find algorithms to calculate these things.

so there's some very interesting findings there and in terms of the average path

length. So, in order to say a little bit more

about diameters and understand why we see these short average path lengths.

first it's going to be useful. I have a couple more definitions in hand.

the neighborhood of a node i, in, in network g N sub i of g is going to be the

set of other nodes that i is linked to in g.

And again, our usual convention is we won't allow for i to be linked to i.

And then, the degree of a node, which is a very important concept, the degree of

node i is just simply counting up how many neighbors a given node i has in the

network g. So, degree is a basic idea, which is can

be very important in keeping track of things in, in networks.

10:16

And in understanding why we have short average path lengths, one thing that's

going to be very useful is understanding the most basic form of random graph.

Which is due to Erdos and Renyi in a set of papers in the 1950s.

Also there was independent work by [UNKNOWN] around the same time.

And the, the model is the simplest possible network you could imagine in

terms of random graph. We start with n nodes.

And each link is just formed independently with some probability p.

So, what do we do? We have a coin, say, it has a

three-quarters chance of, of coming up heads.

And for every link, we look, okay, should one be linked to two, flip a coin,

three-quarters possibility of yes, one quarter possibility of no.

If it comes up heads, we, we color the, the link, if not, we don't.

And we do that for every link we just keep repeatedly flipping the coin, okay?

And, and, the coin comes up heads with some probability p.

this is known as G n, p graph on n nodes with a probability p.

it's a very important benchmark in understanding things.

Why is it an important benchmark? Because if we see some network out there

in the real world if we know what the properties of, of a random network of n

nodes with some probability p was, then we can compare the real world network

that we see to this benchmark network and say does it look like something

systematic is going on? Does this network look systematically

different than if nature had just picked nodes I'm sorry, links at random?

Or does it look like something is systematically different?

So, that's going to be an important benchmark.

And in order to talk about properties of these kinds of networks, it's going to be

useful to talk about large networks. Why large networks?

Well, because if, if we're dealing with very small networks, every single network

has some possibility of arising if we're just flipping coins.

And so, making statements about what's going to happen on a small network is

going to be a very complicated probabilistic statement.

But when we get to large numbers, laws of large numbers are going to allow us to

say things like with a probability close to one.

Then, we should observe that the network should be connected.

something like that. So, we can talk about, you know, the

average path length ought to be 6 with a probability very close to 1, on a network

of such and such a size. So, by looking at sequences of networks

on larger and larger sets of nodes, we can begin to make statements like that.

So, in particular, lets look at Erdos and Renyi networks.

And I'm going to talk about large networks.

And in particular, we're going to look at the networks that have two different

properties. So, one is we're going to be thinking

about the degree, now as a function of n. So, we have to talk about, you know, if

it has a million nodes, what's the average degree, if it has two million

nodes and so forth. We want that to be something which is at

least as large as the log of the number of nodes.

And we'll talk about that later. But what this is going to make sure is

that there's actually paths from any two, two nodes to each other with a high

probability. So, with a high probability, you can get

from any node to any other node. Then, we can actually calculate the

diameter. Okay, so, so we'll take d of n to be at

least log n. And we'll also, we're going to make sure

that the network isn't too complete. So, d of n ought to be a small number

relative to the overall size n, so that it's not as if every node just reaches

every other node in a path of length one. We'll want to, you know, have something

interesting going on. So, it's not as if I have I'm connected

to everybody in the world. I'm connected to some fraction of people

in the world and will make sure that, that's at least log in in, in terms of

size. Okay.

So, what's a theorem? A basic theorem on basic structure is

that if we have these two conditions, and then we start looking at large enough n,

the average path length and the diameter of this random graph, these GNP graphs

are going to be approximately proportional to log of the number of

nodes over log of the degree. So, a very simple formula should give you

a very accurate approximation with a high probability to what the average path

length in diameter are in these kinds of networks.

So, these kinds of theorems have been proven for the original Erdos-Renyi

random graphs that we studied through the 50s and 60s.

paper by Moon and Moser, there's a set of, of calculations by Bollobas.

a series of papers which look at, at these kinds of calculations.

It's been looked at in, in richer and richer models.

Chung and Lu have a paper in 2001 looking at a different set of models.

I have a paper in 2008 that allows for node types and other kind of, of

complications to this model. You still get the same formula in this

kind of, of setting. So, what does this tell us?

this gives us an idea of why, why we end up with a very short average path lengths

and short diameters if we had something in a, in a random graph.

And that'll tell us something about real world networks to the extent that random

graphs are going to be part of those or embedded in those kinds of things.

So when we look at this we, you know, we can state the theorem more succinctly in

terms of mathematical notation. a fore statement of the theorem is that

under these conditions if we're looking at this GNP, then the average distance

compared to the log n over log d is converging in probability to 1.

So basically, average distance is going to be proportional to log n over

log d. The same for the diameter.

So the next thing we're going to do is just try and understand the logic behind

this. And then, I'll also have a separate

lecture, sort of a, a, a more advanced lecture that'll give part of the proof of

this for people who really want to dig into the mathematics.

but the intuition on, behind this is fairly simple.

We'll go through that intuition, and then people who want the more advance part can

actually go through pieces of the proof as well.