Okay, now, we are going to talk about a different class of random graph models that begins to bring in attributes of individuals or nodes into the picture. And as we have looked in terms of random graph models so far, we have talked about Erdos-Renyi really basic uniform at random, used for benchmark models. And we saw that they missed a lot of real world features. Then we talked about other kinds of link by link models that began to bring in features. We looked at, at modes that kept clustering. They could match different kinds of degree distributions. They could get some aspect of co, correlation. And what we are going to start to do now is, is bring in attributes of nodes and, so, we'll start with stochastic blocks models which are sort of the natural enrichment of Erdos-Renyi random networks. So here they are going to allow the probability of a link to actually depend on some characteristics of the nodes. And this could be characteristics attributes that, that could be, that could be latent. So it might be that that we are not fully observing all these things. We're trying to infer them or there could be observed characteristics. And so we are going to start with simple block models that, where everything is, is observed. And so here, the idea now is that nodes have characteristics, things like age, gender, religion, profession, and so forth. And links between nodes depend on those characteristics. So people of of similar ages are more likely to link than people of different ages. People of similar religions and so forth. And so now let's start with a simple situation. We have three types of nodes in this example, blue, green, and yellow nodes. They're connected in a network. And we can imagine, then, that there's, the model is that there's different probabilities of connections. So if there were homophily here, we think of the blues being more likely to connect to blues. There's a probability that yellows connect with yellows. There's a probability that greens connect with greens. Probability that blues connect with greens, and so forth. So we have different probabilities for every different possible combination of types of nodes in this situation. But then everything else about this model is independent. So, the links are formed independently, it's just the probabilities are varying depending on node attributes. And any two pair of blue nodes have the same probability of linking to each other. Any, pair of a green and a blue node has the same probability of, of linking as any other pair of, of green and blue. It's just that we're allowing the probabilities to vary with types of nodes. Okay, so if this were the, the resulting network that we observed, this is a very easy kind of model to go ahead and, and estimate probabilities for. And in this case, for instance, if we're trying to estimate what's the probability of, of blue links links from, between two blue nodes, then what do we see? Well we see that that there is five actual links present among blues right, one, two, three, four, five and out of four possible nodes we have six possible total so we end up with an estimate of five six. When we go to the yellow and green. What do we have? Well, we have 3 times 4 equals 12 possible links between green and yellow nodes. We have one link which is present between green and, and yellow. So our estimate of, of what the probability is between yellow and green nodes would be a 12. Okay? So very simple model, it's just that instead of having an Erdős-Rényi random network where we estimate all of the probabilities just as being the same, now we allow them to, to vary with with node attributes. More generally, these kinds of models are going to be worked with, often with continuous variables. And, so if we have continuous covariates so, for instance, we keep track of age in, in, in days or some for instance or in fractions of years, then we would have, instead of just green yellow and blue, we might have people that are 34 and half years old and people that are 60 years old and so forth. And so then when we start to think about a, links between two nodes, we could allow them to, to depend in more complicated ways on, on the characteristics. So if we have some vector of characteristics that i has, some vector of characteristics that j has, then we could have a, a vector of parameters beta i. Vector of parameters could be the j. It could also be that, that there's, that the probability of the link depends on differences, so people that are close in age are more likely to link than people that have distance between ages or it could be geographic locations, GPS data. If you live closer together you're more likely to, to link together than if you live far apart. And then we have basically a function, which we're going to base the probability on. And now the standard formulation for this, since this could be depending on the parameters and so forth, this might be positive or negative. The idea would then be that, that often one uses a logistic form where you look at the log of the probability that there's a link between i and j compared to the probability that there's not a link between i and j. Right, so we take the odds ratios here. What's the ratio of the odds that there are a link, compared to not. And we just say that the log of that is proportional to this. So this would be a standard way of fitting this kind of model. And what that's going to do is then give you an ability to continuously estimate probabilities based on different attributes. And so, so now it's a more complicated stochastic model, but one that still very easy to go ahead and estimate. And you can use this in most any standard statistics package or allow you to do logistic regressions. And estimate this quite easily. Now one possibility to using this kind of model would be to test for homophily for instance. So is there real difference between the probability of, of certain types linking compared to linking across types? So for instance, here is an example. This is from the data set I talked about earlier in, in the course from work that I did with Abijeet Bannerjee, Arun Chandrasekar, and Esther du Fleau. So this is the 2013 Science paper and, and this is what I did here is I pulled out one of the villages, village 26 in particular. And here the nodes are, are color-coded by caste. And in particular, the blue nodes are what are known as scheduled castes and scheduled tribes, so these are castes that are under affirmative action by the Indian government. The reds are general castes and otherwise backward castes. So these are not affirmative action castes. So. This is a differences in caste, and then we could look here, and we could say okay, let's do the very simple block model. We're just going to see what's the probability of linking with somebody of your same caste compared to linking with somebody of a different caste. So here we can go through and just directly count how many links there are among members of the same caste and going across castes, where caste here is just this dichotomous version actually, the particular caste. There is quite a number of castes in these villages. So what do we end up with, if you do the count here the probability of having a link. Between a red and a blue node in this graph is 0.006, the probability of having a link between either two reds or two blues is 0.089. So more than ten times more likely to have a link within the same caste as a caste designation in this case, as oppose to going across this boundary. And so we see that it indeed there is a substantial homophily based on this kind of simple block model. And that block model then allows us to, to, keep track of these things. And if we hadn't colored these nodes, it's not so obviously that we would see that there is a, a strong dichotomy here as once we start keeping track of this and then estimating directly. Now, we could have not had the pictures of the, of the colors here, and we could have tried to discover that. That's known as community detection or clustering algorithms. So there's ways in which we could begin to look for things. And we might begin to say, okay, look actually this, even the blues look like they're segmented into different groups. We could try and, and fit, and see whether there's additional blocks, and whether there is more density in here than over here and your algorithms and techniques people use for doing that. So what's the important aspect that's missed from block models? Well, the likelihood does depend on node attributes, either observed or, or latent. But in practice. Often that the probability that that two nodes have been interacting, so people would been interacting could depend on whether they have friends in common. So there is real social structure to it. So let's you know, the fact that, that A and B both have a common friend C makes it more likely that they are going to be linked to each other than if, if they didn't have a friend in common. And so that could be an aspect, which independently of their characteristics, people tend to meet people through other, through friends or have reasons for spending time together in, in triples and so forth, and that's going to lead to extra things, so we're going to need richer models to capture that. So next up we'll be talking about classes of models that allow us to keep track of these explicit dependencies. Which is going to make our life a little more difficult statistically, because things aren't independent any longer, but will allow us to capture a lot of things in networks that are going to be important.