We've talked about cluster samples in a simple form, where we're taking all elements within a cluster, a sample of clusters, and two-stage sampling. It's time for us now to go and look at how we might design such samples. And we'll deal with the more complex, the two states sample, in our design. What do we need to do to prepare ourselves to select a sample that will have certain properties that will meet the requirements of a client or meet our own expectations. So we're talking about the design from this point of view of being able to come up with a way of selecting the sample that meets certain kinds of requirements. And we're going to, in this particular context, talk about three things then. And these deal with designe from the point of view of projecting what would happen under a particular approach. So we'll be talking about projecting design effects for new designs. We're going to need to understand what the design effect looks like when we do the next study. Then we're going to look at the design effects and their impact on sample size before we come back to talking about projecting standard errors and confidence intervals for our designs. All of these things are part of this package, this set of considerations when we want to design a new sample for a new problem. Now what we've done up until now has been estimation, we've mainly been estimating quantities. Now, I'm only going to concentrate here on estimating standard errors or variances. So if I talk about variances, as you know, I'm also thinking about standard errors, it's a simple transformation of a square root function. And so when we had a two stage cluster sample, what we talked about doing was estimating the variance of a statistic in this case a proportion, for our current design, let's call it design one. And we would calculate that by using the available data to fit into the formula shown here. 1- the sampling fraction f that combination (1-f) defined at population correction divided by the number of random events in the sample, in our case a. Actually it's the number of random events in our variants calculation, but lowercase a, the number of clusters in the sample, times s of a squared, the variability of a cluster characteristic across the clusters. We won't go through the formulas, but we coupled that then as part of the estimation. That was enough, we would take a square root and go down one path from that to calculate a standard error and a confidence interval. But we also did something a little bit different, and that was to say, we also would like to understand the impact of the design on our outcomes. So let's compare it back to a simple random sample, the idea of building a design effect. To estimate a design effect then, what we're going to do is calculate from the same data the variance of that same statistic, in this case a proportion, for the existing data, design one, if you will, under simple random sampling assumptions. Now for a proportion, it turns out that that variance calculation we looked at very briefly was 1 minus the sampling fraction, 1-f, that finite population correction, the whole thing all together, times p(1-p) over the sample size -1, from that existing design. And then we combine those two things to come up with a design effect, a ratio. It's just an arbitrary definition, a way of doing the comparison. We always put by the way of the standard definition the simple random sampling variance in the denominator. It's the base against which we're comparing the numerator. So it's a fairly standard way to do that kind of calculation, all using the existing data for our current sample design. And then finally from that design effect we said we also note that there's something that drives it. Let's estimate that driving factor, that rate of homogeneity, by taking that design effect -1. Stripping out from the design effect, if you will, the base, the simple random sampling, and only the added effect from the clustering. And then dividing that by b-, taking out the effect of the sub-sample size, the b, from our existing design, and dividing our design effect -1 by b-1. So we have a value now for homogeneity. Now this is kind of a second thread that we might follow. The first was to get towards constant intervals, that's the one that's most useful for us analytically. But here, these are building up for the next step in the process, because what we're going to do is think about our next design. We're going to build on this one. We may do a new application in a different population. We may go back to the same population at a different point in time. We may go back to the same population at a different point in time and change the sample design in some way, change the sub-sample size, the number of clusters, the overall sample size. But what we're going to need to do then is projection. What is going to happen in a new setting without having drawn the sample? So this is an essential design feature that comes up in any number of applied math areas, whether it's in statistics, and in this case a branch of statistics dealing with surveys, or in engineering, in civil engineering, or some other area. What's going to be the outcome? Well, what's the key outcome here? There are two that are primary in our thinking, one is the mean, or in this case the proportion, and the second is its standard error. So what kind of standard error would we actually get if we changed the design? In order to do this process, what were going to do is need to build up from the past information we have, our second thread, starting with that roh value, that rate of homogeneity. That rate of homogeneity becomes portable, in a certain sense. It becomes the foundation, the building block upon which we're going to build our projection. We're going to use the roh value that we've calculated from the past survey to project a design effect, but now we're not calculating the design effect here as a ratio of variances. We're calculating a design effect as a combination of a simple random sample variance, 1, a sub sample size, b sub 2. The one that we're going to use in our new design, -1, times roh. Well, we're not going to invent a new roh, we're going to borrow roh from past data. Because we are using very similar clusters, last time we used schools, now we're using schools. Last time we used blocks, we're using blocks now. Last time we used enumeration areas, we're using enumeration areas now. And we're also measuring a very similar characteristic, roh is specific to that variable. So we can build a new design effect for that variable, for that particular design. But we also know that if we were to calculate or have available a simple random sampling variance, we could use the combination of the design effect and the simple random sampling variance to project an actual variance, to calculate an actual variance. To come up with a numeric representation of our uncertainty, our anticipated uncertainty under the new design. So, we can calculate a simple random sampling variance. Now here, this simple random sampling variance ignores the 1-f. I probably could've ignored it on the other side, but here just to simplify the calculations, we're rounding the 1-f to 1. We're using a proportion again, p(1-p), giving us essentially our element variance. And we're dividing by the new sample size, n sub 2. So we got a new sub-sample size, a new sample size, an old value of roh and maybe a past value of the proportion or a new value of the proportion, because we think that it's going to change in a particular way. And now we have the elements to build our final projected sampling variance. A sampling variance that tells us what uncertainty we can expect, especially when we couple it with a t-statistic or a z-statistic to give us a 95% confidence interval. We're going to take that projected design effect and that projected simple random sampling variance, and take the product of the two to get our new variance. We're manipulating the same expressions that we did in the estimation process to build backwards, if you will, to project what that sampling variance would be. Let's look at an example. We're going to look at two examples. Recall that we had a sample before in which we had a sample size of 2,400 observations in 60 clusters, each of 40 observations each. And suppose that's what's going to happen is that we're going to go back to the same population this time, but we do not have the resources to do what we did last time. We can only afford to do half as many cases, 1,200 of them. And in this particular case we're going to get to the 1,200 by taking half as many clusters as before. Now that's one way to get to that half. Our alternative A here, cuts sample size by cutting the number of clusters. Our alternative B that we'll look at next, cuts sample size by cutting the sub-sample size. But the question here is, what standard error, what sampling variance can we expect for our sample proportion p under this new design? Now that's in contrast to our alternative design B, same reduced sample size, 1,200, but now we're going to keep the same number of clusters, which implies now that the sub-sample size goes from 40 down to 20. What would happen in this case? Do we have sufficient tools to do this projection? In a way, this is like what's going on with climate change projections. They're projecting what would happen under alternative models. We have one basic model here that involves the design effect, homogeneity and sub-sample size, along with a simple random sampling variance that we can inflate or adjust for the clustering effect reflected in our design effect. So how would we do this? For A, what we're going to do is compute the simple random sampling variance, or I suppose we could compute the design effect first, and then the simple random sampling variance, but we're going to have a new sample size, a proportion that we're going to have to make an assumption about. In step two, a design effect, in which we're going to use a past value of roh and our new sub-sample size or old sub-sample size, in our case and in step two, for alternative A, B is exactly the same as before, it's 40. And then we're going to compute the product of the simple random sampling variance and the design effect to give us our projected sampling variance under our new design. And for alternative B, we'd repeat these steps as well replacing B = 40 with B = 20. With the same roh value, the same proportion and that would allow us to compare sampling variances between these two designs. For design A, for example, with 1,200 as our sample size and 30 clusters of 40 elements each, our design effect in this particular case is 2.1795, it's the same one that we had before. We haven't changed anything before, we're using the same value of roh, the same value of b, so it's the same. So all we need to do in this particular case then is compute a new simple random sampling variance. And ignoring the finite population correction, which we actually did anyway in that prior calculation in the illustration we had done before. We have p(1-p), 0.4, the value we had before, x 0.6 divided by 1,200, or a sampling variance of the proportion of 0.0002. The product of that with our design effect gives us a sampling variance of 0.0004358. We would take a square root to get a standard error of course, but we'll stop there, because we're going to compare this variance to the one under design B. Under design B we have a design effect now that is different. B has changed from 40 to 20, that design effect goes from 2.17, 2.18, down to 1.575. It's not cut in half because we cut the sample size in half, the effect has been cut in half. And so now in this particular case then, when we multiply together this projected design effect, 1.575 under this new design, by the simple random sampling variance, which hasn't changed. It's still the same sample size and the same proportions, we get a variance of 0.000315. Well we can compare these. And here's the table comparing what we had before in our original design, 2,400, that's that first row of numbers, where we had 60 clusters of 40 elements each and a design effect of 2.17, 2.18 and a variance of 0.000218. In our projected A version, with 1,200 cases in the sample and 30 clusters of 40 elements each, same design effect as we've noted. But our variance is now twice as large, it's doubled. That is, if we take the sample size and cut it in half, by cutting the number of clusters in half, we double the sampling variance. It goes the other way too. If we were to take the sample size and double it, by doubling the number of clusters, our sampling variance would decrease by one-half. Okay, so that's one way to get our outcome. But the other alternative design B, to get there involves 60 clusters of 20 elements each, cut the sub sample size in half, but retain the same number of clusters. And now our design effect has gone down and our variance has changed. Now we can begin to see alternatives here with a quantitative ways and compare them and make decisions about, what's the best approach to use. We're going to do a more refined version of this, because one of the key issues will be, which of these designs should we use? Should we use one that has sub-samples of size 40 or sub-samples of size 20 or some other number. What's the best sub-sample size? As we look at alternative sub-sample sizes, we're going to need to choose one that's appropriate, but we're going to look at that in lecture 6. Here we have two more things to look at before we wrap up this lecture. One, the impact of the design effect on sample size, and two, the impact of the design effect or these projectived variances on confidence intervals and their width. Let's look at those in the second part of our lecture four on designing two stage samples, thank you.