Now, let's take a look at had you actually implement Poststratification using some software that's in the r. So the package I'm going to use is the R survey package which is got a lot of capabilities. This is the one that's written by Thomas Lumley at the University of Auckland in New Zealand. And there's a data set in there called academic performance index, api, which I'll use. So I require the survey package and then I ordered R that this is the data I want, by saying data api. And then, you define a design object. So, with any of the software where going to handle survey data, you have to inform what the design features are. I'll talk more about this in course six, the meaning of the survey design function and how you do it in other packages, but I'll sketch it here. The first thing you need to tell R is what the first stage units are. So the parameter in R is id. In this case, I'm saying that dnum is the psu or cluster definition. And in this case, dnum is short for district number. So, I'm sampling schools, I have a sample schools and they're the first stage units though were in districts. The weight variable has got to be specified and in this data set pw is the field that holds the weights. And the data set itself I'm using is called apiclus1. It does use an fpc so I suppose that here. Now, the one thing to notice is this use of the tilde in front of these variables. The survey package in this svydesign function wants these definitions to be formulas. So the way you specify formula in R is you preface it with a tilde. So in the US keyboard, that's in the upper left hand side of your keyboard. And notice, that the first stage unit is treated as a formula. The weights are also, the fpc is also, the data set itself is not used to specify that without a tilde. Now, the next thing you have to do is specify the totals for the population auxiliary variables that I'm going to use. So what I've done here is I created data frame using this data dot frame statement. The first column is going to be school type. So the labels for that are E, H, and M which stands for elementary, high school, and middle. And these are just different grade ranges that are used in the US. And then, I get the count of schools, 4421, 755, 1018 in those three school types. Now, how do I post stratify? I just invoke this command postStratify. Notice here, that that's a capital S. R's case sensitive. So if you use the lower case s there, it would bark at you until you couldn't find function. So you've got to be careful to see exactly how your function name's spelled. So I'm operating on this design object, dclus1, I am poststratifying by these indexes stype, and notice that's a formula again tilde in the front. And then, I give it the control totals here, pop.types and I define those back in the previous line. And that's all there is to it. It goes through, it creates these post stratified weights that saves that information and it's saved into this new object called dclus1p, p for postratification. So just take a look at the weights that came out of this. What I've done is I've rbind a summary of the weights for the non post stratified design object dclus1. And then, the weights for the dclus1 object. So, this thing this weights function right here is an extractor kind of function. It'll pull away to out of that design object and show them to you. So what would have I got if I look at the first row here, that's the non post stratified object. You see all the weights are the same. So I've got an equal probability sample. In the second row, those are the post stratified weight and you can see those are spread out from 30.7 up to 53.93. And why is that, it's because the sample itself is not proportionally allocated among these school types. So post stratifying, in a sense, corrects that and we hope that, that will reduce variances. If I had coverage errors it, we also hope that it will reduce those. So let's look at a couple results just to see how the point of estimates can change. The first thing that I called for here is the mean for variable called enrollment so sv one mean is the function that I use to do that. Enroll is the variable and it has a tilde there again. It has to be specified as a formula, and here's the design object, dclus1. So this is the before post straficiation version of this. I get a point estimate of 549.72 students per school enrolled, standard error 45.19. If I do the same thing on the post stratified object, you can see my point estimate mean change some. I'm up to 594.27 and the standard error got bigger too. Now, let's take a look at total. So note this, things can go in any direction but if I compare the standard error before postsratification and after actually made things worse in terms of the standard error. The mean got bigger, the coefficient of variation could be actually smaller but in this case it's not. But there's no guarantee that you're going to improve estimates of the mean with post stratification or of the total, although you may. So let's look at the total, and we can do that with the svytotal or enroll again, same variable. So here, the answers there and I've got a total of about 3.4 million, standard error of 932 thousand and some. And if I use the post stratified version dclus1p, then what I get is this line right here. And so you see, the total changed a bit, not tremendously, but standard error did change quite a lot. If I compare these two values. Before post stratification, I went from 932,000, after post stratification, I go to 406,000. So I cut standard error by over 50% despite post stratifying and that's on the estimated total. Now, I can look at cvs and the function that will do that is called little cv in the survey package. So here, I just collect together the coefficient variation for the mean of enrollment from the dclus1 post stratified object, and then from the post stratified object. So you see right here, I go from a cv of 0.82 to 0.00110, so either in terms of standard error or cv I made things worst by post stratifying here. If I look at totals here, on the other hand, and I do or compare the post stratify and the non poststratified object. I go from .2737 or so to .1103. In other words, I gained quite a lot in terms of cv and standard error by post stratifying. So notice also, that these two are the same. The cv on the mean and the total. Now, why is that? It's because, when I divide by the sum of the weights to get mean, I force the sum of the weights in the post stratify variable to equal the population count. Now, that's going to be true in every sample, so there's no extra variation. That's like a constant. After I poststratify that estimated pop count. So what that leads to is the standard error relative to what you're estimating. Is exactly the same in the mean and the total. Now, have you decided whether this post stratifying is good idea or not, there's different ways of doing it. But one way to think about this is every estimator has an implied model behind it. And by model, I mean a structural model that relates why to whatever covariants you're using in your estimator. So in the post stratification case, it's really simple. Common mean in every poststratum, call it beta sub gamma, common variance for every element in a given post stratum, call it sigma squared gamma. If the model's approximately right, you get an efficient estimator. It's efficient, since of low variance. Think of this as the way I would predict an additional school within post rate of gamma, I take the mean of what I saw and predict the next value would be equal to that mean, if the common mean is a good predictor then you get variance reduction. If it's not, you don't and that's what this bullet says. The one thing about the post stratified estimator is it'll be approximately design-unbiased, meaning in repeated sampling, if you do it over and over again, you'll average out to the right thing even if the model is wrong. That will be true. But if you got a poorly specified model, you're going to be inefficient in the sense of not having a small variance for your estimators associated with y's. So it's a good idea to check the model anytime you're doing something that implicitly involves a model, you want to see if the model is any good. In one way that you could have model failure here, is that you left out some covariants that are very important. So for example, if we defined our post strata and it's a cross of age by gender. So I cross age by gender and that creates a number of age groups times two or three post strata. But suppose that you had two other variables that you should've considered. Race-ethnicity and income level. Because those are good predictors of the y's that you're analyzing from your data. Then, you'll have the wrong model, post stratification will be less efficient than it could be. How do you take up the slack for that, or improve your estimator? You could think about using raking where you included race-ethnicity and income level as margins to write to. Or you could use GREG which will accommodate if you had quantitative income value, you see it would've accommodated both those in these categorical age, gender and race ethnicity variables. So, you can make post stratification fairly flexible but if you do want to include both qualitative and quantitative, then your best choice maybe this thing GREG.