So, in these series of lectures we'll focus on outcomes that only take on one of two values, binary data outcomes. Unlike continuous measures, we'll span the whole range, again, these only take on two values that we can code numerically is a one or a zero. So, for example, we may ask whether or not somebody has had a cold in this past winter. Yes, we get a value one plus zero a value of no. Whether or not somebody's quit smoking among people in a particular group who used to smoke or currently smoke. The answer will be one, if they had quit smoking, a yes, a zero if not, and so on and so forth. So, we'll find out that summarizing binary measures for single samples is less complex than when we're summarizing continuous data, and we had to take in measures of spread, center, spread, and position via percentiles. We'll see that single summary statistic gives us that for us, in a single simple binary data we only need one number. But when we start comparing populations via comparing samples things get more complicated. But let's first talk about the easier side of things, the sample proportion as a summary statistic. So, upon completion of this lecture section, you will be able to summarize a binary outcome across a group of individual observations via the sample proportion. Explain why with binary data, the sample proportion is the only summary statistic besides the sample size 'n' necessary to describe characteristics of the sample. Then, compute the sample proportion based on the results of a study. So, let's look at our first example. We'll look at response to therapy in a random sample of 1,000 HIV positive patients from a citywide clinical population. So, of the 1,000, 206 responded to the therapy given. So, the summary measure we're going to use to quantify this is called the sample proportion represented by p with a little caret on top of it or a hat if you will, and this is actually pronounced p hat. It's given by the following formula p hat in this sample is the proportion of those responding, which is the number of persons who responded, divided by the total number in the sample. So, we had 206 responders out of a thousand people for a value 0.206 or 20.6 percent. Why do we put the hat on the p? Other than to dress it up nicely, what we want is to distinguish this sample estimate from the underlying truth which we'll represent with a p without the hat. We can only estimate the underlying truth via this sample proportion. So, let's think about this. The sample proportion p hat that we just computed, you can think of this as a mean when our outcome takes on two values, zero or one. So, generally, binary data values are given a value of one for observations that have the outcome, and zero for observations that do not. So, in this sample, with 206 of the 1,000 responding, we have x equals one for 206 observations, and x equals 0 for the remaining 794 observations. So, if I were to take the mean of these zeros and ones, I'd be adding up and the numerator, 206 ones plus 794 zeros, so the sum of 206 ones is 206, and nothing is added by adding 794 zeros, and the total sample size is 1,000. So, the sample mean of ones and zeros is 206 out of 1,000 or the sample proportion we've talked about. There is also a formula for the standard deviation of binary data, and this is the formula here. But we won't be using this or calculate this because unlike with continuous sample values where a measure of spread is interesting and useful, especially for comparing samples from different populations, this quality for binary data is not particularly useful in understanding the distribution, it doesn't give us any insights. However, the one thing I do want you to note is that the variability of the sample values is dependent on the value of the summary statistic p hat. So, essentially if we have p hat and n we have all the information about variability in our values in the sample. Furthermore, if we know the value of p hat we have information regarding the sample percentile values, and again, because the data is only yes, no or 0,1 data, all of our percentiles will either be zeros or ones. So, for example, with a thousand HIV positive clinic patients to proportional responding was 0.206. So, if I were actually write out the data in numerical order from the 1,000 observations, there will be 794 zeros representing the 794 persons who did not respond and there would be 206 ones. So, we immediately know all the percentiles of these data and they're only going to be the value zero or the value one. So, for example 50 percent or 500 the observations are less than zero, less than or equal to zero, if 500 or greater than equal to zero, so the 50th percentile and medium of this data is zero. Similarly, the 60th percentile is also zero. So, it's not until we actually get to the 79th percentile that we'd actually start to see the value one as 79 percent of the sample did not have the outcome and the remaining 21 percent did. The 90th percentile is one and the 95th percentile is one. In any case, I don't want you to worry too much about this because percentiles are not particularly informative because they can only be zeros or ones. If we know the percentage of ones in our sample we got the whole story. So, again, the percentile values for binary data are not particularly useful in characterizing the sample distribution. They do not give us any information above and beyond our p hat value. What about visual displays of binary data? Well, these are not so useful either. Here's a histogram with the response to a antiretroviral therapy among 1,000 persons in our data set, and you can see it only has two bars at the values 0 and 1 essentially. These are not really begins here because there's only two values in our data, and these bars to show the proportion who didn't respond in the proportion who did. This spark goes up to 79.4, and this bar for those who responded goes up to 20.6. This graphic doesn't really need much above and beyond having this value p hat of 20.6 because we have all the information there. Certainly, a box plot doesn't make a lot of sense either. The majority of values are zeros, and so by that algorithm it's deemed ones to be outliers. But there's only two points represented on this box plot and no information about what proportion are zeros and what proportion are ones. So, what I am getting as unlike with continuous data where measure standard deviation, percentiles, and visual displays helped us better understand the distribution of values, with binary data, if we have the sample proportion we pretty much have the entire story. Let's look at another example of computing sample proportions. These are results for a randomized trial. HIV positive pregnant women were randomized to either receive AZT or a placebo. What happened then is after birth their children were followed for up to 18 months to see who contracted HIV. So, what they found was from April 1991 through December 20th, 1993 which was the cut-off date for the interim analysis efficacy, 477 pregnant women were enrolled. During the study period 409 gave birth to 415 live born infants. The HIV status was known for a subset of those were 363, 180 in the AZT group or zidovudine group, and 183 in the placebo group, 13 infants in the AZT group or the zidovudine group, and 40 in the placebo group were HIV infected. So, if we were to display this in tabular format, this is sometimes called a two-by-two table because it has two rows and two columns. In the rows we have the outcome of interests whether or not a child was diagnosed with HIV positive, and the columns we have the two groups of mothers who gave birth, those who were given AZT during pregnancy and those who were given placebo. In the AZT group there were 180 children born to 180 mothers given AZT during pregnancy, and 13 of these children were diagnosed with HIV or seven percent. So, if I represented this p hat here I'd fill a little subscript AZT to indicate that we're computing the proportion with the outcome from the group of children whose mothers were given AZT. Do the same thing for the placebo group. There are 183 children born to 183 mothers given placebo during their pregnancy, and 40 of the children contracted HIV which is 0.22 or 22 percent. It's a much larger proportion of children who contracted HIV among mothers who were untreated or given placebo when compared with children whose mothers were given AZT. We'll explore this relationship continually as we move along in the course. So, in summary, for quantifying the distribution of binary outcomes in the sample and, hence, estimating the distribution and population from which the sample was taken, the sample proportion p hat is paramount. It not only summarizes the percentage, we can also say the probability or risk of the outcome in the sample, so the percentage of subjects or observations that have the outcome or the probability of having the outcome, but it also gives information about the variability of individual sample observations, and the sample percentiles. So, all the information about the data is extensively given in that single summary measure. We can think of p hat as the sample, mean of sample observations that take on the value of one for observations with the outcome, and zero for observations without the outcome. In the next sections, we'll talk about several different ways to actually compare binary outcomes between two or more samples from two or more populations, and we'll see that even though taking a signal proportion on a single samples is pretty straightforward when we start comparing these, things can get a little trickier.