Up to now, we've looked at a few examples where we have by default assumed a normally distributed population. Now, for many situations such as product characteristics, e.g. the volume of water in these mineral water bottles could indeed feasibly follow approximately a normal distribution and it serves as a very reasonable model. However, thinking back to the tail end of week two, we introduced a few different families of probability distributions such as the Bernoulli, the Binomial, and the Poisson distributions, and these clearly exhibit very different characteristics from the normal distribution. So in this section, we're going to consider a remarkable mathematical result known as the Central Limit Theorem or CLT, for short. Now, what is this all about? Well, recall when we had a normally distributed population, we noted the sampling distribution of the sample mean x bar, whereby x bar followed a normal distribution with a mean of mu and a variance of sigma squared over n. Such that these sampling distribution would be centered on the true population mean mu. So on average, the sample mean would correctly estimate the population mean. And then there was some variation in the observed sample mean values. However, that variation would reduce as we increase the sample size n. So given there will be many other situations in life where a normal distribution could not reasonably be supposed or assumed, then we can defer to the Central Limit Theorem. Which says, with one or two minor technical caveats, which we will gloss over here, that when sampling from any non-normally distributed population, then asymptotically, that's a big word. This simply means as the sample size n tends to infinity i.e. as the sample gets larger and larger and larger, then the sampling distribution of x bar converges to the same normal mu sigma squared over n distribution. So I'd like us to consider one special application of the CLT results. One which is very useful in opinion surveys, for example opinion polls perhaps in the sort of political science sphere. So let's imagine we have a population which follows a Bernoulli distribution. Remember the Bernoulli distribution, one of my personal favorites, whereby we divide the population into successes and failures, ones and zeros. Such that in any member of the Bernoulli family, there is a proportion of successes in the population denoted by some success probability parameter phi and the remaining failures denoted by zero, we would say occur with probability one minus phi. So, here we have a two point probability distribution whereby if we were to draw a sample from this population, then each time we would either get a success coded as one or a failure coded as zero. Therefore, our entire sample dataset was simply consists of all of the ones and zeros reflecting the number of successes and failures we observed respectively. So this is very different from a continuous smooth normal distribution. But there is one way we can usefully apply the Central Limit Theorem. So what is this saying? So, let's take a very simple example and consider a sample of size five. So n equal to five drawn from this Bernoulli distribution. Such that we are doing some opinion polling. I'm asking people whether they intend to vote for the governing party or some other party. So even though there could be multiple parties in this democracy, it's very easy to reduce this to a dichotomy. We could say you're voting for the governing party or some other party which means all of the opposition parties and we don't need to make a distinction between them. So let's imagine of the people randomly selected. And suppose they are giving us honest answers so there's no response bias. And they all give us an answer, hence there is no non-response bias. Let's say the first person says, no, I will not vote for the governing party, we will denote this as a failure and record this as an observation of zero. The second person says, yes I will, an observed value of one. The third person says, yes I will. The fourth person says, no I won't, so a failure at zero. And let's say the final person says, yes I will, and hence we get a one. So our pattern of observations would be a failure, a success, success, failure, success denoted by zero, one, one, zero, and one. So let's imagine now we'd like to take the sample mean of all observations. Well, here we learn back in our earlier work on descriptive statistics, that to calculate the sample mean we just add up all of the observations and divide by the number of observations. Well, we have no need to deviate from that principle here. So we simply add up our data values. So zero plus one plus one plus zero plus one gives us a grand total of three divided by our sample size n five. So three over five gives us a value of nought point six. So here we have calculated the sample mean. But this has a very special interpretation because effectively we have calculated the sample proportion of successes. Because when we're adding up the zeros and ones, the numerator term of our sample mean statistic. Remember, the sum of the xi over n. What if we're adding up a load of zeros and ones, the zeros do not contribute anything to the aggregate total. So the numerator will simply represent the number of successes we've observed. In this instance, three divided by our sample size n over five. Giving us a sample proportion of nought point six i.e. 60% of the respondents indicated that they would vote for the governing party and hence we got a success. So given we're starting from a Bernoulli distributed population which is very non-normal, we can now appeal to this Central Limit Theorem. Now, we previously derive that the expected value of a Bernoulli random variable was equal to phi in the back of that probability weighted calculation. So here that phi is the expectation of X really serves as the Bernoulli case of the population mean mu. Now, we'll just note the result that the population variance sigma squared for a Bernoulli distribution is phi times one minus phi. So if we now invoke our Central Limit Theorem result, we can say our sample mean x bar which in this particular application refers to the sample proportion, which we may wish to denote let's say by the letter P, for proportion. This is our special case of the sample mean here. As n tends to infinity so asymptotically, then the sample proportion P will be approximately normally distributed with a mean of mu and a variance of sigma squared over n. But since the population from which we are drawing our sample is the Bernoulli distribution, its mean is phi and its variance is phi times one minus phi and hence we can now use the Central Limit Theorem to derive the sampling distribution of the sample proportion. Such that P is approximately, i.e. if a large sample size is normally distributed with a mean a phi and a variance of phi times one minus phi over n. So again, we note that the expectation of the sample mean, specifically this sample proportion, is equal to the true parameter is trying to estimate i.e. phi. That means on average, our sample proportion will be equal to the true population proportion. The phi in this Bernoulli distribution. We also say that the variance of the sampling distribution will decrease as our sample size n increases as we see n in the denominator of the variance of P. So this serves as an excellent illustration of how we can still use the normal distribution. Yet again another example of its usefulness to us as statistician's in being able to approximate the sampling distribution of a sample mean when sampling from a non-normally distributed population. So just to perhaps the end of this section, let's just run a few simulations to see this Central Limit Theorem, this approximation to normality really come to life. Because I said this was an asymptotic result i.e. one which holds increasingly well as the sample size gets larger and larger and larger. But how long does it need to be for this approximation to be reasonable? So, as an illustration, here's one I prepared earlier. I actually used a computer to simulate many random samples of different sample sizes from the Bernoulli distribution with the success parameter phi equal to nought point two. So in the simplest case where n is equal to one, this just means taking multiple random observations, random drawings from this Bernoulli nought point two distribution. So, as we have a very large number of samplings from this distribution then the proportion of successes that we observe should be approximately equal to the true proportion of successes in this population i.e. 20% successes and 80% failures. So here, we can produce a histogram of those simulated results. And clearly this looks nothing like a normal distribution. Unsurprising because as we only have samples of size one, each observation is either a success and hence a value of one which is itself its sample mean or when we have failures a value of zero and hence also the value of its sample mean. But now see what happens as we increase the sample sizes across all of these simulations. By the time we reach sample sizes of about 50, you can see the histogram which represents the sample means calculated across this large number of randomly simulated samples, we see the histogram is converging to a normal distribution. And if we decide to increase the sample size n get further up, so asymptotically, as n tends to infinity, you really do see this histogram converging very nicely to a normal distribution and hence a great example of the Central Limit Theorem. So in our final section to come, I just like to consider a couple of statistical inference examples related to the sample proportion namely, an example of a confidence interval and a hypothesis test.