So here we are in case study number two. Now, what I want to do here is just to go over everything that we've learned up until now. A good review, now we could take a paper from a general article but we're never gonna find one that's going to cover exactly what we've done up to this point. So what I've decided to do is just to create a fake study. We're gonna take some data and we're going to analyze them, but going through all the steps that we've looked at before. So this is going to be data from the World Bank. You can go to the World Bank's website and there is a lot of data that you can download and analyze yourself. And the spreadsheet that I've got here is about health expenditures per capita. That is going to be the research project that we are going to work with here. You see the first column there, country name and it is in alphabetical order of all the countries and then you see these columns 1995, 96, 97. So those would be the health expenditure and you see the numerical values there in the columns for every country, and that's what we're going to look at. So this is the type of values we're looking at. These are the categories that we're looking at. But we're dealing with columns representing different years, and down the rows we have all the countries. So let's consider a little study. We're going compare the healthcare expenditure per capita, that's per person between the two years. They are not in that range of years, not every year in between. Just the two sets of years, 1995 and 2010. That is going to be our research question. What is the difference in spend between 1995 and 2010 for all of these countries? So all of those countries make up, they do make up. All our participants in the study and we have these two groups that we form, 1995 and 2010. That's how we can see it, and although to be simplifying an example this, and probably not real world. That is a legitimate thing to do, and that is to look at finances in medicine. It's a very important aspect of healthcare. So let's then just consider what type of study this is going to be. Now remember, we had the two main types, observational and experimental studies. Now, as far as observational goes remember we do not interfere in any way. That's simple data collection in the normal day to day work. So we have case series, case-control series, cross-sectional and cohort studies. And in a case series, we just take set of values and we're just going to describe them. In case-control settings we're forming two groups. But remember there is a point and time that we referred to in the study. Remember, we talked about wound infection after surgery, and then we'll gather data about what happened before that wound infection set in? So we look backgrounds in time completing two groups. And cross-sectional study, we're just taking a survey, a snapshot in time, and if you think about it if I take all that health expenditure 1995 that is a snapshot in time and 2010 that is also a snapshot in time. And you'll remember cohort series, that is where we have this group that has got a mutual trait and we look at what happens to them from a certain point in time going forward. So that data points will be from the points of, again if we looked at the wound infection, that would be from the day that we diagnosed the wound infection what happens after that. And then we had experimental studies. Remember that is where we stopped the bus and we said, we're going to intervene in some way. We're going to look forward. We're going to change at least our management. And we're going treat different groups differently or manage different groups differently. And we can either blind the patient to the treatment or we can blind the worker to the treatment or at least the person who doing the data collection blinded to what group the patient is in. And usually we have different types of controls. And medicinal studies that usually will be a group that just takes a placebo drug. Now, let's consider just for a moment the types of sampling. Remember, there were four that we spoke about? Simple random sampling where we just had this master list. And everyone on that list an equal likelihood of being chosen as our sample. A systematic random sampling, where we have this master list, and we know exactly how many participants we want. We divide those two numbers, and that leaves us with a number. So for instance ten, and I would choose systematically every tenth person or participant or whatever country or year from that list. Cluster random sampling is where we have groups together. And I can choose everyone inside of that group as one of my samples. So if we can, we could perhaps say that 1995, every one is clustered in 1995. So I'm taking 1995 to 2010, all the countries and their values are clustered around there. I take them. So I suppose we could see it that way and then stratified random sampling. Remember that is where there's a mutually exclusive trait. So what we could do is look down the rows or our data, so we could have asked something else. Let's compare Northern European to Southern European countries. Let's compare European countries to the Americas. In other words, all the countries in that group that's a stratified group in other words, a mutually exclusive trade bay or in some geographical area. So those would be the different ways that we stratify. But we're going down the columns. We're looking at two clusters through 1995 and 2010. Let's just quickly remind ourselves of the data types that we're dealing with. Remember we said, we either had discrete data types, quintessential example was the rolling of a die or continuous data types. The types of numerical values that we can at least practically, infinitely divide into smaller and smaller values. The other classification system we looked at is categorical versus numerical. Now remember, categorical comes in nominal and ordinal type. Nominal just words. So if we think about 1995 in the year 2010 there is some order to that, but it's not really a natural order which comes in which way are we looking at it. I suppose we could say, we could argue that it might be ordinal in some way. But remember the quintessential examples of ordinal pain scale, a likert style question, even if it's converted to numbers. And the numerical interval and ratio type. Remember, interval does not have a true ratio like degrees Fahrenheit or degrees Celsius. And then a ratio type numerical. Now, if you think about it, data point values are ratio type numerical because those were numbers and representing how much money was spent per person. That is our data point and the data type of the data point value is the ratio type numerical. And that tells us what kind of statistical analysis we'll eventually be able to do. Let's have a quick look again at descriptive statistics. So we can choose all the values we have in the 1995 column and in the 2010 column, and we can just describe it. In other words we can look at point estimates. Omissions of central tendencies like mean and median, we can look at dimensions of dispersion in other words variance, standard deviation. We can look at the range we can look at the quartiles and the inner quartile ranges. Let's have a look what I want you to see, that's the 1995 column, all the countries in 1995 and I'm just asking for the mean, and you see it turned out to be 458. Look at this though, the median 94. What I want you to see is the big discrepancy between the median and the mean, and that must tell us something. Let's look at the variance or better yet the standard deviation. Look at the size of that standard deviation. If that's on either side of the mean, there's actually negative values which of course is nonsensical. No country spends a negative amount of money on their healthcare. There'd be a zero level. But what this data tells us, and what it must tell you is that there is some skewness to this data, that there would be some larger values which pulls that mean up. Because if we look at the median, it means half of the values must be less than 94 and the other half must be more than 94. But look at that mean, that must mean that there is a lot of very larger values pulling that mean up and that means there is some skewness to the data. Let's look at the results. In the 1995 columns we actually have 216 entries. So we have 216 participants in that group, 1995 group, that's quite a lot you see the mean there we looked at it 458, and you see the 50th percentile there and that is the median 94. And we also know it is the second quartile, so we can talk about quartile and percentiles. If we look at the 25th percentile, you remember that it's the first quartile. We see the minimum spend, there was only 2. That's the 0th quartile. We see the 75th percentile there, 341. That is the third quartile and we see the maximum then. See that maximum way, way above. So it really gives us another indication here that this data is quite skewed. Now, a good way to look at your data always, and that is why you'll see so many of it in the published literature, is just visualizing your data. Always a very good idea. Let's have a look at a boxplot. And here we have visual proof of the skewness of this data. So 1995, and you see the poor little boxplot squashed at the bottom there. We can't even see the bottom whisker. Now the green dots all over the shore, we've just added some data so that they're not all on the same line. Those are the actual values, but look at the blue and the black. That is the box plot. You see the top whisker there. And see on the straight line up there, all the data points which are statistical outliers. Again, giving us good proof that there is a lot of students in the state out there. There are a lot of large values pulling the mean way up above the median. Now why were we so interested in the interquartile range? I want to remind you that we can determine what values we could view as the statistical outliers and I'm going remove them from the dataset. I'm not suggesting that it's the right thing to do in this analysis. I just want to remind you of the fact that we can remove what we feel are statistical outliers and the way that is done is to take the value at the third quartile, that's the 75th percentile. We subtract from that the value of the first quartile, that's the 25th percentile, and we multiply that by 1.5. Now, we can add this value to the third quartile and we can subtract it from the first quartile, and that will be the upper and the lower limits beyond statistical outliers are. Now, we saw in our data set we just have outliers at the top. So we're going add this value to the third quartile, and that is going to give us a maximum level, and anything beyond that we could remove. And we could suggest that they really are just statistical outliers. That would be one way to go about it. Later on in this course we'll see that we can just leave them in because they are sets of statistical tests that we can do just based on the median. So again, this is what I've done here. You'll remember the values for 1995, so there's the 341 and to that I add 1.5 times the difference between the first and third quartiles. So let's have a look at the data now. Now you can actually see the bottom whisker there. It's being pulled up but this is a whole new data set so it's added in itself will also have statistical outliers. But I just wanted to show you have the values have changed, how the box plot has changed now that we've gotten rid of those initial statistical outliers. Now let's look at 2010. There's also this huge difference between the mean, 1,029 and the median, 304. We're sitting with exactly the same problem here. And that we can assume from this data right here, just these values, the simple descriptive statistics, that there must be some skewness in this data. Once again, look at that. You see all those little black dots in a straight line up above the top which are statistical outliers. You'll remember the values doing the exact same thing. Third quartile value minus first quartile value multiply by 1.5 and we're going add that to the third quartile. Anything beyond that we can throw away. Look at that again, we can see the bottom of our little boxplot. This is a new data set, new set of data values, so it will have its own set of outliers there. Now, we've just got this intuition from the data that this is really skew, but let's look at the distributions. Remember the two types of distribution we talked about, our data point values from our samples themselves an then the sampling distribution through the central limit theorem. Which states that if I could do this over and over again and get all the means for instance, that would be normally distributed. Some means would occur very commonly, some would occur less commonly and All we sit with here are two means, the means of 1995 and the means of 2010. But I can construct through the central limit theorem a distribution plot of the distribution of possible samples that could exist, sample means that could exist. Now, look at the 1995. This was called a kernel density estimate. So you see that there is this right sided tail. There's a positive skewedness to this. Remember skewedness? So most of the data values are toward the left, but it trails out towards the right. So we can certainly look at the skewness and we can see the skewness visually here. We saw it in the box but now with a distribution just of the actual data points we can see this. Let's look at the 2010. Also you can see the skewedness and it's a positive skewness because it tails off towards the right. Now, we can ask the computer to look at the skewness and kurtosis if you remember the skewness we've talked about now and kurtosis has helped peaked that the last central peak is compared really to a normal distribution. So it can ask for skewness and you see a 2.1 for the 1995, 1.6 for the 2010. So both of them positive values. They're larger than zero. Positive skewness and more skewness to the 1995 data. And if I look at kurtosis how peaked it is there was a bigger peak, a sharper peak. In other words a narrower peak in the 1995 data than there was in the 2010 data. And now let's look at this sampling distribution. Remember, I now have a mean for 1995, I have a mean for 2010. There is a difference between those two means, but that is just a difference. One possible difference of many others. But I want to show you the difference between the distributions we've just seen. The skewness in our data point, and if we were to construct a t-distribution of all the possible differences and means that we could get, and we repeated this over and over again, we would get a sampling distribution. Let's look at it. The plugged in the values from the dataset that we're dealing with now, into that equation and look at the beautiful normal distribution. Really, because the numbers are so big that we're dealing with here, we have this nice normal distribution. And this is a sampling distribution. So notice how skewed our actual values were. But the difference in means is just one of many possible means through the mathematics of this sampling distribution. The t-distribution here we can draw a beautiful symmetric graph, and from this graph we can do statistical analysis. We know where our difference falls and we know whether, it's going to be a statistically significant difference or not. So I think that's a good summary of what we've learned up until now. Let's move forward.