So in this section, we're going to explore some visual displays to complement our numerical summary measures for continuous data. So upon completion of this lecture section, you will be able to utilize histograms and boxplots to visualize the distribution of samples of continuous data measures, identify key summary statistics on any boxplot and name and describe basic characteristics of some common distribution shapes for continuous data. So, we're going to look at histograms and boxplots because mean standard deviations and percentile values will be well helpful numerically characterize these certain aspects of the distribution of a set of continuous data do not necessarily tell a whole story of data distributions and sometimes it would be nice to be able to easily ascertain and compare differences in shapes of distribution for different samples, and while we can get at that as we'll see later on by comparing some numerical measures to some degree, the visual displays are much more helpful for such purposes. So histograms are a way of displaying the distribution of a set of data by charting the number or percentage of observations whose values fall within predefined numerical ranges and then plotting these numbers or percentages in a bar graph. Boxplots are graphics that are a little less detailed than the affirmation histograms that display key characteristics of a dataset. These are especially nice tools as we'll see later in this set of lectures for comparing data from multiple samples visually in one graphic. So let's start with some data on a clinical sample of 113 men randomly selected from a clinical population. We have measurements or data on their systolic blood pressures. So we have 113 systolic blood pressures, we can create a histogram by breaking the data or blood pressure measurements, taking that range into bins of equal width, counting the number of 113 observations whose blood pressure values fall within each bin and then plotting this number or the relative frequency of observations that fall within each bin as a bar graph. So here's a basic no frills histogram of these 113 measurements. So, I put this into a computer, I let it do all the work for me, and it decided to break the data into equal bins of width 10 millimeters of mercury and what it's doing is counting the number of observations that fall within each of these 10 millimeter of mercury ranges and plotting them in bar graph form. So for example, there is only one person in the dataset whose blood pressure falls between 80 and 90 millimeters of mercury. Another single person whose blood pressure is between 90 and 100, and then when we get to 100 between 100 and 110, we see a lot more persons and we can see then that the distribution or the number of persons falling into ranges increases and then after we get to the range from 120 to 130, it starts decreasing again. So we get a sense that these data are somewhat not perfectly symmetrically distributed around center high points here and the sample mean for these data is 123.6 and the median is similar in value to 123 millimeters of mercury, those values fall roughly here. So these data present a somewhat symmetric around that measure of center and then the numerical version of what we can visualize here, the variability, is that standard deviation of 12.9 millimeters of mercury. Final thing we can do to make the histograms comparable across samples of different sample sizes is instead of presenting the wall observed number of observations that fall within each bin of blood pressure measurements here, we can instead present the relative proportion, the total out of the sample size on this vertical axis. Since we have 113 men in the sample and 113 is close to the value of 100, these percentages will be very similar to the actual observed numbers, but that won't always be the case depending on the sample size. If we go back, we can tell the computer, "Look, I actually want the bins to be of different width." For example, I don't want 10 millimeter mercury bins, I want to make them wider. I want to make them 20 millimeter mercury bins, and what will that do to our graph? Well, our graph still captures maybe the essence of what we saw in the previous displays but more crudely so. We don't get as much detail about the spread of the values and how they're centered in the middle here and how the proportion of them decreases the farther we get away from that center. It's still here but not as in much detail as before you can think of the wider we make these the more detail we lose and the ultimate histogram, the ultimate bin width would be one that's the entire data range in which we'd have one bar at 100 percent which would not be informative all about the shape of the underlying sample distribution of values. On the other side, we can go overkill in the other direction and make the bins really small. In this case, I told the computer to make the histogram with bins of one millimeter of mercury wide, and we still see that shape that we saw when we had bins of 10 millimeters mercury wide but perhaps in more detail than is necessary at least for this single sample. If we wanted to compare this histogram to other samples of systolic blood pressures that were much larger and we wanted to present those histograms with narrow bars, then we might scale these to be the same size for comparison purposes but not for on its own just for this sample of 113 measurements perhaps this is too granular a presentation. Now, let's look at a boxplot presentation of the same data. We will see this is a less detailed visual graphic to describe the distribution. So certainly, it's probably pretty obvious looking at this why it's called a boxplot, the shaded piece in the middle is that of a sometimes called the box, and this box contains information about the middle 50 percent of the values in our dataset. So the line between the two sides of the box, the bold line is the sample median and then the lower side of the box is the 25th percentile, sometimes called the lower hinge. The upper side of the box is the 75th percentile, sometimes called the upper hinge. Just to note here, notice that the distance between the median and the 25th percentile is similar to the distance between the median and the 75th percentile. So, at least the middle 50 percent of these data are roughly symmetric around that median value. As we trace this dotted line down to the solid horizontal line here on the bottom of the graph, this represents the smallest value in the dataset, the minimum. Vastly, if we do from the 75th percentile or upper hinge up to this horizontal line, we get the largest value. Notice that the distance of the median from the largest value and the smallest value respectively are similar as well. Again, even though there's only five points of data represented in this visual display, we get the sense of the symmetry but not as in much detail as we saw in that histogram. Let's look at another sample and compare these displays again. So, here is the Heritage health length of stay data, where we have 12,928 length of stay values for persons who had at least one day inpatient length of stay in the year 2011. But if you look at this histogram, it looks decidedly different than those blood pressure measurements. It certainly shows that the majority are over 40%, so I've labeled this in terms of the relative percentage of observations, following this first bin and this actually represents the value one, it's between zero and one for binning purposes, but all the values in it are ones. And more than 40% of the sample had a length of stay value of one. Then you see as we move away from that most frequent value, the proportional observations that take on values larger than that decreases as we go up in value on the length of stay axis. So, this distribution is heavily what we might call skewed, and in the language of visual displays, its right skewed or positively skewed because the extreme values, the values less frequently occurring, are much larger than the more commonly occurring values. The mean of this sample is 4.4 days versus a median of 2.0 days. The reason there is such a discrepancy between the mean and the median, is because the median is heavily influenced by these outlying or extreme values. They tend to bringing it up in value whereas the median is only affected by their relative position. Then the sample standard deviation of these 12,000 plus values is 4.7 days. Again, how would we characterize this distribution? Again, the majority of values are small relative to a similar percentage or larger values. So, the larger values are pulling out the tail or the skew and we call this right or positive skewed, right or positive skew. So, how will this present on the box plot? We'll see the just like it looks very different from symmetric data like we saw with the blood pressure measurements on the histogram, the box plot display will be very different. First of all, you can see right off the back that the box is no longer in the middle of the graphic but it's down on the lower end. Then if we look carefully at this box, the median is not equi distance between the sides of the box. Is the 25th percentile and the 75th percentile respectively. In fact, there is no bar beyond the 20th percentile indicating the lowest value of the dataset as being distinct from the 25th percentile. Because you may recall that the lowest possible value in these data with a length of stay in one day and the 25th percentile is also one day because we have so many repeated values. So, that lower hinge or 25th percentile also stands in for the lowest observed value in the dataset. So, we get to see with this box in the middle of the 50 percent of the observations we already see that lack of symmetry in some of that skew because the 75th percentile is much further from the median than the 25th percentile. If we go up here, this line which before was the largest value in our dataset, is now you'll notice that there's these dots representing other data points beyond that line. It now stands for the largest non outline value. Then the points beyond it are what are called outliers and you can see there's some extreme values, they're considered to be extreme values, the largest of which is 40 days. You might say, ''Well, what makes something an outlier versus not, and how does the boxplot algorithm decide whether there are any outliers and if so, what constitutes outliers?'' So, let's look at the criteria for outlier cutoffs for a boxplot. How it decides whether there are outliers and if so, which values constitute them. I don't want you to memorize this and you won't actually use it either, the computer will take care of this but I just want to give you some sense of where this comes from. It all has to do with something, size of the box and the distance between them. So, the criteria for large outliers and maybe there are none if none of the values in the dataset meet this criteria, is these values that are greater than the upper hinge of the data, the 75th percentile plus 1.5 times what's called the interquartile range. The difference in value between the 75th percentile and 25th percentile. On the flip side, it's complimentary on the low end of things, small outliers are deemed as values that are less than the lower hinge or 25th percentile minus 1.5 times the same interquartile range. Certainly, in the blood pressure data, there were no outliers small or large and so none of them were represented in that boxplot. However, in the length of stay data, there weren't any small outliers but there certainly were some large outliers as per this criteria. So, we've seen two types of distribution from real datasets and let's just talk about some common shapes we might see. So certainly, for things like blood pressures and heights in populations, it's not uncommon to see distributions that are roughly symmetric and bell-shaped. Then we look at the histogram and again, the most common values are in the center of the distribution and the further away we get from that, the less common they are. If we were to fit a smooth curve on this, if we were to be able to take larger and larger samples till this smoothed out, you'd expect to see something like this, roughly symmetric and bell-shaped. Some numerical characteristic of this include that the mean is generally equal to the median for this data, is equal to something that we haven't defined yet called the mode, which is the most frequently occurring value. So, everything is centered around that spike in the middle, which corresponds to the mean and median. Certainly, things aren't always exactly symmetric but are approximately so like the blood pressure data we looked at. Another common type of distribution, more so perhaps than symmetric and bell-shaped is this right-skewed data. This is for things like length of stay, health care cost, CD4 counts, etc. This is where the majority of our values are small relative to the remainder. So, the most frequently occurring values are small relative to the less frequently occurring values which tend to be larger or are the outliers. So, in data like these, the mean tends to be larger, sometimes by a fair amount than the sample median because the mean is disproportionately affected by these larger outlying values. Sometimes we'll see left or negatively skewed data, an example of this might be a test scores in a class where the majority of people do well and then some people have more difficulty. So, here the majority of values are larger than the minority of extreme or outlier values. So, the tail of this distribution is in the left or on a number line negative direction and this is called left or negatively skewed data. For this data, the mean tend to be less than the median because they are again disproportionately affected by this tale of values that are smaller than the majority. Then another type of distribution which is actually symmetric but sometimes pops up is a uniform distribution. This is where all values are similarly occurring across the range of data. So, if you were to do a histogram of this, the bars would all be of similar height across the entire data range. Well, this is certainly symmetric and certainly doesn't look bell-shaped like the blood pressure data we saw or the general idea of a symmetric bell-shaped distribution. So, we'll identify and point these out as they come across with real datasets throughout the rest of the course. So, in summary, histograms and boxplots are useful visual tools for characterizing the shape of a data distribution above and beyond the information given by the summary statistics. Relatively common shapes for samples of continuous data measures include: symmetric and bell-shaped, right skewed, left skewed, and occasionally maybe not as common, uniform distribution.