So in this section, we're going to explore some visual displays to

complement our numerical summary measures for continuous data.

So upon completion of this lecture section,

you will be able to utilize histograms and boxplots to

visualize the distribution of samples of continuous data measures,

identify key summary statistics on any boxplot and name and

describe basic characteristics of some common distribution shapes for continuous data.

So, we're going to look at histograms and boxplots because mean

standard deviations and percentile values will be well helpful

numerically characterize these certain aspects of

the distribution of a set of continuous data do not necessarily tell a whole story of

data distributions and sometimes it would be nice to be able to easily

ascertain and compare differences in shapes of distribution for different samples,

and while we can get at that as we'll see later on by

comparing some numerical measures to some degree,

the visual displays are much more helpful for such purposes.

So histograms are a way of displaying the distribution

of a set of data by charting the number or percentage of

observations whose values fall within

predefined numerical ranges and then plotting

these numbers or percentages in a bar graph.

Boxplots are graphics that are a little less detailed than

the affirmation histograms that display key characteristics of a dataset.

These are especially nice tools as we'll see later in this set

of lectures for comparing data from multiple samples visually in one graphic.

So let's start with some data on a clinical sample of

113 men randomly selected from a clinical population.

We have measurements or data on their systolic blood pressures.

So we have 113 systolic blood pressures,

we can create a histogram by breaking the data or blood pressure measurements,

taking that range into bins of equal width,

counting the number of 113 observations whose blood pressure values fall within each bin

and then plotting this number or the relative frequency of

observations that fall within each bin as a bar graph.

So here's a basic no frills histogram of these 113 measurements.

So, I put this into a computer,

I let it do all the work for me,

and it decided to break the data into equal bins

of width 10 millimeters of mercury and what it's

doing is counting the number of observations that fall within each of

these 10 millimeter of mercury ranges and plotting them in bar graph form.

So for example, there is only one person in

the dataset whose blood pressure falls between 80 and 90 millimeters of mercury.

Another single person whose blood pressure is between 90 and 100,

and then when we get to 100 between 100 and 110,

we see a lot more persons and we can see then that the distribution or the number of

persons falling into ranges increases and then after we get to the range from 120 to 130,

it starts decreasing again.

So we get a sense that these data are somewhat not perfectly

symmetrically distributed around center high points

here and the sample mean for these data is

123.6 and the median is similar in value to 123 millimeters of mercury,

those values fall roughly here.

So these data present a somewhat symmetric around that measure of center and

then the numerical version of what we can visualize here,

the variability, is that standard deviation of 12.9 millimeters of mercury.

Final thing we can do to make the histograms comparable across

samples of different sample sizes is instead of presenting

the wall observed number of observations that fall within

each bin of blood pressure measurements here,

we can instead present the relative proportion,

the total out of the sample size on this vertical axis.

Since we have 113 men in the sample and 113 is close to the value of 100,

these percentages will be very similar to the actual observed numbers,

but that won't always be the case depending on the sample size.

If we go back, we can tell the computer, "Look,

I actually want the bins to be of different width."

For example, I don't want 10 millimeter mercury bins,

I want to make them wider.

I want to make them 20 millimeter mercury bins,

and what will that do to our graph?

Well, our graph still captures maybe the essence of what we

saw in the previous displays but more crudely so.

We don't get as much detail about the spread of

the values and how they're centered in the middle

here and how the proportion of them decreases the farther we get away from that center.

It's still here but not as in much detail as before you can think of

the wider we make these the more detail we lose and the ultimate histogram,

the ultimate bin width would be one that's the entire data

range in which we'd have one bar at

100 percent which would not be informative all about

the shape of the underlying sample distribution of values.

On the other side, we can go overkill in

the other direction and make the bins really small.

In this case, I told the computer to make

the histogram with bins of one millimeter of mercury wide,

and we still see that shape that we saw when we had bins of

10 millimeters mercury wide but perhaps in

more detail than is necessary at least for this single sample.

If we wanted to compare this histogram to other samples of

systolic blood pressures that were much larger and we

wanted to present those histograms with narrow bars,

then we might scale these to be the same size for comparison purposes but not for on its

own just for this sample of 113 measurements perhaps this is too granular a presentation.

Now, let's look at a boxplot presentation of the same data.

We will see this is a less detailed visual graphic to describe the distribution.

So certainly, it's probably pretty obvious looking at this why it's called a boxplot,

the shaded piece in the middle is that of a sometimes called the box,

and this box contains information about

the middle 50 percent of the values in our dataset.

So the line between the two sides of the box,

the bold line is the sample median and then

the lower side of the box is the 25th percentile,

sometimes called the lower hinge.

The upper side of the box is the 75th percentile,

sometimes called the upper hinge.

Just to note here, notice that the distance between the median and

the 25th percentile is

similar to the distance between the median and the 75th percentile.

So, at least the middle 50 percent of

these data are roughly symmetric around that median value.

As we trace this dotted line down to

the solid horizontal line here on the bottom of the graph,

this represents the smallest value in the dataset, the minimum.

Vastly, if we do from the 75th percentile or upper hinge up to this horizontal line,

we get the largest value.

Notice that the distance of the median from

the largest value and the smallest value respectively are similar as well.

Again, even though there's only five points of data represented in this visual display,

we get the sense of the symmetry but not as in much detail as we saw in that histogram.

Let's look at another sample and compare these displays again.

So, here is the Heritage health length of stay data,

where we have 12,928 length of stay values for persons who had

at least one day inpatient length of stay in the year 2011.

But if you look at this histogram,

it looks decidedly different than those blood pressure measurements.

It certainly shows that the majority are over 40%,

so I've labeled this in terms of the relative percentage of observations,

following this first bin and this actually represents the value one,

it's between zero and one for binning purposes,

but all the values in it are ones.

And more than 40% of the sample had a length of stay value of one.

Then you see as we move away from that most frequent value,

the proportional observations that take on values larger than that

decreases as we go up in value on the length of stay axis.

So, this distribution is heavily what we might call skewed,

and in the language of visual displays,

its right skewed or positively skewed because the extreme values,

the values less frequently occurring,

are much larger than the more commonly occurring values.

The mean of this sample is 4.4 days versus a median of 2.0 days.

The reason there is such a discrepancy between the mean and the median,

is because the median is heavily influenced by these outlying or extreme values.

They tend to bringing it up in

value whereas the median is only affected by their relative position.

Then the sample standard deviation of these 12,000 plus values is 4.7 days.

Again, how would we characterize this distribution?

Again, the majority of values are

small relative to a similar percentage or larger values.

So, the larger values are pulling out the tail or

the skew and we call this right or positive skewed,

right or positive skew.