A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

來自 Johns Hopkins University 的課程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

238 個評分

Johns Hopkins University

238 個評分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

從本節課中

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

In this lecture section, we will discuss and

define two commonly used graphical techniques: histograms and boxplots.

That add richness to the story and

understanding about the distribution of continuous data,

above and beyond the standard summary statistics

of the mean, standard deviation and median.

In this section, we'll build on what we did previously

in terms of quantifying certain key characteristics of the distributions

of continuous data.

Here we'll look at how to explore them visually.

So upon completion of this section, you should be able to utilize, what

will define histograms and boxplots to visualize

the distribution of samples of continuous data.

Identify key summary statistics that appear on the boxplot and name and

describe some basic characteristics of some

common distribution shapes for continuous data samples.

So let's start by talking about histograms and

boxplots, two very commonly used visual display tools.

So, in the last section, we looked at quantifying certain

key characteristics of data like the mean and standard deviation.

While mean, standard deviations, and perten, percentile values don't

tell the whole story if you will, of data distributions.

There's also potential differences in shape, where are the bulk of the values

concentrated, other extreme values et cetera.

We're not necessarily going to be able to

absorb that just by looking at summary statistics.

So, histograms for example, are a way of displaying the distribution of a set of

data by charting the number or percentage of

observations, whose values fall within predefined numerical ranges.

And boxplots are graphics that display key characteristics

of a dataset.

And these will see our especially nice tools for

comparing data from multiple samples, visually in one graphic.

And we'll define both of these as we move forward and we'll

bring back some of the examples we've looked at in the last section.

So again, our data on this clinical population of

113 men randomly selected, how could we create a histogram?

We've already summarized it with the

mean, standard deviation et cetera.

Or what we could do for example, to create a histogram or let the

computer do, and then interpret the results

of, is take the blood pressure data.

The range, take the range of it and then create bins of equal width.

Say for example, five millimeters of width.

And then we count the number of the 113

observations, whose blood pressure values fall within each bin.

So for example, one of our bins

goes from 110 millimeters of mercury to 115, we count the number of the 113 men

who have blood pressure value between 100 and 115, and that would populate that bin.

And then we plot the number or relative frequency of

observations, that fall within each bin as a bar graph.

So let's, let's get to this and look at an example.

So what we have here is a histogram

of systolic blood pressures from a random clinical

sample of 113 men, what we've been talking about.

Here are the summary statistics over here that we showed

how to compute and looked at it in the last section.

And then this graphic really gives us a little more insight as to what's going on.

You can see these bins here, and this is

computer chosen but they're about five millimeters in width.

And what these bars track, the height of

each bar is the number of observations whose values

fall within that bin. So the bin that runs from roughly 140,139

millimeters to 144, there's roughly ten men who fall in that category.

And so you can see that most of the data value the, are populated

in here, but then there's some values sort of off to the sides as well.

Another way to present this which is more commonly used instead of putting

the absolute number of observations on the

vertical axis, we put the relative percentage.

So for example, between 125 or so and 130 is almost 25%,

roughly 22 or 23% of the men in this dataset have values in that range.

A very small percentage have values

between 90 and 95.

We could muck around if we were playing on the

computer and change the width of the bars, for example,

instead of making five millimeters of mercury y, we could

make them 25 millimeters of mercury y and plot that.

You can see, we loose a fair amount of,

the nuance and the data distribution by doing this.

Of course, so we could make the bar 100 millimeters of mercury y and

our histogram would be one bar that hits 100%, it wouldn't be very insightful.

Of course, play around in the opposite direction and

make the bars really thin, maybe one millimeter of

mercury, and we get something that was probably a

little more detailed than we needed to understand the distribution.

So

[INAUDIBLE]

to handle this.

what the general rule is to take the number of observations.

Take the square root of the number of observations, square root of n.

And that's how many bins we'll have. So we have 113 observations.

The square root is.

Between 10 and 11 so, it figures out we need 10 or 11 bins

that takes the range, divides it by that number, and then creates bins that width.

You don't have to know how to do that.

It's just some insight as to what the optimal visual

presentation is, and most computer packages will give that to you.

And if you were critiquing a paper, you might advise somebody

to go with the standard bands as opposed to the really wider

[INAUDIBLE]

if you saw something like that.

Okay, let's look at the data on the systolic blood pressures

from a random clinical sample of 113 men in another form.

This is what's called the boxplot.

This is less detailed than histogram but still gives us

some insight as to what's going on with these data.

So what we have on this graphic here, well, here's

the box in the middle, that's where the name comes from.

And is sometimes called the box

and whiskers, whiskers plot, because these things

out here are called the whiskers, but let's just dissect this for a moment.

The box in the center of the picture, gives us

information about sort of the middle 50% of the data.

So this bar right here that, between the

two sides of the box, represents the median.

For the dataset.

So that median of 123 millimeters of mercury is this line

right here. You can't draw right on top of it but.

And then the sides of the box, the lower side and the upper side are percentile.

The lower side is the 25th percentile.

And the upper side is the 75th percentile.

So this box in middle contains information about the middle 50%

of values, those values between the 25th and the 75th percentile.

Then we have these lines here, and what we have.

These horizontal lines here at the bottom and the top

respectively, are the smallest value in the dataset, and the largest.

So loo, looking at this picture, what do you see?

Well, here's, lets go back to the median.

It really splits that box in half, right?

So the 25th percentile, in the 75th percentile,

relatively similar distances above and below the mean.

And there is largest and smallest values are

similar distances from their closest sides of the box.

So what we're getting a sense of here is

of what we might call a pretty symmetric dataset.

And the histogram gave some insight to that too, that the

distribution of values is pretty symmetric around the center of the data.

Let's go on and look at another example that may look differently.

So here's our length of stay data.

And let's look at a histogram of the length

of stay claims from from the Heritage Health dataset.

Again, with this in patient stay of at least one day.

And this is.

Did on 12,900 claims, and remember, we saw this had a mean of 4.3 days, a standard

deviation of 4.9 days, but a median of 2 days, and I asked you to think about that.

Well, look at the picture of these data.

Let's try and make some sense of this. Here's the.

Histogram, what do you notice about this?

Well, look at this first bar here.

This reaches over 40%, and if you look at it carefully, it's

the bin from one to 2 days, it's a little hard to

see on this scale, but what this suggests is that, 40% of

the data values Like the stay values are between one and 2 days.

And then what happens

after that.

Well, they'll what relative frequency is decreasing pretty quickly.

So what do we see here?

We see something where the bulk of the data is concentrated in smaller values.

But then, there's increasingly smaller percentage

of larger length of stay values.

So how might we characterize this?

Well, this type of distribution is not symmetric.

Clearly you can see if we looked at the centers,

either measured by the mean, about right here or the median.

In neither case is the data, sort of similarly distributed on both sides.

So we wouldn't call this, symmetric. The term we'd use for this.

As you'll notice that sort of a skewed distribution,

with a tail kind of heading out positively or to the right.

We would characterize this as what's

called a right skewed or positively-skewed distribution.

And I had asked you in the previous

section to compare the mean and median values, and.

What do you notice about this?

Well, the median of 2 days is

substantially less than the mean of 4.3 days.

And something we talked

about as well is that the mean is more

likely to be influenced by extreme values than the median.

Because the median depends on the relative positioning

and the mean actually depends on the value.

Well, what we have here.

Is a mean larger than the median, because the mean

is being influenced by these larger length of stay values.

And so this is called right skewed, or positively-skewed, because

the extreme values relative to the rest of the data

are larger, and we end up with a mean that's larger than the median.

So here's what it looks like when we

put right skewed data into a boxplot representation.

It looks very different than what we saw with that, blood pressure data.

First of all, look at the box relative to the entire picture.

It's only a small portion down here, and it's sort of scrunched down.

And furthermore, look at the sides of the box versus the line between them.

And the line between them is that median,

of 2 days and notice a lot closer to the lower side.

Between the 25th percentile of one day, than

it is to the 75th percentile of 5 days.

So this gives us some insight into that skew we were seeing, that there's the same

percentages of observations on this side of the median as there is on this side.

But the range taken on by the upper portion

is a lot more substantially wider.

And so that's giving some insight into that right skewness, that the

positive values are more spread out than the values closer to 0.

And then these values out here are identified by the boxplot out

[INAUDIBLE]

outliers, things that are outside the, quote unquote, normal pattern of data.

And we'll show you how that's come up with in a minute.

But, when I look at this picture, what I see

is that there's a lot more space given to the

larger values than there is to the smaller values which

sort of jars my memory and reminds me that the.

Least likely values ironically are

the ones that get the most play in this visual.

The ones with the smallest percentage of representation.

And so this gives me some insight as to

the, the tail is positively skewed or right skewed, that

there's, most of the data is concentrated on smaller

values and then there's these extremes that are more larger.

So here's sort of the anatomy without my chicken scratch writing.

This is a little more fine-tuned but In any boxplot this is the.

General rule.

As we talked about before when focusing on the

box the line between the two sides represents the median.

And the two sides of the box, the si, lower

side represents the 25th percentile, the upper side the 75th percentile.

This line above the 75th percentile, represents this arrow is

a little too long here, it represents the largest non-outlying value.

This is sometimes called the upper tail and

the blood pressure data, we had no outlier.

So this was not only the largest non-outlying

value, it's the largest value in the dataset.

From the low end you see we don't have one of these lower tails, and it turns out.

Because of the nature of these data the smallest value of one day also

corresponds to the 25th percentile, because there

were so many observations that had one day.

And so, this 25th percentile is also the smallest value in the dataset.

There's no small outliers, all the outliers are concentrated.

On the upper side.

So, how does it determine what's an outlier and what's not?

Well, there's an algorithm that does this, and here are the cutoff for this.

it designates boundaries such that values beyond them in

one direction or the other are considered to be outliers.

And the rule that used is large outliers, or values that

are larger than the upper hinge, which is also the 75th percentile.

Plus 1.5 times what's called the interquartile range or IQR.

And that's just the difference between the 75th percentile minus the 25th percentile.

So this rules, says if you have data points that are greater than

the 75th percentile. Plus 1.5 times this difference between the

75th and 25th, flag them as outliers, positive outliers, large outliers.

Similarly, if you have smaller values that are less

than the lower hinge which is the 25th percentile.

Minus 1.5 times

this interquartile range, then these would

be considered negative or smaller outliers.

We didn't see any of these in this dataset, but we could in future datasets.

I don't want you to know this rule, what I want you to be able to do,

and we'll get more practice with this as

we go along, is being able to identify outliers.

Individual given by the boxplot.

So where do these cutoffs for determining outliers come from?

Well, historically they're traced to the cutoffs

that would correspond to the 2.5th and 97.5th

percentiles on a bell-shaped Gaussian curve, a distribution

we'll deal with in the next lecture set.

But there's no hard fast rule to why these are the cutoffs.

They're just things that can be

universally applied to datasets from different

populations and establishes a standard for what are considered to be outlying values.

Here's one more type of dataset that we'll see some

times which has slightly different characteristics than the previous two.

It's it's daily temperature measurements in Fahrenheit, collected over.

A, a 14 year period in the city of Philadelphia in the U.S, this

is pretty rich data actually, and this

histogram shows the distribution of these temperature values.

So, this is a city on the East Coast of the United States.

So we,

experience all seasons here, you know, anywhere from winter to summer.

And so there would be expected to be a fair amount

of variation in the temperatures if they were accumulated over multiple years.

The mean temperature across these 14 years

where the data was 54.3 degrees Fahrenheit.

the median was slightly higher, 55.3 degrees Fahrenheit.

But there is a fair amount of

variability we'd expect if we were using all seasons of data and the standard

deviation, at 17.8 degrees Fahrenheit. So, how do you characterize this shape?

Well, let's compare and contrast it to what we saw before.

The middle of the data, whether measured by the median or the mean, is

on the order of about 54 or 55, so those two are close in value.

And, it does look like perhaps roughly half the data is

concentrated above, well, certainly above the median, and the other half below.

But

[INAUDIBLE]

the distribution of those halves is slightly different.

On the upper half, there's a little bit of increase as

we get to the higher values, picking up the summer months, ostensibly.

But on the whole, it's relatively consistent.

On the lower half though, we see that, as we

get into lower and lower temperatures, they become less frequent.

Most of the data, is concentrated between say 40, 35 degrees and 80 degrees.

And then there's some, if you will, less likely colder temperatures on this side.

Some would characterize the shape as being left skewed, meaning that the least likely

values, less frequent values are loo, lower than the bulk of the values.

Numerically, this left skew sort of shows up remember the mean is more influenced

by extremes, large or small, and these less frequent but extreme

points are small relative to the rest of the mean here.

Is lesser than the median, now be it by slightly, but if we had more of a left

skew, we'd expect more of a discrepancy between the

mean and the median, with the mean being smaller.

If we look at this in a boxplot representation, it, it, it

looks, it doesn't look, there's certainly not any outliers that appear in the

[UNKNOWN].

That left skew those values don't fall

outside of the range considered to be regular.

But if you look at this, the middle 50%, the box, the 25th percentile,

and the 75th percentile are pretty much equidistant from the median here.

But you can see that, the largest value is closer to the median than

the smallest value is to the mean.

So that gives some insight to the left tail nature of this.

This is not an extreme left tail

distribution, we don't have any small outliers

[UNKNOWN]

as per that algorithm.

But, we have a little bit of a tendency who

[UNKNOWN]

towards the small values for some extreme.

So, some common distribution shapes that we'll see with continuous data.

Well, something that's not all that common but will actually play a big role in

what we do further down the line, are

distributions that are relatively symmetric and bell shaped.

And the blood pressure data that we first looked at is,

isn't a perfect example of this, but is something getting at that.

Symmetric and bell shaped, well, would be something like this.

That's a bell.

Or at least my representation

of this, and, what we'd have with these types

of data is these are perfectly balance, the median.

Splits the middle obviously and what we have is

that the mean and the median are equal in value.

Because the distribution on both sides of the median is equivalent,

so the mean balance is out to be that same point.

This is uncommon to get a perfect

symmetric" bell shaped" distribution in real life

but some things do have some evidence of

this characteristic blood pressure as being an example.

Something we will see commonly with public health and medical data is what

we saw with that length of stay data, is something called right skewed.

Where the majority of the data is low in value, but there's some positive extremes.

And in these types of data, the right skew

comes from the fact that the extreme values are larger

than the majority of the values.

And what we'll see almost unequivocally with data like this is that the mean is

going to be greater than the median,

because it's influenced by those extreme positive values.

That tends to pull the mean up.

We'll also see in that temperature data sort of fit this criteria.

If we were looking at things like of course test scores and that sort of

thing we might see a left skewed distribution

where we have a majority of the data.

As concentrated around higher values and then there's some small extremes.

And in this type of data, so the tail, if

you will, the where the least frequent measures are is to

the left and smaller than the bulk of the data.

So this might be called the left skewed, and in these types of

datasets, that the mean would be expected to be less than the median.

Because if the mean is influenced by those smaller values, it tends to pull it down.

And then finally a type of distribution that we'll occasionally see and it's

just kind of fun to name it is something called the uniform distribution.

And you, we could argue that the

Philadelphia temperature data was close to it.

With a small appearance of the left tail, but this is where we'd expect,

[UNKNOWN]

all values in our dataset to have similar, frequency.

Something whether is, no clear tail, they're, the things didn't, diminishes as

you got away from the center, but they all had similar frequency.

Example of like this, and it sounds kind

of artificial, but if you were rolling dice.

If you were rolling a 12 sided die for example from some fantasy game.

And you kept rolling it and then did a histogram

of the number of times you got one, two, three, four,

etcetera, we'd expect that to be roughly uniform.

Alright, so we're going to have a lot of fun looking at visuals

as we go throughout this course and this is our first scratch.

A, but I want you to take away the message that histograms

and boxplots are useful visual tools

for characterizing the shape of data distributions.

Above and beyond the information given by summary statistics.

And relatively common shapes for samples

of continuous data measures include symmetric and" bell shaped", right skewed,

left skewed and uniform. In the next section, we'll start

talking about, we'll look at visuals,

we'll look at sample statistics and we'll

start talking about the role that sample

size has on these quantities.