is from the sample mean.

So the further on average any single observation is from the sample mean,

the more variable the values are around that sample mean.

So let's just look at a computing at once by hand just to show the operations and again,

the math here you're all capable of but

we'll generally leave this to the computer so we can

work on the harder things like interpreting this and using it in other situations.

So we had these five systolic blood pressures as we had before.

The sample mean, we said was 99 millimeters of mercury.

So, in order to get the sample standard deviation,

we first start and sum up

the squared differences between each observation and this mean of 99.

So for example, 120 minus 99 is 21,

so we take that and square it.

The next observation 80 minus 99,

that difference is negative 19.

Now you can start to maybe see why we square these things first,

because if we add up the differences unsquared,

we'd be adding together positive and negative differences

and we consistently get something that would be zero actually.

You can show that that would always equals zero.

So we wouldn't be able to quantify

the variability if we didn't square these things before adding.

So if we add these things up and cumulative

the total squared distance of these five measures

from that sample mean of 99 is 1,020 millimeters of mercury squared.

We take that and divide it by the sample size less one essentially average and again,

just think of this as averaging.

We get an average squared distance of 255 millimeters of mercury squared.

If we take the square root of that to get the standard deviation,

turns out to be 15.97 or approximately 16 millimeters of mercury.

So on average, any one of these five data points falls plus

or minus 16 millimeters from the mean of this sample.

So a couple of things to note about standard deviation.

First of all, the more variability there is in the sample of data,

the larger the value of s and we said this before.

What s measures is this variability or

spread of individual sample values around the sample mean.

It can only equals zero if there's no variability.

If all n sample observations have the same value.

The units of the sample standard deviation are the

same as the units of data measurements in the sample.

For example, millimeters of mercury can also be abbreviated

SD or sd but

we'll generally use s to represent sample standard deviation for our purpose.

S squared, the sample variance is

the best estimate of some underlying population we can't directly observe.

The variance of all values in the population and

hence s is the best estimate of the population standard deviation,

and we will represent this unknowable quantity that measures the variation or

values in the population from which we took the sample as the Greek letter Sigma.

So we want Sigma but we can only estimate it via

s. I just want to talk very briefly about why we don't

directly average but divide this by n minus one instead of n. Really

this has very little influence on the results in

larger samples but it could make a difference in smaller samples.

But here's the reason.

What we really want to know is we like to know

the average distance of our sample points from the true population mean Mu.

That's a poorly drawn Mu but that's what we want to know.

But we don't know Mu,

we only have a sample and we can only estimate it through x bar.

So we replace Mu with x bar,

but x bar doesn't depend on all points in the population like Mu does.

It only depends on the points we have in our sample.

So it can be shown mathematically that this squared distance of our points from x bar is

systematically smaller than it would be if we

replaced x bar with the true population mean.

Slightly smaller.

So in order to correct for that and when we compute the variance and standard deviation,

we divide by a number slightly smaller than the sample size to

get a slightly larger value than if we divide it by n alone.

So we're just correcting slightly for something that can be shown with

rather complex mathematics that we're estimating

the numerator underestimates what we'd really like to know by a slight amount.

Again, the impact of this on

the estimated standard deviation is minimal especially in larger samples.

Certainly, we would want to relegate

these computations to the computers especially when we have large samples.

So if I had a 113 blood pressure measurements

and I wanted to compute the mean and standard deviation,

here are the first 50 values and I'm only showing this to show you

that these data are large relatively speaking to

what we did before and we need

two more slides to show the rest of the data in the sample of 113.

While we could compute these summary measures by hand,

it would be quite cumbersome.

So, we're going to let the computer tell us what these are, the estimated mean.

The sample mean for these data is 123.6.

That's an estimate of the true underlying population mean.

Sample standard deviation is 12.9 millimeters of mercury.

That's an estimate of the underlying population variability in

blood pressures in all persons in this population,

and the sample median is 123 millimeters of mercury.

So something else that will help

us quantify characteristics of a distribution of continuous data are percentiles.

We've already seen an example of a percentile, the sample median,

and let's just talk about defining these in situations where

our samples have all unique values and where we have some repeated values.

So, let's first look at the case with all unique values.

In general, if our sample values are unique

the pth per sample percentile is that value in the sample

such that p percent of the sample values are less than or equal to

this value and 100 minus p percent are greater than this value.

So, that's why the median is the 50th percentile,

50 percent of the values are less than or equal to

the median and the remaining 50 percent are greater than the median.

The 25th percentile for example would be the value that was greater than or equal to

25 percent of the sample data points

and less than the remaining 75 percent of the values.

These could be done by hand,

we could line up our values from largest to smallest and put these off,

but again, it's much easier and more effective to have them done by the computer.

If not all sample values are unique,

in other words some are repeated,

then the pth sample value is that value in the sample such that p percent of

the sample values are less than or equal to

this value and 100 minus p percent are greater,

and we'll show an example of this but sometimes we can

have data where there's a large number of

a single value repeated and so we can have multiple percentiles with that same value.

So, let's talk about percentiles in that systolic blood pressure data set we looked at,

taken from a random sample of 113 adult men,

taken from a larger clinical population.

The 10th percentile for these 113 measurements is 107 millimeters of mercury,

meaning that approximately 10 percent of the men in

the sample have systolic blood pressures of less than 107,

and 100 minus 10 percent or 90 percent have systolic blood pressures greater than,

I'm going to generally say greater than or equal to

here it sounds kind of counterintuitive to have

it on both ends but just to cover

the situations where we have repeated data in our samples.

So, 90 percent the men have systolic blood pressures greater than or equal

to 107 millimeters of mercury.

The 75th percentile for

these 113 blood pressure measurements is 132 millimeters of mercury.

Meaning that approximately 75 percent of the men in the sample have

systolic blood pressures less than or equal to 132 millimeters of mercury,

and 25 percent of the men have systolic blood pressures greater than or

greater than or equal to 132 millimeters of mercury.

Here are some other percentiles taken from the sample

as well including the 2.5th percentile,

the value such that only 2 and a half percent of the sample are less than this and the

remaining 97.5 percent or greater that's 100.7,

and we have other examples here like the 25th percentile and the 97.5th percentile.

Lets look at one more example.

We're going to see different results because here

in these data we have a lot of repeated values.

This is a length of stay claims at

Heritage Health Plan for all patients or

all enrollees who had an inpatient stay of at least one day,

an inpatient hospital stay of at least one day in the year 2011.

There's 12,928 claims and I'm clearly not going to show

each individual value in these slides because that would take a lot of PowerPoint,

but I'm just showing you the first 50 measurements

to give you a sense of the flavor of these data,

and you notice that there are a lot of repeated values,

a lot of persons in the sample ended up having a one day,

the minimum, impatient length of stay.

Others had two days but that was repeated as well.

So there were a lot of repeated values in these data.

So let's talk about the summary statistics of center first and spread,

the mean for these data is 4.3 days,

contrast that with the median which is only two days and

maybe you can start thinking about then we'll investigate later in another section,

what's going on here, but start thinking about why that may be,

and the sample standard deviation is 4.9 days.

So it turns out here is some of the sample percentiles for the 12,920 claims.

Remember, in order to be in this data,

you had to have an inpatient stay of at least one day.

So the minimum value in the data set is one,

it turns out that the 2.5th percentile is also one,

2.5 percent of the data values are less than or equal to one,

and the remaining 97.5 percent are greater than or equal to one.

So here's one, 2.5 percent are less than or equal to,

and the remaining 97.5 percent of the values are greater than or equal to.

Because there's so many repeated ones in this data set,

it's also the 25th percentile,

so that tells us right off the bat that more than 25 percent of persons

in the sample had a one day length of stay in 2011.

So, one also informs us about the 25 percent of the data values are less than or equal

to one and the remaining 75 percent percent here,

25 percent here, and this should be one,

the remaining 75 percent are greater than or equal to it.

Then we see the media is actually two days.

So, now we can tell that somewhere between 25 percent and

49 percent of the sample had a value one because the median is two,

jumps to two when we get to that percentile.

But notice that the difference between the median

and the 2.5th percentile is only one day,

contrast that with the difference between this median of

two days and the 97.5th percentile the value on the other end,

the complement or mirror image,

not quite mirror image of the 2.5th percentile value

such that 97.5 percent of the data values are less than or equal to this,

and 2.5 percent or greater, that's 20 days.

This distance here, from the second half of the data set,

this spread is a lot larger than in the first half.

So just think about that and when we start looking at visual displays,

we will see how this plays out.

So in general, summary measures that can be computed for a sample of

continuous data include the mean, standard deviation,

the median or 50th percentile,

as well as other percentiles,

and these sample-based estimates are the best estimates of

unknown underlying population quantities.

For example, x-bar our sample mean is

our best estimate based on the data we have of the underlying population mean.

S, our sample standard deviation is

the best estimate of the population standard deviation.

Soon, about halfway through the course,

we'll start talking about how to address the uncertainty in

these estimates as they relate to the unknown thing they're estimating.

Certainly the sample mean is an imperfect estimate of the population

mean because it's only based on the values in

our sample and not all values in the population.

In the next section,

we'll continue with how to look at and

investigate continuous data by introducing some visual summary measures as well.