0:06
Greetings, and welcome back!
This is going to be our third lecture
in the Statistical Reasoning One series, and today
were' going to talk about a famous, some
might say infamous, distribution called the normal distribution.
Many of you have heard of the normal distribution.
You may even be familiar with some of its key characteristics.
It's bell shaped, it's symmetric around its center.
And the tails die off quickly.
In other words most of the observations that are described
by a normal distribution.
Fall close to the center of the distribution.
Now we're going to spend a little time trying to understand the properties of it.
You might say but why are we doing that?
Is it because most data that we'll see
in public health and medicine is normally distributed?
And the answer is no, not necessarily.
We'll see, for some types of data, types of
continuous data, the normal distribution is a reasonable working model.
And we can use its properties to better flesh
out the distribution of the data from the
population from which the sample we have is taken.
But in other situations, these properties that are specific
to the normal curve aren't going to get much ground.
However, when we focus on our next unit statistical estimation of confidence
regions and inference the normal distribution is going to prove invaluable.
So far, and up to this point, including in this
lecture set, we take our estimates from samples as is, that is, we look at
a sample mean and we say this is
our best estimate of some underlying population truth.
And we know it may not be exactly equal to that unknown underlying truth.
Well, in the next set of lectures post
this, we're going to get into the idea of, can
we put uncertainty bounds on this estimate to
get a rating of possibilities for this unknown truth.
And that's
where the normal distribution is going to be invaluable.
1:50
So, what we're going to do, there's three sections, didactic sections
to this lecture and then one set of practice problems.
And what we're going to do is first
define the properties of the normal distribution, and
show how we can define it perfectly just by knowing it's center or it's mean.
And the spread of the values under the distribution, the standard deviation.
And there's some general rules about it.
In the second section,
section B, we're going to look at some
data examples where the normal distribution is a
reasonable model for the individual observations in
the population from which our samples are taken.
And show how we can exploit the
properties to better understand the underlying population distribution.
In Section C we're going to see, well you know, sometimes in many cases the
data we get in public health and medicine is not well described by this perfectly
symmetric, theoretical distribution.
And we'll see that if we actually apply the
properties of this said normal distribution to these data.
We're going to end up with useless results.
And it's just to remind you that, you
know, some things are only applicable under certain conditions.
So in this section, this first section A,
we're going to actually define some characteristics of the
normal curve, and hopefully upon completion of the
lecture, you'll be able to actually describe the basic
properties of the normal curve.
3:21
And hopefully feel comfortable or beyond your way
to feeling comfortable working with standard normal tables.
Now I'll be honest, I'm not going to require
a lot of that in this course, an it's
not a big focus, but it's, gives you some appreciation
for how quickly, observations fall in their likelihood under a
normal curve the further you get away, from the center.
4:33
Normal distributions are uniquely defined by two qualities.
All we need to know, if we know data comes from a normal disribution, if we want to
completely characterize the distribution of that data, all we
need to know is its mean and standard deviation.
I'll generically represent these with the symbol mu, and
standard deviation sigma to imply a population level mean.
there are literally an infinite number of
possible normal curves for every possible combination
of the mean, in standard deviation.
So here I'm showing some pictures of curves
that have different means and different standard deviations.
You could keep adding to this add infini-,
make t hem wider, skinnier, at different centers.
But you'll notice that these three different
examples I have here, any other examples
of a normal curve would all have
the same proportional structure, that is that they're.
Uniquely excuse
me, they're centered about their mean, and evenly distributed about.
Okay, so for this next slide, I'm just showing
you this, not to scare you off of math.
Many of you like math and are comfortable with it.
But if you're not comfortable with it, don't worry about this.
I just want to show you sort of the beauty of mathematics, and I get
to have the opportunity to have it over my shoulder, which is always a nice perk.
But I want to show you, for any given value
under a normal curve, the proportion of values that take
on that number, the probability of observing a value
equal to that is described by this function here.
And this function is sort of a math majors dream in
some sense, it's got all kinds of symbols and notation in it.
Two of the symbols I want to point out are the
pi symbol, which actually represents a constant, a number roughly 3.14.
And also the e, which also represents
a constant, or number, called the natural constant
of 2.718 or so.
So, once we deal with those constants, the only other
two symbols in here are the mu and sigma, and
the only reason I'm showing this equation to you is
to make you appreciate that this curve was completely specified.
We can figure out where a particular value falls under the curve, only by
knowing that value and the mean and
standard deviation of the distribution it comes from.
6:40
So again, all normal distributions, regardless of the mean and
standard deviation values, have the same structural properties, mean equals median
equals mode, the values are symmetrically distributed about the mean,
and values closer to the mean are more frequent or likely.
Than values further from the mean.
The entire distribution of values described by normal distribution.
Again, I've said this
before but it can be completely specified
by knowing just the mean and standard deviation.
Since all normal distributions have the same structural properties.
we can use a reference distribution called the standard
normal distribution to elaborate on some of these properties.
And we'll define the standard normal distribution in
a minute, and in section B we'll show that
any normal distribution with any mean standard, so the
deviation can easily be scaled to this reference distribution.
So, here's the first one.
This is just something you are going to have to memorize.
The only characteristics of the normal curve that I
want you to take to heart, and you can
always look these up in the table, but hopefully
you will be able to internalize these pretty quickly.
So, I'm just telling you, and I'll show you where this comes from, but
if I'm dealing with a normal distribution,
regardless of the mean and the standard deviation.
If I'm standing at the mean, in the center, and I go one standard
deviation either direction of that center, I
encapsulate 68% of the observations under that curve.
So this shaded red area here is 68% of the entire curve.
8:14
There are several ways to actually state this.
We could say for data whose distribution is approximately normal, 68%
of the observations fall within one standard deviation of the mean.
We could also say the same thing just
rephrasing it in terms of a probability is, the
probability that any randomly selected value is within one
standard deviation of the mean is 0.68 or 68%.
Those are two ways of saying the same thing.
8:41
Let's get to the second part of this rule.
This is one you may be familiar with, but 95% of the
observations under a normal curve fall
within two standard deviations of the mean.
Truthfully, it's 1.96. Computers will use that number.
You can look it up in a table, but for quick and dirty.
Back in the other compilation, computations
is absolutely fine to use, too.
So if we're staring at the mean of a normal curve.
And we go
two standard deviations above and two standard
deviations below, we'll capture 95% of the curve.
9:49
Okay. Let's just consider this for a moment.
Let's think about this for a moment.
So, what would that mean about, the proportion of observations
that are more than two standard deviations above the mean.
Let's think about this. Can we use the logic
of the normal curve and it's symmetry?
We're encapsulating 95% in the middle, this red area.
So that means the entire area of the curve would be a 100%.
So that means what we haven't covered in this middle
territory com, encapsulates the total of 5% of the observation.
Right?
And because the curve is symmetric, that 5%
that we have en-captured in that middle 95%,
should distribute itself equally on both sides.
So, the proportion of observations that are greater than two
standard deviations above the mean, is half of 5% or 2.5%.
Similarly, the proportion of observations under a normal curve that falls more than
two standard deviations below the mean, is also 2.5%.
So just to recap what we've done with the number
two standard deviations in reference to the normal curve, we've
said that the middle 95% of values that take on
a normal distribution fall, within two standard deviations of the mean.
They fall within the interval, mean minus two
standard deviations, and the mean plus two standard deviations.
If we were randomly sampling
data points from data that followed normal distribution the probability
of getting a value in this interval would be 95%.
So what does this mean in terms of percentiles?
Well, let's look at this.
The lower end point here, is the point which is 2.5% of the values under
the distribution are smaller than, and hence a 97.5% are greater than.
So, the 2.5th percentile of the normal curve
is equal to the mean minus two standard deviations.
Conversely, 97.5% of the values are smaller than, and
2.5% are greater than this upper end point of mu plus two standard deviation.
So, this upper end point is the 97.5th percentile of the normal curve.
12:25
Let's again look at the 68% part for a minute.
We know that 68% of the observations in the normal
distribution are in the interval within one standard deviation, 68% here.
' Kay.
So let's just think about this for a minute.
What percentage of the observations that following a normal
distribution are more than one standard deviation above the mean.
They can also be phrased, what is the probability that an individual observation
is more.
Than one standard deviation above the mean, normal distribution.
Well, what are we talking about here? We're talking about this area here.
Let's just see if we can figure that out using the logic of the normal curve.
Well, we know from the rule I've given you that
68% fall in that red area within one standard deviation.
So the total outside of that red area on either side, is a
hundred percent of 68% which is 32%.
But of course by symmetry of the normal curve those two tails
here which total contain 32% of the distribution will contain it equally.
So, in each of these tails, roughly 32% divided by 2, or 16%
of the area, percentage observations fall. So 16%
of the observations that take on a normal
distribution are beyond one standard deviation above the mean.
If we wanted to look at what percentage of
observations fall in the normal distribution, or more than
one standard deviation away from the mean in either
direction, either above the mean or below the mean.
But we've sort of already answered that, but that would
be the percentage beyond on standard deviation in either direction is
that 100 minus 68%. For that 32%.
So where did this rule come from? Did I just make this up?
14:14
No, in other words, how did I know these relationships?
Okay, well, it turns out that there are tables that exist for this.
The actual figuring this out, it would be difficult to do with that formula I
presented you before because it would require
integrating all of the ranges of the data.
So it's nice that people have come up with tables
for us to look at, and you might say, well, that's great John, this
rule is useful, but what about other
percentages under the curve for other standard deviations?
Distances from the mean, you know?
Not just one, two, or three.
Well, all the information I quoted and much more
can be found in what's called the standard normal table.
So here is an example of the standard normal table.
This is just maybe the greatest hits of a table just to get you thinking about it.
me-,
may, the tables represent themselves in different
ways, depending on where you find one.
And we'll speak more to this in a minute.
But you'll notice I've got three columns here.
Most of them will only include one of these.
Descriptions, but we've shown already by the logic of the
normal curve, its symmetry, et cetera, if we're given one piece
of information about a standard deviation in area under the curve,
we can figure out the other bits by employing that logic.
So this table here actually has three columns, most tables
won't be so ornate, but in this first
column, it just shows you what percentage falls within.
Z standard deviations of the mean, Z is this column here,
so for example, if we're looking at one standard deviation, we can
see that 68%, I mean I rounded it in my lecture,
but it's really 68.3% fall within one standard deviation of the mean.
Another way of saying the same thing, is if we were
to go to one standard deviation above the mean, and
look at the percentage of observations that are greater than that.
It would be, we've already shown the logic of that 16% that we showed before.
And if we were tp actually look at the percentage that are outside of the middle.
One standard deviation range it would be that 32% we showed before.
Similarly you could check this for different numbers,
you could see that for two standard deviations.
Well truthfully its 95.5% that fall within.
16:34
So, you know, you might say well where do I find one
of these standard normal tables in case I need to do this.
Well, reminder, we're in an online course, which means you have access to what?
[LAUGH]
The internet.
And if you type in standard normal table
on the internet, you can get multiple hits.
You can even find calculators where you can plug in a number of
standard deviations of interest and it will tell you something about the curve.
And so I'm just going to show you two examples
of tables just to work the logic of these.
You have to, you can also find these in the back of any statistical textbook.
But there are many ways to tell the story of this same curve, and
so you have to pay heed to what a particular table is telling you.
So this is one I went and searched on standard normal tables.
This is one of the hits I got. Here is the URL.
Hopefully it's still working by the time you look
at this lecture, but if not, there's multiple other ones.
Clearly you can't see this on the slide, so I'm going to zoom in a little bit.
Move over to the side and just let me peh, show you
what it's telling you wi, with reference to the values in the table.
For any given,
you have to pay attention to the fine print, for
any given standard deviation what this table is going to tell us.
Is the percentage of the observations that fall
17:50
So it's not telling us about the full range
within that, only part of what we looked at before.
Well we'll see if we can use that to map to numbers for comfortable width.
So for example if I go to this table, let's just
see what it tells us about some of the numbers we know.
Let's go from 1.96 just to, to, to be exact when we're looking at this table.
So the way to follow this is you see its got this column here called Z,
18:15
And this goes in tenths, intervals of a 1 10th of a number.
And then this other column that goes in hundredths.
And the way to piece this together, is that the
root of the number we're looking for 1.96, we're going to look
for the value 1.9 in this column here, and then where it
intersects the value of 0.06 in the column over here.
So if we look at this, if we go 1.96, and I'm just going to
circle this and highlight it, the value we're given here for 1.96 is 0.4750.
So let's see if that makes sense.
Remember 1.96 is the number we say literally cuts off.
95% in the middle.
Does this information jive with what I've told you.
Well let's see what we're looking at.
What are we looking at with this number?
19:13
If we go 1.96, or you can think of it as two standard deviations from the mean, two
positive 1.96 deviations, that cuts off .475, or 47.5% of that cure.
Does that jive with what we've said? Well let's think about this.
By the symmetry of the normal curve. If we actually go
1.6 standard deviations below the mean, what should that area be?
That should also be 0.475 or 47.5%
and the sum of these two is 0.95 or 95%.
Okay, so if we have this one piece of information
about the upper half, encapsulated by going that far above the
mean, we have the story for the rest of the curve.
And now we can also figure out, you know, the 5%
remaining thread areas is equally distributed so this would be 2.5%.
And you can do this for any value.
And we'll look at some other values in our next lecture set.
And then here's another exhibit,
here's another table we got. Okay, just by searching the interwebs.
And you'll see what this tells you, you're
going to pay attention to what the table's telling you
is, for a human standard deviation value, it's
telling you something slightly different than the previous table.
Instead of telling you how much falls between that.
And the mean, it tells you what percentage of the
curve or values are below that number of standard deviations
away from the mean. Okay?
So let's see if we could use this.
So here's a, here's first snippet but it's still kind of hard to read.
So, let's cut to this, just to give you an example if we were looking at so, five.
From this table here, we have that Z column.
21:00
So, this is the similar to the previous table, and then the hundredths unit here,
so we can get down to the second decimal place in terms of standard deviations.
So,
I'm just blowing this up here to look, if we
wanted to look at the story of one standard deviation.
I'm just showing you a piece of the table.
Where the Z column is at negative 1
because this only has negative values in it.
And hundredths column is 0. So, what is this telling us?
It tells us, if we are under a normal curve, and we
are at, here's the mean, for a 1 standard deviation below the mean,
then the percentages of observation that are either further away
in the negative direction, or less than 1 standard deviation.
bu-, more than one standard deviation below the mean
is 15.87%, or that's what we'll round to be 16%.
And so once we have that, we have
the entire story of one standard deviation, right?
We know by symmetry what the If we went one standard deviation
above the mean, we'd also get 16%, so the total area in these two portions is 32%,
which must mean that's what's in the middle is 100 minus 32%, or 68%.
Okay, so let's just think about this for a minute.
22:20
What have we covered here?
We've defined the normal curve, showing that it's symmetric and bell shaped.
We've shown that it can completely be defined by knowing its mean
and standard deviation, and that most of the observations in, of, that, for
[INAUDIBLE]
that follow the normal distribution fall
within two standard deviations to the center.
Although the tails go on infinitely, the majority
of the data, 95% is encapsulated within that range.
We've also gone into looking at how to use a table to find these respective
ranges, and cutoffs, and we'll do some
more examples of that in the subsequent portions.