In this lecture we're going to talk about what it means to log data and what impact

that has when you do things like take arithmetic means of logged data and create

confidence intervals and this sort of thing.

So we'll talk about logs, we'll talk about the geometric mean, which is intrinsically

related to taking logs of data and taking arithmetic means.

And we'll about the geometric mean and it's relationship with the law of large

numbers and the central limit theorem. And then we'll go through some of the

existing techniques that we've already gone through, like creating T confidence

intervals. But just go over how they're interpreted

with respect to log data. And then we'll finish talking about the

log normal distribution. So just to remind everyone a little bit

about logs. Log base b, of the number x, is the number

y, such that b to the y, equals x. And log base b of one is always going to

be zero because b to the zero is going to equal one.

And then this log base b, will always travel to minus infinity as x travels to

zero. And then, you know, for the class, we've

always been writing just log as when the base is E for Euler's number, and

sometimes people tend to write LN in that case.

There's basically only three bases for logs that people ever use.

Base E has a lot of nice mathematical properties.

Base ten is nice because then log speaks of orders of magnitude, right.

Log base ten of ten is one. Log base ten of 100 is two.

Log base ten of 1,000 is three and so on. And then log base two is often very useful

as well. Because it's a smaller number than ten,

you get lower powers, it's often useful. And just to remind everyone log AB is log

A plus log B, log A raised to the B power is B log A, and log A divided by B is log

A minus log B. In other words, log turns multiplication

into addition, division into subtraction, and powers into multiplication.

So hopefully none of this is news to you. So that's sort of the mathematical

properties of the log. But statistically, why do we take logs of

data? The most common reason to take a log of

data is if the data is sort of skewed high.

And what I mean by that, for example, is incomes are a great traditional example of

things that tend to be skewed high. You have a lot of people making very

little money and a handful of people making a lot of money.

And so that distribution looks like a hump towards zero.

And it spreads out with a long tail towards high values.

And sometimes people, you might take logs of.

Income data to try and make it look more bell-shaped.

This occurs frequently in biostatistics for example, health expenditures.

A lot of people tend to spend very little on healthcare until healthcare becomes a

problem, then they spend a lot. So distributions like healthcare

expenditures and other things like that tend to be right skewed especially because

they're bounded from below by zero. In setting where errors are, feasibly

multiplicative. When dealing in things like concentrations

and rates, then it's. Natural to take logs because then it turns

that multiplication into addition. Whenever you're considering ratios.

It's useful to take logs because then instead you have differences rather than

ratios. And then if you are dealing with something

where you're not so concerned about the specific number but more concerned about

orders of magnitude, say something like using log base ten, as an example if you

are considering astronomical distances you might be just more concerned with the

orders of magnitude rather than the actual specific number.

Then you might often take logs. And then, counts are often logged if your

data are the number of say, infections at a hospital or something like that.

You might log data like that. Notice if you have logged several counts

and one of them is zero then you have a problem with taking logs so you have to

come up with some solution for that. So, let me talk a little bit about the

geometric mean. The sample, I, I say the sample geometric

mean, just so we're using the same notation when we talked about the sample

mean of data. The sample geometric mean of data set X1

to XN is, you take the product of the observations, pie I equals one to N, XI,

and then you raise it to the one over nth power.

And notice, if all the Xs are positive, which is generally the case if you're

thinking about geometric means. Then the log of the geometric mean, is

then the arithmetic mean. One over N summation log XI.

So it's the arithmetic mean of the log of observations.

So let me just repeat that. The log of the geometric mean is the

arithmetic mean of the log observations. So because, the log of the geometric mean

is the arithmetic mean of the log observations.

On the log scale, the geometric mean has all these properties that we already

talked about associated with sample means, sample arithmetic means.

So the law of large number applies and the central limit theorem applies.

I have parenthesis here that says, under what assumptions.

But under whatever the assumptions applied for the arithmetic mean.

To have the, the law of large numbers and the central number theorem.

The geometric mean is always less than or equal to the, the sample or arithmetic

mean, as a just general property. So let me just give you a quick example of

using geometric means. In some domains people use the geometric

mean so frequently than when they talk about the mean they, they're referring to

the geometric mean, not the arithmetic mean.

So, as an example, when what you're thinking about is inherently

multiplicative, you would often think of the geometric mean.

So suppose that in a population of interest, the prevalence of the disease

rose two% one year. And then the next year it fell one%, and

then the year after that it rose two%. And then it rose one percent again.

Well if you were thinking about what's the end prevalence of the disease after the

starting prevalence, inherently you would multiply the starting prevalence x 1.02

x.99 x 1.02 x 1.01, and you would get the ending prevalence.

So the geometric mean of these collection of increases and decreases would be a

relevant quantitative to study. And so that geometric mean would be the

product of them raised to the one-fourth power.

And what's interesting about that, then, is.

If you take the starting prevalence and you multiply it, times 1.02, .99, 1.02 and

1.01, you get the ending prevalence after the four years.

If you take the geometric mean and multiply it, times the starting prevalence

four times, you get the same number. So that's what that the geometric mean is,

considered in the sense of the arithmetic mean.

The arithmetic mean is the number you would have to add four times to get the

same end result. The geometric mean is the one you have to

multiply, or times to get the same result. And that's why it's useful.

So if you're thinking about things that are inherently multiplicative like,

percent increases and decreases. Then it's common to take the geometric

mean. So if you work in certain financial

sectors for example, if they say mean, they are referring to the geometric mean

because it's obviously more natural to talk about.

Okay, so just re hashing some of these points.

Multiplying the initial prevalence by 1.01 to the fourth power.

Than otherwise, multiplying it by four times is the same thing as multiplying by

the original four numbers in sequence. So 1.01 is the constant factor by which

you would need to multiply the initial prevalence each year to achieve the same

overall increase or decrease in prevalence over a four year period.

Take that in contrast to the arithmetic mean and that's the factor by which you

would have to add to achieve the same total increase.

And in this case, it's clear to me at least, that the geometric mean makes a lot

more sense that the arithmetic mean to talk about.

On the next slide, I was thinking about how to explain this.

I googled the geometric mean and the arithmetic mean and I found this great

example at the University of Toronto's website and it has a really fun geometric

interpretation of the arithmetic mean and the geometric mean.

So if you have a rectangle and A and B are the lengths of the sides of a rectangle.

Then the arithmetic mean A plus B over two is the length of the sides of the square

that has the same perimeter as the rectangle.

The geometric mean A times B to the one-half is the length of the side of the

square that has the same area. So if you're sort of interested in

multiplicative things like areas, you want the geometric mean of the sides.

If you're interested in additive things like perimeters, you want the arithmetic

means. So it's, I thought that was really cool

when I read that. So, back to statistics, the log of the

sample geometric mean is just an average. And so provided the expected value of log

x exists. Then that average has to converge just by

the law of the large numbers to what, I'm defining here as mu equal to the expected

value of log x. To remember the log of the geometric mean

is, is itself just an arithmetic mean. We have the law of large numbers which

tells us, what the arithmetic mean converges to?

It converges to the population mean. So therefore, the log geometric mean

converges to, expected value of log x, where x is a draw from the, original,

population, on the natural, unit scale. Not on the log scale.

Therefore, if you want to know what the geometric mean converges to,

The geometric mean is the, exponential of the log of the geometric mean.

Of course, because E to the log x, is x. So it would be nice if that worked out to

be expected value of x. But it's doesn't because E to the expected

value of log x, the exponent can't move inside this expected value.

So we get something, which I'm going to just call, E to mu, is, is exactly E to

the mu. And this is not expected value of x.

And this quantity, E to the mu which is the exponent of the expected value of log

x. It doesn't really have a name.

But I like to call it the population geometric mean cuz you know, if you have

that the sample arithmetic mean converges to the population mean.

The sample variance converges to the population variance.

The sample median converges to the population median.

Then by that logic, the sample geometric mean should converge to something that's

called the population geometric mean. So, I'm going to call it that.

I, I don't see that to often in books, but what the heck, I'm going to do it.

So to reiterate, the exponent of the expected value of log X is not equal to

the expected value of the exponent of log X, which is equal to E to the X.

So, what I'm referring to as the population geometric mean, is not equal to

the population mean that we defined earlier.

It is however interesting to note, that if the distribution of log X is symmetric.

Which, remember, that was one of the reasons at the beginning of the lecture,

we stated for taking logs of data is to turn skewed data to data that's more

symmetric. Then if the distribution of log X is

symmetric, then consider the median. The median is the point where point five

is equal to probability that log X being less than or equal to, to mu.

And in this case, because log X is symmetric, mu, the mean on the log scale

is in fact also the median. So this first statement, point five equal

to the probability of log X less than or equal to mu, is just reiterating the

statement that for this distribution that's symmetrical on the log scale, the

mean and the median on the log scale are equal.

But, now on the interior of this probability statement we can, because

everything's positive. And because the E function is monotonic.

We can take an exponent on both sides of this inequality, and get that the

probability of the x on the natural scale, not on the log scale.

But x on the natural scale. The probability that x is less than or

equal to E to the mu is also 50%. So the conclusion is that, for log

symmetric distributions, the geometric mean is estimating the median.

So why am I saying all this, I am making fairly simple ideas rather complicated.

The idea is you have data, you log it and you just do all the normal stuff you do

with your data, you just using it on log data.

And what I'm trying to say is, I'm trying to relate the quantities that you get from

doing that. They have interpretations back on the

natural scale, that's what we're trying to say.

You don't have to discard the natural scale units when you log data, you get a

lot of interesting interpretations back on the natural scale.

So, at any rate, if you use the central limit theorem to create a confidence

interval for the log measurements. Then your interval is estimating mu, the

expected value of the log measurements expected value of log x for log units.

Then if you expodentiate the interval, then you're estimating E to the MU.

The population geometric mean, as I'm calling it.

And then in the event the distribution of the log data is itself symmetric.

Then your exponentiated interval is also estimating the median.

So this is kind if a back handed way of getting the confidence interval for the

median. If you're willing to assume that your data

is symmetric the population. From which your data is drawn, it's

symmetric on the log-scale. Then when you take the log of the data,

create the confidence interval and then exponentiate the end points, then you wind

up with a confidence interval for the median.

And remember, we also talked about getting a confidence interval for the median using

bootstrapping, but this is a lot easier, it just uses the ordinary T confidence

interval. And then this is especially useful for

paired data when their ratio is of interest.

So, let's just quickly go through an example.

So, remember, I quoted before this book by Rozner, Fundamentals of Biostatistics,

which I like. It's very thorough.

And covers a huge chunk of biostatistical topics.

Any rate, on page 298 of the version that I have, which unfortunately I think was

the previous version than the current one. It gives a paired design where it compared

systolic blood pressure for people taking oral contraceptives and matched controls.

And so paired design is where you have a person and you have a bunch of covariates

that you're concerned with, when you want to compare, say, oral contraceptive use to

controls. You're worried that the group of people

that take oral contraceptives are, different than the group of people who

don't take oral contraceptives. So, what you might do is you might take

this, list of, things that you, think might, explain that, difference, and match

on'em so that, the person taking the oral contraceptive, they have a, twin in a

sense, in the, control group, that, at least insofar as the, other variables you

can measure, they're very close. That's this idea of matching.

The matching to the extreme., You know, you couldn't do it in this

experiment, but imagine if you were investigating aspirin.

You would, say, give a person an aspirin and then after a suitable wash out period,

give them a placebo. And then that person would be perfectly

matched to themselves as their own control.

So that's the extreme version of this case, but let's suppose you're in a

circumstance like this where you can't really, randomize people to contraceptive

use. You couldn't do crossover experiment like

that. So you could match people as closely as

you could on all these other things that you think might differentiate

contraceptive users from controls. And match them as closely as possible.

Anyways, that's a match design. But the point for our discussion, is that

person one. Who is in the oral contraceptive group,

and person one, who is in the control group.

They are tied together. And so we want to utilize that information

that they're similar. So what we might do is take the blood

pressure for person one. From the oral contraceptive use.

And the systolic blood pressure from person one in the control group.

And analyze their ratios, right? And so we might be interested in ratios,

because we just might be interested in the interpretation of, well, what percent

increase and decrease does a person in the contraceptive group have over their

associated controls. So imagine if we took ratios, and then

logged the ratio. Well, that would just be the log

difference of the two measurements. Then we could just do an ordinary one

sample t conference interval for the log of the ratios done matched pair by matched

pair. And so in this case, the geometric mean of

the ratios works out to be 1.04. Which in this case the order in which I

was dividing implied to four% increase in systolic blood pressure for the oral

contraceptive users. And t interval on the log scale.

So when I took each measurement, an oral contraceptive user, log the control user,

took the difference, pair by pair. I wound up with then, n measurements where

I started with 2N total measurements, each in pairs.

I had my n measurements on the log scale, I had an ordinary t interval and I

calculated it and I got 0.010 and 0.067. In this case, the units would be in log,

millimeters, or mercury. What we're interested in on a log scale is

whether zero is in this interval or not, right?

Zero is the important thing on the log scale.

If we exponentiate the interval, we get 1.01 to 1.069.

So an estimated via 95% confidence interval, one% to seven% increase in

systolic blood pressure for the oral contraceptive users relative to the

controls. And so on the exponentiated scale we're

interested in whether one is in the interval.

On the log scale, we're interested in whether zero is in the interval.

By the way if your numbers are kind of small, like in this case 0.01 and 0.067.

Exponentiating is about like one plus and if you are math person you take the Taylor

expansion of e to the x and go out one term and you see that it is pretty close

to one plus. You can actually exponentiate things very

quickly but just by taking one plus and then obviously if you, number that you're

looking at is pretty close to one and you want to log it, you can do one minus and

you same thing take the Taylor expansion for log and go out one term.

You can see, if the number is close to zero and you want to exponentiate it one

plus works pretty well in approximation if the number is pretty close to one.

And you want to log it. One minus does pretty well as well.

That's a, trick that's very useful, like when you do, logistic regression and

things like this where you need to, take exponents, quickly.

So let me just talk about this example Just a little bit more.

This estimate, 1.01 to 1.07. This one% to seven% estimated increase

between the two groups. That is a conference interval for this

sort of paired ratio of geometric means. And that's why it's useful in that we're

estimating a ratio here. So now let's just go through the same

exact exercise but instead of when we have parrot observation, we have two

independent groups. If you log the data from group one, log

the data from group two, create a confidence interval for the.

Difference in the group means on the log scale, and then exponentiate it, then what

your estimating, that confidence interval is an estimate of e to the mu one divided

by e to the mu two, that confidence interval is exactly an estimate of the

ratio of the population geometric means. Of course it's an estimate on the log

scale of the difference in the expected values on the, the mean on the log scale.

But if you exponentiate it, then you get exactly an interval for the ratio of the

population geometric means. And if you're willing to assume that the

data is symmetric in the log scale, then this is also equal to a ratio in the

population medians. There's one distribution where take logs

of things and they wind up as Gaussian is so important that we give it a name, we

call it the log normal distribution. And a random variable's log normally

distributed if it's log is a normally distributed random variable.

Note, it's not the log of a normal random variable as it's name kind of implies.

You can't take the log of a normal random variable because those can be negative and

you can't take the log of them. So if you want to remember what's a log

normal random variable, remember this phrase.

I am log normal means think logs of me then I'll be normal and then you'll

remember the correct order. But then also think when you are assuming

what the log normal is, if you are taking the log of something that's possibly

negative then you're doing it wrong. Okay so again log normal random variables

are not logs of normal random variables. As I stay here you can't even take the log

of normal random variable because it can be negative.

So formally, X is lognormal and it depends on two parameters, mui and sigma squared.

If log of X is normal mui comma sigma squared, and again that mirrors kinda what

we're often doing with logs. We're trying to take logs of things.

So that, on the log scale, the data is symmetric.

And then hopefully, the population distribution is also symmetric.

And if log of X is normal, for X being log normal.

Then if Y is normal, mue, comma, sigma squared.

Then E to the Y is log normal. So you can generate a log normal by

generating a normal random variable and exponentiating it.

I give you the log normal density here. If you want to, it depends on the mu and

the sigma squared. Its mean is E to the mu.

Plus sigma squared over two where mu and sigma squared are these mean and variance

on the log scale. And the variance is two mu plus, sigma

squared times e to the sigma squared minus one.

And its median, is e to the mu. And of course, it's geometric, what I'm

calling its population geometric mean is E to the Mu as well.

So you can see here, this gives you an exact example, where expected value of X.

And E to the expected value of log X are two different things.

Expected value of X in this case, when X is log normal, is E to the Mu plus sigma

squared over two. E to the expected value of log X is E to

the Mu. Okay.

So if X1XN are log normal MU sigma squared.

Then log X1 to log XN, where I'm calling this YN, Y1 up to YN, are normally

distributed with mean mu and variance sigma squared.

So they satisfy the conditions to create a T confidence interval.

And then mu is the log of the median of the XI.

E to the mu then gives the median on the original scale.

It also gives you the population geometric mean.

And then, again, assuming log normality in exponentiating T confidence intervals for

the difference in two logs. Two log, again, implies that your

confidence interval is estimating ratios of geometric means.

So, let's just go through a quick example of doing this.

Now I'm assuming you can do the arithmetic of this because you already know how to

create two group T confidence intervals. So all that we're doing is logging the

data and doing something you already know how to do.

So I just want to go through the interpretation real quick.

So imagine if you took gray matter volumes.

I actually did this for some data that I have.

I have brain gray matter volumes for a young and an old group defined as younger

than 65 and older than 65. But of course this doesn't account for

being young at heart or whatever. Young and old, as per my definition, but

if you're 65, rest assured, I don't think you're old.

It's just the definition I'm doing here. So we did two separate group intervals.

And for the old group got 13.24 to 13.27. And for the younger group got 13.29 to

13.31. Both of them are in the units of log cubic

centimeters. If you exponentiate those intervals you

get, 564 for one group and 578 about for the other one.

And 592. To, 606 cubic centimeters for both.

For old and young, respectively. So both of these.

Intervals estimate the population geometric mean, gray matter volume among

the older and younger groups respectively. If we're willing to assume that the

population of brain volumes on the log scale are symmetric then both of these

intervals estimate the population median gray matter volume for old and young

respectively. Then if we were to take the two groups and

do a two group T-interval on the log measurements yields 0.032 to 0.066, log

cubic centimeters, expedentiate this, you get an interval of 1.032 to 1.068, you

know, again, remember the trick, you add one when you expedentiate a, close to

zero. You wind with about a three% to seven%

higher. Geometric mean brain volume among the

younger group than the older group or if we're talking about medians, if we're

willing to assume that individual populations are symmetrically distributed,

then that would be estimated between three and seven percent.

Increase in grey matter volume for the younger group.

This, of course, being the case because as we age, we start to lose a little bit of

grey matter volume over time. Of course, you develop more neuronal

connections, so you get wiser. So you have, maybe, more neuronal

connections, but decrease in volume. So, anyway, what I hope you learned from

this was when you take logs of measurements and do what we talked about

in terms of creating confidence intervals, and exponentiate the intervals.

I hope you know what the estimates are then referring to.

And it's a common problem, people do this all tie time.

But I"m not sure if people always understand exactly what they're doing.

And that's why I devote an entire lecture to the subject of logging which is, in

practice, is a trivial extension of what we've already done.

Take logs of your data, do what we already do, and then exponentiate the intervals.

So no change in what we're doing. But I wanted everyone to understand

exactly what the implications of those things were.

And why log is sort of special in the sense it yields uniquely interpretable.

Results as opposed to doing other functions.

You could say, take cube root of the data, create the confidence interval on the cube

root scale and then. Raise the interval to the third power.

And you wouldn't get the same nice interpretations like you do with log.

Log is special that way. Alright, well thanks troops.

This was our last lecture. I hope you enjoyed the class.

And hope you survived the intense biostatistical training.

And I hope you go on to do great things with this knowledge.

And all the other courses you take from Corsara.