An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

來自 Johns Hopkins University 的課程

Statistics for Genomic Data Science

123 個評分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

從本節課中

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

The first thing that you'll do when you get any genomic dataset is do

an exploratory analysis.

So an exploratory analysis consists of graphing and making tables, and

sort of looking at the data that you have.

And so why do you do this sort of exploration?

The reason why is you need to understand the data properties that help you

understand the scale on which the data were measured, if there were any missing

data, what are the other patterns that you might discover.

It also might suggest modeling strategies.

So it might identify oh, the data looked like they have this particular

distribution, so it looks like a certain kind of model might work.

It also helps you to debug analyses.

So if you identify, for example, a clustering in the data set

that you hadn't seen before or that you weren't expecting,

you can use that to try to update the way that you might do your analysis.

Or if you have problems in your analysis,

you might be able to figure out some of the reasons why those might be.

And then finally, to communicate your results, to tell people here's what I did,

here are the associations that I found, here's what the data look like.

So I thought I'd start off with a little background perceptual task, so

you get an idea of what kind of plots are the most useful to make.

Because the best and

most widely used way to do exploratory analysis is with plotting.

So here's an example of a task that was shown to a bunch of different people, and

then they measured how well people could do comparisons.

So in each case, each of these types, TYPE 1, TYPE 2, TYPE 3, and TYPE 4, and TYPE 5.

You see that your trying to compare the measurements in the two blocks.

That are have a dot.

So for example here, there's a dot in this block, and the dot in that block.

You're trying to compare which of those two is bigger.

Here you try to do it again.

For TYPE 4, you're trying to connect this dot to this dot.

For TYPE 3, you compare this dot to this dot.

The size of the relative rectangles there.

So it turns out, if you do this sort of experiment,

then the one that people do the best on is TYPE 1.

So here, I'm plotting the error of the person making the judgement.

So they actually showed it to human beings and had them check which one they think is

longer, and the error gets smaller as you go to the left of this graph.

And so you see that the TYPE 1 is the lowest, and so,

what are the characteristics of that TYPE 1 graph.

Well, first of all, the two bars are plotted right next to each other.

The other thing that they're plotted on is a common scale, so this scale is the same

for both measurements, and then, you're comparing position here, and

so, you're not comparing anything about the relative size of the box.

You're just saying, is this length longer than this length?

And so, humans tend to be better at making those very simple comparisons

on common scales.

Another thing that they did, was they did an experiment like this.

They tried compare the relative size of the shapes for

a pie chart, that looks like this, versus a bar chart that looks like this.

And so, if you make the same sort of plot and

you look at error again it gets smaller as you go over here to the left side,

you see that there's lower error for the position argument.

So basically,

it means that people are much easier judging is this bar taller than that bar,

than they are judging is the angle here, bigger than the angle here.

And so, this gets worse the more angles that you include in the chart.

And that's the reason why statisticians in general don't like pie charts,

is that it's very hard for people to actually read them off.

Another reason why you want to do exploratory analysis is because,

after you decided to put them on the right scale and make them easy to use.

You can see that even when you make calculations that people typically report,

you might get very different results when you look at the data.

So here what I've shown is four different plots.

This is called Anscombe's Quartet, and so for these four different plots, you see

very different associations between the data of these two measurements.

So it's not any variables that you should care about,

you can just look at the pattern.

So here you can see a nice scatter plot, like you hope to see.

Here you can see a whole bunch of measurements clustered right here, and

then one big outlier over there.

Here, the measurements are very much are in line with one outlet there,

and then you see kind of a curve linear relationship here.

It turns out if you fit the best fit lines on all of these four data sets,

you get the exact same estimates for the best fit line.

If you do a statistical significance calculation which we'll talk about later

in the class, you get the exact same statistical significance and

the R squared is the same as well.

So these are all measurements that you might report say on a paper, but they're

all the same, even though the data are telling you very, very different things.

This is the reason why you want to make sure that you plot your data.

One thing you need to be aware of that often happens,

especially in genomic information, is the data are very high dimensional and they're

often, major efforts made to try to visualize the complexity of genomic data.

But that can very easily go awry.

And of sort of funny way of talking about that is calling it a ridiculogram.

And so a ridiculogram is a really visually impressive display of data,

that doesn't really communicate much information.

So it's very hard to read one of these network plots, if it's really dense.

It's really connected.

It's very hard to actually extract any information out.

So the goal of visualization is to communicate information to people.

And while this is pretty and maybe visually arresting,

it's not necessarily very communicative.

So here's some other information you should use.

I showed you bar charts, showed you position charts, previously, and so

this is a very common type of chart that you see in a paper,

particularly in a genomics paper.

But it turns out, that this type of plot, can obscure quite a bit of information.

So for example,

all of these four data sets here give you these exact same bar charts.

So it turns out, here, we have two data sets that look that they're sort of

symmetric looking distributions and everything is nice.

Here we have two distributions, where one is sort of symmetric looking and

one has a big outliner.

Here we have two bimodal distributions,

here we have totally different sample sizes.

All of those give exactly the same bar chart.

So one thing that, a principle of making data visualization is to

make sure that you show all of the data.

Another principle is to make sure that you know what scale you're plotting on.

So this is a plot here of two technical replicates.

Here it's a gene expression experiment.

And so there are the thing that you can't see from this plot if you look at it,

is that 99% of the data are right down here in this lower

corner below this light blue bar.

Why are 99% of the data down in that corner?

Because the scale is that we've chosen to do,

is actually the wrong scale to measure these things on.

So you can imagine doing some kind of data transformation

to make the data more visible.

So that's one of the things to be aware of when making these plots,

is to be sure you're showing all the data and you're showing it on the right scale.

The other thing is that changing the sort of the way that you make the plot can

have a big impact.

So for example,

that was a plot of two technical replicates on the previous slide.

And now I'm showing those same two things,

only now I'm changing the measurements a little bit.

I'm just going to average the measurements on the X-axis, and

I'm going to take the difference between the two measurements on the Y-axis.

And so, this kind of plot, called an MA plot or Bland-Altman plot is very commonly

used when comparing replicates, because it shows you a couple of different things.

First, it's now centered things on 0, so instead of having to judge

the distance off of a 45 degree line, you just have to judge your distance up and

down from the 0 line to see how far away the two replicates are.

The other thing that you can do, is you can often visualize intensity dependent

effects or effects that depend on the size of the measurement.

So going from left to right here,

is the average measurement between the two replicates.

So if it gets bigger, that means just overall there's more,

you know it's measuring a higher quantity to the right hand side.

And again, on the Y-axis it's the difference.

So what are we seeing here?

Measurements down here that are low measurements

have bigger differences than measurements up here that are high measurements.

So this sort of intense independent effect might mean,

there's not necessarily more real differences down here, it's just because

measurements that are on the low end of the scale are more variable by nature.

So they would be different just because they're more variable.

So another thing to keep in mind is that,

the best to make these plots is to show them centered or starting at zero.

And so the reason why is imagine plot number one.

Plot number 1 shows three different methods, and

it shows it on a scale from 0 to 100 because it's a percentage, and so

you can see that, they all look very, very similar.

99%, 98%, 96%. You can make that same plot but

if the scale starts at 95 instead of starting at 0, all of a sudden,

there, it's, you see exaggerated differences between these groups.

So the interesting thing here is this is a way that you can make it very unclear to

people, what are the real differences between the groups that we care about on

a meaningful scale.

So almost always what you want to do is start any graph with a percentage or

any graph where there's a quantity that starts at zero,

the graph should start at zero as opposed to starting up at higher values.

So one of the most common ways to mislead people with graphs is to

choose the wrong starting point for the axis.

So I've only touched the surface here of exploratory analysis.

I hope, I've taught you about showing all the data, showing position as often as you

can, starting graphs at zero, don't show too much complication in your graphs.

But there's a lot more of this.

And so, I think Karl Broman's guide to displaying data is one of the best

places to go.

It has sort of information from a statistical viewpoint.

There's also a large number of articles on data visualization at Nature, that can be

very useful for learning how to visualize different types of biological data.