An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

來自 Johns Hopkins University 的課程

Statistics for Genomic Data Science

124 個評分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

從本節課中

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

So this is the second lecture on exploratory data analysis.

And so I'm going to go through a little bit more quickly this time the set up just

to get up and going.

But I'm going to set up the colors like I did in the previous lecture.

I'm going to set the parameter for

the plotting parameter so that it looks like I'd want it to as well.

And then I'm going to load all the packages that we need to load.

And then after I do that I'm going to load in the data set and get us up to speed.

Okay.

So if you want to see more about how I did all that and what I'm doing there,

you can watch the first lecture on exploratory data analysis.

So, after we check sort of to make sure that all the certain dimensions match, and

that we're not missing any values or anything like that,

the next thing I do is I make lots of plots.

So the first kind of plot that you can make is a box plot.

And so if I make a box plot of say the first set of values, I'm going to look at

the expression data, the first column of that data set and I make a box plot.

You can see that most of the values are down here at zero, so

I might do some kind of transform to the data.

I might try to take the log transform of the data.

And then here I'm going to add one so that the values won't be messed up.

And so it didn't really help me this time, but you, in general,

can sort of make different transforms to try to see if the data look good.

You can also plot more than one value at once, so here I'm going to do that

same box plot, but now instead of just selecting out the first column,

I'm going to box plot apply to the entire matrix.

And so when I do that, I actually get one box plot per sample, so

one box plot per column of that data set.

And then I can set the color to be a color that I like.

And then, in order to not have all those outlying points,

I can set range equal to zero, and so then it will make this sort of whisker extend

all the way out to the most extreme outlying point.

So here I can already see again that the data is pretty skewed for

all these different samples.

And you can see it because most of the values are squished down here near zero.

The other thing that you can do is you can make histograms.

So I'm actually going to show two histograms.

And if I want to show plot side by side, I can use the par(mfrow=c(1,2)).

This is basically saying set up the screen so

that it has one row and two columns of plots.

So it's side by side plots.

So I can make a histogram of the values from the first sample.

So again I'm going to do this transform here because I

think it makes it a little easier to see.

And then I'm going to set the color equal to two, so

this is going to make this nice blue histogram.

And so you can see here right away that almost all the values are equal to zero.

And then you see some values out to the right there.

And so, then, I can make that same plot.

But for the second sample, and so, when I do that,

you can see that it has a very similar distribution.

It also has many zero values, and then some values out the side, there.

So, that's one way you can start looking sample by sample, but

sometimes it's hard to see a lot of samples that way.

So if you wanted to see a bunch of samples,

you want to be able to overlay the plots on top of each other.

So here I'm going to set it to be back to just a one by one plot so

it's a little bit easier to see.

And so, the first thing I could do, is I could make a density plot.

And so the density plot is going to basically show the distribution of

values sort of similar to a histogram but with a line plot.

And so one way that I can do that is I'm going to

apply the density function to that first sample.

So here I've got the first sample, I'm going to take the log transform of

the first sample, then I'm going to calculate a density of that sample.

And then I'm going to make the color to here.

So here I can see that density so it looks a little bit like

the histogram I've got the X values here and then I've got like how frequently

they appear on the Y axis and so you can kind of see this distribution here.

And so the next thing that you could do is you could add on top of that and

you could start layering other values on top so if I just plotted the next value,

like if I just typed plot the same plot.

So I pressed up to get the command back and I pressed plot.

The second column of that data set it would just overwrite the plot.

So instead of doing that,

what I'm going to do is I'm going to use the lines command.

So the lines command will let me overlay

another plot on top of the plot that I just had there.

So I'm going to do the second sample now, and

I'm going to give it a different color.

So now I can see, you know, plot one is in blue.

Sorry, sample one is in blue, sample two is in pink, and

I can overlay a bunch of samples this way so I can see if the distributions are all

similar or if I see any kind of bulk changes in the distributions.

That can be a sign that there's an artifact going on.

So that's a plot that I usually make to check.

The other thing that I do to check to make sure the samples are sort

of consistent is, I make what's called a q-q plot.

So, if I do a qq plot of the first sample, so here, doing this transform again.

I'm going to do the same thing for the second sample.

I can see a bunch of data points here and so one thing to keep in mind is that

when you make it this plot now, this q-q plot is making one dot for every gene.

And so, or, sorry, for every quantile of these two distributions.

And so

it depends on the number of quantiles that it has to calculate how long that'll take.

And so for example this dot right here this says this is say the 5th percentile

so this is the 5th percentile for the second sample is on the Y axis.

And the fifth percentile for first samples on the x axis.

So we can see that it's above the 45 degree line here and so

you can see that the second sample has a higher fifth percentile

than the first sample so it has higher values for low values.

If you want to be able to see that a little more clearly you can

add a 45 degree line.

And one way you can do that is with this abline command, abline.

And you can tell it, you use this intercept and this slope and

so you can see here this has an intercept of zero and a slope of one.

That's the 45 degree line.

And so, if I add that on top so there's the 45 degree line,

you can see here the quantiles are a little bit larger for

the second sample down low and they're a little bit lower up high.

And so you can kind of see how the two distributions compare to each other.

The other thing you can do when comparing samples,

is you can make what's called an MA or a Bland-Altman plot.

So here I'm going to take the difference between the two samples and

I'm going to make that the Y axis so

I'm going to take the difference between sample one and sample two and

then I'm going to add the two samples up.

I'm going to make that the X axis.

So I'll explain that in a minute.

So now if I plot on the X axis the sum of the two samples and

on the Y axis the difference of the two samples.

Again, now it's going to have to plot 52,000 data points, so

this might take a little while.

So on the X axis we have the sum of the two samples.

So basically moving from left to right you get lower expression or

lower counts to higher counts.

And on that Y axis, you're to take the difference.

So if it's at zero, that means there's no difference between the two samples.

So you can see for example trends here.

You can see as you get higher and higher accounts there's sort of a trend that

appears to be that the samples get closer and closer together so

each dot here is one gene and so

I'm just taking the difference between the two samples for that gene.

So this MA kind of plot people make very often you want to see that it's,

especially for technical replicates,

there's some kind of replicates that should be similar.

You want to see it centered on the zero line, and

you'd like to see it with low variability and

no trends that are dependent on the total number of counts for those samples.

And so, that's a way that you can sort of make those plots to compare them.

Now, the next thing that you can do is,

is if you want to be able to make the plots for.

We saw these sort of box plots, and many of these plots are very skewed.

And so, one thing that you can often do to make the plots a little bit better,

especially for count-based data.

You often need to remove the low expression or the low count

features to be able to really be able to see the distribution of the data.

So to do that I'm going to make the data set a data frame so

that I can use the dplyr filtering commands.

And I'm going to filter the data set.

So I'm going to take that data print here,

I'm going to apply the filter command to it, and I'm going to say filter out,

keep only the rows that have a mean greater than one.

So what is this doing?

It's taking the mean for each row of the data set.

It's going to say is that mean greater than one.

If it is greater than one it's going to keep it after I filter it here, so

now the dimensions of this new data set are smaller.

I've removed a large number of the features.

I only have about 12,000 left, and

now I can make that same box plot that I did before.

So again, I converted it to a data frame so I could do the filtering.

Now I have to convert it back to a matrix so I can do the plotting,

there's a little bit of a difficulty with this filtering approach.

But now you can see, oops.

What did I do here?

Oh, I've got edata here instead of a data, there we go.

So now I've got the box plot one for every sample and so

you can see the distribution a lot better now, so basically I removed

all those like they were a ton of values that were just equal to zero.

I removed all of those, and said this is the distribution of the remaining values.

The values that for the genes that have an average greater than one on this scale.

And so you can do these sorts of transforms and these sorts of filtering to

basically get an idea of what the real distribution of the data will look like.