This is the third lecture on exploratory data analysis so again, I'm going to set things up a little bit quickly. You can watch the first lecture on exploratory data analysis if you want to see why I'm setting up the parameters to be so that the plots are pretty and then loading all the basic libraries. And then, at least the library this lecture depends on, and then I'm going to load the data set in from the web. So now that I'm up to speed, I'm going to do in the sort of the third part of the exploration of a data set, I usually check for consistency. And so you usually want to be able to check for consistency using some kind of external data, if you have it available. As an example of that here, I think what I'm going to do is I'm going to show you how I compare a data set that I have to some meta data or some annotation that I have from somewhere else to make sure they appear to be doing the right thing. So here, what I'm going to do is I'm going to get the IDs for the features and so here I've got these IDs, that are ensemble IDs. Then, I'm going to extract the chromosome information from those ensemble IDs. So if I look at that information, I've got chromosome labels now, for each of the ensemble IDs in the data set. Once I've done that, hopefully that has the same dimension as the data set, but it doesn't. So okay, obviously, we're already kind of stuck in a problem. And one reason why is you can see that some of the chromosome, some of the samples are duplicate labeled. So I'm going to remove all of the duplicated values from the chromosome data set. So once I remove those duplicated values, I can see that the chromosome data set has the same dimension as the expression data set. But we want to go one step further and make sure that the chromosome data matches the information in the expression data, so what I'm going to do is I'm going to say, do all. I'm going to need the all command to basically check do all of the ensemble IDs that are in this chromosome data set match the real names of the expression data set. That's true, so that's good. Basically, again first, I basically check to make sure that this ensemble ID matches the ensemble ID in the expression data, so that the IDs all match up. Now, what I'm going to do, is I'm going to filter to just the chromosome Y samples. To do that, I'm going to have to convert the data frame here. When you're applying this filtering command, you're going to have to convert it to a data frame. And then, you should be able to apply the filtering command in it, so now, I've got a smaller data set that just consists of all the genes measured on the Y chromosome. And so, the first thing that I want to do is then check to see, so I'm going to take the sum in each column. So take out the genes from the Y chromosome. I'm going to take the sums for each sample, that's the total count for the Y chromosome genes for each sample. I'm going to plot that versus gender. When I do that, I can see that the males have way more counts than the females, so that's kind of a good independent check. And then, I can overlay the data points if I want to, actually. If I make that boxplot without coloring it and then I overlay the data points, so here, points like lines from the previous lecture, basically just overlays points on top of the current plot. I'm going to again, plot the column sums of the total counts for the Y chromosome. Here, I'm using a jitter command, you'll see that again in a lot of plots that I make like this, with box plots. So jitter basically adds random noise to this value for gender, so it basically is going to make the points so they don't all land on top of each other. And then I'm going to color it by the gender, and so you can see that the females have counts mostly of zero, and the males have some positive counts. So that makes us think that maybe there is sort of an external validation that the plot is looking okay. The next thing that I might want to do, is look at some kind of multi-variate plots of the data. The way that people often do that is with the heatmap command. To do that, the first thing I'm going to do is I'm going to filter the data set down to just the genes that have a very high roamings, so I'm going to do that because it'll make it quicker to make this heatmap. But you can obviously make it with much larger number of values, if you have time to wait for the plot to make. So then, I'm going to apply the heatmap command to that matrix. When I do that, I can see here, I've got sample by sample. And then in each of the columns and then in each of the rows I've got the genes. What it's done already is it's clustered these data and we're going to talk about clustering in a bit. But one thing that you might want to do first of all, is maybe make the colors be different than the heat colors. You can do that by defining a new set of colors, a new color ramp. You can do that with the color ramp palette function. So here, I'm going to say blend the colors from the third color in the tropical palette that I defined, to white to the second color. And I'm going to say make a nine-color color ramp so you can see. Now this has defined the colors in hex decimal format. The next thing that we can do is we can make a heat map, but I can tell it, instead of using the colors you were using before, use these new colors. So now you can see, you got this kind of, nice, white, blue, pink colors format. The other thing you can do is if you don't want to cluster them, you actually want to see the samples in the order that they originally were in, you can again, use the heatmap command. And we're going to use the pretty colors I just defined. Then, if you say, Rowv=NA, Colv=NA, it will remove those sort of clustering diagrams. So now, these are actually the samples in the order that they are in the data set you have. So one thing you can already kind of detect from this is that you can see, right here, that these samples, here, have a little bit higher values than the other samples. That's helpful to be able to immediately check that out. The other thing you can do in the G+ package is use the heatmap_2 function to add, say, for example, a scale to this plot, because sometimes I said that these values were higher. But how do you know what's higher or lower in terms of the colors? And so I'm going to again, use basically, almost the exact same syntax as you would have with the heatmap. But now, I'm going to tell it not to cluster by using dendogram=none. And then, I'm going to tell it to scale each row. Now, what it's going to do is it's going to make this plot same as it did before, but now it's added this color key histogram, here, for you. It says most of the values are near zero in the matrix and that blue is high values and pink is low values. And so then, you can kind of see it right there, in the data set. So that's the sort of multivariate way I go about looking at the data. After I've done all of those checks, and I've sort of done all that exploration, I start thinking about plots that I'm going to make related to model it.