Our first lecture isn't about statistics or about genomics. It's about reproducible research. Reproducible research means that you can take someone's data and their code, rerun it and get the same answer exactly the same every time that that code and data are run. So it turns out that this idea although it seems very simple is a source of major problems. So here's an example of a paper that people were incredibly excited about in the area of genomics. If you've heard about personalized medicine or precision medicine, this seemed to be one of the huge success stories for that area. In this paper there was a claim that you could use genomic signatures, in specific gene expression signatures, to guide the use of chemotherapeutics. In other words, you could tell what chemotherapy would work best for which person based on their gene expression profiles. So obviously this paper got a lot of excitement, and tons of people rushed out to try to analyze the data, and see if they could apply that same signature to the patients at their institution. Two of the people that actually took this problem on at M.D. Anderson were Keith Baggerly and Kevin Coombes. So they're two statisticians who attempted to obtain the data and rerun the analysis just so they can check and make sure that they knew how to do it before they would apply it to their own patients. But it turns out, during the process of trying to do this, they discovered many problems. So, for example, they noticed that a lot of the data weren't available and a lot of the code weren't available. Once they got the data and code, it turns out that it didn't produce the results that they had in the paper. They couldn't reproduce the pictures, and they couldn't reproduce the results that were reported in the paper. Maybe even more troubling, if you ran the code on different days at different times you would get different answers, even if the same data was put through the code. That was because there was a random component to the predictions. So it turns out that this ended up being played out over a long saga. And you can read a lot more about that saga in this paper that I have down here linked to at the bottom of the slide. But the basic idea is that Keith and Kevin attempted to get the data from the investigators and they had trouble getting the data. Once they did get the data they identified these problems. But despite that, there was some clinical trials that got launched. So it turns out that those clinical trials assigned people to chemotherapy, possibly erroneously based on this original analysis, and that actually led to some law suits. Just to show you how important it can be to reproduce all of your data and all of your code, obviously most problems that you're going to be working on, won't have this level of importance and won't rise to this level of difficulty. But it's important to keep in mind that having the data and code available to reproduce exactly what you've done for this class and for your future statistical genomic analyses is really important. If you want to hear a lot more about this particular case, and a lot more about reproducible research in general, I highly recommend this talk that I'm linking to from Keith Baggerly. Keith is an amazing speaker, very funny, and he'll tell you a lot more about this problem if you're interested. It's not just in genomics that reproducibility is a big issue. Anywhere where you're analyzing data, if the code and data aren't available and you can't check them out, you can have problems. And even if the data and code are available, there are always little mistakes that come up. So this is an example of a paper called Growth in a Time of Debt, that was actually quite influential. Many people credit it with starting the austerity movement for many countries around the globe. It's often cited for that purpose. But it turns out that a couple of graduate students actually went through the paper and they realized that once they got the code which was in Excel, they found some Excel errors. It turns out these errors didn't change the results terribly much, but it did lead to sort of the ridiculous site of Stephen Colbert discussing reproducibility on his show. And I hope it drives home the point that it's critical that all the data and the code are available, and that you try to make all of your analyses for this class reproducible.