All right, in this lecture we're going to look at doing RNA-seq Analysis. Particularly mapping of RNA-seq data using Galaxy. First we need to get some data sets, so we're going to create a new history. And then from the library da, data library demonstration data sets. We'll get a couple of different sets of reads produced from RNA-seq experiment. So go to Galaxy. I'm already logged in and I already have a new history. So let's go to Share Data, data libraries and load the data library called Demonstration Data sets And this time we want the folder called Human RNA-seq so let's go ahead and grab all five of the data sets in here. So we have two sequence data sets. These are two replicates for two different cell types, and then we also have a set of Gene annotations. So select all five of those, and import them into current history. Okay, so if we return to Analyze Data, we now see that we have these five data sets in our history. So, and each of these data sets is a FASTQ formatted data set containing sequencing reads from an RNA-seq experiment. Once again we're going to ignore quality control. However, this would be a very sensible place to run FASTQC and evaluate the quality of the reads. So, just a little background on RNA-seq data. The, in, in this particular case, we're using RNA-seq data, which is looking at levels of messenger RNAs, and so, what is done is first, RNA molecules are isolated and fragmented. And a random priming is used to create set of short short DNA sequences that are complimentary to the original RNA-seq and then these sequences are sequenced using an Illumina sequencer, in this case, to give us our RNA-seq data set. So now, we need to actually analyze this RNA-seq data. There's many ways we can do this depending on the goals of the experiment. For transcript tones in particular, there are two paths we can imagine taking, the align then assemble approach if we have a reference genome we can align our reads back to or we can actually try to assemble transcripts de novo from the RNA-seq data. We are going to go ahead and use the Align-then-assemble approach, meaning that we need to take all of these reads and align them back to reference genome. The align-then-assemble approach if you have a reference genome is potentially more sensitive. Now, there may be, depending on if there's certain kinds of variation in, in your genome you may want to not use this approach but generally if you have a reference genome this is going to be more sensitive. So, the problem we now have is we need to be able to align reads back to the genome, in a way that's aware of splicing. So when we are doing chip seek for example, we expected every read would align exactly back to our genome. Maybe with some small differences, primarily due to sequencing error. But transcripts have been spliced. This means that our reads may align to two regions that have a large gap in-between them representing the the spliced out intron. And so, the tool that we're going to use for this is a tool called Tophat, which is a spliced read aligner designed for aligning RNA-seq data from splice transcripts back to the reference genome. So if we go to Galaxy and look for TopHat, we can align using the TopHat for Illumina tool. So, click on Tophat for Illumina and we want to select our first data set. This is the Cd20_Rep1. We need to use a built-in genome, and the appropriate genome for this case, so this data this is Human Data, and so the appropriate genome is going to be the human genome and we're going to use build hg19. So if you type hg19 into the search box, you should see Human hg19. And for this particular example, we have just single ended data. So we're going to leave this at single end, and we're going to leave the, the settings at their defaults. If you click execute, you'll get four new data sets from TopHat. So we have insertions and deletions, splice junctions, and accepted hits. So the first three data sets are will be bed format data sets that represent locations where we believe, where you're predicting insertions, deletions, and splice junctions. The last is actually the mapped reads, and so these are spliced map reads in sand format. So this will take a few minutes to run. So while this is running, I'm going to go ahead and queue up TopHat mapping jobs for the additional three data sets. We will need to map each of the four data sets independently with TopHat. Later, we'll be able to combine them doing assembly. So, if we select TopHat again there is a feature of Galaxy that we can use to make it easy to run this tool over multiple data sets. So, under RNA-seq FASTQ file, where we previously were selecting a single data set, you see three options here. The second is, allows you to select multiple data sets simultaneously. So click on the multiple data set button. Now, we've already run our mapping on data set one. We want to run on two, four, and five. So, using Shift-click I can select all three of those at the same time. Again, select HG19 as a reference genome. And once again, our data is a single line data set and we're going to go ahead and use the default settings. So if I click that, I can now run three different, three additional TopHat jobs simultaneously, and the results come into my history. All right, so once your TopHat jobs are completed, you'll see that you, you're going to have four sets of accepted hits in BAM format so each of these is a, the result of doing spliced read alignment. You'll also have bed files including this one of the spliced junctions if you go ahead and look at that. These are all of the cases where based on the displaced mapping data. TopHat believes that there is a splice junction present. And so this is in BED format again and so you can see the positions of, of these predicted splice junctions. So, to summarize, RNA-seq analysis using a reference genome requires an aligner that is splicing aware, meaning it can handle what appear to be long deletions In the reads. Tophat is one such aligner, based on Bowtie that's available in Galaxy, which we can use for RNA-seq Analysis.