Okay. Welcome to Bioinformatic Methods I, the first lecture. I'm your instructor Nicholas Provart. The course material for this course was developed by Ryan Austin, David Guttman, Laura Hug, Momoko Price, and myself. Just as a note, please use the Coursera tools to discuss lecture content and the labs. So this course will cover the basics of searching one of the main repositories of nucleotide and protein sequence information, NCBI, the National Center for Biotechnology Information's Genbank, using GQuery / Entrez / Search and Blast, which stands for Basic Local Alignment Search Tool. We'll also create multiple sequence alignments and phylogenies and cover selection analysis and next-gen sequence analysis and metagenomics. Most of the tools we'll be using are web-based. So here's an overview of the modules that we'll be undertaking. Just as a note, we do have a Bioinformatic Methods II, where we cover protein motifs, protein-protein interactions, protein structure, gene expression analysis, as well as cis-regulatory elements. The format of the course will be as follows. It will consist of 20 minute mini-lectures and summary videos, two minute summary videos, the weekly labs that are the focus of the course, which will take between one and two hours to complete. There will be lab quizzes that go along with these, as well as optional lab discussion videos. So if you get stuck during in the lab, you can turn to these videos for help. There are two sectional quizzes, one after the first three weeks, and one after the last three weeks. Then at the end of the course, there's one final assignment which is due at the end of the course too. What is bioinformatics? Basically, it's the use of computational tools to manage all kinds of biological data. Here we use computers for storage, retrieval, to manipulate and to distribute information related to biological molecules, such as DNA, RNA, proteins, and metabolites. Here, we're generally talking about sequence information, structural information, functional analysis of genes and genomes, and their corresponding products such as transcripts, so gene expression levels. It's sometimes called computational molecular biology. This field has really developed in the past 10 years due to the efforts of genome sequencing projects, such as the human genome sequencing project which you may have heard of ;-) How do we deal with three billion pieces of sequence information? So why do we need bioinformatics? Well, if you can imagine three billion letters in the human genome, three billion nucleotides, how do you really make sense of that without using computers? So this is just a small section of a human genome encompassing the human globin gene. We would like to know about which parts of the genome are important, that code for proteins for instance. Without using computers, we would never know that this region here or this region here actually comprise an exon of the globin gene. That is a piece of the gene that actually codes for protein. The other thing that bioinformatics is about is biological databases, how we can store these biological data. We'll talk a little bit about what a database is, data structures, flat file databases versus relational databases. We'll talk also about accession numbers and identifiers, and we'll go over the GenBank flat file format, and we'll just touch briefly on a practical example of utility using NCBI's Entrez / GQuery / Search. So if we look at the planet Earth from afar, we see a lot of green means that life's present, and this background that I've placed here is actually an output from a next generation sequencing machine, you see small spots, clusters in an Illumina flow cell, and it's possible now to generate a universe of information about the organisms that are on our planet. We might be interested in genome and genomic sequences, gene sequences and mutations, gene regulation, where a given gene is expressed and when. That can tell us about the function of the gene, what happens when introns aren't spliced properly, or when they are spliced properly but create variants. We can think about protein sequences and some post-translational modifications, such as phosphorylation of proteins, and look at how the proteins fold up to create small machines, basically that do the things that we need them to do inside our bodies. These machines don't operate in isolation, often they operate in networks so we're interested in how proteins function together in networks, where the proteins are localized, the kinetics of enzymes, which are a sub-class proteins, the metabolites that some of these proteins produce, and when things go awry, what kinds of diseases are caused by defects in genes and proteins. Of course, we would like to tie all of these together with some academic framework, so we want to have access to the literature. So basically, we need databases to archive accumulated knowledge and to provide scientists with easy access to biological data. How can we store this data? You can store them in a flat-file format with the field separated by some kind of a delimiter. So here we've got four records of professors, University of Toronto in this case, some former professors. Basically, that's the first name, separated by a pipe character, last name and then the department, the university, and the address in this case. We could store those data in a spreadsheet, so this is maybe a more familiar way for you to think about storing data and you're all familiar with Excel, I'm sure. Here we've got a column that contains the first_name, the last_name, the institution, the department, and the address. There are problems with this kind of flat file format, this kind of database. One of these problems is that there's some redundancy. So that for instance if we look at this record here and this record here, we've got two entries, which is taking up extra storage space. If the physical building changes where these professors are housed in, we'll have to update all of the records in this flat file database. If we miss one of them, that would be an error. So relational databases actually offer a solution, and they are commonly used in biology. What we've got is a series of tables, relations, that contain attributes, which are fields or columns of the table, and each row in a table is known as a tuple or a record, and the information in these tables should be normalized so that it's non-redundant. So we can do this in a couple of different ways. One common way to do this is to use a foreign key to link tables. The second table here, the first table, we've got the table of Professors, we've got a link to another table of Contacts down here by a foreign key to the primary key of the Contacts table. So in fact here we would only represent the Department of Botany once in the table of Contacts, instead of having entered multiple times as we did in the flat file field. SQL can be used to query relational databases, and there's a very large body of research and development on SQL databases, how to index things efficiently and query these databases efficiently. When we create biological databases, often we use different identifiers to index records. A couple of different ways of identifying records in a database, in GenBank for instance, are using identifiers or accession codes. In the case of identifiers, typically a string of letters and digits that's understandable in some meaningful way by a human. They're not stable as accession numbers, mainly because they can be changed by curators if the function of the, presumed function of the protein is found, is changed, is updated as research advances. In the case of GenBank, the identifier for human alcohol dehydrogenase six looks like this, HUMADH6A01. So you can imagine what that function would be by just looking at the identifier. This might be useful if you're creating a sequence alignment. You would want to see these identifier...they can guide you as to what those are as opposed to some random string of characters which doesn't mean anything to you. But note that an identifier can change if the curators decide that the identifier for an entry is no longer appropriate. This doesn't happen very often. Another way to identify records in the database is by an accession code and these accession codes are typically arbitrarily assigned. In the case of ADH6_HUMAN and ADH6_HUMAN in UNIPROT, the identifier, the accession is P28332. In the case of GenBank, the accession code for the human ADH6 gene is AH001409. So it's not particularly intuitive what that might be if we just looked at the accession code. All right, how are sequences versioned within GenBank? Records typically contain the Accession.Version identifier such as AH001409.2 in the VERSION line of the record. This identifier used to be mapped to its corresponding GI number which was like the primary key of GenBank. To specify a sequence exactly in GenBank, you use its Accession.Version while to retrieve the most up-to-date sequence, use the accession number without the version. Therefore then, the most up-to-date sequence will be retrieved automatically. So the GenInfo Identifier system, the GI system, was deprecated by GenBank in 2016. You used to be able to retrieve a specific sequence record using a specific GI number but now you simply just use the Accession.Version only to retrieve a specific sequence. I'm just telling you about GIs because you do still see them in GenBank records. All right. So now we're going to look at the GenBank record for human alcohol dehydrogenase six by going to this link here. So this is the top of, the start of the record. This record is presented to us and stored back in the GenBank flatfile format. It's one of the most commonly used formats for nucleotide sequences. It contains all of the information associated with the sequence along with the sequence itself. It's got three parts; the header, the features, and the sequence itself. The header, we've got the first line and in the first line we see the identifier here, the length of the sequence, the source (whether it's from DNA or mRNA), the type of sequence whether it's linear or circular, the NCBI taxonomic group... it's deprecated now, this PRI refers to primate, and the date of the entry. Also in the header we have a definition, which is the biology of the molecules in a sentence. We've got some accession codes, version numbers as well as the GI number that maps to the version (sometimes). We've got some keywords that have been entered by researchers. So just keep in mind when you're thinking about the keywords that these are entered on a free text basis. So there's no ontology of keywords that researchers use. There's no standard terminology for keywords. We also see the segment for multi-exon records. In the header, we can also see the source which contains the organism name. We've got the complete taxonomic information from NCBI. We've got reference information on publication details about the sequence. We also have a comments section where we can dump some miscellaneous information and revision details. In the features section, that's a direct representation of the biological information in the record. The source feature has to be present in all GenBank records and again it tells us where the sequence came from: i.e. the organism. Also it tells us some map, chromosome, and tissue type information. The exon feature tells us that in the case of the ADH6 record, that sequence from 287 to 396 comprises an exon. All right. In some GenBank records, you'll see the CDS coding sequence feature present. In the case of this ADH6 record, we see that we have to join these parts of the nucleotide sequence to get the coding sequence. Then you'll often see a computational translation of the coding sequence into its corresponding protein sequence. So here in this case we start with a methionine, there's a serine followed by two threonines and so on. Also in some records, you can see things like signal peptides or transmembrane regions defined. So these computational predictions are very useful for follow-up protein work. So the last part of the GenBank flatfile format is the sequence itself. It's here where you'll see the actual nucleotide sequence for that record listed starting with TGT, in the case of the ADH6 gene. So the growth of GenBank has been dramatic over the years. And that's really been driven by improvements in next-generation sequencing technology that we will touch on in a later lecture. So currently we're at about 400 billion base pairs within GenBank and around 200 million sequences. In order to be able to search that we need some computational methods obviously. We can use either keywords and that's what we'll be talking about in a little bit as one of the examples and what you'll be doing today in the lab or we can search by sequence similarity using BLAST. Now just to point out, we couldn't use Google to do sequence searches because Google doesn't really understand how sequences work. Say in the case of the protein data bank for searching protein sequences, what the relationship is between amino acids and the fact that some amino acids are actually somewhat equivalent and could be allowed as substitutions in a sequence over evolutionary time. Google also doesn't really do gaps well. So if some sequences have acquired insertions over time, Google wouldn't really understand to group those sequences and maybe put in some gaps in the one sequence, in the smaller sequence to make it line up with the larger sequence better. That's why we need to use specialized tools to search sequences. Also in terms of talking about DNA sequences we should think about some definitions. Let's consider the case of an early globin gene. So in this case, what we've got here is, this is an early globin gene in an ancestor of all of these species; frog, chick, mouse. Then imagine that some gene duplication event has happened here in this ancient organism leading to an alpha chain and a beta chain globin. Then in that ancient organism, there's a couple of speciation events. Over time there are speciation events leading to frogs, chicks and mice. Now, the frog alpha, chick alpha and mouse alpha genes are all orthologs of one another. The mouse beta, chick beta and frog beta globin genes are all orthologs of one another and the mouse alpha and the mouse beta are paralogs of one another. Together, all of these are homologs of one another. So when you're talking about sequences, these are homologous sequences, orthologous sequences, paralogous sequences, these all have very specific meanings and try not to get those confused. If we want to search across the various databases at NCBI, we can use NCBI Search tool, which used to be called Entrez or GQuery, and it allows us to search all the links between the many databases at NCBI. What we'll do is we'll go through an example using NCBI Search. Our sample problem is to identify some SNPs, single nucleotide polymorphisms, some DNA changes in individuals, which could potentially cause early onset breast cancer and design oligos to amplify them using PCR, polymerase chain reaction, in samples of human genomic DNA and then be able to sequence these. We can use the OMIM function of GQuery / Search. So OMIM is Online Mendelian Inheritance in Man and it provides links to everything that is known about a given disease across the various databases at NCBI. So what we can do is we can enter this "early onset breast cancer" into the search box here in the GQuery / Entrez / Search interface. Then the system will go and retrieve all of the information across all of the databases about that particular search term. So one part of NCBI is OMIM, Online Mendelian Inheritance in Man. We can see that it flags BRCA1 as involved in this early onset breast cancer. So we can quickly zoom into the dbSNP... So what we would do is we would click on the "Gene summaries" link from the OMIM page, the OMIM record, and we would be taken to the Gene page where we can see information about the gene and then in the Table of Contents part of the gene page, we can see the "SNP GeneView" part. By selecting the links next to that entry and then we're taken to a genomic region that contains the BRCA1 gene. Here we're looking at genetic variations in parts of the gene that could cause non-synonymous changes. So that part in red down here is in fact... all the nonsynonymous changes are flagged in red in terms of the SNPs. Then we could design primers in that specific regions so we could sequence DNA from an individual to see if those polymorphisms are present, typical polymorphisms across samples in the database. So what we would do is we would go to the contig, which is stored in a different part of NCBI, then specify a specific region then get the nucleotide sequence and we could use a tool called Primer3 to very rapidly design primers to amplify that region from a person's genomic DNA sample. That process would take less than 10-15 minutes and this is just using one set of tools to get at the incredible information that's stored in NCBI. We'll also be using BLAST in the lab to access the sequence information that's stored in NCBI. We'll be talking about BLAST in the next lecture. Just as an aside, we could also use the Variation Viewer at NCBI to get at the nonsynonymous SNP information if we wanted to. Alright. Just to finish up, let's think about what databases to use for what and as the course progresses we'll be accessing different kinds of data in different databases and we'll be talking about some of the different databases that researchers use to access bioinformatic data. All right. I hope you enjoyed the lab.