[MUSIC] All right, in this week's lab we'll be doing selection analysis which attempts to identify if natural selection is acting on DNA sequences. And if in fact natural selection is acting is the selection in a positive direction or negative direction, that means is there selection for or against a site. And we're also trying to address how the sequences are changing because of natural selection. So there are three main flavors of natural selection. Positive or Darwinian selection occurs when a beneficial mutation arises in a population and increases in frequency. So an example of this would be antibiotic resistance arising in a bacterial population that would quickly spread throughout the population. So that bacteria someone has been treated with an antibiotic all the bacteria would quickly acquire that mutation. The other kind of natural selection is negative or purifying selection which is the opposite of positive selection. And this occurs when a detrimental mutation is selected out of a population. Detrimental mutation is typically anything that would inactivate a protein or an enzyme. And these kinds of mutations quickly become eliminated from a population. The third flavor of natural selection is balancing or diversifying selection. And this is the kind of selection that favors the maintenance of genetic variation of the locus. And this you can imagine that this kind of selection might occur in a population that is subject to or exists in different environmental niches where one allele might be advantageous under one condition. And certain other allele might be advantageous under another set of conditions. How do we measure selection? There are two commonly used measures, Tajima's D and the dN/dS test. And there are many ways to quantify genetic variation. Typically population geneticists focus on two metrics, those being theta, which is based on the number of segregating or variable sites in the sample. And pi is based on the average number of differences between all pairwise combinations of sequences. And this is also called pairwise nucleotide diversity. So both data and pi measure variation. Pi is more sensitive to the frequency of genetic variants, while theta considers all variants equal, regardless whether they're only found in one sequence or half of all of the sequences. And pi takes these differences into account. So this is just a cartoon of what I just said. Basically, we can have rare polymorphisms occurring just in one sequence. We can have frequent polymorphisms occurring in several of the sequences from a population. And, theta just counts the number of polymorphisms across the sequence, whereas pi measures the pairwise difference between all possible pairwise comparisons of the sequences. And theta is the number of segregating sites, and pi is the nucleotide diversity. So natural selection acts directly on mutations that change fitness. So this is typically manifest at the protein level. But genetic regions that are surrounding a selected site are also influenced. And the reason for this is that genetic regions surrounding a site are dragged along due to genetic hitchhiking. And that the size of the piece that hitchhikes along is dependent on the rate of recombination. So most of the genetic variation in the surrounding regions is neutral. And so these are third positions substitution in the codon, and we'll talk about this in a little bit that don't change the coding sequence. But we can still see signs of selection due to linkage to a selected polymorphism. Tajima's D looks at theta and pi and asks which one predominates. So, if a gene is neutrally evolving, where there's no selection acting theta would be roughly equal to pi. In the case of positive selection, what happens is that genetic variation is swept from a population and this is called a selective sweep. Very little genetic variation will remain in a population after the beneficial mutation spreads through it. But then new genetic variation will begin to accumulate. And this new genetic variation is accumulating in a sort of a homogeneous background. And, therefore, most mutations will be at a low frequency. And in this case, theta is greater than pi, and in the case of balancing selection, what's happening here is that we're retaining allelels genetic variations longer than would be expected. So, the an allele is neither doesn't sweep through the population, or it's not eliminated from the population. And since that's the case, those allelels will arise to intermediate frequencies. In this case, theta is less than pi. This is just a cartoon again of what I was just saying. Basically if the genes are neutrally evolving our theta is roughly equal to pi. And R to GMS D is approximately 0. If there's been a selective sweep due to positive selection, what we'll see is a lot of accumulation of rare variants in members of the population at a given locus. And in this case theta is greater than pi, and d is less than 0. And if alleles have been retained for longer period, we'll actually see accumulation of variation in multiple branches of the locus. And theta in this case is less than pi, and D is greater than 0. So there are some caveats when using Tajima's D you can see that it could be a powerful statistic for detecting selection. But it can be fooled by a couple of factors. One of these is the population having gone through a bottleneck. And this bottleneck would very much resemble the case of positive selection. So it's very difficult to distinguish between a population, having gone through a bottleneck versus positive selection. And the other issue could be that if the look in a region of low recombination. So basically there's some other tests that we need to apply to look at a population substructure with other tools in order to be able to interpret Tajima's D properly. The other kinds of tests we can do for natural selection is to use the dN/dS Ratio Test. So this also known as the Ka/Ks Ratio Test. Test. And it's perhaps the the most widely used method for detecting, have the pattern the kind of natural selection. That's that's occurring from nucleotide sequence data. And it's it's quite useful because we can actually infer selection. All the way down to the level of the cotton to ask the question whether or not there are specific sites that are being selected for. In order to be able to understand the DNA by this test, we need to understand non-store animus versus synonymous nucleotide changes. Now non synonymous substitutions result in a change in the protein sequence. Synonymous substitutions change the DNA sequence but not the protein sequence due to the degeneracy of the genetic code. So if we look at our code on the code ons for Amino acids, we can see that, for instance, in the case of leucine, there are actually six codons that encode leucine, leucine TTA, TTG, CTT and so on. And typically it's the third position sometimes the second position that can change without changing the amino acid that's being coded for and typically the first position of a code on Is the one that causes non synonymous change. The DN by DS ratio tests actually calculates the ratio of the rate of. [INAUDIBLE] Substitutions, which is DN, the number of non synonymous substitutions per non synonymous sites. So there's a bit of we look at all non synonymous possible non-squamous sites. To the rate of synonymous substitutions, DS, which is the number of synonymous substitutions per possible synonymous right now, synonymous substitutions are not exposed to strong selective pressures because they don't result in a change to the protein sequence. And typically, selective pressure acts at the level of the proteins. Which are the manifestation of that phenotype that arises from the DNA sequence. And thus they accumulate at roughly a constant rate so we can use this. The rate of synonymous substitutions as a baseline to compare the substitutions that do change the protein sequence, which are the non synonymous substitutions. So, when we do a dN by vs ratio test, it's critical to have our DNA actually action. In code on, broken up into code ons and aligned by code ons, and we talked about this when we were when we were doing earlier labs, where we can actually translate our sequence from the DNA into proteins and then do the alignment at the protein level and then back translate into the DNA sequence. And once we've got the DNA sequence in a code on based alignment, then we can do the dN/dS ratio test with those data. So how do we interpret the dN/dS ratio in the case of a completely neutral sequence? Where it will be free to change without constraints, you would expect the dN to be the same as dS, so the rate of none smaller substitutions to be the same as the rate of smaller substitutions in this case the ratio of dN by dS will be equal to one. So that's whats shown over here in this little, this Cable case would be equal to one when there are selective constraints on a sequence. So if there's negative selection, you'd expect fewer substitutions that change the protein sequence. So we would have. But lower dn number and therefore dN by ds would be less than one. And so that's shown over here k, k by k s would be down here. closer to zero. In the case of positive selection. We'd expect to see a greater proportion of amino acid substitutions in the population because they're being increased by positive selection. So we would have a higher number of non synonymous substitutions relative number of non synonymous options. Therefore are dN/dS ratios will be greater one. Now it's important to keep in mind that most functional genes or protein coding genes under some level of selection constrain. So the dN/dS ratios are typically Well the low, one for coding regions, certain sites within those sequences however it can be under positive selection. So this figure over here to the right, just shows rat or orthologous genes. Plotted in bins, according to their k by Ks ratio. So,one k by test ratio ratio of one is over here, and it goes down to point 05. On this side, we see that the the by far, the greatest number of different values for K by Ks falls into this bin here with a very strongly small k by k s or d n by d s ratio. So how can we use this information to study the world around us, well. Basically, evolution is in arms race, and certainly a lot of the drugs that we take are applying selective pressure on to bacterial or viruses, bacteria or viruses that infect us. Or the plants that we like to eat. And there is this of concern certainly changes to Viral DNA sequences can happen over time. Changes to bacterial DNA sequences can happen over time, simply through new mutations that arise. And then these mutations, if they are beneficial can spread The two other members of the population could lead to outbreaks. So for instance this paper here actually models the effect of antiretroviral drugs on HIV and shows that certainly some of the strains of HIV that are now resistant to antiretroviral drugs could pose a threat as to cause that cause a new epidemic which would be not good. So perhaps we can use This information to actually understand the kind of selective pressure that's being applied to DNA sequences and to get a better understanding of how we can perhaps modify drugs So they could still work against an evolved strain of HIV, for instance. And here is an example with an HIV one protease using SELECTON server out of the tel Aviv University. Basically The HIV one protease is and this is an essential enzyme for viral replication, and it's several drugs targeted. One of these is ritonavir And what the authors, the creators of this tool, the SELECTON server, did, Go and colleagues, they took 70 HIV-1 protease gene sequences from Ritonavir treated HIV-1 patients, extracted from a Database, the Stanford HIV drug resistance database. And what they did is that they then fed these sequences, along with the PDB, the tertiary structure of the protease bound to Ritonavir into the SELECTON server. And what the SELECTON server does is actually examine on a residue by residue basis, whether or not the sequence for the HIV protease is under positive selection or purifying selection. Positively selected residues in this case are shown in this orangey color, and negatively selected residues are those residues under purifying selection are in this purple color. And when we map that onto the 3D structure of the HIV-1 protease complexed with Ritonavir, we can see quite nicely actually, that residues that are in the area where the Rtonavir binds are actually being positively selected for. So selected to change from something that they were before, perhaps so that this Ritonavir can find as well in the patients that are now resistant to Ritonavir, HIV patients who are now resistant to Ritonavir. So by understanding which residues are being acted upon, we can actually start to think about how evolution is acting and perhaps modify the drug so that it could work in the cases where it currently doesn't work. So another site-specific, dN by dS analysis, comes from, professor Guttman, who's coauthor on this course, and his interest is in plant pathogen interactions and plants are pretty interesting. They do actually have an immune system. There's an innate immune system that basically prevents most bacteria, the vast majority of bacteria actually can't infect plants. And this innate immune system is triggered by molecular patterns called MAMPs. And these molecular patterns are found on structures like the peptidoglycan, cell wall of bacteria or the flagella of bacteria. And these trigger an innate immunity, this is commonly callose deposition, increase in kind of cell wall component that makes it difficult for those particular bacteria to enter the plant cell and attack it. Some kinds of bacteria, however, have evolved mechanisms that actually shut down this innate immunity by injecting components through the secretion apparatus, effectors that actually block this innate immunity. And plants then in turn have evolved mechanisms to actually shut down the bacterial mechanisms and then the arms race continues. These other bacteria have, in turn, acquired new mechanisms for blocking this compensation mechanism here. So it really is an evolutionary arms race. And we can start to use sequences across many bacterial varieties, different strains of bacteria, and ask the question, are there any particular sites that are under positive selection, which might indicate [COUGH] that those sequences are being forced to change by the action of plant defense systems? And there's a very nice paper published a couple of years ago by Anna McKann, one of Dave Guttman's grad students, and Professor Guttman. Where they took the the genomes of three Pseudomonas syringae plant pathogen and three Xanthomonas campestris pathovars, scanned 1322 orthologous core genes for evidence of positive selection, and came up with several candidates. So what we're looking at here are four genes and the signatures of positive selection along those genes. So the red line here is a k by ks ratio of one. Anything above that line indicates positively selected residue or region. And anything below that line indicates a negatively selected region or region under purifying selection. And as you can see, as I mentioned before for the rat mouse orthologs, many of the residues are indeed under purifying selection, as you would expect for proteins that are important for certain core functions. So DNA processing protein, elongation factor TU, flagelin, need to have the flagelin to be able to move around. But however there are some residues that do seem to be under positive selection. And what McCann and her coworkers did is they took those residues and actually asked the question, are those residues being detected by the plant? And are they able to initiate this kind of innate immune response? And if that's the case, then you could imagine that those things would be under positive selection to change, to actually reduce the innate immune response. And indeed, several of the peptides that they chose from this positive selection analysis were able to initiate a positive response in the innate immune assay that we use, which is accumulation of callose, as you can see here in these panels here. So this is callose deposition that we're seeing measured in the graph below. So in today's lab, we'll be examining one of the components HrpZ, of the type three secretion system that allows bacteria to insert a set of proteins into the plant cell that can actually help subvert the innate immune response. It's not clear whether or not parts of this should be under positive selection, but one suspects that it might be and we'll be investigating that. Now in order to be able to do this kind of analysis, we do need to first perform a phylogenetic analysis, and to also, of course, have our sequences aligned as per their codons. We'll also need some evolutionary models that we touched on briefly in the phylogenetics lecture. We'll be letting a tool called Data Monkey help us choose the correct model. And we'll also upload our phylogenetic information to this tool, in order to be able to do the positive selection analysis. I hope you can appreciate, it's becoming a bit of a detective game here, trying to figure out what's going on in biology, all the way down to the level of individual amino acids. The availability of large numbers of sequences is really allowing this detective game to become easier, and we'll talk more about that next time. I hope you enjoyed the lab.