Welcome back to Bioinformatics: Introduction and Methods.
Let’s continue with our lessons on Functional Prediction of Genetic Variations.
In Unit 4, let’s take a close look at SAPRED, a classifier-based method.
SAPRED stands for Single Amino acid Polymorphisms disease-association Predictor.
It formulates the question as a supervised classification problem.
We first collected variations that are known to cause diseases and those not,
as the positive and negative training datasets.
Then we calculated a comprehensive set of structural and sequence features of each variation.
We used feature selection methods to select 10 groups of a total of 60 attributes to build an SVM classifier.
The classifier then can be applied to predict the likelihood of disease association of a new amino acid variation.
Known protein three-dimensional structures are retrieved from PDB, the Protein DataBank.
As you can see, different proteins can have very different structures.
As in PolyPhen, it is easy to imagine that
amino acid variations at different locations within different protein structures may have very different effect.
When the structure is not available for a protein of interest but it has a homologue with known structure,
we can use Homology Modeling to predict its structure.
First
we use the sequence of our protein of interest as input to BLAST to identify homologues
whose 3D structure had been solved.
We then used the structure of the most similar homologue as the template.
Then we align our protein sequence to the template structure to build a preliminary structure.
Next we run energy minimization to fine-tune and evaluate the model.
This process is usually done iteratively until the best possible model is selected.
The supplementary student presentation will discuss homology modeling in greater detail.
SAPRED also used similar sequence-based features as SIFT and PolyPhen including residue frequencies and conservation scores.
It also used similar structure-based features as PolyPhen including solvent accessibility, C beta density, secondary structure, and so on.
Variations on the inside of a protein structure tend to be more damaging than those on the surface,
so solvent accessibility has powerful predictive value.
The novelty of SAPRED is that it had defined a number of new biologically-intuitive features,
some of which had proven to have high predictive value and were later adopted by new prediction methods developed by other groups.
One interesting and powerful feature defined by SAPRED is called “Structural neighbor profile”.
It is based on the observation that some structural microenvironments seem to be more tolerant of changes at the center,
whereas other structural microenvironments seem to be intolerant of changes.
A 20-D vector is defined by taking the C alpha of the variant residue as the center, drawing a sphere with a specific radius.
The residues inside the sphere are counted to get the number of each of the 20 kinds of amino acid residues.
Each number is a component of the vector.
In this example shown here, the variant site is Histidine 128, placed at the center of a sphere with a radius of 10 Angstroms.
If we walk from the N-terminus to the C-terminus of the protein, we will find the following amino acids that fall within the sphere.
We count the number of each of the 20 types of amino acids, and create a 20-D vector.
We found that different radius had different predictive power.
Based on the simulation here we can see that 13 Angstroms seem to be the optimal radius.
Another intuitive and useful feature we defined in SAPRED is called nearby functional sites.
Amino acid variations located exactly on functional, active, and binding sites tend to have large effect on protein function.
However these sites are few and far in between, and thus the coverage of this feature is low
In SAPRED we considered that variants in the vicinity of important sites could also affect protein function.
This can significantly increase coverage of this feature.
Indeed, our statistical analysis of the variations showed that the closer the distance to functional sites, the more deleterious the variations get.
This is true when “distance” is measured in terms of either sequence or 3D structure.
Another useful feature that surprised us was whether a variation is located in disordered regions that do not have a fixed structure.
Contrary to our initial expectations, we found that 114 out of 122, or 93% of amino acid variations in disordered regions are associated with disease.
The next feature is consistent with our intuition
The more hydrogen bonds lost when a variant amino acid replaces the wild-type,
the higher the probability of the variant being associated with disease.
Also intuitively, 94% of variants in transmembrane regions are disease-associated.
87% of variants that alter β-aggregation properties are disease-associated.
On the contrary, among 435 variants from HLA families, all except one are neutral.
Using these features we trained a Support Vector Machine classifier.
Your TA Meng Wang had filmed a supplementary video to tell you more about Support Vector Machines.
To evaluate the accuracy of SAPRED we performed five-fold cross-validation, meaning that
we divided the training data into five portions.
We performed five rounds of training-and-test, each round keeping one portion of the data for testing
while using the other four portions for training.
We made sure that the data was balanced.
Accuracy was calculated using both the standard accuracy and MCC, which is defined here.
The results showed that, other than residue frequencies, 13 Angstrom structural neighbor profile had the highest predictive power.
Nearby functional sites also had higher predictive power than solvent accessibility.
SAPRED achieved an accuracy of 82.60%, which was higher than other contemporary methods on the data sets tested.
SAPRED can be accessed from a web interface.
It takes us input the protein sequence, variant site, and 3D structure or structural model,
and outputs the predicted likelihood of the variant being associated with disease.
It also outputs the structural features that explain why the variant was predicted to be disease-associated or neutral,
as well as relevant sequence features.
When homology models cannot be built, SAPRED has a sequence-only mode that has a slightly reduced accuracy but still pretty good.
We have now finished learning about the genetic variation databases and three types of predictive methods.
Let’s now go back to look at Angelina Jolie’s decision.
What we had covered in this week’s lectures is only one of the questions involved in her decision.
There are lots of other factors to consider in making this complicated decision,
many, but certainly not all, of which can be formulated as different conditional probabilities.
For instance, even after she had her breasts removed,
what is the likelihood of her still getting breast cancer?
If she didn’t have her breasts removed, could frequent check-up detect early signs of cancer if it develops?
If cancer can be detected early, is there an effective treatment or cure?
Even if there is no effective treatment now,
new effective treatments continue to be developed for all sorts of diseases every year.
What is the probability that effective cure may be developed before her cancer develops?
Another factor that is important to consider is that no surgery is 100% safe.
What is the probability of death from the breast removal surgery?
Other factors that are more important for some people than for others include at what age may the cancer develop?
Which emotional stress is worse, the fear of cancer or the emotional distress over loss of breasts?
Last year when in my class at Peking University
a young woman said that having her breasts removed would make her feel no longer a woman, and that she would rather die.
So you see, different people may have very different opinions and preferences.
Finally, a factor that is not important for Angelina Jolie but important for most others is the cost of the surgery.
One effective approach to increase prediction power is to incorporate family history.
Angelina has a strong family history of breast and ovarian cancer that has early-onset and poor prognosis.
Jolie's mother died from ovarian cancer at 56 after a 10-year's struggle.
Her aunt died of breast cancer in 2013 at 61-year-old after a 9-year's struggle.
Her grandmother also died of cancer at 45.
Her great-grandmother died of ovarian cancer at 53.
Her uncle also died of cancer in 2009.
Two questions that I’m sure her doctors had addressed, but they are too important, so I have to mention here, are:
Has the causal mutation been identified in her affected families
and does it co-segregate with cancer in her family?
Does Jolie carry the same mutation?
So as you can see, there are lots of unanswered questions and remaining challenges.
How can we further improve prediction accuracy?
How can we integrate multiple sources of evidence?
We have focused on missense variations in this week, but how can we make predictions of noncoding variations?
Having more high-quality training data can significantly improve a classifier-based method’s accuracy.
So how can we generate more training data?
And finally, when it comes to human genetics, as responsible scientists we need to keep all the ethical issues in mind.
Challenges are not necessarily a bad thing.
Whenever there are challenges, there are opportunities for innovation, and it could be you.
Thank you for your interest in bioinformatics and for your participation in this MOOC.
See you next time!