0:00
[MUSIC]
Alright in this week's lab we are
going to be exploring protein-protein Interactions and in
today's lecture, in this lecture, we'll talk
about why we want to study protein-protein Interactions.
We'll talk about methods for determining
protein-protein interactions, protein-protein interaction databases, the
properties of protein-protein interaction networks and
tools for investigating protein-protein interaction networks.
So if we consider the cell as a city, how could we go about describing the city?
We could describe where the people live, how they get from A to B,
how they interact, the methods that they use to get from A to B...
these kinds of things, these kinds of parameters. we could also consider the
cell as a circuit and here I've
depicted a Colpitts oscillator as an electronic diagram.
And you can see that we've got a transistor and
some various other parts, a couple of resistors, capacitor, power supply.
And it's important to know how these parts are connected in order
to be able to figure out what that circuit is going to do.
Now ultimately, all we care about in terms of an electronic circuit, in
terms of our iPhone is what happens when we turn it on.
Can we talk into
it, can we hear music from it?
And this is actually what happens when you turn on the Colpitts oscillator.
We get an oscillating signal coming out of it.
And we're sort of approaching that level
of understanding for biology and that's almost the
area of systems biology, understanding biological systems as
a collection of parts in order to understand how
things respond, how we respond, to
environmental cues.
Now an important step however, to being able
to get there is to know how things are
wired, and that's why we want to figure out
how the parts of the cell are wired together,
so that hopefully, down the road, we
can actually understand how the response happens.
So protein-protein interactions don't typically (er, error - repeat)
proteins do not typically operate in isolation, but rather as part of
larger complexes or as signal transduction cascades or metabolic modules.
So we need to elucidate PPIs, protein-protein interactions,
to understand the biological system in question.
And such studies can tell us whether a given protein is
a key player or peripheral member of a given
system, how many interactors that protein has.
2:55
where people have used such methods to determine protein-protein interactions.
We can do yeast two hybrid studies,
followed by clone sequencing. We can do affinity purification.
TAP tagging, followed by mass spectrometry to determine the interacters.
We can infer interaction based on orthology and we can also use some
other hight-throughput methods that I'll talk about in a second.
So, in case of yeast 2-hybrid, what we're doing is we're taking two proteins:
we're taking
3:27
the Gal 4 protein and we're splitting it into two parts,
the activation domain and the binding domain.
So the activation domain here is shown in as this pink
dot and the binding domain is shown as this green blob.
And what we're doing is, we're attaching one protein (the prey) to
the binding domain, we're attaching another protein, here called the bait,
to the activation domain.
And if the bait and the prey interact,
then we basically reconstitute the function of the GAL4
protein, bringing the activation domain in proximity to the start of transcription,
then we would get a transcriptional read-out from a reporter gene such as lacZ.
So in the case of TAP tagging, what we're doing is we're
adding a small tag to the end of a protein,
the target protein.
And, we can actually add a couple of tags. So here's the calmodulin binding peptide.
And this protein, A. And that's actually the tandem part of the acronym,
5:28
So there are several advantages and disadvantages to yeast
two-hybrid systems.
So the advantages are are that it's very amenable to automation.
We can do a lot of screens in a high throughput manner.
It's a yeast-based system, so that
creating the clones is very rapid. And doing the tests is quite rapid.
The disadvantages are that it's a somewhat artificial system in the sense
that we're targeting our proteins to the nucleus of yeast, typically they're
over-expressed, so that if the protein
is inherently sticky, we might get a lot of false positives out of the system.
6:16
It's not a great way to determine non- binary interactions.
So it's great for determining binary protein protein interactions.
But if we want to determine the membership of
a complex, we might want to use TAP tagging.
One other advantage of yeast two-hybrid is that it
tends to be quite good for detecting transient interactions,
so the kinds of interactions that occur in signaling pathways.
Now, the advantages of TAP tagging are that
it's performed typically in in vivo systems.
We introduce this construct back into the
organism from where the protein was originally isolated.
We might express it at endogenous levels, native levels.
So we're not over-expressing it and therefore
we can hopefully avoid this problem of stickiness that might occur with
yeast two-hybrid. The disadvantage of this method is that we have to
7:58
There are other experimental methods for determining protein-protein
interactions, and they are listed in this table here.
So there's yeast two-hybrid, as I mentioned. (This table is sorted
by whether or not the methods are high throughput or low throughput.)
High throughput methods include yeast two- hybrid, affinity purification
mass spectrometry and those two I just told you about.
DNA microarrays
and gene coexpression.
We'll talk a little bit about that in the gene expression analysis lecture and lab.
Protein micro arrays are another method whereby proteins are spotted
onto membranes and then we can wash over a different protein
over that, over those membranes, to see which protein it binds to
on the array. We can use synthetic lethality or phase display.
Low throughput methods include x-ray crystallography, so if we can
co-crystallize two proteins and see how they interact that's great.
But definitely a very low throughput method.
We can use FRET, we can use surface plasmon resonance,
Atomic Force Microscopy, and electron microscopy.
These are all quite low throughput methods.
9:14
Right.
So, as I mentioned, we can also use the
interacting orthologs, the concept of
interacting orthologs, to predict whether or
not two proteins interact and that's what we've done, my lab
has done, in collaboration with Matt Geisler at Southern Illinois University.
And what we did here is we took the genome sequence databases from four organisms
yeast, fly, worm and human. We took the Arabidopsis genome, and we
computed the orthologs using a piece of software called INPARANOID to come up with
an ortholog list of Arabidopsis genes, to these other species.
And then we took the interactome databases of those four species, did a match-
replace for the orthologs to come up
with an Arabidopsis predicted Interactome and we
could have a score associated with each predicted interaction.
And then we validated this predicted interactome by, on the one hand, doing
10:39
Just in point form here and in terms of the colocalization of
the Arabidopsis interologs, if we look at the network or part of the
network of the interacting orthologs, and we color the nodes (the
nodes represent the proteins), we color
the nodes according to their subcellular localization,
we can already start to see that these nodes tend to cluster together
when they share interactions.
So this means that, just visually, from a visual observation
of the network, again where the proteins are represented
by the nodes, the interactions are represented by the edges,
we see that the proteins that are in the same compartment
do tend to interact, or put it
another way, proteins which are predicted to interact
do seem to be in the same compartment. We can actually test this statistically.
11:36
And that's described in this paper here.
But we see a definite enrichment for
interacting orthologs to be in the same compartment as
shown along the diagonal here, by this red
coloring where we have a p less than 0.01,
an enrichment for them being in the same compartment.
And in fact, we actually see depletion in the case
of the interacting orthologs being in other compartments.
12:29
the co-expression score.
And we used a compendium of gene expression data across about 1,000
different conditions, tissues responses to abiotic stress and so on,
to calculate this Pearson Correlation Coefficient.
And then we compared this to three random data sets, random from the entire
proteome, random from the interolog dataset, random
with the same topology as the interolog network.
And these are the results. So the two important curves on this graph,
where we're looking at on this axis,
the Pearson correlation coefficient, it's a distribution on the
13:10
distribution graph, are the blue graph here, the blue line here,
which denotes the distribution of Pearson correlation coefficient scores.
For our predicted interologs, the predicted interactors.
And we see that the Pearson correlation score
on average is around 0.8 for these predicted interactors.
So a score of one means
that the genes are perfectly coexpressed.
They're always on at the same time at the same place.
Score of zero means that the genes are not
at all correlated in terms of their expression pattern.
And a score of minus one would mean that
the genes are anti-correlated in terms of their expression pattern.
The other important line here is this purple line and that is the
random network that we generated.
And we see that the Pearson correlation coefficient
is approximately 0.2 on average for this random network.
So we do see a distinct difference between our predicted interactors
and our random network in terms of the co-expression scores.
So, this gives us another level of support for our predicted interactors.
14:27
So how can we use this predicted interaction network?
We can extend known pathways, so this is a small
subset of the SNARE-Syntaxin Network from an interaction database called BIND.
It's for the SNARE-Syntaxin network which is involved in vesicle trafficking.
The original network
is based on these literature examples in BIND.
And we can extend those predictions
quite dramatically, er, extend what's known quite dramatically with our predictions.
15:00
In terms of the protein-protein interaction
network topology, the PPI networks tend to be
scale free and follow power log distribution with
respect to the connectivity distribution of the nodes.
So what this mean is that there are relatively few nodes
with a high degree of connectivity and these are the hub
proteins and there are many more nodes with low degree of connectivity.
So connectivity is just the number of
interactions radiating out from a given node.
So in this case,
this node here in gray would have one,
two, three, four, five, six, connections to it.
Whereby this node in white would only have one connection.
So, a degree of connectivity of one.
15:50
So there's some other terms of for describing networks including
betweenness, so how many network paths pass through a given node.
Connectivity, how many nodes or edges need to be
removed to disconnect the remaining nodes from each other.
And the the degree, which I just told you about.
We can also generate clusters of protein-protein
interactions using some of these parameters as
cut-offs.
And then with those clusters we can actually ask the question,
is there enrichment for any particular term associated with that cluster?
16:27
So more than 50 protein-protein
interaction databases exist for several model organisms.
It's important to know how the data in the
database are generated, if the database
aggregates protein-protein interaction data,
does it provide a link back to the
primary data source, the the literature reference?
There're several tools available for
analyzing protein-protein interaction data, and
we'll be using one called Cytoscape in the lab today.
It's a fairly powerful tool.
17:50
as to whether or not they're enriched are called Gene Ontology terms.
So an ontology is a controlled vocabulary for describing
a knowledge system, and in the case of the Gene Ontology for
classifying genes or gene products, actually,
there are three main organizing principles.
And those are molecular function,
biological process, and cellular component.
18:24
like cell growth and/or maintenance. Such as nuclear division.
And nuclear division in turn, would
contain some subset of genes of a given organism.
So Gene Ontology is organism agnostic.
And that's actually quite nice, because then
you can actually start to compare between organisms.
Now, GO was initially proposed
in 1998, by Michal Ashburner for an ISMB bioinformatics conference in Montreal.
And its aims were to develop a set of shared vocabularies of terms that describe
aspects of molecular biology and that are common to more than one life form.
To describe gene products held in each, contributing model organism database.
And to provide a scientific resource
for access to the vocabularies, the annotations
and the associated data, and to provide
a software resource to assist in the
curation of GO term assignments to biological objects.
So the structure of GO is an acyclic graph,
and I'll show you what that
means in a minute, and it simply means that
one child term can have more than one parent term.
We can explore GO using godatabase.org. And
19:46
here is an example of the GO-Biological Process for DNA metabolism.
We see several sub-categories underneath that, such as DNA degradation,
DNA replication, DNA recombination. And under,
say, the category DNA ligation, we could have genes from yeast in blue.
We could have genes from drosophila in
magenta and the corresponding genes from mouse in red.
So really it doesn't matter what the organism is, they
can all use the Gene Ontology term for DNA ligation,
to ascribe a particular function to a gene or set
of genes. The other aspect of the GO, the
Gene Ontology that I was just mentioning, is this directed acyclic
nature of the graph. And we see that DNA ligation actually has
three parent terms: DNA recombination, DNA repair and DNA-dependent DNA replication.
So it's not a hierarchical structure. It is this, this directed acyclic graph,
21:00
which makes it very flexible.
So in order to assess whether or not
any categories, in particular, GO categories are
enriched in particular sets of protein that
interact, we can use a hypergeometric P value.
And this is just a bit of information, from Excel, actually.
We can calculate the,
hypergeometric test score, P value, using this function in Excel, HYPGEOMDIST.
And we need four parameters, we need the sample size,
which is the number of genes in your list with a
given function, we need the number in the sample, is
the total number of genes in our list of interest.
We also need the population,
s, which is the total number of genes with a given function.
And that's the total number of genes available for sampling,
and the number of population is the total number of genes.
22:12
And out of the, the total number genes in the list, and then in,
in the overall set of genes, or gene products, what are
the total number of genes with that function, and, then the
total, what is the number of genes in total?
If that makes sense.
So we can use the P-value to assess whether
or not a particular GO category is enriched for a set
of protein-protein interactions.
And that will help us to get an idea of what that set of proteins might do.
22:47
And you might be asking, well, why do we care about that?
And the reason is that oftentimes these protein-protein interactions are
determined in the absence of any information about the biology.
So in the case of yeast two-hybrid
system, not really focused on any particular aspect of biology, we
just generate whether or not the set of/ pair of proteins interact.
And so, then, using such methods as GO enrichment can really be helpful
to make sense of the large data sets that are generated.