[SOUND]
[MUSIC]
So one of the things that we are supposed to do as the BD2K-LINCS Data Coordination
and Integration Center, is to find solutions for
putting all the big data that is collected in biomedical research and
specifically molecular data, at the genome wide scale, and
trying to make sense of it, and developing tools that can be used
to further extract knowledge from all this data.
So we made great strides in this direction, and
recently published two large reviews that lay out our plan
on how we can leverage all of this data and put it together.
So in the first review, we listed most comprehensive resources
of experimental data that is collected in the field, and
categorized the data into seven subsections.
The first is Drug and Gene Knockdown, followed by Genome-Wide Expression,
and this is what the LINCS Program is all about.
The Connectivity Map that we discussed,
as well as the work that is done by many of the LINCS centers,
as well as the data that is deposited in the Gene Expression Omnibus.
The next category is Transcription Factors and
Histone Modification Profiled by ChlP-Seq.
There are two large scale NIH projects called ENCODE and
Epigenomics Road Map Project that systematically use
ChlP-Seq analysis, to profile the binding of proteins
onto the DNA in different human cell types and conditions.
The next type of data is cell viability data after single gene knockdowns,
or drug pertubations of many different human cell.
The next type of data is knockout, or
mutation data and their association with disease.
We now have more and more gene expression data from individual patients,
and different tissues of those patients.
The GTEx project provide such data combined with genomic sequencing
project like the cancer genome atlas, provide large collection
of different data at different regulatory layers from cancer patients.
Including genomics, transcriptomics, and
proteomics from individual tumors across many different types of cancers for
large cohorts of hundreds of patients for each cancer.
There is also accumulated knowledge of protein, protein interactions,
metabolic and self signalling pathways that are continually extracted from
the literature, or are now being able to be profiled with high content screens.
Finally, there is accumulated knowledge about drugs and
toxic chemicals that cause adverse event and toxicity.
Those provide links between small molecules and the human phenotype.
All this data can be converted to what we call attribute tables,
networks, single-entity node networks, for example, gene,
gene association networks, or functional association networks.
Gene set libraries that we will discuss when we discuss enrichment analysis.
Bi-partite graphs that connect genes to entities,
and those are just different views of the same data.
So by collecting many of those data sets and obstructing them to those attribute
tables, bi-partite graph, gene sets and single node networks,
the challenge of integrating all of those resources becomes easier and possible.
The ensemble of bi-partite graphs, gene sets and networks,
allow us to form connection between biological entities
that typically are not identifiable by standard methods.
And those could be of great interest to biomedical researchers,
because they can identify interesting relationships
that are not obviously found when you're looking at one data set alone.
Graph theory algorithms and machine learning methods can now be applied
to draw noble inferences from this integrated data.
So the data integration opens an opportunity to discover new connections
among drugs, genes, diseases, tissues, and other biological entities,
and this is becoming gradually more clear as we advance in the course.
So let's look at those data structures that can be used for
this unification of representation and analysis.
So the first data structures that is relatively obvious are bipartite graphs.
In this data structure, you have two types of nodes.
In our case, those are genes and their biological properties,
or functions, that are associated with those genes.
The most typical way that data is represented in biology is
through attribute tables.
And in most cases, we have genes as the rows, and then the different conditions
that the biological conditions that measure the state of those genes,
or those variables, either gene expression or protein expression,
can be the columns or the conditions, and this could be measurements
of different cell types, different drugs applied to those cell types.
And the numbers in those tables don't have to be zeroes and one, they can be
overall absolute change, or differential change when compared to the control.
We can always put a threshold, and decide when there is a connection.
In other cases, the binary representation is the only possibility.
This is for example, if we knock a gene in a mouse, and
we look at the possible phenotypes that the knockout can cause.
The representation of set libraries is
basically the same, and it's just a transformation of the data,
where now the terms of each library, or the label of each library,
is the common function for the genes, and the genes are members of each set.
The label for example, could be a pathway and
the genes in the pathway comprise the actual set.
In this example, we show how this can be transposed where the genes are the labels,
and the terms can become the set members.
From this data we can also construct a network.
So if the nodes in the network are the genes,
they can be connected based on their common shared attributes,