Assistant Professor, Principle Investigator Center for Bioinformatics, School of Life Science

Liping Wei 魏丽萍, Ph.D.

Professor, Director Center for Bioinformatics, School of Life Sciences

I'm Yuqi Meng from Center for Bioinformatics, Peking University. Let's start our computer lab.

I will introduce the basic steps of feature selection and clustering analysis.

This diagram is a summary for feature selection methods made in 1997.

Here the methods are classified into three types: Complete on the left, Heuristic in the middle, and Random on the right.

Next I will introduce heuristic search and sequential backward selection, and their applications.

The basic principle of heuristic algorithm is to evaluate the results of calculation at each step, and choose the best one to do the next calculation.

Sequential forward selection (SFS) starts with the feature set being empty and adds one feature at a time to get the optimal solution with respect to the evaluation.

Briefly, SFS chooses at each step the feature that [after addition] can optimize the evaluation function. This is in fact a simple greedy algorithm.

Sequential backward selection is just the same as SFS except that it starts with the universe (i.e. the set with all possible features included).

SFS is problematic in the sense that it cannot delete features, making it very easy to choose features whose information overlaps.

SBS is problematic in the opposite way as SFS is. These two algorithms are subject to getting (only) locally optimal solutions.

Bidirectional search (BDS) applies both SFS (starting with an empty set) and SBS (starting with the universe) at the same time, and stop when they end up with the same feature subset C.

Although BDS do not resolve the problems of SFS and SBS theoretically, it saves time. However, it is possible for BDS to get different results by SFS and SDS, and the optimal result is thus impossible to obtain.

Plus-L minus-R selection is essentially the same as the previous methods except that it adds L features and then removes R features at each training step.

The L and R are fixed, and the final result depends largely on these two values. [But the advantage of this method is that] it can avoid choosing some features with redundant information.

Sequential floating selection is derived from the LRS method. It differs from LRS as the number of features to add or to remove each time is not fixed.

This is a workflow of SFS. SFS will add to the empty set one feature each time. Then it will train all possible feature subsets after such addition and evaluate the results.

In each step of addition, SFS will keep the feature subset with the optimal result, and determine whether the final result is also optimal, i.e. it cannot be improved any more by introducing other features.

SFS will repeat this process again and again until it finds the (possibly locally) optimal solution.

This is a workflow of SBS, which is almost the same with that of SFS. The only difference is that SBS starts with the universe and removes one feature at a time; SBS and SFS are the same in all other aspects.

This is the workflow of feature selection when using SVM to do machine learning. This is a workflow of SBS.

Here are three Perl scripts.

The first one can add a feature and then remove another feature, the second one can add a feature, and the third one can remove a feature.

As we are doing SBS here, we only need to remove features.

The â10featuresâ is the original file with 10 features.

Let's run the first step.

The first step will remove the first feature of all the 10 features, and store it in the 1mis1 file in the 1miss folder.

In this 1miss fold are the files each of which is produced by removing a feature.

If all goes right, we will then need to execute the grid.py file to find the optimal C value and G value.

The C value and G value are one optimal solution selected in SVM training. Due to time limitations [of this video] and the extremely slow speed of running this, I will not train it here.

Next we assume we have got the optimal C value and G value, and we will need to train the features.

We use the 1mis1 file for the training.

Here we can see that we have got a 1mis1 model

which will be used in subsequent prediction.

Then we use this model to predict

to get the accuracy and prediction result of [machine learning] after removing a feature,

which is the 1mis.predict file here.

Then we get the â9featuresâ file after removing the âORF typeâ feature, remove the first feature in it, and store it in the â2mis1â file in the â2missâ folder.

Then we get the â9featuresâ file after removing the âORF typeâ feature, remove the first feature in it, and store it in the â2mis1â file in the â2missâ folder.

This is the result produced by the method I have just described.

This is the result after selecting features over the 10 features.

Here the blue, red, and gray [blocks] represent for positive accuracy, negative accuracy, and total accuracy, respectively. The yellow line is the harmonic mean of all results.

We can see that the one whose harmonic mean is the highest after removing the âORF typeâ variable. Therefore, we remove the âORF typeâ variable here to get the result [we want].

Then we get the â9variablesâ file after removing the âORF typeâ variable, remove the first variable in it, and store it in the â2mis1â file in the â2missâ folder. Repeat this again and again and,

after removing variables for 9 times, we can get the 9 files in this 2miss folder, each of which is the result of [machine learning] after removing a variable.

This is the result of removing each of the 9 variables.

Here the one with the highest harmonic mean has removed the âORF scoreâ variable. Removing âORF [score]â variable will increase the harmonic mean.

The positive accuracy is increased considerably. The negative accuracy is decreased. The total accuracy is increased.

The harmonic mean is increased. Therefore, according to our criteria, we remove the âORF scoreâ variable to do further variable selection.

Remove âORF scoreâ variable and we can get the â8featuresâ file. Repeat the previous process again

to store the result after removal of each of the 8 variables in the 3mis folder.

This folder should have the [result of prediction after] removing each of the 8 variables.

Let's check the result of removing each of the 8 variables.

We can see that

although we can still get the optimal choice of variable to remove, this optimal choice will also decrease the harmonic mean.

Therefore, we assume that we have got a (globally) optimal solution, and stop and end the training process. Thus the 8 variable subsets is an optimal solution.

Next let's review the basic ideas of clustering.

Let's first make it clear what it means by saying âclusteringâ. Clustering will classify [elements that are] similar into one group and study them altogether.

We can use clustering to scale down the scope of study or subject. By [treating] one class instead of one individual at a time, [we can study] more efficiently.

Similar to the way we classify similar proteins or genes into one family in biological researches,

clustering is just a process of distribute the data into different families.

There are many clustering methods. Here I will introduce distance-based clustering and hierarchical clustering.

Let me first explain some basic knowledge of distance.

The data obtained at the beginning do not denote the distance between two [samples], but the values of these two [samples].

Therefore, we need to construct a function to transform them into the value [of distance between them].

The first four calculate the distance between two points in a standard Euclidean space. The third one is a generalized one, while the first two are specialized forms of the third one.

The first is called as âabsolute distanceâ (Manhattan distance), i.e. sum of the absolute difference of the coordinates of the two points over all dimensions. The second one is Euclidean distance, which is the notion of distance we talk about in daily life.

The third one is a generalization of the first two.

The fourth one is derived as the q in the third one approaches positive infinity.

The fifth one is the so-called âMahalanobis distanceâ. Mahalanobis distance is different from the Euclidean distance in the sense that it is [based on] distance covariance. It represents whether the two points have similar trends.

The one at the bottom is the âLance and Williams distanceâ. It is good in the sense that it is immune to the effects of dimensions.

Here we use R to calculate the distances.

The dist function is used to calculate distances.

X denotes the data to use. The âmethodâ is the method we choose to use, and defaults to the Euclidean distance.

The following arguments describe how we use the data in the table.

The following two arguments denote whether we should output the diagonal and the upper traingle of the distance matrix, respectively.

P is the power of the Minkowski distance (i.e. the third formula in the previous slide). It is 2 for Euclidean distance.

One of the major concerns in clustering analysis is the distance between clusters.

Therefore, we need to consider how to transform the distance [from the distance between points to the distance between clusters] during clustering.

Here K and L are two points before merging. M is the result of merging. J is another point out of the cluster, which is used to calculate the distance.

The first two methods choose from K and L the one whose distance to the point J is the smallest or largest, and use that distance as the distance of the new point M to J.

The third method uses the midpoint of K and L as the M point to calculate the distance to J.

The fourth method averages all distances between all sample pairs to get the final distance.

The fifth one takes the center of gravity of K and L to calculate the distance.

The sixth method (Ward's method)

assumes that the distance within a cluster is relatively short,

and the distance out of a cluster is large. Therefore, we calculate the sum of squares of deviations to calculate the distance. This method will result in this formula.

In R, Hclust is the command used to do clustering.

The âdâ denotes the input data. The âmethodâ is the method [of transforming distances between points into distances between clusters] described earlier.

The âmemberâ describes whether we should use covariance matrix to replace the general similarity measure to calculate the distances. This can be used to reconstruct the phylogeny tree.

Next we will introduce how to use R to cluster transcriptome data from RNA-Seq.

Next we will introduce how to use R to cluster transcriptome data from RNA-Seq.

First let's read a dataset.

This is a dataset describing expression profiles from RNA-Seq. Then we transform this dataset into the matrix format.

After transformed into matrix, the data can be used to draw a heatmap.

Here, a red block have its row and column the most correlated, while a green block have its row and column the most uncorrelated.

We can also draw this heatmap in other ways.

Here we calculate the correlation first, and use the correlation to calculate distances.

Then we use hclust function to do clustering as described just now.

Here the clustering result is visualized as a tree-like graph.

The reason why we draw this tree-like graph is that

it could be used as a marker at the left [of the heatmap].

We can see that the âRowvâ argument is set to this [variable]. This means that the left part [of the heatmap] is represented by a tree-like graph [described in this variable].

Here the heatmap is based on correlation.

Likewise, we can use distance directly to draw another heatmap.

The method is essential the same as the previous ones. We do not calculate the correlation, but use the dist function to calculate the distances directly.