I'm Yuqi Meng from Center for Bioinformatics, Peking University. Let's start our computer lab.
I will introduce the basic steps of feature selection and clustering analysis.
This diagram is a summary for feature selection methods made in 1997.
Here the methods are classified into three types: Complete on the left, Heuristic in the middle, and Random on the right.
Next I will introduce heuristic search and sequential backward selection, and their applications.
The basic principle of heuristic algorithm is to evaluate the results of calculation at each step, and choose the best one to do the next calculation.
Sequential forward selection (SFS) starts with the feature set being empty and adds one feature at a time to get the optimal solution with respect to the evaluation.
Briefly, SFS chooses at each step the feature that [after addition] can optimize the evaluation function. This is in fact a simple greedy algorithm.
Sequential backward selection is just the same as SFS except that it starts with the universe (i.e. the set with all possible features included).
SFS is problematic in the sense that it cannot delete features, making it very easy to choose features whose information overlaps.
SBS is problematic in the opposite way as SFS is. These two algorithms are subject to getting (only) locally optimal solutions.
Bidirectional search (BDS) applies both SFS (starting with an empty set) and SBS (starting with the universe) at the same time, and stop when they end up with the same feature subset C.
Although BDS do not resolve the problems of SFS and SBS theoretically, it saves time. However, it is possible for BDS to get different results by SFS and SDS, and the optimal result is thus impossible to obtain.
Plus-L minus-R selection is essentially the same as the previous methods except that it adds L features and then removes R features at each training step.
The L and R are fixed, and the final result depends largely on these two values. [But the advantage of this method is that] it can avoid choosing some features with redundant information.
Sequential floating selection is derived from the LRS method. It differs from LRS as the number of features to add or to remove each time is not fixed.
This is a workflow of SFS. SFS will add to the empty set one feature each time. Then it will train all possible feature subsets after such addition and evaluate the results.
In each step of addition, SFS will keep the feature subset with the optimal result, and determine whether the final result is also optimal, i.e. it cannot be improved any more by introducing other features.
SFS will repeat this process again and again until it finds the (possibly locally) optimal solution.
This is a workflow of SBS, which is almost the same with that of SFS. The only difference is that SBS starts with the universe and removes one feature at a time; SBS and SFS are the same in all other aspects.
This is the workflow of feature selection when using SVM to do machine learning. This is a workflow of SBS.
Here are three Perl scripts.
The first one can add a feature and then remove another feature, the second one can add a feature, and the third one can remove a feature.
As we are doing SBS here, we only need to remove features.
The “10features” is the original file with 10 features.
Let's run the first step.
The first step will remove the first feature of all the 10 features, and store it in the 1mis1 file in the 1miss folder.
In this 1miss fold are the files each of which is produced by removing a feature.
If all goes right, we will then need to execute the grid.py file to find the optimal C value and G value.
The C value and G value are one optimal solution selected in SVM training. Due to time limitations [of this video] and the extremely slow speed of running this, I will not train it here.
Next we assume we have got the optimal C value and G value, and we will need to train the features.
We use the 1mis1 file for the training.
Here we can see that we have got a 1mis1 model
which will be used in subsequent prediction.
Then we use this model to predict
to get the accuracy and prediction result of [machine learning] after removing a feature,
which is the 1mis.predict file here.
Then we get the “9features” file after removing the “ORF type” feature, remove the first feature in it, and store it in the “2mis1” file in the “2miss” folder.
Then we get the “9features” file after removing the “ORF type” feature, remove the first feature in it, and store it in the “2mis1” file in the “2miss” folder.
This is the result produced by the method I have just described.
This is the result after selecting features over the 10 features.
Here the blue, red, and gray [blocks] represent for positive accuracy, negative accuracy, and total accuracy, respectively. The yellow line is the harmonic mean of all results.
We can see that the one whose harmonic mean is the highest after removing the “ORF type” variable. Therefore, we remove the “ORF type” variable here to get the result [we want].
Then we get the “9variables” file after removing the “ORF type” variable, remove the first variable in it, and store it in the “2mis1” file in the “2miss” folder. Repeat this again and again and,
after removing variables for 9 times, we can get the 9 files in this 2miss folder, each of which is the result of [machine learning] after removing a variable.
This is the result of removing each of the 9 variables.
Here the one with the highest harmonic mean has removed the “ORF score” variable. Removing “ORF [score]” variable will increase the harmonic mean.
The positive accuracy is increased considerably. The negative accuracy is decreased. The total accuracy is increased.
The harmonic mean is increased. Therefore, according to our criteria, we remove the “ORF score” variable to do further variable selection.
Remove “ORF score” variable and we can get the “8features” file. Repeat the previous process again