Feature selection is probably the **most important part** of machine learning, as well as hyperparameter tuning. How can we select the right set of features? Is it related to the model hyperparameters?

Let’s see an example in Python.

## The purpose of feature selection

Feature selection is one of the **most fascinating **and probably underestimated fields in machine learning. Many people give too much importance to the model and think that a **complex model **will learn automatically which are the **most important variables **to use.

My experience as a data scientist has proven that **simple algorithms **are able to **generalize **better than complicated ones and that feature selection is often **more important **than the model itself. If you choose the wrong features, no model will learn anything. If you choose the **right features**, even a simple model could achieve good results.

## Unsupervised or supervised?

Unsupervised feature selection involves techniques that don’t rely on some model efficiency but rely **only on data**. They are applied **before **any model training, so they are **model-free**. Such techniques are, for example, choosing the most correlated variables to the target variable using **Pearson’s correlation coefficient**, chi-square, mutual information and so on. This kind of feature selection is quite powerful, but sometimes it can be unreliable if it’s not followed by a proper model. For example, Pearson’s correlation coefficient measures linear correlation, but if the model is non-linear, the features selected by a linear approach may not be the best set possible.

Supervised feature selection chooses the best set of input features that maximize a **model performance**. This kind of feature selection is **model-dependent** because different models may *think* about input data in **different ways**and consider the feature importance under different points of view. We want a good model, so searching for a feature set that maximizes our model’s performance is **reasonable**, but we must choose the model *a priori* and this choice can make a **bias **in our analysis.

## A mixed approach

If your model has **hyperparameters **(e.g. Random Forests), things become more difficult. How do you choose hyperparameters values and features? Do you choose the features **before **the values of the hyperparameters or you first optimize your hyperparameters on all the features and **then choose** only the most relevant inputs?

I don’t think that feature selection is independent of hyperparameter tuning. The same model with different hyperparameters values is actually **another model**, so it can consider the input features in a different way and show **different performances **even with the same features. On the contrary, changing the features of a model keeping the same values of the hyperparameters may affect performances due to **collinearity** or, more generally, due to the curse of dimensionality.

So, I think that the right answer is choosing the features and the values of the hyperparameters during the **same search procedure**. This becomes possible if we consider feature selection as part of the **hyperparameter tuning **process.

Everything will be more clear in this example. We’ll start with an **unsupervised **approach based on Pearson’s correlation coefficient. We’ll calculate the correlation coefficient between every feature and the target variable, convert it into an **F-score **and sort variables by this score. Then we’ll take the first *k* variables with the **highest score**, optimize the hyperparameters of our model trained with these features and repeat the process changing *k* until **every combination **of variables and values of hyperparameters is checked. The features/hyperparameters combination that **maximizes **the average performance in a 5-fold cross-validation is the one we are looking for.

## Example in Python

Here follows an example of this procedure made in Python. You can find the whole code on **GitHub** here.

The selection of the K best variables is done by the `SelectKBest`

module of scikit-learn. This object selects the *k *most important features according to a given correlation metric.

Then we combine `SelectKBest`

and our supervised model in a `Pipeline`

object. Finally, we perform a hyperparameter tuning by `GridSearchCV`

considering *k* as a hyperparameter of our pipeline. Remember, pipelines in python work exactly as models, so every parameter of each object included in the pipeline is considered as a pipeline hyperparameter.

For our example, we’ll use the **Boston data **included in sklearn. Since the target variable is a real number, we are facing a **regression **problem. The models we’re going to use in this example are **Linear Regression **and **Random Forest **regression.

Let’s import some libraries first.

`import numpy as np`

from sklearn.datasets import load_boston

from sklearn.feature_selection import SelectKBest,f_regression

from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression

from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error

Then we’ll load the Boston dataset.

`data,target = load_boston(return_X_y=True)`

Now we can perform our analysis with the models.

### Linear Regression

The first thing to do is defining a **pipeline **that contains the **feature selector **and the **model**.

`pipeline = Pipeline(`

[

('selector',SelectKBest(f_regression)),

('model',LinearRegression())

]

)

The `f_regression`

variable inside the `SelectKBest`

constructor tells the selector that it must score the variables according to an F-score calculated starting from **Pearson’s correlation coefficient** between each feature and the target variable. After the feature selection, a Linear Regression on the **selected features **will be performed.

Then, we define the `GridSearchCV`

object that performs a grid search on the number of features to use.

`search = GridSearchCV(`

estimator = pipeline,

param_grid = {'selector__k':[3,4,5,6,7,8,9,10]},

n_jobs=-1,

scoring="neg_mean_squared_error",

cv=5,

verbose=3

)

This object acts exactly like a model (so it has *fit* and *predict *methods). When we fit it, it will calculate the **average value **of the scoring metric (the mean squared error with a minus sign) on a **5-fold cross-validation **(*cv=5*) for each value of *k, *which is the number of the most relevant variables to consider. Finally, the grid search chooses the *k* value that **maximizes **the average scoring value across the folds.

As you can see, the *param_grid* value contains a dictionary with one key, which is *selector__k*. You can see there is a **double underscore **inside this name. This is a **special syntax **of `GridSearchCV`

that makes possible to specify the grid for the *k* parameter of the object called *selector* in the pipeline.

We can now fit the grid search and check the best value for *k* and the best score achieved.

`search.fit(data,target)search.best_params_`

# {'selector__k': 3}search.best_score_

# -36.4236890153343

As you can see, the selector has chosen the **first 3 **most relevant variables.

Let’s see what happens with a model that has hyperparameters.

### Random Forest

For the Random Forest model, we can define another pipeline.

`pipeline = Pipeline(`

[

('selector',SelectKBest(f_regression)),

('model',RandomForestRegressor(random_state = 0))

]

)

We’ll perform the hyperparameter tuning only on the number of trees, which is the *n_estimators* parameter of the `RandomForestRegressor`

object. This value will span from 10 to 190 with steps of 10.

The grid search is then:

`search = GridSearchCV(`

estimator = pipeline,

param_grid = {

'selector__k':[3,4,5,6,7,8,9,10] ,

'model__n_estimators':np.arange(10,200,10)

},

n_jobs=-1,

scoring="neg_mean_squared_error",

cv=5,

verbose=3

)

This new search spans over the *k *values and the *n_estimators* values **simultaneously**. Considering 5 fits for each combination, it performs 760 different fits.

The final result is:

`search.fit(data,target)search.best_params_`

# {'model__n_estimators': 110, 'selector__k': 6}search.best_score_

# -22.170138432624004

So the grid search has found **6 features **to consider and a model with **110 trees**.

## Conclusions

In this article, I’ve described a technique to **mix **feature selection and hyperparameter tuning in the same procedure, considering the feature set as a hyperparameter itself. We cannot know in advance which are the most important variables according to our model and this is more important when we work with a model that has hyperparameters. Different values for the hyperparameters could work differently with different sets of features, so feature selection should be done **together **with hyperparameter tuning.

Of course, feature selection introduces a new dimension in hyperparameter tuning, which increases the number of iterations in a grid search. Maybe a **random search** instead of a grid search can be more useful to quickly find a solution or, if you have it, a **Spark cluster** may parallelize the calculations and increase the program speed.